At some sites, if I try to add passages by URL, it finds nothing. If I copy to clipboard, it finds plenty of passages. Am I doing something wrong?
Here is a sample that for me finds no passages by URL, but finds 87 passages by clipboard:
http://www.vatican.va/holy_father/john_paul_ii/encyclicals/documents/hf_jp-ii_enc_17042003_ecclesia-de-eucharistia_en.html
Here is another that finds 79 from the clipboard but none by URL:
http://www.vatican.va/holy_father/john_paul_ii/encyclicals/documents/hf_jp-ii_enc_15101998_fides-et-ratio_en.html
I can reproduce that. Could the underlying HTML be protecting it?
Is there any comment on this, should I enter it as a bug or something?
I presume this is because of the formatting. All the book names are in italics, which means Mk 26:26 actually looks like <i>Mk</i> 26:26 when you download it. It's therefore not parsed correctly.
Interesting...but it's on the clipboard that way as well, if you paste in Word you can see the italics, if you paste in Notepad you do not - being text-based, it strips that out.
Would one think that if they are properly parsing it when they read it off the clipboard, that they could properly parse it when reading it via URL?
It doesn't work like that, but like this:
So, it's the browser that is essentially stripping out the formatting, not Logos. It would be very hard for Logos to do (at least it would be very hard to come up with a system that always worked (HTML is a complicated beast), though getting something that worked perhaps 90% of the time would be much more achievable. I'm not sure it's worth it though.
Thanks Mark, for the analysis.
So what we are saying, is that if a web site does virtually any formatting of the bible notations, it won't be able to be recognized by Logos? That hardly seems like a very useful feature then. FYI I was a programmer for over 30 years, and stripping any HTML formatting out of the text before it's parsed is very trivial. That's why it seems like a half-baked feature to me, if it won't work a lot of the time given how they implemented it, why implement it at all?
I'll just use copy to clipboard from now on, and hope they didn't spend a lot of time implementing that feature [:^)]
So what we are saying, is that if a web site does virtually any formatting of the bible notations, it won't be able to be recognized by Logos?
No, I'm saying that if a website does any formatting within the notation it's a problem. So <i>Matt</i> 26:3 is a problem, but <i>Matt 26:3</i> isn't.
FYI I was a programmer for over 30 years, and stripping any HTML formatting out of the text before it's parsed is very trivial.
Yes and no. I'm sure Logos are aware of whatever the .NET equivalent of KSES is. But just stripping HTML formatting does not solve the problem. Take this example:
<li><b>1.</b> I went for a walk with Matt.</li><li><b>2.</b> Then I got pizza.</li>
Strip out the tags and you end up with: 1. I went for a walk with Matt. 2. Then I got pizza.
So stripping tags in that scenario makes things worse, not better.
When I said that something that works 90% of time might be feasible, I had in mind a script that removed element level HTML, but not block level (so it kept the <p>'s, <div>'s and <li>'s, but removed the <b>'s, <i>'s, <span>'s, etc.). But it would still fail with references added by javascript, and might also introduce some unintended consequences.
In a past life I actually did a lot of work "screen scraping" scripture verses off of public web sites using .NET and C#. I used regular expressions (http://en.wikipedia.org/wiki/Regular_expression), a very powerful yet obscure/unreadable "language" that .NET and many other languages and platforms support. Using regexp I could get accurate scripture verses with pretty much anything embedded in or around it, or even if the verse notation went across multiple lines. It could even correct and reformat the abbreviations where a web site would use non-standard book abbreviations. The actual extraction/formatting code was probably under 20 lines of code, regexp is so powerful (did I mention obscure? <g>).
It's really not that hard, someone just has to believe it's important that this work correctly 100% of the time. I guess with all the features that need to go in, and all the more important bug fixes, some features are just going to be an 80% implementation first pass.
In this particular case the source code makes the problem evident:
Certainly it is a gift given for our sake, and indeed that of all humanity (cf. <i>Mt</i> 26:28; <i>Mk</i> 14:24; <i>Lk</i> 22:20; <i>Jn</i> 10:15), yet it is<i> first and foremost a gift to the Father</i>: “asacrifice that the Father accepted, giving, in return for this total self-giving by his Son, who 'became obedient unto death' (<i>Phil</i> 2:8),
I'd make an educated guess - based on more evidence than this - that the routine expects the string to all fall within a pair of tags rather than being split by an end tag. This should be a hard fix ... but perhaps at the cost of creating other problems? More alaysis needed.
I have dabbled in RegEx for a multi-lingual version of RefTagger I started to code (like most of my projects, still unfinished!). You should test your expression on the HTML in question, and if it works, send it to Bradley!
Available Now
Build your biblical library with a new trusted commentary or resource every month. Yours to keep forever.