How does L4 determine a word?

I remember that in L3 a word in the search YYY WITHIN 5 words ZZZ was defined by 5 characters or something like that. In L4 is it still defined this way or by the grammatical definition of a word?

Find more posts tagged with

Comments

George Somsel

I remember that in L3 a word in the search YYY WITHIN 5 words ZZZ was defined by 5 characters or something like that. In L4 is it still defined this way or by the grammatical definition of a word?

My guess would be that hasn't changed.

davidphillips

https://community.logos.com/discussion/comment/38154#Comment_38154

While I can't answer the question, I believe something has changed. I remember Dave Hooton being excited about the fact that WITHIN, BEFORE, etc. were now couting "real" words (or something to that effect), which was an improvement from 3.0.

Dave Hooton

https://community.logos.com/discussion/comment/38164#Comment_38164

I remember Dave Hooton being excited about the fact that WITHIN, BEFORE, etc. were now couting "real" words

Yup, I am still excited. L4 counts grammatical words. You can try a simple test:-

search the NT for the phrase "Lord God" in ESV. You get 11 results/ 11 verses. L3 gets 16/16 because of "Lord our God", etc!

Bradley Grainger (Logos)

I remember that in L3 a word in the search YYY WITHIN 5 words ZZZ was defined by 5 characters or something like that. In L4 is it still defined this way or by the grammatical definition of a word?

Logos 4 uses the Unicode word-breaking algorithm (http://unicode.org/reports/tr29/) to split text into words when indexing. (Putting it simplistically, it splits on spaces, but there are some special cases: colons between letters break them into separate words, but colons between numbers don't. Hyphenated-phrases are also treated as separate words. There is currently no CKJV support (e.g., bigram indexing); Asian languages typically require a dictionary for good word-breaking, and we haven't implemented that.) There are also some smarts to ignore footnote characters when counting words in the surface text, except when those superscripted characters are significant (think "P71").

In general, though, it should do what you expect, so the "WORDS" unit in Logos 4 searches counts actual words; it is no longer a character-based simulation.

Matthew C Jones

https://community.logos.com/discussion/comment/40986#Comment_40986

There is currently no CKJV support (e.g., bigram indexing); Asian languages typically require a dictionary for good word-breaking, and we haven't implemented that

RE: Japanese,

Is this for lack of a usable dictionary? (Jim Breen's JDIC) Is this on the future agenda?

Does Logos 4 search Hebrew by word-breaking? And is it only double-byte text that presents this problem in word-break searching?

Bradley Grainger (Logos)

https://community.logos.com/discussion/comment/40993#Comment_40993

There is currently no CKJV support (e.g., bigram indexing); Asian languages typically require a dictionary for good word-breaking, and we haven't implemented that

RE: Japanese,

Is this for lack of a usable dictionary? (Jim Breen's JDIC) Is this on the future agenda?

Does Logos 4 search Hebrew by word-breaking? And is it only double-byte text that presents this problem in word-break searching?

I don't know what the roadmap is, but we worked closely with the Japan Bible Society to get the NIT etc. indexed correctly in LDLS3, so I expect a similar thing will happen in Logos 4.

Hebrew text is basically broken on spaces and punctuation (including maqqef and sof pasuq).

Double-byte text doesn't cause any problems itself; it's the script that is encoded that determines whether algorithmic word-breaking is feasible.

Matthew C Jones

https://community.logos.com/discussion/comment/41007#Comment_41007

Double-byte text doesn't cause any problems itself; it's the script that is encoded that determines whether algorithmic word-breaking is feasible.

I am years behind on Unicode.

Can Logos not use the "Grapheme Cluster Boundaries" for searching? Must there be spaces? I thought Unicode Extended was supposed to solve all this.

I was very excited to see the Greek/Japanese Interlinear in Logos. I hope someday to see it run in Version 4.