Logos 4.0d suggestion: Optimise phrase searching
Logos 4 is superfast for most searching, but very slow with phrase searching. Take the following example:
- "look out for the dogs, look out for the evildoers" which takes about 30s in my library. That's too long.
- but dogs NEAR look NEAR evildoers takes 0.55s - that's one sixtieth of the time!
Now it seems to be that if you can find the latter search in such a short time, it surely must be possible to optimise the former one some how.
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
Comments
-
I agree Mark. I typically do phrase searching so your suggestion resonates with me.
Sarcasm is my love language. Obviously I love you.
0 -
Try a phrase without "the" in it. :-)
Because of memory limitations, we store the index on the hard drive. When you search for a phrase with "the" in it, we need to load and check every instance of "the" in your entire library against the position of the adjacent words in your query.
We spend a lot of time reading and studying the art of full-text information retrieval. There are solutions that make searching for "the" (and other "stop words") faster, but they all involve less precision. Over the year our users have objected to some of these time saving techniques. (The classic one is to simply ignore the stop words.)
There are also probabilistic search techniques that are faster, but which sometimes return false hits or miss hits. We chose not to use these, too.
There are probably some small optimizations we can do, but the bottom line is that if you want precise phrase searches (which our users told us they do, back when we didn't have them!), there's simply no avoiding looking at all the hits. We already use compression and smart algorithms to try and reduce disk time in loading and analyzing those hits.
Google helps address this by keeping thousands of computers running 24x7 with (according to what I've read) the entire Internet in memory. The small-time equivalent for Logos users would be to use a solid-state hard drive, but that is still an expensive option.
(Our developers use an SSD for a small compiling/working drive, and a slower, traditional hard drive for data files.)
0 -
Why not do a two-pass solution? Ignore all the stop words for the first pass and then check the hits that resulted from the first pass against the actual search string? That would probably end up being faster than the current solution and wouldn't involve a loss of precision.
0 -
Rosie Perera said:
Why not do a two-pass solution?
YES !!!
Run the first scan on some uncommon word to make the first list. Then run the full "slow" but complete search only on the 1st hit list. The rule could be to run the first pass using the longest word or something simple.
(Same for expanding a book in the search. Only look within that book. I.E. Code to support two-pass, and it may well share most or much of whats already in place for searching collections, Passage Lists, Notes etc. Good reusable plumbing ...)
0 -
JimT said:
Run the first scan on some uncommon word to make the first list. Then run the full "slow" but complete search only on the 1st hit list. The rule could be to run the first pass using the longest word or something simple.
I would run the first pass on all the non-trivial words in the phrase ANDed together. That will be super fast and will weed down the number of hits significantly over just using one word. Then you can do the full "slow" phrase search within the resulting hit list. Should be able to bring a 30-second search down to just a few seconds at most, except for short phrases that have nothing but very common words in them.
0 -
The issue here is with complex AND, OR, NOT, NEAR etc rules.
Pass One would need to ALWAYS got the super-set of all possible resources (or smaller breakdown) that might ever be in the final answer.
THEN, pass two could be as complete, painful, and slow as it needs to get the exact results of whatever query the user requested, but it would only need to run against the first superset of possible hits. Even if the first pass was to only build a per-resource list, that might be enough for the needed improvements.
Already, we know that Logos4 does not have a solid syntax parser, given the strange and unexpected results people get if they feed it a "bad" query. If the parser was most strict, it could generate the control inputs needed for both passes, as it would "understand" what had been requested.
A few days ago, I searched something like [""the quick brown fox"] and it found "the" and "fox" etc all over the place. On account of the double-quote, it acted in an unexpected way. Maybe better if I got an error, beep, or some kind of feedback, in place of a bad search.
0 -
JimT said:
If the parser was most strict, it could generate the control inputs needed for both passes, as it would "understand" what had been requested.
I would like very much to be able to see the actual query generated in strict logical form. At the moment it is difficult to determine if my search failed because it was poorly formed or because I ran into a Logos bug. I know that this would not be useful to many Logos users but I think the payoff to Logos would be worth it - we could actually give accurate advice on the forums.[:P]
Okay, casual reader. Most of your queries are working fine. It is in the combination of operators in the query that we can force unusual results.
Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."
0 -
Rosie Perera said:
Why not do a two-pass solution?
A multi-pass solution is the obvious optimisation, though I appreciate it's difficult to code. It's easy for a human to know that dogs NEAR look NEAR evildoers would make a good first pass because both words a relatively uncommon (at least compared to look, our, for, the) - though I still have more than 11,500 dogs in my library!
But if your index kept a note of the frequency of words, then Logos itself could perform the optimisation for phrase searching - it do a WITHIN x WORDS search for the least frequent words, and then just search the results.
With AND searches, the part of the query that should yield the least results could be identified and run first, and other parts of the query run on a subset of those results.
Somehow other search engines manage to optimise for phrase-searching. It must be possible.
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
0 -
Mark Barnes said:
I still have more than 11,500 dogs in my library!
Goodness! Let them out. They're probably all peeing on the floor in there!
0 -
Rosie Perera said:
Goodness! Let them out. They're probably all peeing on the floor in there!
It's OK. They're well behaved [:D]
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
0 -
Bob Pritchett said:
Because of memory limitations, we store the index on the hard drive
I'm out of my depth in terms of thinking through how you'd code something like this to accommodate different computer specs but could you change this behaviour depending on available RAM? My bible index is ~750mb, my systems have 6gb and 8gb of RAM and it is rare that more than 50% is used except when video editing and encoding. I for one would be perfectly happy if Logos4 used up its potential (32-bit limited) 4gb and cached my bible index, bibles and other frequently used books. [:)]
0 -
Hmmm... there may be other bottlenecks, I decided to test my own theory:
Setup one: normal Logos installation (on 2 x 750gb RAID 0 array)
Setup two: created RAMdisk, placed the LibraryIndex on it and mounted it into the LibraryIndex folder.
Results for Mark's search:
1: 29s
2: 29s
Hmm... [:O]
0 -
Jon said:
Hmm...
I'm fairly sure that the index only includes pointers to other resources, and that for phrase searching Logos has to actually open the resource to check the result. If I'm right, then you'd have to copy all your resources into memory to see a speed increase. (Also, as a minimum you'd also want to add BibleIndex to your RAM Drive.)
Also, my LibraryIndex and BibleIndex is more than 5Gb. Even if I could spare 6Gb (which I can't - it's all I have!), just copying that into RAM takes a long time. It's probably not worth it.
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
0 -
Mark Barnes said:
I'm fairly sure that the index only includes pointers to other resources,
other?
Mark Barnes said:for phrase searching Logos has to actually open the resource to check the result.
After studying disk activity in Resource Monitor all resources with the expected results are opened but I didn't see others being opened!
But phrase search is interesting as it will search footnotes and glosses for an exact match within the footnote or within the gloss of a single manuscript word, which is something you cannot do with a proximity search! For example the BEFORE 1 word evildoers will find results from the glosses of adjacent manuscript words whereas "the evildoers" will find the exact match from the gloss of a single manuscript word.
Dave
===Windows 11 & Android 13
0 -
Dave Hooton said:
other?
Sorry, I meant "I'm fairly sure that the index only includes pointers to the actual resources".
Regarding proximity searching, I'm pretty sure position is always measure in relation to the surface text, so your finding would make sense.
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
0 -
MJ. Smith said:
I would like very much to be able to see the actual query generated in strict logical form. At the moment it is difficult to determine if my search failed because it was poorly formed or because I ran into a Logos bug. I know that this would not be useful to many Logos users but I think the payoff to Logos would be worth it - we could actually give accurate advice on the forums.
Me too. And it would be nice to have a Wiki tutorial on how to construct strict logical searches. I learned a few new ideas at Morris Proctor's Camp Logos on how to construct different searches. Sometimes I get weird results because I didn't think through my query.
Have you ever noticed Logos.Bible.com appears to use a fuzzy-logic in returning search results? When I type a phrase there my search phrase is almost always worded from the KJV or NASB vocabulary. Even though the ESV is the version searched I get hits consistent with my preferred versions. I have not observed this behavior within my Logos program, only with the online Logos search. Is there a version cross-referencing going on behind the scenes? If so, is it possible to duplicate a wider fuzzy search within the Logos program if desired? I'm guessing searching multiple Bibles is the closest I can get.
Logos 7 Collectors Edition
0 -
Matthew C Jones said:MJ. Smith said:
I would like very much to be able to see the actual query generated in strict logical form. At the moment it is difficult to determine if my search failed because it was poorly formed or because I ran into a Logos bug. I know that this would not be useful to many Logos users but I think the payoff to Logos would be worth it - we could actually give accurate advice on the forums.
Me too.
I agree with this, and I vaguely remember someone from Logos saying that they were working towards it. Actually, my preference would be to educate the user by suggesting the correct form of the search. So, for example, if someone searched for (cat OR dog) NEAR (evil OR good), Logos would say Did you mean (cat, dog) NEAR (evil OR good)?. Likewise for missing parentheses, lowercase operators like within, etc. Ideally, there would also be an explain button, so they could see why Logos wanted to make the correction.
That way it's only an extra click for users, but they learn how to do them properly.
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
0 -
Mark Barnes said:
Actually, my preference would be to educate the user by suggesting the correct form of the search. So, for example, if someone searched for (cat OR dog) NEAR (evil OR good), Logos would say Did you mean (cat, dog) NEAR (evil OR good)?. Likewise for missing parentheses, lowercase operators like within, etc. Ideally, there would also be an explain button, so they could see why Logos wanted to make the correction.
That way it's only an extra click for users, but they learn how to do them properly.
As one uneducated user...I say yes to this suggestion...(I've had Dave and Mark try and explain it all to me more than once..!)
Robert Pavich
For help go to the Wiki: http://wiki.logos.com/Table_of_Contents__
0 -
would it be possible for the search box to somehow indicate that a given word is one of these "stop" words? Maybe turn it red with a little floater that says you might get better/faster results if you don't include this word?
0 -
Rosie Perera said:
Why not do a two-pass solution? Ignore all the stop words for the first pass and then check the hits that resulted from the first pass against the actual search string? That would probably end up being faster than the current solution and wouldn't involve a loss of precision.
We don't open the resources to check the phrase. We read the hit list. The hit list is stored in one long list, with each book's hits one after another. Smart "skip" logic already avoids checking unnecessary resources. (Though there may be more optimization we can do, at a trade off of taking more space.)
Checking the rarer words first, and then using their less frequent documents lists to speed up scanning of the more common words hit lists is a classic optimization, and I believe we've already implemented it. (This does what the two-pass would do, only more efficiently.)
If you're really interested in this, I'd suggest the book "Managing Gigabytes," by Witten, Moffat, and Bell.
0 -
Bob Pritchett said:
Checking the rarer words first, and then using their less frequent documents lists to speed up scanning of the more common words hit lists is a classic optimization, and I believe we've already implemented it. (This does what the two-pass would do, only more efficiently.)
Thanks for that info Bob. I'm very impressed with the way a phrase search works and would trust that Search will be made easier by appropriate parsing and a suggested restructuring to avoid invalid queries.
Dave
===Windows 11 & Android 13
0 -
Dave Hooton said:
Thanks for that info Bob. I'm very impressed with the way a phrase search works and would trust that Search will be made easier by appropriate parsing and a suggested restructuring to avoid invalid queries.
Logos 3 parsed your query on every keystroke and turned the search edit box border red whenever the query was in an invalid state.
User feedback, and our reading of academic research, revealed that only a small number (< 2%) of users EVER use a boolean operator, and that most people expect (and prefer) a search system like Google's.
So in Logos 4 we implemented a very forgiving parser, and removed the requirement of any syntax. If you do something that does parse, we run it through the "formal" system; otherwise (like Google), we treat it as a "bag of words".
The one place this fails is if you thought you were typing something formal, and just slightly messed it up. I believe we had an early spec that showed you how the system interpreted your query, right down to wildcard expansion, but it got messy, and sometimes the formal rendering of the parsing looked so different it could be confusing. We cut this to avoid duplication and because we were running out of time.
One option would be to only show the formal rendering when we formally parsed a tree; then you'd realize that if you didn't get the rendering you'd know we hadn't parsed what you meant. Another option would be to get "smarter", and detect more common errors, like Google does with likely spelling mistakes, typos, etc. The first parts of this are shown in our optional query expansions for topical words; internally we have early (but still not perfected) implementations of "Did you mean correct spelling?" up and running.
We'll be revisiting query parsing (and planning next generation search features) later this summer, and we'll put some more energy into making it even easier.
0 -
Bob Pritchett said:
Another option would be to get "smarter", and detect more common errors, like Google does with likely spelling mistakes, typos, etc. The first parts of this are shown in our optional query expansions for topical words; internally we have early (but still not perfected) implementations of "Did you mean correct spelling?" up and running.
I've argued for this elsewhere as the best solution. It should include errors like not putting WITHIN in capitals, missing parethesis, and invalid proximity searches like (a OR b) NEAR (c OR d), which should be corrected to (a, b) NEAR (c, d)
This is my personal Faithlife account. On 1 March 2022, I started working for Faithlife, and have a new 'official' user account. Posts on this account shouldn't be taken as official Faithlife views!
0 -
Bob Pritchett said:
... later this summer ...
When is summer?
Its early Winter here. Please use month names or something for your international customers.
0 -
Mark Barnes said:
I've argued for this elsewhere as the best solution. It should include errors like not putting WITHIN in capitals, missing parethesis, and invalid proximity searches like (a OR b) NEAR (c OR d), which should be corrected to (a, b) NEAR (c, d)
I'm glad the idea is to do this as a suggestion: "Did you mean ___ ?" because there are legitimate times when you might spell a special search word in lower case and not mean it as a special search word. For example whoever before heaven is a legitimate search in its own right, and it means something quite different from whoever BEFORE heaven. I wouldn't want it automatically corrected for me. When Google detects a possible misspelling, it goes ahead and deos the search anyway, as you typed it, but gives you the opportunity to easily rerun the search with the corrected spelling if you choose. That's the way I'd hope Logos would do this.
Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.
0 -
Rosie Perera said:
Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.
Bradley has explained the logical (OR) vs. physical (NEAR) incompatibility and indicated that he would allow that type of search to work as the user intended, so there should be no hinting "Did you mean ...".
Dave
===Windows 11 & Android 13
0 -
Dave Hooton said:Rosie Perera said:
Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.
Bradley has explained the logical (OR) vs. physical (NEAR) incompatibility and indicated that he would allow that type of search to work as the user intended.
I don't get it. Either there's an incompatibility between OR and NEAR, or Bradley can make that type of search work as the user intended. I don't see how it can be both. I do remember a thread explaining why (a,b) NEAR c works while (a OR b) NEAR c doesn't, but I don't remember why and can't find the thread.
0 -
They aren't supposed to be synonyms--they just seem like it. Bradley has made this point before. The comma list is one term logically, whereas the OR clause just generates a boolean value that has no meaning when used with a NEAR operator ( Does "true NEAR true" mean anything?)Rosie Perera said:Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.
See here for where he discusses this and says they have plans to make the parsing figure things out better:
http://community.logos.com/forums/p/4896/72043.aspx#72043MacBook Pro (2019), ThinkPad E540
0 -
Mark Barnes said:Bob Pritchett said:
Another option would be to get "smarter", and detect more common errors, like Google does with likely spelling mistakes, typos, etc. The first parts of this are shown in our optional query expansions for topical words; internally we have early (but still not perfected) implementations of "Did you mean correct spelling?" up and running.
I've argued for this elsewhere as the best solution. It should include errors like not putting WITHIN in capitals, missing parethesis, and invalid proximity searches like (a OR b) NEAR (c OR d), which should be corrected to (a, b) NEAR (c, d)
I agree with the "smarter" approach which might also provide an option for possibly ambiguous queries like a OR b AND c OR d eg. "Did you mean (a OR b) AND (c OR d)?" otherwise run as written.
Dave
===Windows 11 & Android 13
0 -
JimT said:Bob Pritchett said:
... later this summer ...
When is summer?
Its early Winter here. Please use month names or something for your international customers.
In the US Summer is June, July, and August.
Prov. 15:23
0