Logos 4.0d suggestion: Optimise phrase searching

Page 1 of 4 (66 items) 1 2 3 4 Next >
This post has 65 Replies | 1 Follower

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Posted: Thu, Jun 3 2010 8:18 AM | Locked

Logos 4 is superfast for most searching, but very slow with phrase searching. Take the following example:

  • "look out for the dogs, look out for the evildoers" which takes about 30s in my library. That's too long.
  • but dogs NEAR look NEAR evildoers takes 0.55s - that's one sixtieth of the time!

Now it seems to be that if you can find the latter search in such a short time, it surely must be possible to optimise the former one some how.

Posts 8602
TCBlack | Forum Activity | Replied: Thu, Jun 3 2010 4:26 PM | Locked

I agree Mark.  I typically do phrase searching so your suggestion resonates with me. 

Truth Is Still Truth Even if You Don't Believe It

Check the Wiki

Warning: Sarcasm is my love language. I may inadvertently express my love to you.

Posts 1692
LogosEmployee
Bob Pritchett | Forum Activity | Replied: Thu, Jun 3 2010 11:09 PM | Locked

Try a phrase without "the" in it. :-)

Because of memory limitations, we store the index on the hard drive. When you search for a phrase with "the" in it, we need to load and check every instance of "the" in your entire library against the position of the adjacent words in your query.

We spend a lot of time reading and studying the art of full-text information retrieval. There are solutions that make searching for "the" (and other "stop words") faster, but they all involve less precision. Over the year our users have objected to some of these time saving techniques. (The classic one is to simply ignore the stop words.)

There are also probabilistic search techniques that are faster, but which sometimes return false hits or miss hits. We chose not to use these, too.

There are probably some small optimizations we can do, but the bottom line is that if you want precise phrase searches (which our users told us they do, back when we didn't have them!), there's simply no avoiding looking at all the hits. We already use compression and smart algorithms to try and reduce disk time in loading and analyzing those hits.

Google helps address this by keeping thousands of computers running 24x7 with (according to what I've read) the entire Internet in memory. The small-time equivalent for Logos users would be to use a solid-state hard drive, but that is still an expensive option.

(Our developers use an SSD for a small compiling/working drive, and a slower, traditional hard drive for data files.)

Posts 18822
Rosie Perera | Forum Activity | Replied: Thu, Jun 3 2010 11:20 PM | Locked

Why not do a two-pass solution? Ignore all the stop words for the first pass and then check the hits that resulted from the first pass against the actual search string? That would probably end up being faster than the current solution and wouldn't involve a loss of precision.

Posts 1367
JimTowler | Forum Activity | Replied: Thu, Jun 3 2010 11:56 PM | Locked

Rosie Perera:
Why not do a two-pass solution?

YES !!!

Run the first scan on some uncommon word to make the first list. Then run the full "slow" but complete search only on the 1st hit list. The rule could be to run the first pass using the longest word or something simple.

(Same for expanding a book in the search. Only look within that book. I.E. Code to support two-pass, and it may well share most or much of whats already in place for searching collections, Passage Lists, Notes etc. Good reusable plumbing ...)

Posts 18822
Rosie Perera | Forum Activity | Replied: Fri, Jun 4 2010 12:02 AM | Locked

JimT:

Run the first scan on some uncommon word to make the first list. Then run the full "slow" but complete search only on the 1st hit list. The rule could be to run the first pass using the longest word or something simple.

I would run the first pass on all the non-trivial words in the phrase ANDed together. That will be super fast and will weed down the number of hits significantly over just using one word. Then you can do the full "slow" phrase search within the resulting hit list. Should be able to bring a 30-second search down to just a few seconds at most, except for short phrases that have nothing but very common words in them.

Posts 1367
JimTowler | Forum Activity | Replied: Fri, Jun 4 2010 12:15 AM | Locked

The issue here is with complex AND, OR, NOT, NEAR etc rules.

Pass One would need to ALWAYS got the super-set of all possible resources (or smaller breakdown)  that might ever be in the final answer.

THEN, pass two could be as complete, painful, and slow as it needs to get the exact results of whatever query the user requested, but it would only need to run against the first superset of possible hits. Even if the first pass was to only build a per-resource list, that might be enough for the needed improvements.

Already, we know that Logos4 does not have a solid syntax parser, given the strange and unexpected results people get if they feed it a "bad" query. If the parser was most strict, it could generate the control inputs needed for both passes, as it would "understand" what had been requested.

A few days ago, I searched something like [""the quick brown fox"] and it found "the" and "fox" etc all over the place. On account of the double-quote, it acted in an unexpected way. Maybe better if I got an error, beep, or some kind of feedback, in place of a bad search.

Posts 27926
Forum MVP
MJ. Smith | Forum Activity | Replied: Fri, Jun 4 2010 12:32 AM | Locked

JimT:
If the parser was most strict, it could generate the control inputs needed for both passes, as it would "understand" what had been requested.

I would like very much to be able to see the actual query generated in strict logical form. At the moment it is difficult to determine if my search failed because it was poorly formed or because I ran into a Logos bug. I know that this would not be useful to many Logos users but I think the payoff to Logos would be worth it - we could actually give accurate advice on the forums.Stick out tongue

Okay, casual reader. Most of your queries are working fine. It is in the combination of operators in the query that we can force unusual results.

Orthodox Bishop Hilarion Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Fri, Jun 4 2010 12:49 AM | Locked

Rosie Perera:
Why not do a two-pass solution?

A multi-pass solution is the obvious optimisation, though I appreciate it's difficult to code. It's easy for a human to know that dogs NEAR look NEAR evildoers would make a good first pass because both words a relatively uncommon (at least compared to look, our, for, the) - though I still have more than 11,500 dogs in my library!

But if your index kept a note of the frequency of words, then Logos itself could perform the optimisation for phrase searching - it do a WITHIN x WORDS search for the least frequent words, and then just search the results.

With AND searches, the part of the query that should yield the least results could be identified and run first, and other parts of the query run on a subset of those results.

Somehow other search engines manage to optimise for phrase-searching. It must be possible.

Posts 18822
Rosie Perera | Forum Activity | Replied: Fri, Jun 4 2010 12:52 AM | Locked

Mark Barnes:

I still have more than 11,500 dogs in my library!

Goodness! Let them out. They're probably all peeing on the floor in there!

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Fri, Jun 4 2010 12:54 AM | Locked

Rosie Perera:
Goodness! Let them out. They're probably all peeing on the floor in there!

It's OK. They're well behaved Big Smile

Posts 687
Jon | Forum Activity | Replied: Fri, Jun 4 2010 11:23 PM | Locked

Bob Pritchett:
Because of memory limitations, we store the index on the hard drive

I'm out of my depth in terms of thinking through how you'd code something like this to accommodate different computer specs but could you change this behaviour depending on available RAM?  My bible index is ~750mb, my systems have 6gb and 8gb of RAM and it is rare that more than 50% is used except when video editing and encoding. I for one would be perfectly happy  if Logos4 used up its potential (32-bit limited) 4gb and cached my bible index, bibles and other frequently used books. Smile

Posts 687
Jon | Forum Activity | Replied: Sat, Jun 5 2010 12:00 AM | Locked

Hmmm... there may be other bottlenecks, I decided to test my own theory:

Setup one: normal Logos installation (on 2 x 750gb RAID 0 array)

Setup two: created RAMdisk, placed the LibraryIndex on it and mounted it into the LibraryIndex folder.

Results for Mark's search:

1: 29s

2: 29s

Hmm... Surprise

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Sat, Jun 5 2010 1:16 AM | Locked

Jon:

Hmm... Surprise

I'm fairly sure that the index only includes pointers to other resources, and that for phrase searching Logos has to actually open the resource to check the result. If I'm right, then you'd have to copy all your resources into memory to see a speed increase. (Also, as a minimum you'd also want to add BibleIndex to your RAM Drive.)

Also, my LibraryIndex and BibleIndex is more than 5Gb. Even if I could spare 6Gb (which I can't - it's all I have!), just copying that into RAM takes a long time. It's probably not worth it.

 

Posts 25276
Forum MVP
Dave Hooton | Forum Activity | Replied: Sun, Jun 6 2010 5:39 AM | Locked

Mark Barnes:
I'm fairly sure that the index only includes pointers to other resources,

other?

Mark Barnes:
for phrase searching Logos has to actually open the resource to check the result.

After studying disk activity in Resource Monitor all resources with the expected results are opened but I didn't see others being opened!

But phrase search is interesting as it will search footnotes and glosses for an exact match within the footnote or within the gloss of a single manuscript word, which is something you cannot do with a proximity search! For example the BEFORE 1 word evildoers will find results from the glosses of adjacent manuscript words whereas "the evildoers" will find the exact match from the gloss of a single manuscript word.

Dave
===

Windows 10 & Android 8

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Sun, Jun 6 2010 7:42 AM | Locked

Dave Hooton:
other?

Sorry, I meant "I'm fairly sure that the index only includes pointers to the actual resources".

Regarding proximity searching, I'm pretty sure position is always measure in relation to the surface text, so your finding would make sense.

Posts 8967
RIP
Matthew C Jones | Forum Activity | Replied: Sun, Jun 6 2010 8:37 AM | Locked

MJ. Smith:
I would like very much to be able to see the actual query generated in strict logical form. At the moment it is difficult to determine if my search failed because it was poorly formed or because I ran into a Logos bug. I know that this would not be useful to many Logos users but I think the payoff to Logos would be worth it - we could actually give accurate advice on the forums.Stick out tongue

Me too. And it would be nice to have a Wiki tutorial on how to construct strict logical searches. I learned a few new ideas at Morris Proctor's Camp Logos on how to construct different searches. Sometimes I get weird results because I didn't think through my query.  

Have you ever noticed Logos.Bible.com appears to use a fuzzy-logic in returning search results? When I type a phrase there my search phrase is almost always worded from the  KJV or NASB vocabulary. Even though the ESV is the version searched I get  hits consistent with my preferred versions. I have not observed this behavior within my Logos program, only with the online Logos search. Is there a version cross-referencing going on behind the scenes? If so, is it possible to duplicate a wider fuzzy search within the Logos program if desired? I'm guessing searching multiple Bibles is the closest I can get.

Logos 7 Collectors Edition

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Sun, Jun 6 2010 8:48 AM | Locked

Matthew C Jones:

MJ. Smith:
I would like very much to be able to see the actual query generated in strict logical form. At the moment it is difficult to determine if my search failed because it was poorly formed or because I ran into a Logos bug. I know that this would not be useful to many Logos users but I think the payoff to Logos would be worth it - we could actually give accurate advice on the forums.Stick out tongue

Me too.

I agree with this, and I vaguely remember someone from Logos saying that they were working towards it. Actually, my preference would be to educate the user by suggesting the correct form of the search. So, for example, if someone searched for (cat OR dog) NEAR (evil OR good), Logos would say Did you mean (cat, dog) NEAR (evil OR good)?. Likewise for missing parentheses, lowercase operators like within, etc. Ideally, there would also be an explain button, so they could see why Logos wanted to make the correction.

That way it's only an extra click for users, but they learn how to do them properly.

Posts 4508
Robert Pavich | Forum Activity | Replied: Sun, Jun 6 2010 10:41 AM | Locked

Mark Barnes:

Actually, my preference would be to educate the user by suggesting the correct form of the search. So, for example, if someone searched for (cat OR dog) NEAR (evil OR good), Logos would say Did you mean (cat, dog) NEAR (evil OR good)?. Likewise for missing parentheses, lowercase operators like within, etc. Ideally, there would also be an explain button, so they could see why Logos wanted to make the correction.

That way it's only an extra click for users, but they learn how to do them properly.

 

As one uneducated user...I say yes to this suggestion...(I've had Dave and Mark try and explain it all to me more than once..!)

 

Robert Pavich

For help go to the Wiki: http://wiki.logos.com/Table_of_Contents__

Posts 3810
spitzerpl | Forum Activity | Replied: Sun, Jun 6 2010 12:35 PM | Locked

would it be possible for the search box to somehow indicate that a given word is one of these "stop" words? Maybe turn it red with a little floater that says you might get better/faster results if you don't include this word?

Page 1 of 4 (66 items) 1 2 3 4 Next > | RSS