Logos 4.0d suggestion: Optimise phrase searching

Page 2 of 4 (66 items) < Previous 1 2 3 4 Next >
This post has 65 Replies | 1 Follower

Posts 1692
LogosEmployee
Bob Pritchett | Forum Activity | Replied: Sun, Jun 6 2010 2:53 PM | Locked

Rosie Perera:
Why not do a two-pass solution? Ignore all the stop words for the first pass and then check the hits that resulted from the first pass against the actual search string? That would probably end up being faster than the current solution and wouldn't involve a loss of precision.

We don't open the resources to check the phrase. We read the hit list. The hit list is stored in one long list, with each book's hits one after another. Smart "skip" logic already avoids checking unnecessary resources. (Though there may be more optimization we can do, at a trade off of taking more space.)

Checking the rarer words first, and then using their less frequent documents lists to speed up scanning of the more common words hit lists is a classic optimization, and I believe we've already implemented it. (This does what the two-pass would do, only more efficiently.)

If you're really interested in this, I'd suggest the book "Managing Gigabytes," by Witten, Moffat, and Bell.

Posts 25282
Forum MVP
Dave Hooton | Forum Activity | Replied: Sun, Jun 6 2010 5:31 PM | Locked

Bob Pritchett:

Checking the rarer words first, and then using their less frequent documents lists to speed up scanning of the more common words hit lists is a classic optimization, and I believe we've already implemented it. (This does what the two-pass would do, only more efficiently.)

Thanks for that info Bob. I'm very impressed with the way a phrase search works and would trust that Search will be made easier by appropriate parsing and a suggested restructuring to avoid invalid queries.

Dave
===

Windows 10 & Android 8

Posts 1692
LogosEmployee
Bob Pritchett | Forum Activity | Replied: Sun, Jun 6 2010 10:40 PM | Locked

Dave Hooton:
Thanks for that info Bob. I'm very impressed with the way a phrase search works and would trust that Search will be made easier by appropriate parsing and a suggested restructuring to avoid invalid queries.

Logos 3 parsed your query on every keystroke and turned the search edit box border red whenever the query was in an invalid state.

User feedback, and our reading of academic research, revealed that only a small number (< 2%) of users EVER use a boolean operator, and that most people expect (and prefer) a search system like Google's. 

So in Logos 4 we implemented a very forgiving parser, and removed the requirement of any syntax. If you do something that does parse, we run it through the "formal" system; otherwise (like Google), we treat it as a "bag of words".

The one place this fails is if you thought you were typing something formal, and just slightly messed it up. I believe we had an early spec that showed you how the system interpreted your query, right down to wildcard expansion, but it got messy, and sometimes the formal rendering of the parsing looked so different it could be confusing. We cut this to avoid duplication and because we were running out of time.

One option would be to only show the formal rendering when we formally parsed a tree; then you'd realize that if you didn't get the rendering you'd know we hadn't parsed what you meant. Another option would be to get "smarter", and detect more common errors, like Google does with likely spelling mistakes, typos, etc. The first parts of this are shown in our optional query expansions for topical words; internally we have early (but still not perfected) implementations of "Did you mean correct spelling?" up and running.

We'll be revisiting query parsing (and planning next generation search features) later this summer, and we'll put some more energy into making it even easier.

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Sun, Jun 6 2010 11:54 PM | Locked

Bob Pritchett:
Another option would be to get "smarter", and detect more common errors, like Google does with likely spelling mistakes, typos, etc. The first parts of this are shown in our optional query expansions for topical words; internally we have early (but still not perfected) implementations of "Did you mean correct spelling?" up and running.

I've argued for this elsewhere as the best solution. It should include errors like not putting WITHIN in capitals, missing parethesis, and invalid proximity searches like (a OR b) NEAR (c OR d), which should be corrected to (a, b) NEAR (c, d)

Posts 1367
JimTowler | Forum Activity | Replied: Mon, Jun 7 2010 12:03 AM | Locked

Bob Pritchett:
... later this summer ...

When is summer?

Its early Winter here. Please use month names or something for your international customers.

Posts 18826
Rosie Perera | Forum Activity | Replied: Mon, Jun 7 2010 12:11 AM | Locked

Mark Barnes:

I've argued for this elsewhere as the best solution. It should include errors like not putting WITHIN in capitals, missing parethesis, and invalid proximity searches like (a OR b) NEAR (c OR d), which should be corrected to (a, b) NEAR (c, d)

I'm glad the idea is to do this as a suggestion: "Did you mean ___ ?" because there are legitimate times when you might spell a special search word in lower case and not mean it as a special search word. For example whoever before heaven is a legitimate search in its own right, and it means something quite different from whoever BEFORE heaven. I wouldn't want it automatically corrected for me. When Google detects a possible misspelling, it goes ahead and deos the search anyway, as you typed it, but gives you the opportunity to easily rerun the search with the corrected spelling if you choose. That's the way I'd hope Logos would do this.

Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.

Posts 25282
Forum MVP
Dave Hooton | Forum Activity | Replied: Mon, Jun 7 2010 1:36 AM | Locked

Rosie Perera:
Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.

Bradley has explained the logical (OR) vs. physical (NEAR) incompatibility and indicated that he would allow that type of search to work as the user intended, so there should be no hinting "Did you mean ...".

Dave
===

Windows 10 & Android 8

Posts 18826
Rosie Perera | Forum Activity | Replied: Mon, Jun 7 2010 1:43 AM | Locked

Dave Hooton:

Rosie Perera:
Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.

Bradley has explained the logical (OR) vs. physical (NEAR) incompatibility and indicated that he would allow that type of search to work as the user intended.

 

I don't get it. Either there's an incompatibility between OR and NEAR, or Bradley can make that type of search work as the user intended. I don't see how it can be both. I do remember a thread explaining why (a,b) NEAR c works while (a OR b) NEAR c doesn't, but I don't remember why and can't find the thread.

Posts 5615
Todd Phillips | Forum Activity | Replied: Mon, Jun 7 2010 1:47 AM | Locked

Rosie Perera:
Why should (a OR b) NEAR (c OR d) be considered an invalid search? I know it doesn't work currently, while (a, b) NEAR (c, d) does. But that seems to me to be a bug. The user shouldn't be forced to use a comma when OR and comma are supposed to be synonyms.


They aren't supposed to be synonyms--they just seem like it.  Bradley has made this point before. The comma list is one term logically, whereas the OR clause just generates a boolean value that has no meaning when used with a NEAR operator ( Does "true NEAR true" mean anything?)

See here for where he discusses this and says they have plans to make the parsing figure things out better:
http://community.logos.com/forums/p/4896/72043.aspx#72043

Wiki Links: Enabling Logging / Detailed Search Help - MacBook Pro (2014), ThinkPad E570

Posts 25282
Forum MVP
Dave Hooton | Forum Activity | Replied: Mon, Jun 7 2010 1:54 AM | Locked

Mark Barnes:

Bob Pritchett:
Another option would be to get "smarter", and detect more common errors, like Google does with likely spelling mistakes, typos, etc. The first parts of this are shown in our optional query expansions for topical words; internally we have early (but still not perfected) implementations of "Did you mean correct spelling?" up and running.

I've argued for this elsewhere as the best solution. It should include errors like not putting WITHIN in capitals, missing parethesis, and invalid proximity searches like (a OR b) NEAR (c OR d), which should be corrected to (a, b) NEAR (c, d)

I agree with the "smarter" approach which might also provide an option for possibly ambiguous queries like a OR b AND c OR d eg. "Did you mean (a OR b) AND (c OR d)?" otherwise run as written.

Dave
===

Windows 10 & Android 8

Posts 5337
Kevin Becker | Forum Activity | Replied: Mon, Jun 7 2010 5:13 AM | Locked

JimT:

Bob Pritchett:
... later this summer ...

When is summer?

Its early Winter here. Please use month names or something for your international customers.

In the US Summer is June, July, and August.

Posts 25282
Forum MVP
Dave Hooton | Forum Activity | Replied: Mon, Jun 7 2010 6:53 AM | Locked

Rosie Perera:
I don't get it. Either there's an incompatibility between OR and NEAR, or Bradley can make that type of search work as the user intended. I don't see how it can be both.

Sorry, Rosie - that was a poor "explanation".

This might have been better "Bradley has explained the incompatibility between the output of a Boolean expression (OR, AND) and a proximity term (BEFORE, NEAR)  but it is still possible to make that type of search work as the user intended.", however Todd has provided a more complete explanation.

Dave
===

Windows 10 & Android 8

Posts 8967
RIP
Matthew C Jones | Forum Activity | Replied: Mon, Jun 7 2010 7:06 AM | Locked

Kevin Becker:
In the US Summer is June, July, and August.

 

Not really. It is June 21st through September 20th.  Astronomically based, of course. Sleep Star

And since God told us the heavenly lights were for signs, I'd defer to Genesis over my local school board's schedule . Coffee

Logos 7 Collectors Edition

Posts 5337
Kevin Becker | Forum Activity | Replied: Mon, Jun 7 2010 9:30 AM | Locked

Matthew C Jones:

Not really. It is June 21st through September 20th.  Astronomically based, of course. Sleep Star

And since God told us the heavenly lights were for signs, I'd defer to Genesis over my local school board's schedule . Coffee

I knew someone would go and look up the specifics! Big Smile

Posts 8967
RIP
Matthew C Jones | Forum Activity | Replied: Mon, Jun 7 2010 9:37 AM | Locked

Kevin Becker:
I knew someone would go and look up the specifics! Big Smile

I didn't need to look it up.

My father bought me a telescope for my 12th birthday and took me to the top of Mt Zao to show me the rings of Saturn. It sparked a love of astronomy in my young heart. The telescope is long gone but the memory is treasured forever.

Logos 7 Collectors Edition

Posts 1367
JimTowler | Forum Activity | Replied: Mon, Jun 7 2010 9:38 AM | Locked

Thanks for the answers re summer.

I never get why mags come out titled "Fall 2010 Issue" or something. I never know when that is or when to expect the next one.

Its Winter here, a massive strorm outside, very wet and windy, and 4:38 am Tuesday. See my problem.

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Mon, Jun 7 2010 10:56 AM | Locked

Rosie Perera:
I don't get it. Either there's an incompatibility between OR and NEAR, or Bradley can make that type of search work as the user intended. I don't see how it can be both. I do remember a thread explaining why (a,b) NEAR c works while (a OR b) NEAR c doesn't, but I don't remember why and can't find the thread.

OR and comma are not the same, though they give the same results in some situations. This is from the Wiki:

Using lists

Lists are a very useful feature which provide shortcuts in a number of searches. A list is written like this: (term1, term2, term3, etc.). When Logos encounters a list, it performs the search using just term1. Then it repeats the search using just term2, then with just term3, etc. Once it has finished, it then ORs the results. Here are some examples:

  • (Jesus, Christ) is equivalent to Jesus OR Christ
  • (Jesus, Christ) AND love is equivalent to (Jesus AND love) OR (Christ AND love)

Lists are most useful when used with fields (see below), or when trying to ensure proximity operators are only used in the outer terms of your search. For example:

  • The search described earlier (master NEAR love) OR (master NEAR serve) OR (neighbor NEAR love) OR (neighbor NEAR serve) can be simplified to (master, neighbor) NEAR (love, serve). Logos treats the two lists separately, iterating through them until every combination as been reached, like this:
    • master NEAR love
    • master NEAR serve
    • neighbor NEAR love
    • neighbor NEAR serve

Please note: Some people get confused as they equate the list with the OR command. They are not the same, even though in a very basic search they will perform in the same way. Remember, Logos iterates through lists, then ORs the results.

Posts 1228
Ron | Forum Activity | Replied: Mon, Jun 7 2010 11:37 AM | Locked

Mark Barnes:
OR and comma are not the same, though they give the same results in some situations. This is from the Wiki:

So the short version is that 99% of the time we should be using commas instead of ORs in our searches?

Can anyone explain to me a case where using OR would be correct and appropriate?

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Mon, Jun 7 2010 12:33 PM | Locked

I can't think of any situation where you mean OR, but where a comma wouldn't produce the result you intend - so the 'safest' course of action would be to always use commas. Commas will always give you the results you expect. (But that doesn't mean you should be using commas. OR is not incorrect most of the time.)

The reason why it's wrong some of the time is that OR is a boolean logical operator. In logic, a OR b either produces TRUE or FALSE. It can have no other result. In Logos, searching for Jesus OR Christ will return TRUE if the article contains one of those words, FALSE if the article contains neither of those words. After a search, Logos lists all the articles that are TRUE. This is quite satisfactory and quite correct. It's exactly what we want.

But what if you type (Jesus OR Christ) NEAR (love OR compassion)? The left-hand side is evaluated, and let's imagine returns the result TRUE. Then the right-hand side is evaluated. Let's imagine that returns TRUE as well. So we're left with TRUE NEAR TRUE. Logically, that makes no sense, hence Logos returns no results for that time of search.

 

 

Posts 13379
Forum MVP
Mark Barnes | Forum Activity | Replied: Mon, Jun 7 2010 12:34 PM | Locked

JimT:
Its Winter here, a massive strorm outside, very wet and windy, and 4:38 am Tuesday. See my problem.

Sounds to me like your problem is insomnia! Smile

Page 2 of 4 (66 items) < Previous 1 2 3 4 Next > | RSS