Bug: 2-word string search - included 2 words across punctuation

LimJK · November 2016

Hi,

To illustrate the problem see the following string search "看见", which I think should not include hits like "看,见" where "看" and "见" is each part of their respective multi-word / phrase. Let's take Dan 7:9 as example

我观看，
见有宝座设立

Therefore, returning Dan 7:9 as a hit based on "看,见" is not acceptable in the Chinese Language.

Please consider to filter away hits with punctuation within of a Chinese String. Unless I specify punctuation in my search string.

Lawrence Rafferty · November 2016

This is working as is currently intended, although we are discussing ways to improve this kind of search in response to your post. Thank your for sharing your feedback.

For now, you can search for 看 BEFORE 1 CHAR 见 to not include the places where they are separated by punctuation, although this has the unfortunate side effect of treating each character as a separate term.

LimJK said:
Please consider to filter away hits with punctuation within of a Chinese String. Unless I specify punctuation in my search string.

Punctuation is not indexed, so there is no way to search for it. Punctuation's influence is to increase the CHAR (not WORD) separation, so although we are now treating every character as a word, the search 看 BEFORE 1 CHAR 见 is different from 看 BEFORE 1 WORD 见 when there is intervening punctuation.

PL · November 2016

https://community.logos.com/discussion/comment/865044#Comment_865044

I agree with LimJK. This seems wrong. When I use quotation marks, I do not expect the string to be separated by punctuation marks in the search results.

Just like when you do an English search for "Jesus Christ" (with quotation marks), you would not expect "...Jesus, Christ..." or "...Jesus. Christ..." to be returned as hits.

Lawrence, I think this can be easily solvable by excluding the very small finite set of Chinese punctuation marks (perhaps a dozen of them) from your bigram indexing. I could be wrong.

I do, however, agree that this is not a high priority issue (not a showstopper because it is returning more hits rather than missing hits).

Thanks,

Peter

MJ. Smith · November 2016

LimJK said:
Please consider to filter away hits with punctuation within of a Chinese String. Unless I specify punctuation in my search string.

FYI Chinese is behaving in the same way as all other languages. Note I chose a fake example to show the behaviour.

Bradley Grainger (Logos) · November 2016

https://community.logos.com/discussion/comment/865050#Comment_865050

PL said:
Just like when you do an English search for "Jesus Christ" (with quotation marks), you would not expect "...Jesus, Christ..." or "...Jesus. Christ..." to be returned as hits.

This is not actually true for English (but we're discussing internally how to improve Chinese search so that it doesn't have to follow the English rules):

Lawrence Rafferty · November 2016

https://community.logos.com/discussion/comment/865050#Comment_865050

PL said:

I agree with LimJK. This seems wrong. When I use quotation marks, I do not expect the string to be separated by punctuation marks in the search results.

Thanks for your feedback. We will take your expectations into consideration as we determine how to improve this kind of search.

PL said:

Just like when you do an English search for "Jesus Christ" (with quotation marks), you would not expect "...Jesus, Christ..." or "...Jesus. Christ..." to be returned as hits.

While expectations may vary, that is the behavior (as MJ points out.) That is why I said it is currently working as intended. It matches the behavior of all phrase searches. But don't worry, we are looking into changing for Chinese it to better match user expectations.

PL said:

Lawrence, I think this can be easily solvable by excluding the very small finite set of Chinese punctuation marks (perhaps a dozen of them) from your bigram indexing. I could be wrong.

We are not indexing punctuation at all, either individually or in bigrams. Furthermore, we do not index bigrams across punctuation, e.g. in AB,C we index the terms A AB B C. Indexing bigrams allows us to determine when two unigrams are actually adjacent, which we leverage to improve rankings for longer matching terms. We will also make use of them to improve phrase search.

PL · November 2016

https://community.logos.com/discussion/comment/865068#Comment_865068

Thanks for the quick responses, all. I should have checked the English search behavior first before posting. Perhaps maintaining consistency within the software for all languages may be an important consideration. (For example, if one tries to conduct a bilingual search including English and Chinese in a bilingual book, having different rules for the two languages may give rise to confusion.)

Personally I can live with the current behavior. Like I said, as long as it's not missing search hits (like it did before), then it's okay to me.

Thanks,

Peter

LimJK · November 2016

https://community.logos.com/discussion/comment/865064#Comment_865064

Bradley Grainger (Faithlife) said:

... (but we're discussing internally how to improve Chinese search so that it doesn't have to follow the English rules):

Bradley, Thank you very much for the considerations [Y]

Lawrence Rafferty said:

For now, you can search for 看 BEFORE 1 CHAR 见 to not include the places where they are separated by punctuation, although this has the unfortunate side effect of treating each character as a separate term.

Lawrence, Thanks .... I learn something new today; I did not know that we can search 看 BEFORE 1 CHAR 见

MJ. Smith said:

FYI Chinese is behaving in the same way as all other languages. Note I chose a fake example to show the behaviour.

MJ, I am not a language person ... may be an example of the effect of what I was trying to articulate might help.

Say we search for a word like Butterfly, can you imagine Logos returning search results of both "Butter" and "Fly" or "Butter, Fly". So think of 看见 like the Butterfly example[:)]

PL said:

I agree with LimJK. This seems wrong.

PL, Thanks for supporting the case

MJ. Smith · November 2016

https://community.logos.com/discussion/comment/865130#Comment_865130

LimJK said:
Say we search for a word like Butterfly, can you imagine Logos returning search results of both "Butter" and "Fly" or "Butter, Fly". So think of 看见 like the Butterfly example

I did understand you - I (used to) read a bit of Chinese - old Confuscian and Buddhist texts. But a similar problem exists in English where British and American English disagree over the use of a hyphen - aardvark / aard-vark. Compound lexemes in English may be bound together (grownup), hyphenated (grown-up) or unmarked (New York) ... languages such as German and Sanskrit that have heavy use of compounds provide yet another similar problem.

Lawrence Rafferty · November 2016

https://community.logos.com/discussion/comment/865064#Comment_865064

Bradley Grainger (Faithlife) said:
This is not actually true for English (but we're discussing internally how to improve Chinese search so that it doesn't have to follow the English rules)

We are thinking that it would be possible to change the phrase search for Chinese to make it so "看见" only matches where the two characters are adjacent, with no intervening punctuation. We would make it so that if you include a space in the phrase search "看见" then punctuation will be allowed (but not required) to appear between the characters where the spaces appear.

This has the advantage and drawback that by default you will see less results. Users will have to know to include spaces if they want to see phrase results with intervening punctuation.

Could you please confirm that this change is preferable to the current behavior? Thanks again for your feedback.

LimJK · November 2016

https://community.logos.com/discussion/comment/865608#Comment_865608

Lawrence Rafferty said:

We are thinking that it would be possible to change the phrase search for Chinese to make it so "看见" only matches where the two characters are adjacent, with no intervening punctuation. We would make it so that if you include a space in the phrase search "看见" then punctuation will be allowed (but not required) to appear between the characters where the spaces appear.

This has the advantage and drawback that by default you will see less results. Users will have to know to include spaces if they want to see phrase results with intervening punctuation.

Could you please confirm that this change is preferable to the current behavior? Thanks again for your feedback.

Lawrence,

Once again, appreciate the consideration for the Chinese Search challenge[:)] I am not sure if I understand the implications to the results of the question you posted ... Let me see if I make sense as I cannot visualize possible negative effects:

I assume that searching 看见 (without quotes) will perform searches of the various permutation of 看见, 看 and 见 ... will have the effect of giving us more results, mitigating the drawback that you mentioned that we will see less results by default with the change you are contemplating for Chinese string search.
Can you help me visualize what are the search results that will be included if I search the string "看,见" or "看见"?, I am assuming that the hits will only include strings with "看" as first Char and 见 as 3rd Char with a valid punctuation as 2nd Char. Maybe if we should expand from 2 to 3 or more Word/Char so that we can flush out other scenarios that we have not considered. Eg. The string “观看，见”, longer strings, etc.
In string searches, I would expect the results to contain the same string in the same Word/Char order. I understand that punctuation is not indexed for some good reasons. I think it is fine for the punctuation not to be searched, however, please display it with the punctuation mark in the text as per normal string search in English (eg. the example used by MJ).
Yes, I would confirm that this change is preferable in the above mentioned context. However, I am not a search or language specialist, so I felt unqualified to confirm. It would be great if others more knowledgable could also help to confirm.

Sorry for the long winded way of answering your question[:)]

Lawrence Rafferty · November 2016

https://community.logos.com/discussion/comment/865665#Comment_865665

LimJK said:

I assume that searching 看见 (without quotes) will perform searches of the various permutation of 看见, 看 and 见 ... will have the effect of giving us more results, mitigating the drawback that you mentioned that we will see less results by default with the change you are contemplating for Chinese string search.

Can you help me visualize what are the search results that will be included if I search the string "看,见" or "看见"?, I am assuming that the hits will only include strings with "看" as first Char and 见 as 3rd Char with a valid punctuation as 2nd Char. Maybe if we should expand from 2 to 3 or more Word/Char so that we can flush out other scenarios that we have not considered. Eg. The string “观看，见”, longer strings, etc.

In string searches, I would expect the results to contain the same string in the same Word/Char order. I understand that punctuation is not indexed for some good reasons. I think it is fine for the punctuation not to be searched, however, please display it with the punctuation mark in the text as per normal string search in English (eg. the example used by MJ).

Yes, I would confirm that this change is preferable in the above mentioned context. However, I am not a search or language specialist, so I felt unqualified to confirm. It would be great if others more knowledgable could also help to confirm.

You are right. The proposed change would not affect searches without the quotation marks. The drawback would only apply if someone were expecting the behavior to more closely match other languages. I think this is mitigated by the fact that in other languages you also inherently have to type spaces between words.
"看,见" is not a valid search string, or at least not what you want. You can't search for punctuation, so you shouldn't include it in your queries, especially considering that the comma is a search operator in other contexts.
With our proposed change "看见" will return hits for 看见 as well as 看，见.
With our proposed change "观看见" would match 观看，见 but "观看见" would not.
I think you are referring to MJ's "murder thou" English example, where the results include murder, Thou. The results in Chinese would work the same way. If we matched text with punctuation, even though the punctuation wasn't in the query it will still be in the result.
We will wait for further confirmation. Thank you again for taking the time to evaluate this.

LimJK · November 2016

https://community.logos.com/discussion/comment/865975#Comment_865975

Lawrence Rafferty said:

You are right. The proposed change would not affect searches without the quotation marks. The drawback would only apply if someone were expecting the behavior to more closely match other languages. I think this is mitigated by the fact that in other languages you also inherently have to type spaces between words.

"看,见" is not a valid search string, or at least not what you want. You can't search for punctuation, so you shouldn't include it in your queries, especially considering that the comma is a search operator in other contexts.
With our proposed change "看见" will return hits for 看见 as well as 看，见.
With our proposed change "观看见" would match 观看，见 but "观看见" would not.

I think you are referring to MJ's "murder thou" English example, where the results include murder, Thou. The results in Chinese would work the same way. If we matched text with punctuation, even though the punctuation wasn't in the query it will still be in the result.

We will wait for further confirmation. Thank you again for taking the time to evaluate this.

Lawrence,

Thanks for the replies on items (1) and (3) , ... I have a few points for your consideration for item (2):

Searching "看,见" has to be valid a valid search, simply because the most common way to compose a search is when one is reading the CUV text and then mark a typical string with the mouse pointer to do a search. It is therefore, very likely that a punctuation will be included in the search string this way (that is how it works in English now). I think what you meant is when the search string is "看,见" Logos will instead conduct the search based on "看见". Specifically, I hope Logos is not going to return an "Invalid Search String" message in such cases. Eg. of such a search.

I have some concerns ... with your examples of 2 Char/Word and 3 Char/Word, let me enumerate the various cases as follows:

"看见" should NOT return hits for 看见 as I have illustrated in the earlier posts that 看 and 见 is actually part of their respective compound word "观看" or phrase "见有一块非人手凿出来的石头" from Daniel 2:34.
"看见" should return hits for "看?见" where ? can be "space" or any of the valid punctuation for the hits as found in the CUV text
"观看见" similarly should return hits for "观看?见" as above
"看见" should return hits for "看见"
"观看见" should return hits for "观看见" (this is not a good example)
看见 without quotes should return the various permutations of 看 and 见 and 看见 as it is now in Logos 7.2 RC3

Thank you for listening [:)] much appreciated.

Lawrence Rafferty · November 2016

https://community.logos.com/discussion/comment/866001#Comment_866001

LimJK said:
Searching "看,见" has to be valid a valid search, simply because the most common way to compose a search is when one is reading the CUV text and then mark a typical string with the mouse pointer to do a search. It is therefore, very likely that a punctuation will be included in the search string this way (that is how it works in English now). I think what you meant is when the search string is "看,见" Logos will instead conduct the search based on "看见". Specifically, I hope Logos is not going to return an "Invalid Search String" message in such cases.

You are right. It is not invalid in that sense. Searching for a phrase from selected resource text is common and will remain supported as best as possible. I only meant to emphasize that a user should usually not type in punctuation in a query unless it is as a search operator.

LimJK said:
"看见" should NOT return hits for 看见 as I have illustrated in the earlier posts that 看 and 见 is actually part of their respective compound word "观看" or phrase "见有一块非人手凿出来的石头" from Daniel 2:34.

I don't know if this is possible. If it is, I think it will be complicated to implement. Given the building blocks of the search index, we have terms that occur at character and word offsets. For the terms 看 and 见 to match in the way you describe we'd have to find all the places where they occur together and then ensure that 看 precedes 见 by only one word, but that they are separated by at least one character.

I would appreciate more feedback and usage examples about this idea. It seems like it could be inconvenient for other queries, where A and B could occur as AB or A，B in the text, and you are interested in seeing the results together (i.e. the normal behavior of a phrase search.)

Bug: 2-word string search - included 2 words across punctuation

Comments

Categories