Bible search is unreliable
Sometimes I felt that the bible search in CUV (both Shen and Shangti versions) is wrong. Now I found a good example to show that. The results seems totally wrong and useless. Can anyone have a look and fix it. Thank you.
1. search for "瞎了眼",I get 4 results.
2. search for "瞎了",I get another 4 results
3. search for "瞎", I get only 2 results.
Comments
-
I forgot to mention my version. It's Logos 6.12 SR-2.
0 -
We have a case in our bug tracker for fixing problems searching Chinese text. I've added a link to this thread to the case.
0 -
Bradley,
Adding another case for the search reliability issue:
I am trying to search for all occurrences of "国" (kingdom), for illustration I am just limiting to Daniel Chapter 2 ... noticed those that I annotated with red circle are words that the search missed.
By the way, I noticed that when I do a select the single word "国" in the CUV text, I always get the word with a neighboring word selected too, eg. "国度", "一国", etc. I am highlighting this as I think this may related the the search issue (just a guess ?)
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
We will be releasing some improvements for Chinese search in the next version of the program (7.2, currently in beta). See the release notes here and the screenshot below: https://wiki.logos.com/Logos_7.2_Beta_3
0 -
[Y][Y]
0 -
Bradley,
Thanks for the heads up on beta ... look forward to that.
While I am at this, I noticed that text in "notes" of the CUV Bible are not being searched like in the English Bibles. An example here: I am searching for "定命" (command in English) here and notice that there is no hit.
Thanks!
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
LimJK said:
While I am at this, I noticed that text in "notes" of the CUV Bible are not being searched like in the English Bibles.
Unless I'm misunderstanding, that behaviour is what I would expect, and matches English Bibles. A "Bible" search will not find results in translators' notes, but they can be found with a "Basic" search.
0 -
Oops ... sorry,
I meant Basis search ... see below, I search for Bible Text, Footnote Text and Translator's Note.
- I have "Bible Text" selected just to show that the search is a valid search
- Turn on "Footnote" and "Translator's note" ... yields no result ... specifically, missed the "定命" in the Footnote in Dan 2:5.
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
I think this should also be fixed with the new search code.
0 -
I suppose, I should test it after 7.2 is release ... Just to complete the various permutation, so that your tester(s) can tests these ... thanks
- First test "定"was not pickup in the notes of Dan 2:5
- Second test "命" picks up Dan 2:5, however "定命" missed, sounds like 7.2 should resolve this
- Third test "定" AND "命" missed
- Fourth test "定" WITHIN 2 WORDS "命" missed
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
Bradley,
Thanks
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
Hi Bradley,
Let me know if I should move this to the Beta forum.
A search query consisting of consecutive Han characters without spaces is treated as an AND search for all of the characters, regardless of order, with the added consideration that any consecutive pairs will merged into a single search hit. E.g. if A, B, and C are Han characters, then the search query ABC will return hits in articles or verses that contain A, B, AND C in any order, including consecutively.To restrict the result to just ABC you could specify the query as a phrase search “ABC”, although for ranked search results you should use ABC INTERSECTS “ABC”. This causes the search to use the better rankings of the search results for ABC, while limiting the results to the same set as the phrase search.
This new search behavior in Beta is extremely confusing for Chinese users. It swings the pendulum from missing a lot of hits to returning too many hits. E.g. A search for 和平 ("peace") also returns verses with 和平安祭 ("and peace offering"). These results are confusing and misleading.
I think as a start, for Chinese searches you should at least do the same as Korean - default the search for ABC as ABC INTERSECTS "ABC". A search for ABC XYZ should be interpreted as (ABC INTERSECTS "ABC") AND (XYZ INTERSECTS "XYZ").
In addition, for the indexer to blindly index any two consecutive characters is a flawed approach, for two reasons:
1. Many Chinese terms (especially transliterated names) contain 3, 4, 5 characters.
2. Two consecutive characters may not necessarily imply they are related as one unit of meaning (as seen in the above 和平安祭 example).
You almost need to have a dictionary of multi-character terms (耶和華,和平,平安祭) for the indexer to know which consecutive characters form legitimate units of meaning, and which do not, in the absence of spaces in CJK languages.
Many Chinese input methods (輸入法) provide such a "dictionary" (詞典) of multi-character phrases, but I'm not sure if they can be leveraged for this purpose (technically and legally). I know Sougou Input Method 搜狗輸入法 has many user-created dictionaries 細胞詞典 and many users have created Christianity and Bible-related ones for other users to use.
My few cents. Thanks for trying to solve this hard problem with CJK languages. Prior to this I have avoided using Logos for Chinese Bible searches all these years because I know the search results are completely unreliable. Now with the Chinese Bronze version pending, this problem comes to the forefront.
Thanks,
Peter
0 -
PL said:
This new search behavior in Beta is extremely confusing for Chinese users. It swings the pendulum from missing a lot of hits to returning too many hits. E.g. A search for 和平 ("peace") also returns verses with 和平安祭 ("and peace offering"). These results are confusing and misleading.
I think as a start, for Chinese searches you should at least do the same as Korean - default the search for ABC as ABC INTERSECTS "ABC". A search for ABC XYZ should be interpreted as (ABC INTERSECTS "ABC") AND (XYZ INTERSECTS "XYZ").
Thank you for this feedback. If we change the search behavior to match Korean, you will also be forced to put spaces in the search query where you want to explicitly allow the terms to be treated separately. I take your feedback to mean that you would find this preferable, that you would rather be forced to always put spaces between words in you search query, for the benefit of not also getting unhelpful hits for every character in the search query. If I am mistaken please let me know.
PL said:In addition, for the indexer to blindly index any two consecutive characters is a flawed approach, for two reasons:
1. Many Chinese terms (especially transliterated names) contain 3, 4, 5 characters.
2. Two consecutive characters may not necessarily imply they are related as one unit of meaning (as seen in the above 和平安祭 example).
I think you misunderstand the reason for indexing every two consecutive characters. We index every *overlapping* two consecutive characters so we can merge the hits together into longer hits. That means we can find any one or two character word because it is in the search index, and we can find any longer word by merging the hits for the overlapping bigrams.
The prior indexer was using ICU's word breaker, which uses a dictionary, which has its own set of problems.PL said:You almost need to have a dictionary of multi-character terms (耶和華,和平,平安祭) for the indexer to know which consecutive characters form legitimate units of meaning, and which do not, in the absence of spaces in CJK languages.
Many Chinese input methods (輸入法) provide such a "dictionary" (詞典) of multi-character phrases, but I'm not sure if they can be leveraged for this purpose (technically and legally). I know Sougou Input Method 搜狗輸入法 has many user-created dictionaries 細胞詞典 and many users have created Christianity and Bible-related ones for other users to use.
Our new approach is based on the scholarly research and software practices for CJK information retrieval. The options are basically to index unigrams, bigrams, or dictionary words, or some combination thereof, and the query parser must be coded to match the indexing method.
Our approach of indexing unigrams and overlapping bigrams solves the problem of unknowable word breaks by abandoning any attempt to predetermine them. Rather, we make sure we can find any string of characters the user wishes to find, in any combination. The main drawback to this approach is the loss of meaningful word based proximity searches, but that seems worth sacrificing for actually being able to find every occurrence of a given search term.I'd like to thank you again for your thoughtful feedback and assure you we will investigate changing the parsing to be more like Korean as you suggested.
0 -
Hi Lawrence,
Thanks for your prompt response and for your working for CJK users. I deeply appreciate Logos paying attention and investing time and resources to this longstanding search reliability issue.
Before you make the change, I think we should hear the preference from other users first.
I spent some time playing with the Beta again, and it actually is more usable than I first thought:
- In most cases I can just do "ABC" instead of ABC INTERSECTS "ABC" if I don't care about ranking.
- I can easily solve the problem of 和平 finding 和平安祭 by searching for “和平” -平安祭 (the search engine even recognizes the curly quotation marks that my Chinese input method uses by default, which is very cool).
- Searching for 耶穌福音 (Jesus Gospel) finds verses with 耶穌 and/or 福音 in whichever order, which is fine. (With the proposed change, users will have to search for 耶穌 福音 which I'm not sure everyone is OK with. To find the exact string in that exact order, user will have to use quotation marks.)
- Searching for 和平 without quotation marks will find MANY extraneous verses with 和 (and) and 平 (flat) in whichever order, which is confusing and seems wrong, but this can be easily solved by using quotation marks, and it only happens with bigrams or trigrams where each of the characters is also a common word that can be used in totally different contexts. Such phrases may be rarer than I originally think. I also tried 主的恩 (Lord's grace) which has the same issue but again putting quotation marks solves it.
I also tried doing similar searches on other Chinese Bible search iOS apps or engines (e.g. Bible Gateway) and compare results. Looks like 和平 trips up almost all of them.
Other CJK users, please chime in?
Thanks again!
Peter
0 -
PL said:
- Searching for 耶穌福音 (Jesus Gospel) finds verses with 耶穌 and/or 福音 in whichever order, which is fine. (With the proposed change, users will have to search for 耶穌 福音 which I'm not sure everyone is OK with. To find the exact string in that exact order, user will have to use quotation marks.)
Hi, Peter,
Lawrence and I have worked on this for some time and we think using the Korean style by having a space between 耶穌 福音 might NOT be ok with other CJK users since it is usually no space between Chinese characters when typing or writing. Therefore, Lawrence and I decided to use quotation marks for searching exact Chinese phrase. We plan to tell users in our instructions to use quotation mark for exact phrase search results. However, we are welcome for other Chinese users to give us the suggestions or inputs on your proposal. Like Lawrence said, we will take all the suggestions into the consideration to solve this Chinese search issue for the majority of Chinese users.
I appreciate very much for your time and effort to help us improve Logos Bible Software user experiences. We strive to give our Chinese users the best Bible software in the world. Thank you once again!
Best regards,
Philip
0 -
Thank you Philip and Lawrence (and Logos!) Either implementation will be acceptable to me. By using quotation marks, your search results are already better than the other apps/websites I've tested.
I'm eagerly looking forward to the Bronze package for a full Chinese Bible software experience!
Thank YOU for all of your diligence, professionalism, and service to the global Church!
Peter
0 -
PL said:
Thank you Philip and Lawrence (and Logos!) Either implementation will be acceptable to me. By using quotation marks, your search results are already better than the other apps/websites I've tested.
I'm eagerly looking forward to the Bronze package for a full Chinese Bible software experience!
Thank YOU for all of your diligence, professionalism, and service to the global Church!
Peter
Hi, Peter,
You are welcome. Your encouragement and loyal support of Logos make our jobs more meaningful. Thanks again for your contributions.
Best regards,
Philip
0 -
Philip,
I have not participated in beta for a long time ... after hearing about 7.2 from Bradley above, I finally install the 7.2 RC1
Chinese Search is finally working now[:)]
See my input on 2 words search as a string if I initiate the search from the CUV text in the following post in beta.
http://community.logos.com/forums/p/132973/864140.aspx#864140
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
Hi,
Can someone with better command of the Chinese Language chip in on this, so that Logos can consider to fix this before releasing 7.2. Chinese is my second language[:)]
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
I just did. Thanks for cross-referencing that thread here. I missed that thread.
Thanks,
Peter
0 -
Lawrence + Philip,
http://community.logos.com/forums/t/133070.aspx
My post on beta for over 10 days ago on the same subject matter, it is probably lost in the midst of all the new beta 7.3 discussion, so, I thought I should post here to ask if Lawrence see my reply to his questions on preference of search results on multi word/char across punctuation.
Thanks.
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
I just replied there. Sorry for the delayed response [:$]
0 -
Lawrence Rafferty said:LimJK said:
"看,见" OR "看 见" should NOT return hits for 看见 as I have illustrated in the earlier posts that 看 and 见 is actually part of their respective compound word "观看" or phrase "见有一块非人手凿出来的石头" from Daniel 2:34.
I don't know if this is possible. If it is, I think it will be complicated to implement. Given the building blocks of the search index, we have terms that occur at character and word offsets. For the terms 看 and 见 to match in the way you describe we'd have to find all the places where they occur together and then ensure that 看 precedes 见 by only one word, but that they are separated by at least one character.
I would appreciate more feedback and usage examples about this idea. It seems like it could be inconvenient for other queries, where A and B could occur as AB or A,B in the text, and you are interested in seeing the results together (i.e. the normal behavior of a phrase search.)
Lawrence,
I thought the discussion is more relevant here, if we want others to suggest other use cases
(1) I must admit I do not look at the technical possibilities of the implementation [:)]
(2) Rethinking ... I think it is not likely that I would construct a search with 2 words/chars separated by punctuation such as "看,见" in our discussion in real life. In the example I was actually looking for equivalent of "... looked, and behold ...", I would search "观看,见" and the search results in the current implementation is correct.
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0 -
Lawrence,
I was looking for use cases to proof my point ... instead, I found these case that shows that your proposed search results is better [:)] for these 2 examples
- “一载、二载、半载” ... time, times and half a time
- "金、银、铜" ... Gold, Silver, bronze
JK
MacBookPro Retina 15" Late 2013 2.6GHz RAM:16GB SSD:500GB macOS Sierra 10.12.3 | iPhone 7 Plus iOS 10.2.1
0