Details on Search Improvements for Chinese, Japanese and Korean Languages

Philana R. Crouch
Philana R. Crouch Member, Logos Employee Posts: 4,597
edited November 20 in English Forum

In Chinese, Japanese, and Korean resources each CJK character is now indexed separately and treated as a "word" in the search index. Overlapping pairs of CJK characters (called bigrams)are also indexed, making it so search hits can be merged into one longer hit, which improves ranked search results.

A search query consisting of consecutive Han characters without spaces is treated as an AND search for all of the characters, regardless of order, with the added consideration that any consecutive pairs will merged into a single search hit. E.g. if A, B, and C are Han characters, then the search query ABC will return hits in articles or verses that contain A, B, AND C in any order, including consecutively.To restrict the result to just ABC you could specify the query as a phrase search "ABC", although for ranked search results you should use ABC INTERSECTS "ABC". This causes the search to use the better rankings of the search results for ABC, while limiting the results to the same set as the phrase search.

For search queries consisting of Hangul (Korean) characters, or a mixed script query starting with at most one Hanja character, the query parser will automatically treat ABC similar to the search for ABC INTERSECTS "ABC", since it expects Korean queries to separate words with spaces, while Chinese and Japanese queries are not expected to separate words with spaces.

Chinese and Japanese queries can separate words with spaces to force the query parser to treat the words as separate terms, rather than merging them.