Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
Robert Stepanek
rsto at fastmailteam.com
Thu Jan 4 16:50:22 GMT 2024
I think I found a bug in Xapian 1.5 when using FLAG_WORD_BREAKS for input that contains characters in Unicode Halfwidth and Fullwidth Forms (https://unicode.org/charts/PDF/UFF00.pdf).
Since I am undecided yet if and how to fix this in Xapian I haven't come up with a pull request. Because trac currently is offline, I could not file a bug. I hope it's OK to post my analysis here first, I'll be happy to follow up reporting that bug proper later (should we conclude that it actually is a bug).
Imagine indexing the following Japanese text "三菱UFJファクター株式会社" which in English denotes the "Mitsubishi UFJ Factors Limited" bank.
Using word segmentation in Xapian 1.5 this causes the following terms to get indexed:
ファクター
三菱
株式会社
UFJ
Note that last term, which starts with FULLWIDTH LATIN CAPITAL LETTER U' (U+FF35). Xapian's Unicode library correctly assigns this the UPPERCASE_LETTER category and indexes this verbatim.
However, querying for UFJ produces the query Query(ufj@1). That is, it queries for the lowercase form which seems to be the result of unconditional lower-casing at https://github.com/xapian/xapian/blob/master/xapian-core/queryparser/queryparser.lemony#L1459. As a result, the query returns no result.
I have written code that demonstrates this at https://gist.github.com/rsto/168a61536793e10a0a07c3920977e5eb
Now, I think that much of this issue can be prevented by normalizing both indexed text and queries before passing them over the Xapian, but this requires to rewrite indexes so isn't necessarily a quick fix. As a workaround, I chose to detect such queries and query for both the lower-cased and original uppercase forms in our systems.
Still, I do think it is a bug for Xapian not to return a result when querying for a term that's verbatim in the original input and the database. Should you agree I will be happy to discuss how to fix this and might come up with a pull request once we agreed on a solution.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20240104/19872ad8/attachment.htm>
More information about the Xapian-devel
mailing list