Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Robert Stepanek rsto at fastmailteam.com
Mon Jan 8 13:01:46 GMT 2024


On Sun, Jan 7, 2024, at 7:45 PM, Olly Betts wrote:
> I've restarted trac.

I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I create a trac issue, too?

> Assuming the latter is valid, just removing this block (or removing the
> parts of it which are Lu or Ll) should fix the problem as then
> tokenisation will switch mode - I tried this and it fixes your case at
> least:

Removing the whole block will cause word-breaker to not correctly handle halfwidth Katakana, such as "シーサイドライナー" which it would treat as a single term, whereas it should be two: シーサイドand  ライナー).

My pull request causes word-breaker to only handle halfwidth Katakana and Hangul codepoints as unbroken script and treats Latin characters, numbers, symbols and punctuation as broken script. There's a couple of unit tests that check for this.

diff --git a/xapian-core/queryparser/word-breaker.cc b/xapian-core/queryparser/word-breaker.cc
index 8108523ccd53..6122dcdccc97 100644
--- a/xapian-core/queryparser/word-breaker.cc
+++ b/xapian-core/queryparser/word-breaker.cc
@@ -102,8 +102,10 @@ is_unbroken_script(unsigned p)
        0xF900 - 1, 0xFAFF,
        // FE30..FE4F; CJK Compatibility Forms
        0xFE30 - 1, 0xFE4F,
-       // FF00..FFEF; Halfwidth and Fullwidth Forms
-       0xFF00 - 1, 0xFFEF,
+       // FF00..FF60: Fullwidth Numbers, Latin Characters, Punctuation
+       // FF61..FF64: Halfwidth Punctuation
+       0xFF65 - 1, 0xFFDC, // Halfwidth Katakana and Hangul
+       // FFE0..FFEF; Fullwidth and Halfwidth Symbols

The fullwidth "hello ,world" tests suggests to me that either Xapian should allow for Unicode normalization, or application developers must take care of this before indexing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20240108/a387d8c2/attachment.htm>


More information about the Xapian-devel mailing list