<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div>On Sun, Jan 7, 2024, at 7:45 PM, Olly Betts wrote:<br></div><blockquote type="cite" id="qt" style=""><div>I've restarted trac.<br></div></blockquote><div><br></div><div>I now created a pull request: <a href="https://github.com/xapian/xapian/pull/329">https://github.com/xapian/xapian/pull/329</a> Should I create a trac issue, too?<br></div><div><br></div><blockquote type="cite" id="qt" style=""><div>Assuming the latter is valid, just removing this block (or removing the<br></div><div>parts of it which are Lu or Ll) should fix the problem as then<br></div><div>tokenisation will switch mode - I tried this and it fixes your case at<br></div><div>least:<br></div></blockquote><div><br></div><div>Removing the whole block will cause word-breaker to not correctly handle halfwidth Katakana, such as "ｼｰｻｲﾄﾞﾗｲﾅｰ" which it would treat as a single term, whereas it should be two: ｼｰｻｲﾄﾞand  ﾗｲﾅｰ).<br></div><div><br></div><div>My pull request causes word-breaker to only handle halfwidth Katakana and Hangul codepoints as unbroken script and treats Latin characters, numbers, symbols and punctuation as broken script. There's a couple of unit tests that check for this.<br></div><div><br></div><div>diff --git a/xapian-core/queryparser/word-breaker.cc b/xapian-core/queryparser/word-breaker.cc<br></div><div>index 8108523ccd53..6122dcdccc97 100644<br></div><div>--- a/xapian-core/queryparser/word-breaker.cc<br></div><div>+++ b/xapian-core/queryparser/word-breaker.cc<br></div><div>@@ -102,8 +102,10 @@ is_unbroken_script(unsigned p)<br></div><div>        0xF900 - 1, 0xFAFF,<br></div><div>        // FE30..FE4F; CJK Compatibility Forms<br></div><div>        0xFE30 - 1, 0xFE4F,<br></div><div>-       // FF00..FFEF; Halfwidth and Fullwidth Forms<br></div><div>-       0xFF00 - 1, 0xFFEF,<br></div><div>+       // FF00..FF60: Fullwidth Numbers, Latin Characters, Punctuation<br></div><div>+       // FF61..FF64: Halfwidth Punctuation<br></div><div>+       0xFF65 - 1, 0xFFDC, // Halfwidth Katakana and Hangul<br></div><div>+       // FFE0..FFEF; Fullwidth and Halfwidth Symbols<br></div><div><br></div><div>The fullwidth "ｈｅｌｌｏ ，ｗｏｒｌｄ" tests suggests to me that either Xapian should allow for Unicode normalization, or application developers must take care of this before indexing.<br></div></body></html>