Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Olly Betts olly at survex.com
Tue Jan 9 02:28:43 GMT 2024


On Mon, Jan 08, 2024 at 02:01:46PM +0100, Robert Stepanek wrote:
> Removing the whole block will cause word-breaker to not correctly
> handle halfwidth Katakana, such as "シーサイドライナー" which it would treat
> as a single term, whereas it should be two: シーサイドand  ライナー).
> 
> My pull request causes word-breaker to only handle halfwidth Katakana
> and Hangul codepoints as unbroken script and treats Latin characters,
> numbers, symbols and punctuation as broken script. There's a couple of
> unit tests that check for this.

Thanks, that looks good - now merged.

I think we probably should backport this to 1.4 - it's a behaviour
change, but limited to text containing these fullwidth latin characters
and the change fixes a bug.  The awkward wrinkle is that you need to
reindex to get the full benefits of the fix.

Did you already check the other ranges for cased letters?  I can but if
you have already there's not much point.

> The fullwidth "hello ,world" tests suggests to me that
> either Xapian should allow for Unicode normalization, or application
> developers must take care of this before indexing.

We currently leave it to the API user to normalise Unicode
representation, though maybe we should provide support for doing so.

Cheers,
    Olly



More information about the Xapian-devel mailing list