Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Thu Jan 18 04:02:39 GMT 2024

On Wed, Jan 10, 2024 at 09:02:03AM +0100, Robert Stepanek wrote:
> On Tue, Jan 9, 2024, at 3:28 AM, Olly Betts wrote:
> > We currently leave it to the API user to normalise Unicode
> > representation, though maybe we should provide support for doing so.
> 
> Thinking some more about this, I think it's sane to leave this out of
> Xapian. Unless there is also some bookkeeping added within Xapian to
> tell which normalisation was applied to terms, which can get complex
> for sub-databases or mixed normalisations within one database.

We could provide an "opinionated" implementation where we pick one
normalisation and only support that (we essentially do that for
encodings - Xapian features which care about an encoding only support
UTF-8).

A composed form is probably the more sensible choice here:

* Snowball stemmers all support that and few/none support decomposed
  forms
* It makes for smaller terms
* It seems by far the dominant form that data is actually in

That means NFC or NFKC - the latter seems helpful in some cases (e.g.
"oﬃce" -> "office") but less so in others (e.g. "4²" -> "42").
I think this would need a deeper analysis, but possibly we could define
a subset of the Unicode compatibility equivalent forms to use here.

Cheers,
    Olly