Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Robert Stepanek rsto at fastmailteam.com
Wed Jan 10 08:02:03 GMT 2024


On Tue, Jan 9, 2024, at 3:28 AM, Olly Betts wrote:
> Thanks, that looks good - now merged.

Thanks!

> Did you already check the other ranges for cased letters?  I can but if
> you have already there's not much point.

I did not. If you find time, that'd be great. Otherwise I can make room for it in the next days.

> > The fullwidth "hello ,world" tests suggests to me that
> > either Xapian should allow for Unicode normalization, or application
> > developers must take care of this before indexing.
> 
> We currently leave it to the API user to normalise Unicode
> representation, though maybe we should provide support for doing so.

Thinking some more about this, I think it's sane to leave this out of Xapian. Unless there is also some bookkeeping added within Xapian to tell which normalisation was applied to terms, which can get complex for sub-databases or mixed normalisations within one database.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20240110/a46af694/attachment.htm>


More information about the Xapian-devel mailing list