Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Olly Betts olly at survex.com
Thu Jan 18 03:46:12 GMT 2024


On Wed, Jan 10, 2024 at 09:02:03AM +0100, Robert Stepanek wrote:
> On Tue, Jan 9, 2024, at 3:28 AM, Olly Betts wrote:
> > Did you already check the other ranges for cased letters?  I can but if
> > you have already there's not much point.
> 
> I did not. If you find time, that'd be great. Otherwise I can make
> room for it in the next days.

I hacked up a quick Perl script and no character in any of the ranges
changes if I apply Perl's lc or uc function (but before your fix it
reports problems).

I said I was leaning towards backporting this to 1.4.x, but having looked
into it I think that's not a good idea as it would result in some
queries no longer matching unless the affected documents were reindexed.

Cheers,
    Olly



More information about the Xapian-devel mailing list