[Xapian-discuss] UTF-8: what is done and what is not?

Olly Betts olly at survex.com
Fri Nov 3 01:34:51 GMT 2006


On Thu, Nov 02, 2006 at 08:01:44PM -0500, tata 668 wrote:
> Doesn't a UTF-8 queryparser useless until it uses the exact same word 
> splitter than the one use for indexing the documents?

It would be more convenient is a compatible word splitter were available
in the core library, but "useless" is much too strong a summary of the
situation.

I'm intending to improve this situation before releasing 1.0.  Prior to
that, I suggest cribbing from indextext.cc in Omega - that's what
omindex and scriptindex use for tokenising utf-8 text.

Cheers,
    Olly



More information about the Xapian-discuss mailing list