[Xapian-discuss] Problem getting Xapian working with Burmese

Olly Betts olly at survex.com
Sun Jan 31 23:53:34 GMT 2010


On Sun, Jan 31, 2010 at 11:31:03AM +0100, Emmanuel Engelhart wrote:
> I think, I more or less have understood what is wrong.
> 
> "???????????????" is the name of "Paris" in Burmese.
> 
> Here is the result of delve -r 1:
> Term List for record #1: ??? ??????
> 
> We can see that the diacritics were removed... and I think here is the
> issue: the diacritics are interpreted as SEPARATOR by the tokenizer and
> that should not be the case because they are not "alone", but "belongs
> to a letter".

Thanks for the example and analysis.

There seem to be two issues here.

The first is with NON_SPACING_MARK characters (which I think is what
you are referring to above).  In 1.1.x, these are treated as part of the
word, but this issue was reported when we were at about 1.0.11, so we
couldn't just change the behaviour of 1.0.x without breaking existing
databases.  So we went for the less good but compatible approach of making
QueryParser treat these characters as phrase generators.

This is the ticket for that issue which has more detail:

http://trac.xapian.org/ticket/355

The second issue in your case is that there are zero-width space characters
in there as well, which currently act as word breaks.  These are present to
indicate acceptable places to split a word when wrapping text, so we should
ideally just strip them out when generating terms.

Cheers,
    Olly



More information about the Xapian-discuss mailing list