[Xapian-devel] QueryParser and utf-8 strings

Olly Betts olly at survex.com
Fri Dec 9 17:17:53 GMT 2005


On Fri, Dec 09, 2005 at 11:50:35AM +0100, Radovan Garabik wrote:
> since most of the characters above 0x80 are meant as letters, only with 
> very few exceptions (non breaking spaces and punctuation, and 
> people generally do not write queries using these characters).

Users typically won't type them, but pasted text can easily include
non-ASCII apostrophes and hyphens, even for English queries.

> Of course, the same effect can be achieved by modifying is_tab.

The gmane index is utf-8 - here's the patch I use for the queryparser
there:

http://thread.gmane.org/gmane.comp.search.xapian.general/1925

I'm intending to integrate this patch, but it really needs utf-8
stemming which means I need to upgrade the version of snowball we
use (the gmane index currently doesn't use stemming so this isn't
a problem there).  Upgrading snowball is worth doing anyway, but means
there's more to do.

I'm currently working on producing 0.9.3 which is mainly a bug-fix
release.  I suspect after that we'll go for 1.0.0 with this in.

> I would suggest to make transliteration optional (or if not, remove it,
> since it makes more harm than benefit)

That seems to be the consensus (and any transliterations which are
worthwhile should be folded into the stemmers).

> and to consider
> all the chars above 0x80 to be letters (at least there is no
> better solution unless full Unicode support is implemented, and THAT is
> probably not worth the effort)

I tend to think it's worth doing this properly - there are libraries we
can use which implement the unicode equivalent of "isalnum()" and friends
so it's very little extra work to get it right.

Cheers,
    Olly




More information about the Xapian-devel mailing list