[Xapian-devel] QueryParser and utf-8 strings

Olly Betts olly at survex.com
Fri Dec 9 17:17:53 GMT 2005

On Fri, Dec 09, 2005 at 11:50:35AM +0100, Radovan Garabik wrote:
> since most of the characters above 0x80 are meant as letters, only with 
> very few exceptions (non breaking spaces and punctuation, and 
> people generally do not write queries using these characters).

Users typically won't type them, but pasted text can easily include
non-ASCII apostrophes and hyphens, even for English queries.

> Of course, the same effect can be achieved by modifying is_tab.

The gmane index is utf-8 - here's the patch I use for the queryparser


I'm intending to integrate this patch, but it really needs utf-8
stemming which means I need to upgrade the version of snowball we
use (the gmane index currently doesn't use stemming so this isn't
a problem there).  Upgrading snowball is worth doing anyway, but means
there's more to do.

I'm currently working on producing 0.9.3 which is mainly a bug-fix
release.  I suspect after that we'll go for 1.0.0 with this in.

> I would suggest to make transliteration optional (or if not, remove it,
> since it makes more harm than benefit)

That seems to be the consensus (and any transliterations which are
worthwhile should be folded into the stemmers).

> and to consider
> all the chars above 0x80 to be letters (at least there is no
> better solution unless full Unicode support is implemented, and THAT is
> probably not worth the effort)

I tend to think it's worth doing this properly - there are libraries we
can use which implement the unicode equivalent of "isalnum()" and friends
so it's very little extra work to get it right.


