[Xapian-devel] QueryParser and utf-8 strings
olly at survex.com
Fri Dec 9 17:17:53 GMT 2005
On Fri, Dec 09, 2005 at 11:50:35AM +0100, Radovan Garabik wrote:
> since most of the characters above 0x80 are meant as letters, only with
> very few exceptions (non breaking spaces and punctuation, and
> people generally do not write queries using these characters).
Users typically won't type them, but pasted text can easily include
non-ASCII apostrophes and hyphens, even for English queries.
> Of course, the same effect can be achieved by modifying is_tab.
The gmane index is utf-8 - here's the patch I use for the queryparser
I'm intending to integrate this patch, but it really needs utf-8
stemming which means I need to upgrade the version of snowball we
use (the gmane index currently doesn't use stemming so this isn't
a problem there). Upgrading snowball is worth doing anyway, but means
there's more to do.
I'm currently working on producing 0.9.3 which is mainly a bug-fix
release. I suspect after that we'll go for 1.0.0 with this in.
> I would suggest to make transliteration optional (or if not, remove it,
> since it makes more harm than benefit)
That seems to be the consensus (and any transliterations which are
worthwhile should be folded into the stemmers).
> and to consider
> all the chars above 0x80 to be letters (at least there is no
> better solution unless full Unicode support is implemented, and THAT is
> probably not worth the effort)
I tend to think it's worth doing this properly - there are libraries we
can use which implement the unicode equivalent of "isalnum()" and friends
so it's very little extra work to get it right.
More information about the Xapian-devel