[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)
R. Mattes
rm at seid-online.de
Wed Aug 10 15:41:41 BST 2005
On Wed, 2005-08-10 at 15:29 +0100, Richard Boulton wrote:
> I believe that there haven't been any updates since the last flurry of
> messages on the list. (But feel free to check the commit logs for the
> relevant module.)
I was afraid of that - just wanted to make shure.
> Part of the problem has been that the stemming algorithms used not to
> support UTF-8 - however, the upstream algorithms (at
> http://snowball.tartarus.org/) now support this quite happily. However,
> other changes to the output of the stemmers have also occurred since the
> algorithms were imported into the Xapian source tree, so updating the
> algorithms has been waiting for a major release (since changing the
> stemming algorithms will force all databases to be rebuilt with the new
> algorithms). That said, don't let that stop you taking a look at the
> work, and changing them locally (and submitting a patch...)
Well, the stemmer is the lesser problem - i'd be happy iff at least
unstemmed terms would stay correct (and _not_ be truncated at the first
non-ASCII character :-/ ).
> The query parser itself shouldn't need too much work - you'll probably
> need to look at the accent normalising code (see accentnormalisingitor.h
> and symboltab.h).
Well, looks like this will be my next task on the stack ...
> Oh, and note that the very latest english stemming algorithm from
> snowball makes use of apostophe characters if it's presented with them,
> so it would be good to stop stripping them out of the input to the
> stemmer, if the language is english.
Unfortunalely we are dealing with german data (where stemming is pretty
hard -- well, we even would have access to a great stemmer but it has
an 500MB+ memory< footprint and isn't reentrant ..).
Thanks for your input
RalfD
More information about the Xapian-discuss
mailing list