[Xapian-discuss] queryparser thinks ø is o

Tue Sep 13 05:08:08 BST 2005

On Sun, Aug 28, 2005 at 02:14:15PM +0200, R. Mattes wrote:
> Yes, the queryparser itself modifies characters. The code that does this
> is in 'xapian/xapian-core/queryparser/accentnormalisingitor.h'. IMHO
> this is a rather "murky" and anglocentric part of the Xapian codebase.

It is perhaps murky, but not really anglocentric - very few English
words use diacritical marks, and the remaining few seem to be
disappearing.

It's more germanocentric if anything.  This accent normalisation arises
out of what we usually used to do with Muscat 3.6.  Back then the
stemming algorithms had some quirky scheme of their own for representing
accents (it involved '^'), but we eschewed this in favour of simply
normalising accents before stemming.  This was easier than trying to
translate them into '^'-form, and had the additional benefit that
searches with the accents transliterated would match documents where
they weren't and vice versa.

The main downside is occasional conflation of terms which shouldn't
be (not just in Norwegian - for example the french for "peach" and
"fish" differ only by accents, and I suspect examples can be found
in other languages).

The transliteration should also really be language dependent - in German
&auml; -> ae, but that's not appropriate in Swedish I believe.  But
language dependent normalisation is what the stemming algorithms do!  So
I think this really should get folded into the stemming algorithms in
languages where it makes sense (and languages where it doesn't wouldn't
do anything).

Cheers,
    Olly