[Xapian-discuss] Unicode QueryParser

Olly Betts olly at survex.com
Fri Oct 7 17:13:03 BST 2005


On Fri, Oct 07, 2005 at 10:39:46AM +0200, Marcus Ramberg wrote:
> I've applied this patch now, and reindexed with utf8, however... it  
> seems to me like xapian still normalizes ø -> o , and such. I was  
> under the impression that this would go away with the utf8 qp. Did I  
> miss something?

It still does that.  I'm intending to look at that when I upgrade the snowball
stemmers to the latest snowball version.

It's pretty easy to hack off in your own copy though.  Just nuke the bits in
accentnormalisingitor.h which refer to TRANSLIT1 and TRANSLIT2.  The two
methods that do should then look something like this (totally untested):

    char_type operator*() const {
        return g_utf8_get_char_validated(itor, end - itor);
    }
    AccentNormalisingItor & operator++() {
        size_t skip = g_utf8_skip[*reinterpret_cast<const guchar *>(itor)];
        if (size_t(end - itor) < skip) {
            itor = end;
        } else {
            itor += skip;
        }
        return *this;
    }

With that change, runs of characters which glib's unicode functions say are
alphanumeric will form terms (just like they do for characters not in the
transliteration table without the change).

Cheers,
    Olly



More information about the Xapian-discuss mailing list