[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters

Olly Betts olly at survex.com
Fri Jun 25 13:59:48 BST 2010


On Mon, Jun 14, 2010 at 06:22:16PM +0200, Bjorn Lamers wrote:
> After my test I can conclude the following:
> 
> $lIndexer->index_text("woörd")

Yes, Xapian::TermGenerator doesn't parse HTML - you need to decode entites
and strip tags first.

> The terms after executin of the statement above are: {woö, rd}. Replacing
> the HTLM entities to their actual characters it works like a charm. The only
> weird thing is that the indexer recognized a part of a HTML entity as a
> character.

In some cases, it will tokenise & as part of a term (the aim is to handle
things like "AT&T" as a single term).  So I suspect the terms are:

wo&ouml rd

And when you print that in PHP, browsers will treat &ouml the same as
ö.

You can print out terms in the database from the command line using the
delve utility if you want to check such things.

Cheers,
    Olly



More information about the Xapian-discuss mailing list