[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters

Olly Betts olly at survex.com
Thu Jun 10 03:57:45 BST 2010


On Wed, Jun 09, 2010 at 04:48:24PM +0200, Bjorn Lamers wrote:
> I try to index some German text with Xapian using the xapian_php bindings. I
> run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian
> bindings from Flax:
> Xapian Support enabled Xapian
> Compiled Version @PACKAGE_VERSION@

Charlie, can you fix that?

> Xapian Linked Version 1.2.0
> 
> The problem is that after indexing text which contains special characters
> like ä, ö, ü and ß, using TermGenerator::index_text (
> http://xapian.org/docs/sourcedoc/html/classXapian_1_1TermGenerator.html#b358784fa685139e8bdd71d37f39573e),
> terms get cut off (stopped) after the special character. For example the
> term gesundheitsschädlich is indexed as gesundheitsschä and Zgesundheitsschä
> (stemmed).
> 
> All character encodings are set to UTF-8, the MySql database is also in
> UTF-8 encoding.
> *
> #1 $lIndexer = new XapianTermGenerator();
> #2 $lStemmer = new XapianStem(XapianHelper::GetStemmer($pLanguage)); //
> ?german?
> #3 $lIndexer->set_stemmer($lStemmer);
> #4 $lDoc = new XapianDocument();
> #5 $lDoc->add_term($lObj->Id);
> #6 $lIndexer->set_document($lDoc);
> #7 $lIndexer->index_text("Nahrungsergänzungsmittel Ausreißer");
> #8 $lIndexer->index_text($lSomeStringFromDb);*
> 
> In the code example just above here the problem only occurs when I try to
> index text on line #8. The string which get indexed on line #7 is indexed
> correctly ({Zausreiss, Znahrungserganzungsmittel, ausreißer,
> nahrungsergänzungsmittel}). When I force *$lSomeStringFromDb* to be in UTF-8
> encoding the tokens are also incorrect.
> *$lSomeStringFromDb* can either come from the database or from MemCache.

If it works with a literal string but not with a variable containing that
string, it sounds to me like there's something funny about the variable.

What does this show:

    vardump($lSomeStringFromDb);

I'm wondering if it's an object with some conversion magic which isn't
working quite right.

> I checked the character encoding of the different inputs with the
> PHP-method: mb_detect_encoding. Strings containing special characters have
> encoding UTF-8, string which not contain special characters are detected as
> ASCII. The string from #7 is detected as UTF-8.

That's what I'd expect (since ASCII is a subset of UTF-8).

> When I do the checks on the MemCache/database variable I get these results:
> *mb_detect_encoding($lSomeStringFromDb) => ASCII
> mb_check_encoding($lSomeStringFromDb, "UTF-8")  => true*
> 
> No matter what conversions I do the variable is detected as ASCII.

That's OK if it only has ASCII characters in.  A UTF-8 string with only ASCII
characters in is byte for byte identical to an ASCII string with the same
characters.

Cheers,
    Olly



More information about the Xapian-discuss mailing list