[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters

Bjorn Lamers bjorn.lamers at gmail.com
Wed Jun 9 15:48:24 BST 2010


Dear Xapian users,

I try to index some German text with Xapian using the xapian_php bindings. I
run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian
bindings from Flax:
Xapian Support enabled Xapian
Compiled Version @PACKAGE_VERSION@
Xapian Linked Version 1.2.0

The problem is that after indexing text which contains special characters
like ä, ö, ü and ß, using TermGenerator::index_text (
http://xapian.org/docs/sourcedoc/html/classXapian_1_1TermGenerator.html#b358784fa685139e8bdd71d37f39573e),
terms get cut off (stopped) after the special character. For example the
term gesundheitsschädlich is indexed as gesundheitsschä and Zgesundheitsschä
(stemmed).

All character encodings are set to UTF-8, the MySql database is also in
UTF-8 encoding.
*
#1 $lIndexer = new XapianTermGenerator();
#2 $lStemmer = new XapianStem(XapianHelper::GetStemmer($pLanguage)); //
”german”
#3 $lIndexer->set_stemmer($lStemmer);
#4 $lDoc = new XapianDocument();
#5 $lDoc->add_term($lObj->Id);
#6 $lIndexer->set_document($lDoc);
#7 $lIndexer->index_text("Nahrungsergänzungsmittel Ausreißer");
#8 $lIndexer->index_text($lSomeStringFromDb);*

In the code example just above here the problem only occurs when I try to
index text on line #8. The string which get indexed on line #7 is indexed
correctly ({Zausreiss, Znahrungserganzungsmittel, ausreißer,
nahrungsergänzungsmittel}). When I force *$lSomeStringFromDb* to be in UTF-8
encoding the tokens are also incorrect.
*$lSomeStringFromDb* can either come from the database or from MemCache.

I checked the character encoding of the different inputs with the
PHP-method: mb_detect_encoding. Strings containing special characters have
encoding UTF-8, string which not contain special characters are detected as
ASCII. The string from #7 is detected as UTF-8.
*
mb_detect_encoding("Nahrungsergänzungsmittel Ausreißer")  => UTF-8
mb_detect_encoding("Nahrungserganzungsmittel Ausreisser") => ASCII
mb_check_encoding("Nahrungsergänzungsmittel Ausreißer", "UTF-8")  => true
mb_check_encoding("Nahrungserganzungsmittel Ausreisser", "UTF-8") => true*

When I do the checks on the MemCache/database variable I get these results:
*mb_detect_encoding($lSomeStringFromDb) => ASCII
mb_check_encoding($lSomeStringFromDb, "UTF-8")  => true*

No matter what conversions I do the variable is detected as ASCII.
*// http://www.php.net/manual/en/function.utf8-encode.php#89789
function fixEncoding($in_str)
{
 $cur_encoding = mb_detect_encoding($in_str) ;
 if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
   return $in_str;
 else
   return utf8_encode($in_str);
} // fixEncoding

if (mb_detect_encoding($lString) == "ASCII")
   $lString = mb_convert_encoding($lString, "UTF-8", "ASCII");*

Only by adding a special character to the variable it gets detected as UTF-8
(in all cases the string was correctly encoded checked as UTF-8 with
mb_check_encoding). But still the generated terms are incorrect.
*mb_detect_encoding(“ä ” . $lSomeStringFromDb) => UTF-8
mb_check_encoding(“ä ” . $lSomeStringFromDb, "UTF-8")  => true*

To sum up my encoding problems:
Text which contains special characters is not correctly indexed (German
text). Terms are cut off just after a special character. I’m pretty sure my
variables/objects are all in UTF-8 format, but they are not properly
indexed. When I copy the contents of my variables/objects into strings in
PHP the content is properly indexed.

What can be the problem of the variables, why aren’t the indexed properly?

Thanks in advance.

Best regards,

Bjorn


More information about the Xapian-discuss mailing list