[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters

Charlie Hull charlie at juggler.net
Thu Jun 10 15:00:28 BST 2010


On 10/06/2010 03:57, Olly Betts wrote:
> On Wed, Jun 09, 2010 at 04:48:24PM +0200, Bjorn Lamers wrote:
>> I try to index some German text with Xapian using the xapian_php bindings. I
>> run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian
>> bindings from Flax:
>> Xapian Support enabled Xapian
>> Compiled Version @PACKAGE_VERSION@
>
> Charlie, can you fix that?

I could, if I knew where it came from! I've checked all the Windows 
build files and I'm not sure where this is defined.

Bjorn, can you tell me where the string "Compiled Version 
@PACKAGE_VERSION@" comes from? How did you display this?

Thanks

Charlie

>
>> Xapian Linked Version 1.2.0
>>
>> The problem is that after indexing text which contains special characters
>> like ä, ö, ü and ß, using TermGenerator::index_text (
>> http://xapian.org/docs/sourcedoc/html/classXapian_1_1TermGenerator.html#b358784fa685139e8bdd71d37f39573e),
>> terms get cut off (stopped) after the special character. For example the
>> term gesundheitsschädlich is indexed as gesundheitsschä and Zgesundheitsschä
>> (stemmed).
>>
>> All character encodings are set to UTF-8, the MySql database is also in
>> UTF-8 encoding.
>> *
>> #1 $lIndexer = new XapianTermGenerator();
>> #2 $lStemmer = new XapianStem(XapianHelper::GetStemmer($pLanguage)); //
>> ?german?
>> #3 $lIndexer->set_stemmer($lStemmer);
>> #4 $lDoc = new XapianDocument();
>> #5 $lDoc->add_term($lObj->Id);
>> #6 $lIndexer->set_document($lDoc);
>> #7 $lIndexer->index_text("Nahrungsergänzungsmittel Ausreißer");
>> #8 $lIndexer->index_text($lSomeStringFromDb);*
>>
>> In the code example just above here the problem only occurs when I try to
>> index text on line #8. The string which get indexed on line #7 is indexed
>> correctly ({Zausreiss, Znahrungserganzungsmittel, ausreißer,
>> nahrungsergänzungsmittel}). When I force *$lSomeStringFromDb* to be in UTF-8
>> encoding the tokens are also incorrect.
>> *$lSomeStringFromDb* can either come from the database or from MemCache.
>
> If it works with a literal string but not with a variable containing that
> string, it sounds to me like there's something funny about the variable.
>
> What does this show:
>
>      vardump($lSomeStringFromDb);
>
> I'm wondering if it's an object with some conversion magic which isn't
> working quite right.
>
>> I checked the character encoding of the different inputs with the
>> PHP-method: mb_detect_encoding. Strings containing special characters have
>> encoding UTF-8, string which not contain special characters are detected as
>> ASCII. The string from #7 is detected as UTF-8.
>
> That's what I'd expect (since ASCII is a subset of UTF-8).
>
>> When I do the checks on the MemCache/database variable I get these results:
>> *mb_detect_encoding($lSomeStringFromDb) =>  ASCII
>> mb_check_encoding($lSomeStringFromDb, "UTF-8")  =>  true*
>>
>> No matter what conversions I do the variable is detected as ASCII.
>
> That's OK if it only has ASCII characters in.  A UTF-8 string with only ASCII
> characters in is byte for byte identical to an ASCII string with the same
> characters.
>
> Cheers,
>      Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>




More information about the Xapian-discuss mailing list