[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters

Bjorn Lamers bjorn.lamers at gmail.com
Mon Jun 14 17:22:16 BST 2010


After my test I can conclude the following:

$lIndexer->index_text("woörd")

The terms after executin of the statement above are: {woö, rd}. Replacing
the HTLM entities to their actual characters it works like a charm. The only
weird thing is that the indexer recognized a part of a HTML entity as a
character.


On Mon, Jun 14, 2010 at 8:57 AM, Bjorn Lamers <bjorn.lamers at gmail.com>wrote:

> Sorry for my late reply.
>
> I downloaded my binaries from:
> http://www.flax.co.uk/xapian_binaries
> http://www.flax.co.uk/xapian/120/xapian-1.2.0-bindings-php.zip
>
> Besides that I think I found my problem, want to do some extra checks later
> this day. But I think it had to do with html-entities. The only think I
> don't understand, and which I want to find out, is that why &auml; in some
> way get indexed as ä, So why does it ignores the & and "stops" at the ;
>
> Kind regards,
> Bjorn
>
>
> On Thu, Jun 10, 2010 at 4:07 PM, Olly Betts <olly at survex.com> wrote:
>
>> On Thu, Jun 10, 2010 at 03:00:28PM +0100, Charlie Hull wrote:
>> > On 10/06/2010 03:57, Olly Betts wrote:
>> >> On Wed, Jun 09, 2010 at 04:48:24PM +0200, Bjorn Lamers wrote:
>> >>> Xapian Support enabled Xapian
>> >>> Compiled Version @PACKAGE_VERSION@
>> >>
>> >> Charlie, can you fix that?
>> >
>> > I could, if I knew where it came from! I've checked all the Windows
>> > build files and I'm not sure where this is defined.
>>
>> xapian-bindings/xapian-version.h.in
>>
>> Cheers,
>>    Olly
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>
>


More information about the Xapian-discuss mailing list