[Xapian-discuss] UTF-8 Corruption

Thu Mar 20 14:08:00 GMT 2008

Thanks Olly

Very much appreciated as always.

>
> If you pass data through Xapian::Utf8Iterator before doing anything  
> with
> it, then this will fix bad UTF-8.  This is essentially what omindex
> does to deal with this problem.

I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not  
convert to UTF-8?

> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
>> I was wondering if anyone every came across a problem I seem to be
>> having. I'm indexing in text files using some basic code written in  
>> C+
>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
>> (but very rarely) even some other format - I have no way of knowing.
>
> There are ways to detect the character set of a file, though not  
> always
> 100% reliably.

Can anyone recommend some c++ code to do this?

Regards

Colin

On 18 Mar 2008, at 03:56, Olly Betts wrote:

> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
>> I was wondering if anyone every came across a problem I seem to be
>> having. I'm indexing in text files using some basic code written in  
>> C+
>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
>> (but very rarely) even some other format - I have no way of knowing.
>
> There are ways to detect the character set of a file, though not  
> always
> 100% reliably.
>
>> Question is, does Xapian convert none UTF-8 characters when it stores
>> the document. I think I read that UTF-8 is the default encoding for
>> Xapian, which is exactly what I am after.
>
> Most of Xapian treats things as opaque data.  The classes which need
> to know are Xapian::Stem, Xapian::QueryParser, and
> Xapian::TermGenerator.  The UTF-8 parsing used by the latter two will
> treat invalid sequences as if they were ISO-8859-1, which for
> real-world examples will almost always do the right thing when fed
> ISO-8859-1.  Xapian::Stem uses Snowball's UTF-8 parsing code  
> currently -
> I'm not sure how that handles invalid sequences.
>
>> The reason I'm asking is that I am getting some seriously corrupted
>> characters in the index. When they are displayed on Tomcat I get a
>> "sun.io.MalformedInputException" when trying to display the search
>> results. I have set the pages charset to UTF-8 and apparently this
>> error is thrown when when the streamreader detects characters that  
>> are
>> not proper UTF-8 characters.
>
> If you set document data, document values, or directly add terms  
> (using
> Document::add_posting() or Document::add_term()) then you'll get back
> what you put in verbatim.  So if you pass in something which is  
> invalid
> UTF-8, it will still be invalid.
>
> If you pass data through Xapian::Utf8Iterator before doing anything  
> with
> it, then this will fix bad UTF-8.  This is essentially what omindex
> does to deal with this problem.
>
> Cheers,
>    Olly