[Xapian-discuss] UTF-8 Corruption

Tue Mar 18 03:56:58 GMT 2008

On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
> I was wondering if anyone every came across a problem I seem to be  
> having. I'm indexing in text files using some basic code written in C+ 
> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly  
> (but very rarely) even some other format - I have no way of knowing.

There are ways to detect the character set of a file, though not always
100% reliably.

> Question is, does Xapian convert none UTF-8 characters when it stores  
> the document. I think I read that UTF-8 is the default encoding for  
> Xapian, which is exactly what I am after.

Most of Xapian treats things as opaque data.  The classes which need
to know are Xapian::Stem, Xapian::QueryParser, and
Xapian::TermGenerator.  The UTF-8 parsing used by the latter two will
treat invalid sequences as if they were ISO-8859-1, which for
real-world examples will almost always do the right thing when fed
ISO-8859-1.  Xapian::Stem uses Snowball's UTF-8 parsing code currently -
I'm not sure how that handles invalid sequences.

> The reason I'm asking is that I am getting some seriously corrupted  
> characters in the index. When they are displayed on Tomcat I get a  
> "sun.io.MalformedInputException" when trying to display the search  
> results. I have set the pages charset to UTF-8 and apparently this  
> error is thrown when when the streamreader detects characters that are  
> not proper UTF-8 characters.

If you set document data, document values, or directly add terms (using
Document::add_posting() or Document::add_term()) then you'll get back
what you put in verbatim.  So if you pass in something which is invalid
UTF-8, it will still be invalid.

If you pass data through Xapian::Utf8Iterator before doing anything with
it, then this will fix bad UTF-8.  This is essentially what omindex
does to deal with this problem.

Cheers,
    Olly