[Xapian-discuss] UTF-8 Corruption
Colin Bell
colinabell at gmail.com
Thu Mar 20 14:08:00 GMT 2008
Thanks Olly
Very much appreciated as always.
>
> If you pass data through Xapian::Utf8Iterator before doing anything
> with
> it, then this will fix bad UTF-8. This is essentially what omindex
> does to deal with this problem.
I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not
convert to UTF-8?
> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
>> I was wondering if anyone every came across a problem I seem to be
>> having. I'm indexing in text files using some basic code written in
>> C+
>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
>> (but very rarely) even some other format - I have no way of knowing.
>
> There are ways to detect the character set of a file, though not
> always
> 100% reliably.
Can anyone recommend some c++ code to do this?
Regards
Colin
On 18 Mar 2008, at 03:56, Olly Betts wrote:
> On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
>> I was wondering if anyone every came across a problem I seem to be
>> having. I'm indexing in text files using some basic code written in
>> C+
>> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
>> (but very rarely) even some other format - I have no way of knowing.
>
> There are ways to detect the character set of a file, though not
> always
> 100% reliably.
>
>> Question is, does Xapian convert none UTF-8 characters when it stores
>> the document. I think I read that UTF-8 is the default encoding for
>> Xapian, which is exactly what I am after.
>
> Most of Xapian treats things as opaque data. The classes which need
> to know are Xapian::Stem, Xapian::QueryParser, and
> Xapian::TermGenerator. The UTF-8 parsing used by the latter two will
> treat invalid sequences as if they were ISO-8859-1, which for
> real-world examples will almost always do the right thing when fed
> ISO-8859-1. Xapian::Stem uses Snowball's UTF-8 parsing code
> currently -
> I'm not sure how that handles invalid sequences.
>
>> The reason I'm asking is that I am getting some seriously corrupted
>> characters in the index. When they are displayed on Tomcat I get a
>> "sun.io.MalformedInputException" when trying to display the search
>> results. I have set the pages charset to UTF-8 and apparently this
>> error is thrown when when the streamreader detects characters that
>> are
>> not proper UTF-8 characters.
>
> If you set document data, document values, or directly add terms
> (using
> Document::add_posting() or Document::add_term()) then you'll get back
> what you put in verbatim. So if you pass in something which is
> invalid
> UTF-8, it will still be invalid.
>
> If you pass data through Xapian::Utf8Iterator before doing anything
> with
> it, then this will fix bad UTF-8. This is essentially what omindex
> does to deal with this problem.
>
> Cheers,
> Olly
More information about the Xapian-discuss
mailing list