[Xapian-discuss] UTF-8 Corruption

Olly Betts olly at survex.com
Mon Mar 31 03:30:42 BST 2008


On Thu, Mar 20, 2008 at 02:08:00PM +0000, Colin Bell wrote:
> 
> > If you pass data through Xapian::Utf8Iterator before doing anything  
> > with it, then this will fix bad UTF-8.  This is essentially what
> > omindex does to deal with this problem.
> 
> I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not  
> convert to UTF-8?

I'm not quite sure what you're asking.  Utf8Iterator returns Unicode
code point values, and for bad UTF-8 sequences, these will be for
those bytes read as ISO-8859-1.

But it doesn't ever modify the bytes being iterated over - generally the
fixed sequence would be longer, so this couldn't be done in place
anyway.

So if you want the data as valid UTF-8, you need to read with
Utf8Iterator and write the returned Unicode code point values out again
as UTF-8, e.g. using:

    void Xapian::Unicode::append_utf8 (std::string &s, unsigned ch);

    Append the UTF-8 representation of a single unicode character to
    a std::string. 

Cheers,
    Olly



More information about the Xapian-discuss mailing list