[Xapian-discuss] UTF-8 Corruption
Olly Betts
olly at survex.com
Mon Mar 31 03:30:42 BST 2008
On Thu, Mar 20, 2008 at 02:08:00PM +0000, Colin Bell wrote:
>
> > If you pass data through Xapian::Utf8Iterator before doing anything
> > with it, then this will fix bad UTF-8. This is essentially what
> > omindex does to deal with this problem.
>
> I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not
> convert to UTF-8?
I'm not quite sure what you're asking. Utf8Iterator returns Unicode
code point values, and for bad UTF-8 sequences, these will be for
those bytes read as ISO-8859-1.
But it doesn't ever modify the bytes being iterated over - generally the
fixed sequence would be longer, so this couldn't be done in place
anyway.
So if you want the data as valid UTF-8, you need to read with
Utf8Iterator and write the returned Unicode code point values out again
as UTF-8, e.g. using:
void Xapian::Unicode::append_utf8 (std::string &s, unsigned ch);
Append the UTF-8 representation of a single unicode character to
a std::string.
Cheers,
Olly
More information about the Xapian-discuss
mailing list