[Xapian-discuss] UTF-8 Corruption
Colin Bell
colinabell at gmail.com
Wed Apr 2 11:31:15 BST 2008
This is really good stuff. Many thanks for all your help Olly. I hope
all goes well in NZ.
Regards
Colin
On 31 Mar 2008, at 03:30, Olly Betts wrote:
> On Thu, Mar 20, 2008 at 02:08:00PM +0000, Colin Bell wrote:
>>
>>> If you pass data through Xapian::Utf8Iterator before doing anything
>>> with it, then this will fix bad UTF-8. This is essentially what
>>> omindex does to deal with this problem.
>>
>> I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not
>> convert to UTF-8?
>
> I'm not quite sure what you're asking. Utf8Iterator returns Unicode
> code point values, and for bad UTF-8 sequences, these will be for
> those bytes read as ISO-8859-1.
>
> But it doesn't ever modify the bytes being iterated over - generally
> the
> fixed sequence would be longer, so this couldn't be done in place
> anyway.
>
> So if you want the data as valid UTF-8, you need to read with
> Utf8Iterator and write the returned Unicode code point values out
> again
> as UTF-8, e.g. using:
>
> void Xapian::Unicode::append_utf8 (std::string &s, unsigned ch);
>
> Append the UTF-8 representation of a single unicode character to
> a std::string.
>
> Cheers,
> Olly
More information about the Xapian-discuss
mailing list