[Xapian-discuss] UTF-8 Corruption

Colin Bell colinabell at gmail.com
Wed Apr 2 11:31:15 BST 2008


This is really good stuff. Many thanks for all your help Olly. I hope  
all goes well in NZ.

Regards

Colin

On 31 Mar 2008, at 03:30, Olly Betts wrote:

> On Thu, Mar 20, 2008 at 02:08:00PM +0000, Colin Bell wrote:
>>
>>> If you pass data through Xapian::Utf8Iterator before doing anything
>>> with it, then this will fix bad UTF-8.  This is essentially what
>>> omindex does to deal with this problem.
>>
>> I take it that Xapian::Utf8Iterator will only fix bad UTF-8 not
>> convert to UTF-8?
>
> I'm not quite sure what you're asking.  Utf8Iterator returns Unicode
> code point values, and for bad UTF-8 sequences, these will be for
> those bytes read as ISO-8859-1.
>
> But it doesn't ever modify the bytes being iterated over - generally  
> the
> fixed sequence would be longer, so this couldn't be done in place
> anyway.
>
> So if you want the data as valid UTF-8, you need to read with
> Utf8Iterator and write the returned Unicode code point values out  
> again
> as UTF-8, e.g. using:
>
>   void Xapian::Unicode::append_utf8 (std::string &s, unsigned ch);
>
>   Append the UTF-8 representation of a single unicode character to
>   a std::string.
>
> Cheers,
>   Olly




More information about the Xapian-discuss mailing list