[Xapian-discuss] Xapian 1.0.0 released!

Fri May 18 13:35:18 BST 2007

On Fri, May 18, 2007 at 02:13:14PM +0200, Ralf Mattes wrote:
> On Fri, 2007-05-18 at 12:42 +0100, Olly Betts wrote:
> 
> > In fact, while any UTF-8 string is trivially a valid ISO-8859-1 string,
> > "real world" ISO-8859-1 doesn't look like valid UTF-8, 
> 
> ??? Could you explain? By "valid" you mean "won't roll on the floor
> making silly noises"?

In UTF-8, some sequences of bytes aren't valid (many sequences actually).
But in ISO-8859-1, there's no requirement for which bytes can follow
others.  It won't make sense of course, so yes, something like that.

In fact, my point is that the top-bit-set bytes in a UTF-8 sequence,
interpreted as ISO-8859-1, mix letters and punctuation in an unnatural
way so don't look like real ISO-8859-1 text.

> > Any UTF-8 string using characters with code points
> > 127 _will_ have a binary representation different from the same string
> > encoded in ISO-8859-1 (all characters with code points > 127 will be
> > encoded with 2 octets).

Yes, it will.

> > and our UTF-8
> > handling code deals with invalid and overlong sequences by assuming
> > they're really ISO-8859-1, so you can probably just feed in ISO-8859-1
> > and it will be indexed magically converted to UTF-8.  This hasn't been
> > tested much though so test carefully before deploying.
> 
> This makes me feel slightly uneasy i have to say ... trying to guess an
> encoding seems like a fast lane to insanity.

We don't try to "guess an encoding".  As you say, that's madness.

We need to do something when we encounter a sequence of bytes which is
invalid UTF-8 or represents an overlong encoding of a valid character
(RFC 3629 says we shouldn't just decode overlong sequences, for good
reasons).

What we choose to do is assume that the bytes represent Unicode code
points, i.e. that the text is actually in ISO-8859-1.  That's option
3 here:

http://en.wikipedia.org/wiki/Utf-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations

So, we need to do something in this case.  From that list, option 5 is
really out (halting indexing on a single bad byte of data amongst
terrabytes is most unhelpful), and option 4 is bad too, as it could
allow potentially unsanitised data to be indexed, and then placed into a
web page.

Of options 1, 2, and 3, option 3 seems the best to me by far, because it
so happens that this fallback handling actually means that real world
ISO-8859-1 encoded text "just works", because either it's just ASCII
(which is also valid UTF-8) or it doesn't create valid UTF-8 multibyte
sequences so the fallback handling for invalid and overlong sequences
kicks in.

I hope that's clearer.

Cheers,
    Olly