[Xapian-discuss] Python bindings and unicode strings

Tue Sep 4 05:50:17 BST 2007

On 9/3/07, James Aylett <james-xapian at tartarus.org> wrote:
> On Mon, Sep 03, 2007 at 12:07:39AM +0100, Olly Betts wrote:
> > > > I understand that the Xapian core uses UTF-8, but is there a way to
> > > > get the Python bindings to always work with Python's native unicode
> > > > string type so that the underlying UTF-8 is not exposed?
> > >
> > > This isn't true, and therein lies the problem. Xapian core treats
> > > everything as blobs of bytes;
> >
> > Except that Xapian::Stem, Xapian::QueryParser, and Xapian::TermGenerator
> > all assume UTF-8 (since 1.0.0).
>
> Well, yes. While they appear in core, they aren't really part of the
> underlying model, though. (Again, this doesn't help the clarity.)

Hmm.  Clarity is rather important.  I suspect this may just need
some additional documentation (Or maybe it's there? Xapian does
have a lot of technical documentation, but it's a bit scattered)

Obviously it makes sense for Stem to work with Unicode, since it
must deal with written languages.  It gets a bit more clouded
beyond that.  Is the core intentionally designed to allow indexing
arbitrary binary stuff, or is that just a side-effect of it not making any
assumptions or trying to interpret the bytes in any way?

Even though you can stuff UTF-8 into a raw byte sequence, the
other way around doesn't work.  For example the byte 0xFF is
illegal in UTF-8 "text".  And it also needs to be clear how the byte
0x00 is treated (as a character or as an end-of-string terminator).
Basically all parts of Xapian, as well as users of it must agree
whether things are raw bytes or UTF-8 strings.  It can't really be
both, safely anyway.

> > The Python bindings will convert any unicode string to UTF-8 before
> > passing it to Xapian.  The reverse conversion isn't performed when a
> > string is returned to Python though.  I don't really remember the
> > rationale for that, but looking at bindings.html, I think it might be
> > that it allows binary data to be stored and recalled.
>
> Yes, absolutely.

And if the arbitrary data were to contain say 0xFF, then trying
to UTF-8 decode it would raise an UnicodeDecodeError.  So
if it's possible in Xapian for some data someplace to contain
a 0xFF, then nothing should assume it can always UTF-8
decode it (or should deal with the possibility of failure).

> > Perhaps it would be better to convert to unicode strings and add %extra
> > methods (e.g. get_data_raw()) which return a non-unicode string?
>
> That seems a better balance, and will trip up fewer people.

Yes, I like that too.  Although it would be nice to completely
hide any UTF-8 from Python, the more important thing is
being consistent.

If you can put in a Unicode string then you expect to get a
Unicode string back out.  And if you have to return a raw
byte string, then Xapian should raise an error when you attempt
to insert Unicode strings rather than silently UTF-8 encode them.
Making the error explicit reduces surprises and makes the
interface more "Pythonic".

Also, even though there's plenty of time, if you're rethinking the
interface keep in mind that Python 3000 just had it's first alpha
release (final release to occur sometime in 2009).  This will be
the *big* release of Python that breaks backwards compatibility
to try to clean up all of past language warts.

The big change is that all strings in Python will be Unicode strings;
which at the C/C++ interface means either UCS-2 or UCS-4.
There will be a new Python "bytes" type for raw octets, but they
will definitely not behave like strings.  The good-ole ASCII
style single-byte strings will be no more.

I'm sure not much can be done until the SWIG folks start
addressing Python 3000, but it will be coming.

-- 
Deron Meranda