[Xapian-discuss] Python bindings and unicode strings

Tue Sep 4 09:29:54 BST 2007

Deron Meranda wrote:
> Obviously it makes sense for Stem to work with Unicode, since it
> must deal with written languages.  It gets a bit more clouded
> beyond that.  Is the core intentionally designed to allow indexing
> arbitrary binary stuff, or is that just a side-effect of it not making any
> assumptions or trying to interpret the bytes in any way?

The core of xapian (by which I mean the database, the document, and the 
matcher, but not the text processing parts (the stemmers, the query 
parser, and the term generator)) deals with terms as byte strings, and 
puts no interpretation on their meaning.  This means that higher levels 
can encode and generate terms in any way that they like.

The text processing parts all assume that their input is UTF-8.

> Even though you can stuff UTF-8 into a raw byte sequence, the
> other way around doesn't work.  For example the byte 0xFF is
> illegal in UTF-8 "text".  And it also needs to be clear how the byte
> 0x00 is treated (as a character or as an end-of-string terminator).
> Basically all parts of Xapian, as well as users of it must agree
> whether things are raw bytes or UTF-8 strings.  It can't really be
> both, safely anyway.

Well, the set of UTF-8 strings is a subset of the set of raw bytes, of 
course.

Currently, if you have a database which you create and only put valid 
UTF-8 terms into it, you will always get valid UTF-8 terms out. 
However, if you don't know that the terms put into the database were 
UTF-8, you cannot rely on this.

> And if the arbitrary data were to contain say 0xFF, then trying
> to UTF-8 decode it would raise an UnicodeDecodeError.  So
> if it's possible in Xapian for some data someplace to contain
> a 0xFF, then nothing should assume it can always UTF-8
> decode it (or should deal with the possibility of failure).

If a term containing 0xFF was put into the database (either from python, 
or from an indexer written in something else), then that value can be 
returned at some point.

>>> Perhaps it would be better to convert to unicode strings and add %extra
>>> methods (e.g. get_data_raw()) which return a non-unicode string?
>> That seems a better balance, and will trip up fewer people.
> 
> Yes, I like that too.  Although it would be nice to completely
> hide any UTF-8 from Python, the more important thing is
> being consistent.

In the past, Unicode objects have been a bit of a second class citizen 
in Python.  However, that is changing, and as you say in Python 3000 
they will be the default text handling mechanism.  I agree that it might 
be nice to change the default returned type of methods which return 
terms to Unicode - however, this would require a second set of methods 
to get the "raw" values, and the implementation is always going to be 
such that you can insert a Unicode value to a document, and then get it 
out again as a "raw" value, and have magic translations happening in the 
background.

> If you can put in a Unicode string then you expect to get a
> Unicode string back out.

Hmm.  Currently, you can also put in a "str" string - surely the 
argument is just as strong that you should then expect to get a "str" 
string out.

 >  And if you have to return a raw
> byte string, then Xapian should raise an error when you attempt
> to insert Unicode strings rather than silently UTF-8 encode them.
> Making the error explicit reduces surprises and makes the
> interface more "Pythonic".

That's probably the clean way to solve this problem - remove the 
automatic UTF-8 conversion.  On the other hand, converting unicode 
strings to UTF-8 is always going to be the right thing to do, so isn't 
it helpful to just do it?

> There will be a new Python "bytes" type for raw octets, but they
> will definitely not behave like strings.  The good-ole ASCII
> style single-byte strings will be no more.

At this point, document data and values set in Xapian or returned from 
Xapian will be "bytes", since they _usually_ contain arbitrary data. 
There's a strong argument that terms should be the same, since they can 
also contain arbitrary data.

My view is that there are only two viable options:

1. Leave the API as it is (and perhaps put a more prominent pointer to 
the documentation on the subject in python/docs/bindings.html, section 
"Unicode" somewhere).  When Python 3000 comes along, the return type of 
terms will be "bytes", and the permissible input types will be "bytes" 
or Unicode (which will be UTF-8 encoded and stored as "bytes").

2. Remove the automatic conversion of input unicode strings.  This 
removes the "magic" from the API, which is generally a good thing, but 
it's a simple enough piece of magic to explain (and it is clearly 
documented in bindings.html), and I have personally found it useful.

Changing to return Unicode strings isn't viable, because it would 
require a whole set of alternative access functions to get at the raw 
value - much simpler to tell users to convert the output value to 
Unicode if they know that they only put valid UTF-8 into the database.

I currently think that option 1 is the right approach, but I'm open to 
persuasion.  However, changing to option 2 should only happen when 
xapian moves to version 1.1.0, since it's a backwards incompatible API 
change.

-- 
Richard