[Xapian-discuss] Python bindings and unicode strings
Richard Boulton
richard at lemurconsulting.com
Tue Sep 4 09:29:54 BST 2007
Deron Meranda wrote:
> Obviously it makes sense for Stem to work with Unicode, since it
> must deal with written languages. It gets a bit more clouded
> beyond that. Is the core intentionally designed to allow indexing
> arbitrary binary stuff, or is that just a side-effect of it not making any
> assumptions or trying to interpret the bytes in any way?
The core of xapian (by which I mean the database, the document, and the
matcher, but not the text processing parts (the stemmers, the query
parser, and the term generator)) deals with terms as byte strings, and
puts no interpretation on their meaning. This means that higher levels
can encode and generate terms in any way that they like.
The text processing parts all assume that their input is UTF-8.
> Even though you can stuff UTF-8 into a raw byte sequence, the
> other way around doesn't work. For example the byte 0xFF is
> illegal in UTF-8 "text". And it also needs to be clear how the byte
> 0x00 is treated (as a character or as an end-of-string terminator).
> Basically all parts of Xapian, as well as users of it must agree
> whether things are raw bytes or UTF-8 strings. It can't really be
> both, safely anyway.
Well, the set of UTF-8 strings is a subset of the set of raw bytes, of
course.
Currently, if you have a database which you create and only put valid
UTF-8 terms into it, you will always get valid UTF-8 terms out.
However, if you don't know that the terms put into the database were
UTF-8, you cannot rely on this.
> And if the arbitrary data were to contain say 0xFF, then trying
> to UTF-8 decode it would raise an UnicodeDecodeError. So
> if it's possible in Xapian for some data someplace to contain
> a 0xFF, then nothing should assume it can always UTF-8
> decode it (or should deal with the possibility of failure).
If a term containing 0xFF was put into the database (either from python,
or from an indexer written in something else), then that value can be
returned at some point.
>>> Perhaps it would be better to convert to unicode strings and add %extra
>>> methods (e.g. get_data_raw()) which return a non-unicode string?
>> That seems a better balance, and will trip up fewer people.
>
> Yes, I like that too. Although it would be nice to completely
> hide any UTF-8 from Python, the more important thing is
> being consistent.
In the past, Unicode objects have been a bit of a second class citizen
in Python. However, that is changing, and as you say in Python 3000
they will be the default text handling mechanism. I agree that it might
be nice to change the default returned type of methods which return
terms to Unicode - however, this would require a second set of methods
to get the "raw" values, and the implementation is always going to be
such that you can insert a Unicode value to a document, and then get it
out again as a "raw" value, and have magic translations happening in the
background.
> If you can put in a Unicode string then you expect to get a
> Unicode string back out.
Hmm. Currently, you can also put in a "str" string - surely the
argument is just as strong that you should then expect to get a "str"
string out.
> And if you have to return a raw
> byte string, then Xapian should raise an error when you attempt
> to insert Unicode strings rather than silently UTF-8 encode them.
> Making the error explicit reduces surprises and makes the
> interface more "Pythonic".
That's probably the clean way to solve this problem - remove the
automatic UTF-8 conversion. On the other hand, converting unicode
strings to UTF-8 is always going to be the right thing to do, so isn't
it helpful to just do it?
> There will be a new Python "bytes" type for raw octets, but they
> will definitely not behave like strings. The good-ole ASCII
> style single-byte strings will be no more.
At this point, document data and values set in Xapian or returned from
Xapian will be "bytes", since they _usually_ contain arbitrary data.
There's a strong argument that terms should be the same, since they can
also contain arbitrary data.
My view is that there are only two viable options:
1. Leave the API as it is (and perhaps put a more prominent pointer to
the documentation on the subject in python/docs/bindings.html, section
"Unicode" somewhere). When Python 3000 comes along, the return type of
terms will be "bytes", and the permissible input types will be "bytes"
or Unicode (which will be UTF-8 encoded and stored as "bytes").
2. Remove the automatic conversion of input unicode strings. This
removes the "magic" from the API, which is generally a good thing, but
it's a simple enough piece of magic to explain (and it is clearly
documented in bindings.html), and I have personally found it useful.
Changing to return Unicode strings isn't viable, because it would
require a whole set of alternative access functions to get at the raw
value - much simpler to tell users to convert the output value to
Unicode if they know that they only put valid UTF-8 into the database.
I currently think that option 1 is the right approach, but I'm open to
persuasion. However, changing to option 2 should only happen when
xapian moves to version 1.1.0, since it's a backwards incompatible API
change.
--
Richard
More information about the Xapian-discuss
mailing list