[Xapian-discuss] Python bindings and unicode strings

Tue Sep 4 17:55:54 BST 2007

On Tue, Sep 04, 2007 at 09:29:54AM +0100, Richard Boulton wrote:
> Deron Meranda wrote:
> >Even though you can stuff UTF-8 into a raw byte sequence, the
> >other way around doesn't work.  For example the byte 0xFF is
> >illegal in UTF-8 "text".

So you need to avoid putting it in a place where UTF-8 text is expected.

> >And it also needs to be clear how the byte
> >0x00 is treated (as a character or as an end-of-string terminator).

Xapian just treats a zero byte like any other byte value.  How you treat
it is up to you (and perhaps to the language you're using - the way the
C# bindings work means that they don't transparently handle zero bytes).

(Actually, there's one exception - a zero byte in a term is currently
internally encoded as two bytes in some places, so the term length
limit is lower for terms with zero bytes in.  Essentially this is a
bug, and this issue will be resolved eventually.)

> In the past, Unicode objects have been a bit of a second class citizen 
> in Python.  However, that is changing, and as you say in Python 3000 
> they will be the default text handling mechanism.  I agree that it might 
> be nice to change the default returned type of methods which return 
> terms to Unicode - however, this would require a second set of methods 
> to get the "raw" values, and the implementation is always going to be 
> such that you can insert a Unicode value to a document, and then get it 
> out again as a "raw" value, and have magic translations happening in the 
> background.

Could we return a Python object which turns itself into a unicode string
if used as one, but otherwise just acts as a "str" string?

Cheers,
    Olly