[Xapian-discuss] Python bindings and unicode strings
Olly Betts
olly at survex.com
Tue Sep 4 17:55:54 BST 2007
On Tue, Sep 04, 2007 at 09:29:54AM +0100, Richard Boulton wrote:
> Deron Meranda wrote:
> >Even though you can stuff UTF-8 into a raw byte sequence, the
> >other way around doesn't work. For example the byte 0xFF is
> >illegal in UTF-8 "text".
So you need to avoid putting it in a place where UTF-8 text is expected.
> >And it also needs to be clear how the byte
> >0x00 is treated (as a character or as an end-of-string terminator).
Xapian just treats a zero byte like any other byte value. How you treat
it is up to you (and perhaps to the language you're using - the way the
C# bindings work means that they don't transparently handle zero bytes).
(Actually, there's one exception - a zero byte in a term is currently
internally encoded as two bytes in some places, so the term length
limit is lower for terms with zero bytes in. Essentially this is a
bug, and this issue will be resolved eventually.)
> In the past, Unicode objects have been a bit of a second class citizen
> in Python. However, that is changing, and as you say in Python 3000
> they will be the default text handling mechanism. I agree that it might
> be nice to change the default returned type of methods which return
> terms to Unicode - however, this would require a second set of methods
> to get the "raw" values, and the implementation is always going to be
> such that you can insert a Unicode value to a document, and then get it
> out again as a "raw" value, and have magic translations happening in the
> background.
Could we return a Python object which turns itself into a unicode string
if used as one, but otherwise just acts as a "str" string?
Cheers,
Olly
More information about the Xapian-discuss
mailing list