[Xapian-discuss] Python bindings and unicode strings

Deron Meranda deron.meranda at gmail.com
Tue Sep 4 18:50:23 BST 2007


On 9/4/07, Richard Boulton <richard at lemurconsulting.com> wrote:
> We could return an object which has a ".encode()" method, which produces
>   a str (with the specified encoding), as well as a ".decode()" method,
> which produces a unicode object.  But that's probably a lot more
> suprising than inserting a unicode object to a document and getting
> UTF-8 back.

It's clever, but I agree that that bit of magic would probably cause more
confusion than it solved.

If the core really does just deal with binary data, then it is probably
best to just stick to "str" for current Pythons, and change to "bytes"
for Python 3000.  From there an application can, if wanted, use
the decode() method to get unicode out.  Plus that gives the caller
more flexibility with dealing with possible decoding errors.

I think the most immediate need is just to properly document the
current behavior.  And really the only "surprise" in the way it works
now is that it will silently convert unicode into UTF-8-encoded str
objects rather than raising an error.  And just a little bit of
documentation can smooth that out.

Having to call encode() or decode() methods is not really that
large of a burden that we need to hide it with magic, as long as it
is clear to the user that they need to do so if they want to work
with unicode and not binary blobs.


> A better option might be to return an object with a "unicode" method
> (and the magic method to make "unicode(val)" work), and a "str" method
> (and magic method).  That way, users have to explicitly say how they
> want the returned value to be interpreted.
>
> Such an object could behave just like "unicode", except that the raw
> data from xapian would be converted to unicode lazily, when any method
> of the object was first called

This might be possible, but weird.  First strings are immutable, so something
that lazily changed its value might break expectations.  Also it would make
code like "isinstance(val, basestring)" not work, although that bit of dubious
code would break anyway when Python 3000 brings the bytes type.

-- 
Deron Meranda



More information about the Xapian-discuss mailing list