[Xapian-discuss] Python bindings and unicode strings
Richard Boulton
richard at lemurconsulting.com
Tue Sep 4 18:15:59 BST 2007
Olly Betts wrote:
> Could we return a Python object which turns itself into a unicode string
> if used as one, but otherwise just acts as a "str" string?
Nice idea, if maybe a bit evil. But I don't think we can, because there
isn't really any way to know how it's being used: for example, val[0]
should return the first character in val, but does that mean the first
unicode character, or the first "str" character? val has no way of knowing.
We could return an object which has a ".encode()" method, which produces
a str (with the specified encoding), as well as a ".decode()" method,
which produces a unicode object. But that's probably a lot more
suprising than inserting a unicode object to a document and getting
UTF-8 back.
A better option might be to return an object with a "unicode" method
(and the magic method to make "unicode(val)" work), and a "str" method
(and magic method). That way, users have to explicitly say how they
want the returned value to be interpreted.
Such an object could behave just like "unicode", except that the raw
data from xapian would be converted to unicode lazily, when any method
of the object was first called - thus, if the data wasn't valid UTF-8,
an exception would be raised at this point. Thus, users would be able
to treat the returned data as Unicode strings (and hence, we'd have no
problems when Python 3000 comes along), but could deal with non-UTF-8
data stored in xapian simply by calling a method on the returned data to
get it as a "str" (or as "bytes" in Python 3000).
I'm still not entirely convinced that implementating something like that
is possible, though.
--
Richard
More information about the Xapian-discuss
mailing list