[Xapian-discuss] Python bindings and unicode strings

Richard Boulton richard at lemurconsulting.com
Tue Sep 4 18:15:59 BST 2007


Olly Betts wrote:
> Could we return a Python object which turns itself into a unicode string
> if used as one, but otherwise just acts as a "str" string?

Nice idea, if maybe a bit evil.  But I don't think we can, because there 
isn't really any way to know how it's being used: for example, val[0] 
should return the first character in val, but does that mean the first 
unicode character, or the first "str" character?  val has no way of knowing.

We could return an object which has a ".encode()" method, which produces 
  a str (with the specified encoding), as well as a ".decode()" method, 
which produces a unicode object.  But that's probably a lot more 
suprising than inserting a unicode object to a document and getting 
UTF-8 back.

A better option might be to return an object with a "unicode" method 
(and the magic method to make "unicode(val)" work), and a "str" method 
(and magic method).  That way, users have to explicitly say how they 
want the returned value to be interpreted.

Such an object could behave just like "unicode", except that the raw 
data from xapian would be converted to unicode lazily, when any method 
of the object was first called - thus, if the data wasn't valid UTF-8, 
an exception would be raised at this point.  Thus, users would be able 
to treat the returned data as Unicode strings (and hence, we'd have no 
problems when Python 3000 comes along), but could deal with non-UTF-8 
data stored in xapian simply by calling a method on the returned data to 
get it as a "str" (or as "bytes" in Python 3000).

I'm still not entirely convinced that implementating something like that 
is possible, though.

-- 
Richard



More information about the Xapian-discuss mailing list