[Xapian-discuss] Python bindings and unicode strings

Olly Betts olly at survex.com
Mon Sep 3 00:07:39 BST 2007


On Sun, Sep 02, 2007 at 01:38:10PM +0100, James Aylett wrote:
> On Thu, Aug 30, 2007 at 03:02:22PM -0400, Deron Meranda wrote:
> 
> > I understand that the Xapian core uses UTF-8, but is there a way to
> > get the Python bindings to always work with Python's native unicode
> > string type so that the underlying UTF-8 is not exposed?
> 
> This isn't true, and therein lies the problem. Xapian core treats
> everything as blobs of bytes;

Except that Xapian::Stem, Xapian::QueryParser, and Xapian::TermGenerator
all assume UTF-8 (since 1.0.0).

> > It appears that I can store unicode strings, like;
> > 
> > >>>  document.set_term( u'panach\u00e9' )
> > 
> > but then when I get them back out they're plain byte sequences (UTF-8
> > encoded) rather than nice unicode strings,
> > 
> > >>>  [t.term for t in document.allterms()]
> > ['panach\xc3\xa9']
> > 
> > I would have expected to get [u'panach\u00e9'] out instead.
> 
> I'm not sure what the right way of solving this is. Ideally we want a
> way of saying what encoding is being used, and have Python do the
> right thing. It would probably always come out as a Unicode string,
> but the deserialisation would depend on the encoding used.

The Python bindings will convert any unicode string to UTF-8 before
passing it to Xapian.  The reverse conversion isn't performed when a
string is returned to Python though.  I don't really remember the
rationale for that, but looking at bindings.html, I think it might be
that it allows binary data to be stored and recalled.

Perhaps it would be better to convert to unicode strings and add %extra
methods (e.g. get_data_raw()) which return a non-unicode string?

Cheers,
    Olly



More information about the Xapian-discuss mailing list