[Xapian-discuss] Python bindings and unicode strings
Olly Betts
olly at survex.com
Mon Sep 3 00:07:39 BST 2007
On Sun, Sep 02, 2007 at 01:38:10PM +0100, James Aylett wrote:
> On Thu, Aug 30, 2007 at 03:02:22PM -0400, Deron Meranda wrote:
>
> > I understand that the Xapian core uses UTF-8, but is there a way to
> > get the Python bindings to always work with Python's native unicode
> > string type so that the underlying UTF-8 is not exposed?
>
> This isn't true, and therein lies the problem. Xapian core treats
> everything as blobs of bytes;
Except that Xapian::Stem, Xapian::QueryParser, and Xapian::TermGenerator
all assume UTF-8 (since 1.0.0).
> > It appears that I can store unicode strings, like;
> >
> > >>> document.set_term( u'panach\u00e9' )
> >
> > but then when I get them back out they're plain byte sequences (UTF-8
> > encoded) rather than nice unicode strings,
> >
> > >>> [t.term for t in document.allterms()]
> > ['panach\xc3\xa9']
> >
> > I would have expected to get [u'panach\u00e9'] out instead.
>
> I'm not sure what the right way of solving this is. Ideally we want a
> way of saying what encoding is being used, and have Python do the
> right thing. It would probably always come out as a Unicode string,
> but the deserialisation would depend on the encoding used.
The Python bindings will convert any unicode string to UTF-8 before
passing it to Xapian. The reverse conversion isn't performed when a
string is returned to Python though. I don't really remember the
rationale for that, but looking at bindings.html, I think it might be
that it allows binary data to be stored and recalled.
Perhaps it would be better to convert to unicode strings and add %extra
methods (e.g. get_data_raw()) which return a non-unicode string?
Cheers,
Olly
More information about the Xapian-discuss
mailing list