[Xapian-discuss] Python bindings and unicode strings
James Aylett
james-xapian at tartarus.org
Mon Sep 3 16:11:01 BST 2007
On Mon, Sep 03, 2007 at 12:07:39AM +0100, Olly Betts wrote:
> > > I understand that the Xapian core uses UTF-8, but is there a way to
> > > get the Python bindings to always work with Python's native unicode
> > > string type so that the underlying UTF-8 is not exposed?
> >
> > This isn't true, and therein lies the problem. Xapian core treats
> > everything as blobs of bytes;
>
> Except that Xapian::Stem, Xapian::QueryParser, and Xapian::TermGenerator
> all assume UTF-8 (since 1.0.0).
Well, yes. While they appear in core, they aren't really part of the
underlying model, though. (Again, this doesn't help the clarity.)
> The Python bindings will convert any unicode string to UTF-8 before
> passing it to Xapian. The reverse conversion isn't performed when a
> string is returned to Python though. I don't really remember the
> rationale for that, but looking at bindings.html, I think it might be
> that it allows binary data to be stored and recalled.
Yes, absolutely.
> Perhaps it would be better to convert to unicode strings and add %extra
> methods (e.g. get_data_raw()) which return a non-unicode string?
That seems a better balance, and will trip up fewer people.
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-discuss
mailing list