[Xapian-discuss] Python bindings and unicode strings

James Aylett james-xapian at tartarus.org
Mon Sep 3 16:11:01 BST 2007


On Mon, Sep 03, 2007 at 12:07:39AM +0100, Olly Betts wrote:

> > > I understand that the Xapian core uses UTF-8, but is there a way to
> > > get the Python bindings to always work with Python's native unicode
> > > string type so that the underlying UTF-8 is not exposed?
> > 
> > This isn't true, and therein lies the problem. Xapian core treats
> > everything as blobs of bytes;
> 
> Except that Xapian::Stem, Xapian::QueryParser, and Xapian::TermGenerator
> all assume UTF-8 (since 1.0.0).

Well, yes. While they appear in core, they aren't really part of the
underlying model, though. (Again, this doesn't help the clarity.)

> The Python bindings will convert any unicode string to UTF-8 before
> passing it to Xapian.  The reverse conversion isn't performed when a
> string is returned to Python though.  I don't really remember the
> rationale for that, but looking at bindings.html, I think it might be
> that it allows binary data to be stored and recalled.

Yes, absolutely.

> Perhaps it would be better to convert to unicode strings and add %extra
> methods (e.g. get_data_raw()) which return a non-unicode string?

That seems a better balance, and will trip up fewer people.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list