[Xapian-discuss] Python bindings and unicode strings

James Aylett james-xapian at tartarus.org
Sun Sep 2 13:38:10 BST 2007


On Thu, Aug 30, 2007 at 03:02:22PM -0400, Deron Meranda wrote:

> I understand that the Xapian core uses UTF-8, but is there a way to
> get the Python bindings to always work with Python's native unicode
> string type so that the underlying UTF-8 is not exposed?

This isn't true, and therein lies the problem. Xapian core treats
everything as blobs of bytes; in many cases the sensible choice for
applications is to put UTF-8 in there.

> It appears that I can store unicode strings, like;
> 
> >>>  document.set_term( u'panach\u00e9' )
> 
> but then when I get them back out they're plain byte sequences (UTF-8
> encoded) rather than nice unicode strings,
> 
> >>>  [t.term for t in document.allterms()]
> ['panach\xc3\xa9']
> 
> I would have expected to get [u'panach\u00e9'] out instead.

I'm not sure what the right way of solving this is. Ideally we want a
way of saying what encoding is being used, and have Python do the
right thing. It would probably always come out as a Unicode string,
but the deserialisation would depend on the encoding used.

We might be okay having one encoding for everything, rather than
separate for terms and doc data... and values. Hmm. And I guess we
could stuff this into database metadata, which would make it
automatic. Some more thought may be required here first, though.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list