[Xapian-tickets] [Xapian] #346: Python 3 support

Fri Sep 20 00:04:04 BST 2013

#346: Python 3 support
--------------------------------------+------------------------------
 Reporter:  olly                      |             Owner:  richard
     Type:  defect                    |            Status:  assigned
 Priority:  highest                   |         Milestone:  1.3.2
Component:  Xapian-bindings (Python)  |           Version:  SVN trunk
 Severity:  normal                    |        Resolution:
 Keywords:                            |        Blocked By:
 Blocking:                            |  Operating System:  All
--------------------------------------+------------------------------
\
\
\
\
\
\

Comment (by olly):

 OK, so we need to decide what the python3 API is actually going to look
 like.

 I agree we don't want anything too clever, but just mapping std::string to
 bytes everywhere seems too simplistic to me still - I just tried rewriting
 smoketest.py for that API as an experimental, and it's full of b'...',
 x.encode('utf-8') and y.decode('utf-8').  That's inevitably going to lead
 to people writing wrapper functions or classes so they can pass in Unicode
 strings without having to convert at every single call site so we'd end up
 with a proliferation of python wrappers, or at best one dominant wrapper
 which everyone uses.  So it seems saner to just create the bindings with
 that wrapper included.  It's likely to be more efficient if integrated
 too.

 But I'm not a big python user, so it doesn't seem entirely sane for me to
 be designing the python API alone.

 FWIW, adding "_raw" (or "_unicode") variants of methods returning
 std::string looks quite feasible to do - sabrina's trick of using
 {{{typedef std::string pybytes;}}} and then having an output typemap for
 bytes allows this to be done for a {{{%extend}}} method easily, and we can
 generate a set of forwarding wrappers from a list of methods to be handled
 this way (possibly via markup in the C++ API headers even - this would
 also be useful for some other languages - e.g. the Java bindings currently
 blindly convert std::string return values assuming UTF-8).

 If we go that route, I'm not sure which is the better default.  The more
 used case is almost certain to be unicode, and making the common case the
 shorter one seems sane, but "bytes as default with a unicode variant where
 it makes sense" avoids having to decide what to do with things like
 sortable_serialise() which definitely only return binary data.  However,
 there are very few of those (possibly it's unique), so it could be an
 exception to the rule, or we could only have the bytes form for that
 function.  I can't really think of a case where the bytes variant makes no
 sense.

 Also not sure what the naming should be.  Something short would be good:
 "_raw" is tolerable; "_unicode" seems too long.  I wonder about "_b" or
 "_u" (to match the b'...' or u'...' notation).
\
\
\

--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:54>
Xapian <http://xapian.org/>
Xapian