[Xapian-tickets] [Xapian] #346: Python 3 support
Xapian
nobody at xapian.org
Fri Sep 20 00:04:04 BST 2013
#346: Python 3 support
--------------------------------------+------------------------------
Reporter: olly | Owner: richard
Type: defect | Status: assigned
Priority: highest | Milestone: 1.3.2
Component: Xapian-bindings (Python) | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
--------------------------------------+------------------------------
\
\
\
\
\
\
Comment (by olly):
OK, so we need to decide what the python3 API is actually going to look
like.
I agree we don't want anything too clever, but just mapping std::string to
bytes everywhere seems too simplistic to me still - I just tried rewriting
smoketest.py for that API as an experimental, and it's full of b'...',
x.encode('utf-8') and y.decode('utf-8'). That's inevitably going to lead
to people writing wrapper functions or classes so they can pass in Unicode
strings without having to convert at every single call site so we'd end up
with a proliferation of python wrappers, or at best one dominant wrapper
which everyone uses. So it seems saner to just create the bindings with
that wrapper included. It's likely to be more efficient if integrated
too.
But I'm not a big python user, so it doesn't seem entirely sane for me to
be designing the python API alone.
FWIW, adding "_raw" (or "_unicode") variants of methods returning
std::string looks quite feasible to do - sabrina's trick of using
{{{typedef std::string pybytes;}}} and then having an output typemap for
bytes allows this to be done for a {{{%extend}}} method easily, and we can
generate a set of forwarding wrappers from a list of methods to be handled
this way (possibly via markup in the C++ API headers even - this would
also be useful for some other languages - e.g. the Java bindings currently
blindly convert std::string return values assuming UTF-8).
If we go that route, I'm not sure which is the better default. The more
used case is almost certain to be unicode, and making the common case the
shorter one seems sane, but "bytes as default with a unicode variant where
it makes sense" avoids having to decide what to do with things like
sortable_serialise() which definitely only return binary data. However,
there are very few of those (possibly it's unique), so it could be an
exception to the rule, or we could only have the bytes form for that
function. I can't really think of a case where the bytes variant makes no
sense.
Also not sure what the naming should be. Something short would be good:
"_raw" is tolerable; "_unicode" seems too long. I wonder about "_b" or
"_u" (to match the b'...' or u'...' notation).
\
\
\
--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:54>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list