[Xapian-tickets] [Xapian] #346: Python 3 support
Xapian
nobody at xapian.org
Fri Jun 15 02:12:39 BST 2012
#346: Python 3 support
--------------------------------------+-------------------------------------
Reporter: olly | Owner: richard
Type: defect | Status: assigned
Priority: highest | Milestone: 1.3.2
Component: Xapian-bindings (Python) | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
--------------------------------------+-------------------------------------
Comment(by olly):
There are certainly some useful changes in Sabrina's patch, which we
hopefully should get in for the 1.3.2 development snapshot, such as the
PEP3147 support (though that should use {{{imp.get_tag()}}} rather than
hard-coding cpython-32) and probably the {{{__next__}}} rename (though
I've not looked at the reasons for that). However, I'm afraid it doesn't
address the major remaining issue (Unicode strings) in the right way. If
we had a patch which fixed all the remaining issues, we would have applied
it by now! The patch changes the testcases to match what the output it
produces, so passing the tests is largely meaningless.
The fundamental issue (as mentioned in the bug description) is that the
Xapian C++ API uses std::string as both a UTF-8 string and a byte string.
Some methods will only ever return one of these - e.g.
Xapian::sortable_serialise() always returns a byte string, while
Xapian::Stem::operator() always returns a UTF-8 string (well, unless you
create a user stemming algorithm which doesn't...), but some can return
either, generally depending what you stored earlier (e.g.
Xapian::Document::get_value()). Similarly, some methods which take
strings can take only one sort, or either, but in this case we can just
handle whichever we are passed when the C++ API accepts a std::string.
The key difference is that for a return value, we have to pick a Python
type to return.
So to fix this, for each API method which returns std::string we need to
decide whether it returns Unicode, bytes, or both. If it's both, the best
solution is probably to add a second form (e.g.
xapian.Document.get_value_unicode()) which does the conversion for the
user, rather than forcing them to sprinkle explicit conversions around
calls to xapian.Document.get_value() in their code. SWIG's %extend makes
this pretty easy to do.
Or perhaps the standard should be for get_value() to return Unicode with a
get_value_bytes() or get_value_raw() alternative. Or perhaps what we do
should depend on how the method will usually be used (e.g. terms can be
arbitrary binary strings, but in practice they're almost always UTF-8).
I guess if you're trying to get everyone onto Python 3 for Ubuntu, you've
looked at quite a few upstreams already - has a standard pattern for
resolving such situations already emerged?
One further complication may be the user sub-classable API classes (which
SWIG calls "directors"). Here C++ calls back to Python, so it's the
arguments rather than the return types which matter. I'm not sure if
there are any cases there which could take either Unicode or bytes, but if
there are, I think we probably have to always pass bytes and let the
Python subclass explicitly convert if it wants to.
It looks like the feature freeze date for 12.10 is 23rd August, which is
only just over 2 months away - if you want to see Python 3 support in a
stable Xapian release by then, realistically you're going to have to be
the one to actually make that happen. As things are currently, it's not
looking at all likely it would even be fixed on trunk by then. It would
certainly be good to sort out Python 3 support, but there's not yet much
evidence of actual user demand, and Richard was the main one driving this,
but isn't very active in Xapian development right now.
--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:28>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list