[Xapian-tickets] [Xapian] #346: Python 3 support
Xapian
nobody at xapian.org
Fri Jun 15 20:12:15 BST 2012
#346: Python 3 support
--------------------------------------+-------------------------------------
Reporter: olly | Owner: richard
Type: defect | Status: assigned
Priority: highest | Milestone: 1.3.2
Component: Xapian-bindings (Python) | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
--------------------------------------+-------------------------------------
Comment(by barry):
Thanks for the detailed response. A few comments:
Replying to [comment:28 olly]:
> There are certainly some useful changes in Sabrina's patch, which we
hopefully should get in for the 1.3.2 development snapshot, such as the
PEP3147 support (though that should use {{{imp.get_tag()}}} rather than
hard-coding cpython-32) and probably the {{{__next__}}} rename (though
I've not looked at the reasons for that). However, I'm afraid it doesn't
address the major remaining issue (Unicode strings) in the right way. If
we had a patch which fixed all the remaining issues, we would have applied
it by now! The patch changes the testcases to match what the output it
produces, so passing the tests is largely meaningless.
Right, mainly I was responding just on the currency of the latest patch.
> The fundamental issue (as mentioned in the bug description) is that the
Xapian C++ API uses std::string as both a UTF-8 string and a byte string.
Some methods will only ever return one of these - e.g.
Xapian::sortable_serialise() always returns a byte string, while
Xapian::Stem::operator() always returns a UTF-8 string (well, unless you
create a user stemming algorithm which doesn't...), but some can return
either, generally depending what you stored earlier (e.g.
Xapian::Document::get_value()). Similarly, some methods which take
strings can take only one sort, or either, but in this case we can just
handle whichever we are passed when the C++ API accepts a std::string.
The key difference is that for a return value, we have to pick a Python
type to return.
Wow, this does make it even more challenging. Usually if the API you're
interfacing to has a strong model of bytes v. strings, it's not too hard
to work out the details (e.g. my earlier work on dbus-python), but it's
certainly more difficult if the semantics are ambiguous.
> So to fix this, for each API method which returns std::string we need to
decide whether it returns Unicode, bytes, or both. If it's both, the best
solution is probably to add a second form (e.g.
xapian.Document.get_value_unicode()) which does the conversion for the
user, rather than forcing them to sprinkle explicit conversions around
calls to xapian.Document.get_value() in their code. SWIG's %extend makes
this pretty easy to do.
>
> Or perhaps the standard should be for get_value() to return Unicode with
a get_value_bytes() or get_value_raw() alternative. Or perhaps what we do
should depend on how the method will usually be used (e.g. terms can be
arbitrary binary strings, but in practice they're almost always UTF-8).
I don't know the Xapian API very well, but on first blush I do like the
idea of .get_value() returning one or the other consistently, and either
adding a new API for the alternative, or leaving it to the user to do the
conversion, though as you say, you don't want to make client code horribly
less readable. I'm not sure whether get_value() should always return
bytes or strings, though it seems like get_value_bytes() might be a good
first approach.
> I guess if you're trying to get everyone onto Python 3 for Ubuntu,
you've looked at quite a few upstreams already - has a standard pattern
for resolving such situations already emerged?
Well, the only upstream I currently have to support is software-center,
since we're only converting to Python 3 on the standard desktop image (for
12.10 anyway). So its use case will be my primary driver. We have maybe
a dozen reverse depends on python-xapian in total.
> One further complication may be the user sub-classable API classes
(which SWIG calls "directors"). Here C++ calls back to Python, so it's
the arguments rather than the return types which matter. I'm not sure if
there are any cases there which could take either Unicode or bytes, but if
there are, I think we probably have to always pass bytes and let the
Python subclass explicitly convert if it wants to.
That sounds reasonable.
>
> It looks like the feature freeze date for 12.10 is 23rd August, which is
only just over 2 months away - if you want to see Python 3 support in a
stable Xapian release by then, realistically you're going to have to be
the one to actually make that happen. As things are currently, it's not
looking at all likely it would even be fixed on trunk by then. It would
certainly be good to sort out Python 3 support, but there's not yet much
evidence of actual user demand, and Richard was the main one driving this,
but isn't very active in Xapian development right now.
Note that we don't necessarily need a stable Xapian release supporting
Python 3. It would be okay if upstream blessed the patches (ideally, by
committing them to svn), the we could package that up and feel fairly
confident that when the stable release is made, we can switch to it
without a ton of churn.
One big question is this: what version of Python 2 do you still need to
support (please tell me, nothing earlier than 2.6 :), and how should we
handle cases where the API has to change for Python 3?
--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:30>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list