[Xapian-tickets] [Xapian] #346: Python 3 support

Xapian nobody at xapian.org
Fri Jun 15 20:12:15 BST 2012


#346: Python 3 support
--------------------------------------+-------------------------------------
 Reporter:  olly                      |       Owner:  richard  
     Type:  defect                    |      Status:  assigned 
 Priority:  highest                   |   Milestone:  1.3.2    
Component:  Xapian-bindings (Python)  |     Version:  SVN trunk
 Severity:  normal                    |    Keywords:           
Blockedby:                            |    Platform:  All      
 Blocking:                            |  
--------------------------------------+-------------------------------------

Comment(by barry):

 Thanks for the detailed response.  A few comments:

 Replying to [comment:28 olly]:
 > There are certainly some useful changes in Sabrina's patch, which we
 hopefully should get in for the 1.3.2 development snapshot, such as the
 PEP3147 support (though that should use {{{imp.get_tag()}}} rather than
 hard-coding cpython-32) and probably the {{{__next__}}} rename (though
 I've not looked at the reasons for that).  However, I'm afraid it doesn't
 address the major remaining issue (Unicode strings) in the right way.  If
 we had a patch which fixed all the remaining issues, we would have applied
 it by now!  The patch changes the testcases to match what the output it
 produces, so passing the tests is largely meaningless.

 Right, mainly I was responding just on the currency of the latest patch.

 > The fundamental issue (as mentioned in the bug description) is that the
 Xapian C++ API uses std::string as both a UTF-8 string and a byte string.
 Some methods will only ever return one of these - e.g.
 Xapian::sortable_serialise() always returns a byte string, while
 Xapian::Stem::operator() always returns a UTF-8 string (well, unless you
 create a user stemming algorithm which doesn't...), but some can return
 either, generally depending what you stored earlier (e.g.
 Xapian::Document::get_value()).  Similarly, some methods which take
 strings can take only one sort, or either, but in this case we can just
 handle whichever we are passed when the C++ API accepts a std::string.
 The key difference is that for a return value, we have to pick a Python
 type to return.

 Wow, this does make it even more challenging.  Usually if the API you're
 interfacing to has a strong model of bytes v. strings, it's not too hard
 to work out the details (e.g. my earlier work on dbus-python), but it's
 certainly more difficult if the semantics are ambiguous.

 > So to fix this, for each API method which returns std::string we need to
 decide whether it returns Unicode, bytes, or both.  If it's both, the best
 solution is probably to add a second form (e.g.
 xapian.Document.get_value_unicode()) which does the conversion for the
 user, rather than forcing them to sprinkle explicit conversions around
 calls to xapian.Document.get_value() in their code.  SWIG's %extend makes
 this pretty easy to do.
 >
 > Or perhaps the standard should be for get_value() to return Unicode with
 a get_value_bytes() or get_value_raw() alternative.  Or perhaps what we do
 should depend on how the method will usually be used (e.g. terms can be
 arbitrary binary strings, but in practice they're almost always UTF-8).

 I don't know the Xapian API very well, but on first blush I do like the
 idea of .get_value() returning one or the other consistently, and either
 adding a new API for the alternative, or leaving it to the user to do the
 conversion, though as you say, you don't want to make client code horribly
 less readable.  I'm not sure whether get_value() should always return
 bytes or strings, though it seems like get_value_bytes() might be a good
 first approach.

 > I guess if you're trying to get everyone onto Python 3 for Ubuntu,
 you've looked at quite a few upstreams already - has a standard pattern
 for resolving such situations already emerged?

 Well, the only upstream I currently have to support is software-center,
 since we're only converting to Python 3 on the standard desktop image (for
 12.10 anyway).  So its use case will be my primary driver.  We have maybe
 a dozen reverse depends on python-xapian in total.

 > One further complication may be the user sub-classable API classes
 (which SWIG calls "directors").  Here C++ calls back to Python, so it's
 the arguments rather than the return types which matter.  I'm not sure if
 there are any cases there which could take either Unicode or bytes, but if
 there are, I think we probably have to always pass bytes and let the
 Python subclass explicitly convert if it wants to.

 That sounds reasonable.

 >
 > It looks like the feature freeze date for 12.10 is 23rd August, which is
 only just over 2 months away - if you want to see Python 3 support in a
 stable Xapian release by then, realistically you're going to have to be
 the one to actually make that happen.  As things are currently, it's not
 looking at all likely it would even be fixed on trunk by then.  It would
 certainly be good to sort out Python 3 support, but there's not yet much
 evidence of actual user demand, and Richard was the main one driving this,
 but isn't very active in Xapian development right now.

 Note that we don't necessarily need a stable Xapian release supporting
 Python 3.  It would be okay if upstream blessed the patches (ideally, by
 committing them to svn), the we could package that up and feel fairly
 confident that when the stable release is made, we can switch to it
 without a ton of churn.

 One big question is this: what version of Python 2 do you still need to
 support (please tell me, nothing earlier than 2.6 :), and how should we
 handle cases where the API has to change for Python 3?

-- 
Ticket URL: <http://trac.xapian.org/ticket/346#comment:30>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list