[Xapian-tickets] [Xapian] #346: Python 3 support

Xapian nobody at xapian.org
Fri Jun 15 23:21:09 BST 2012


#346: Python 3 support
--------------------------------------+-------------------------------------
 Reporter:  olly                      |       Owner:  richard  
     Type:  defect                    |      Status:  assigned 
 Priority:  highest                   |   Milestone:  1.3.2    
Component:  Xapian-bindings (Python)  |     Version:  SVN trunk
 Severity:  normal                    |    Keywords:           
Blockedby:                            |    Platform:  All      
 Blocking:                            |  
--------------------------------------+-------------------------------------

Comment(by richard):

 To chip in here; after much reflection, I've been thinking that the
 strings/bytes problem for Xapian is essentially that python 3 requires us
 to build an abstraction on top of Xapian's API to separate strings and
 bytes.  However, because Xapian doesn't store information about the types
 of strings that many of its methods were given, such an abstraction must
 be leaky.  Therefore, we should aim to make the abstraction as thin as
 possible, to make it easy to document the leaks.

 The easiest and cleanest way to do this is to avoid doing any magic
 conversions, use the bytes type in all places where binary strings may be
 used (both in parameters and in return values), and use the unicode string
 type in any places where only encoded strings may be used.  Specifically:

  - all methods which it can ever make sense to pass arbitrary binary
 strings to should accept only bytes.  Those which are sometimes used to
 store text strings will therefore need the user to explicitly call
 .encode('utf8') before passing the arguments.  However, this is actually a
 good thing, since the user will get that encoded value back if they later
 do a call to extract the string from Xapian again.

  - return types of methods which can return arbitrary binary strings
 should return only bytes.  No magic "return unicode if the arguments
 passed to the call were unicode" or similar hackery, because this just
 makes the API harder to document and understand, and leads to subtle and
 hard to track down bugs.

  - there should be no special method variants added as syntax sugar to do
 conversions.  ie, no "add_value_unicode" methods.  It's clearer to tell
 the user to use .encode('utf8') before making the call, because then the
 user knows what actually happened without having to learn about each
 special API method in turn.

  - methods in the C++ API which always expect UTF-8 encoded strings (such
 as the QueryParser::parse() method), should accept only unicode strings.
 This will ensure that we're never making assumptions about the character
 set of the input, and is appropriate because users should be using unicode
 strings when manipulating text already (so there shouldn't be a need to do
 foo.decode('utf8') or anything similar).

  - callbacks from C++ to python which include strings in the parameters
 which come from Xapian should pass byte strings if the parameters can be
 arbitrary binary strings, and should pass unicode strings if there is no
 situation where that may happen.

 Following this scheme will cause some backwards compatibility problems,
 since in Xapian's python 2 bindings, many methods accept unicode strings
 and automatically convert them to UTF8 before passing to Xapian, but we
 wouldn't carry this over to the python 3 bindings. However, the bytes /
 unicode support in python 2 is a mess and needs sorting out, so what
 better time than the python 3 transition for doing this.

-- 
Ticket URL: <http://trac.xapian.org/ticket/346#comment:31>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list