[Xapian-tickets] [Xapian] #346: Python 3 support

Sat Sep 21 01:07:32 BST 2013

#346: Python 3 support
--------------------------------------+------------------------------
 Reporter:  olly                      |             Owner:  richard
     Type:  defect                    |            Status:  assigned
 Priority:  highest                   |         Milestone:  1.3.2
Component:  Xapian-bindings (Python)  |           Version:  SVN trunk
 Severity:  normal                    |        Resolution:
 Keywords:                            |        Blocked By:
 Blocking:                            |  Operating System:  All
--------------------------------------+------------------------------
\
\
\
\
\
\

Comment (by olly):

 I discussed this some with michelp on IRC yesterday.  His feeling was
 making the conversions explicit was better, but having methods accept str
 or bytes for convenience would be reasonable.  The "accept bytes or str,
 return bytes" is exactly what dbm (in the standard python libraries) does:

 {{{
 import dbm

 # Open database, creating it if necessary.
 db = dbm.open('cache', 'c')

 # Record some values
 db[b'hello'] = b'there'
 db['www.python.org'] = 'Python Website'
 db['www.cnn.com'] = 'Cable News Network'

 # Note that the keys are considered bytes now.
 assert db[b'www.python.org'] == b'Python Website'
 # Notice how the value is now in bytes.
 assert db['www.cnn.com'] == b'Cable News Network'

 # Often-used methods of the dict interface work too.
 print(db.get('python.org', b'not present'))

 # Storing a non-string key or value will raise an exception (most
 # likely a TypeError).
 db['www.yahoo.com'] = 4

 # Close when done.
 db.close()
 }}}

 Example taken from: http://docs.python.org/3.1/library/dbm.html

 I tried to see if anyone's criticised this API, but failed to find
 anything online, so I've been thinking this seemed a reasonable approach.
 There's a clear logic to it, which is simple to explain and to understand
 (and as a bonus, also to implement):

   The API is wrapped as "C++ std::string <-> Python bytes".  For
 convenience, if you pass in str, then rather than raising an exception, we
 convert to it to bytes containing UTF-8 (since that's Xapian's standard
 representation for Unicode text).

 Returning bytes works for UTF-8 or binary data (and we could leave adding
 variants returning Unicode until later without breaking code written to
 the bytes returning API).  It also matches what we do for python2, so
 should also be familiar to existing users.  It suffers from a round-
 tripping issue (add a term as str, get it back as bytes), but so does the
 dbm API, and nobody seems to be ranting about how confusing they find
 that.

 It's possible to pick what happens on a function by function basis with
 SWIG, but unless there's some way to generate the lists of methods and
 parameters automatically, that will be a maintenance headache.  It's also
 a more complicated rule for users to grasp.

 Richard mentions text input to the !QueryParser or !TermGenerator as a
 case where only str should be accepted, but for the common case where your
 source data is in UTF-8, that would force you to convert from UTF-8 to
 <whatever Python uses internally> to get a Python str object, and then
 you'd pass that to the bindings which would have to convert it back to
 UTF-8 again to pass to xapian-core.  Even if the changes in PEP 393 mean
 that Python 3.3 would keep the data as UTF-8 (I'm not clear on if that's
 actually the case), it still complicates the rules for the API user to
 remember.

 The number of cases which inherently ''only'' take binary data is very
 small - I think it's only sortable_unserialise() currently.  I think not
 accepting str for those cases would be desirable and feasible to do - here
 Unicode makes no sense, so an error really is better.

 I don't think it's really feasible to have a "unicode version" of classes
 like Document or Database - the distinction can't really be made at the
 class level - e.g. values will often be binary but terms usually text.

 I'm not sure you can make it at the function/method level either - a
 function could have two std::string parameters, or a std::string parameter
 and return type, but these be "text" and "binary data".  E.g. the former
 pattern matches !QueryParser's parse_query() and add_prefix() methods if
 you think term prefixes are binary data (as suggested above); the latter
 pattern is what an API for serialising a string would look like.  So if
 "foo_str()" means all std::string parameters and return values are
 unicode, that doesn't really work - this is one reason why I think
 accepting either str or bytes for parameters is more workable - then the
 unicode variant only affects the return value, and there's only one of
 those (well, strictly speaking, returning std::pair or via passed in
 pointers/references is possible, and would naturally map to returning a
 tuple in Python, but only worrying about return values greatly reduces the
 number of cases, and multiple return values in C++ are a bit awkward so
 tend to be rare).

 !QueryParser assumes the encoding of text is utf-8, so providing a way to
 specify that str should be converted to a different encoding just seems to
 be setting a trap for users.  For those who really want to do this,
 converting to bytes in the encoding they want in their own code allows
 this.
\
\
\

--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:57>
Xapian <http://xapian.org/>
Xapian