[Xapian-tickets] [Xapian] #346: Python 3 support
Xapian
nobody at xapian.org
Fri Jun 15 23:21:09 BST 2012
#346: Python 3 support
--------------------------------------+-------------------------------------
Reporter: olly | Owner: richard
Type: defect | Status: assigned
Priority: highest | Milestone: 1.3.2
Component: Xapian-bindings (Python) | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
--------------------------------------+-------------------------------------
Comment(by richard):
To chip in here; after much reflection, I've been thinking that the
strings/bytes problem for Xapian is essentially that python 3 requires us
to build an abstraction on top of Xapian's API to separate strings and
bytes. However, because Xapian doesn't store information about the types
of strings that many of its methods were given, such an abstraction must
be leaky. Therefore, we should aim to make the abstraction as thin as
possible, to make it easy to document the leaks.
The easiest and cleanest way to do this is to avoid doing any magic
conversions, use the bytes type in all places where binary strings may be
used (both in parameters and in return values), and use the unicode string
type in any places where only encoded strings may be used. Specifically:
- all methods which it can ever make sense to pass arbitrary binary
strings to should accept only bytes. Those which are sometimes used to
store text strings will therefore need the user to explicitly call
.encode('utf8') before passing the arguments. However, this is actually a
good thing, since the user will get that encoded value back if they later
do a call to extract the string from Xapian again.
- return types of methods which can return arbitrary binary strings
should return only bytes. No magic "return unicode if the arguments
passed to the call were unicode" or similar hackery, because this just
makes the API harder to document and understand, and leads to subtle and
hard to track down bugs.
- there should be no special method variants added as syntax sugar to do
conversions. ie, no "add_value_unicode" methods. It's clearer to tell
the user to use .encode('utf8') before making the call, because then the
user knows what actually happened without having to learn about each
special API method in turn.
- methods in the C++ API which always expect UTF-8 encoded strings (such
as the QueryParser::parse() method), should accept only unicode strings.
This will ensure that we're never making assumptions about the character
set of the input, and is appropriate because users should be using unicode
strings when manipulating text already (so there shouldn't be a need to do
foo.decode('utf8') or anything similar).
- callbacks from C++ to python which include strings in the parameters
which come from Xapian should pass byte strings if the parameters can be
arbitrary binary strings, and should pass unicode strings if there is no
situation where that may happen.
Following this scheme will cause some backwards compatibility problems,
since in Xapian's python 2 bindings, many methods accept unicode strings
and automatically convert them to UTF8 before passing to Xapian, but we
wouldn't carry this over to the python 3 bindings. However, the bytes /
unicode support in python 2 is a mess and needs sorting out, so what
better time than the python 3 transition for doing this.
--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:31>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list