[Xapian-tickets] [Xapian] #346: Python 3 support
Xapian
nobody at xapian.org
Sat Sep 21 01:07:32 BST 2013
#346: Python 3 support
--------------------------------------+------------------------------
Reporter: olly | Owner: richard
Type: defect | Status: assigned
Priority: highest | Milestone: 1.3.2
Component: Xapian-bindings (Python) | Version: SVN trunk
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
--------------------------------------+------------------------------
\
\
\
\
\
\
Comment (by olly):
I discussed this some with michelp on IRC yesterday. His feeling was
making the conversions explicit was better, but having methods accept str
or bytes for convenience would be reasonable. The "accept bytes or str,
return bytes" is exactly what dbm (in the standard python libraries) does:
{{{
import dbm
# Open database, creating it if necessary.
db = dbm.open('cache', 'c')
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
db.close()
}}}
Example taken from: http://docs.python.org/3.1/library/dbm.html
I tried to see if anyone's criticised this API, but failed to find
anything online, so I've been thinking this seemed a reasonable approach.
There's a clear logic to it, which is simple to explain and to understand
(and as a bonus, also to implement):
The API is wrapped as "C++ std::string <-> Python bytes". For
convenience, if you pass in str, then rather than raising an exception, we
convert to it to bytes containing UTF-8 (since that's Xapian's standard
representation for Unicode text).
Returning bytes works for UTF-8 or binary data (and we could leave adding
variants returning Unicode until later without breaking code written to
the bytes returning API). It also matches what we do for python2, so
should also be familiar to existing users. It suffers from a round-
tripping issue (add a term as str, get it back as bytes), but so does the
dbm API, and nobody seems to be ranting about how confusing they find
that.
It's possible to pick what happens on a function by function basis with
SWIG, but unless there's some way to generate the lists of methods and
parameters automatically, that will be a maintenance headache. It's also
a more complicated rule for users to grasp.
Richard mentions text input to the !QueryParser or !TermGenerator as a
case where only str should be accepted, but for the common case where your
source data is in UTF-8, that would force you to convert from UTF-8 to
<whatever Python uses internally> to get a Python str object, and then
you'd pass that to the bindings which would have to convert it back to
UTF-8 again to pass to xapian-core. Even if the changes in PEP 393 mean
that Python 3.3 would keep the data as UTF-8 (I'm not clear on if that's
actually the case), it still complicates the rules for the API user to
remember.
The number of cases which inherently ''only'' take binary data is very
small - I think it's only sortable_unserialise() currently. I think not
accepting str for those cases would be desirable and feasible to do - here
Unicode makes no sense, so an error really is better.
I don't think it's really feasible to have a "unicode version" of classes
like Document or Database - the distinction can't really be made at the
class level - e.g. values will often be binary but terms usually text.
I'm not sure you can make it at the function/method level either - a
function could have two std::string parameters, or a std::string parameter
and return type, but these be "text" and "binary data". E.g. the former
pattern matches !QueryParser's parse_query() and add_prefix() methods if
you think term prefixes are binary data (as suggested above); the latter
pattern is what an API for serialising a string would look like. So if
"foo_str()" means all std::string parameters and return values are
unicode, that doesn't really work - this is one reason why I think
accepting either str or bytes for parameters is more workable - then the
unicode variant only affects the return value, and there's only one of
those (well, strictly speaking, returning std::pair or via passed in
pointers/references is possible, and would naturally map to returning a
tuple in Python, but only worrying about return values greatly reduces the
number of cases, and multiple return values in C++ are a bit awkward so
tend to be rare).
!QueryParser assumes the encoding of text is utf-8, so providing a way to
specify that str should be converted to a different encoding just seems to
be setting a trap for users. For those who really want to do this,
converting to bytes in the encoding they want in their own code allows
this.
\
\
\
--
Ticket URL: <http://trac.xapian.org/ticket/346#comment:57>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list