Xapian-Haystack is available in Python 3

Olly Betts olly at survex.com
Thu Dec 10 00:02:48 GMT 2015


On Sat, Nov 14, 2015 at 01:27:32PM +0100, Jorge Cardoso Leitão wrote:
> Here I report some of the "features" that hindered this task from our
> perspective, so that Xapian devs are aware of the kind of problems a user
> may face on this process.
> 
> 1. Dev version of Xapian has different names for their tools, namely
> xapian-config and delve. xapian-config became xapian-config-1.3, delve
> became xapian-delve-1.3.
> 
> Suggestion: make names independent of oddity of the minor version.

The default program-suffix of "-1.3" allows for clean parallel
installation of development versions alongside stable versions by
default, which helps avoid people getting themselves strung up when
they have both installed.  This is only enabled for development
releases (1.4.x won't add "-1.4" by default), and can be turned off by
configuring xapian-core with:

./configure --program-suffix=

The "delve" to "xapian-delve" rename is an entirely separate thing - the
other tools we install all have a "xapian-" prefix, and some packaged
versions of xapian already rename "delve" to "xapian-delve", so we
decided to rename "delve" for consistency (and so this rename will be in
1.4.x).

If you want to run delve from a script, the reality is you already
need to check for both names to be portable to those packaged versions,
so by renaming upstream we actually make things simpler in the longer
term (as you'll be able to just assume "xapian-delve" once 1.2.x isn't
relevant).

> 2. Almost all Xapian bindings output is in non-unicode that can be
> converted to unicode via `decode('utf-8')`, which is great. Yet, this is
> still not perfect because e.g. `xapian.sortable_unserialise(12.345)` is not
> decodable to utf-8.

Your example doesn't make sense, as sortable_unserialise() doesn't
accept a float - did you mean?

xapian.sortable_serialise(12.345)

> Thus, depending on the type of field (string, int,
> float) (in the user side), its value will be either a string or byte
> strings, something that is against any Python idiom.
> 
> Suggestion: make all public interface of Xapian in Python to return either
> unicode or utf-8 decodable strings. IMO, at the current state of Python
> development where unicode is *the* standard, it is the bindings
> responsibility to return unicodes. If that is not possible in Xapian
> bindings, at least consider making the output to be totally undecodable so
> a user can be sure that any Xapian public interface allows .decode('utf-8').

To implement what you suggest, we'd have to come up with a whole new
serialisation which produced data which was also valid UTF-8, and that's
going to inherently be less compact, and incompatible with existing
database serialised with the current algorithm.

So there are definite downsides to this, and the benefits of being able
to handle opaque serialised data as UTF-8 seem pretty thin.  If you
really don't like the serialised form being a binary blob, you can just
eschew these functions and use your own serialisation functions instead.

As for the more general question of bytes vs unicode, we already
discussed that at great length in the ticket for python 3 support, and
ended up with the current plan:

http://trac.xapian.org/ticket/346

I'm not keen to rehash that whole discussion, and doing so will
inevitably delay 1.4.0, for which the loudest clamouring is from the
python 3 people.

> 3. In Xapian-Haystack we use TravisCI to build against different Python,
> Django and Xapian versions. Installing Xapian takes 95% of the total build
> time. Any suggestion how to reduce this? For concreteness, here is the
> installation file we are using:
> https://github.com/notanumber/xapian-haystack/blob/master/install_xapian.sh

You can configure core with --disable-static, which will probably halve
the time for the build of the core library itself:

./configure --disable-static

The default CXXFLAGS are "-O2 -g" for GCC (and masquerading compilers
like clang).  The "-g" option generates debug information, which
probably adds a bit of time to the build, so you could try:

./configure --disable-static CXXFLAGS=-O2

Turning down the optimisation level would help too, though that's also
changing what you're testing more significantly.  But something like:

./configure --disable-static CXXFLAGS=-O1

Or maybe "CXXFLAGS=-O0", though the unoptimised code can be much larger
and run significantly more slowly, so you might end up losing as much
as you gain there.

Cheers,
    Olly



More information about the Xapian-devel mailing list