[Xapian-discuss] using Xapian as backend for google

Wed Dec 13 04:41:56 GMT 2006

On Mon, Dec 11, 2006 at 06:04:40PM +0000, James Aylett wrote:
> On Fri, Dec 08, 2006 at 05:23:27AM +0000, Olly Betts wrote:
> > It's likely to be.  Note that there's scope for improving matters with
> > enhancements to Xapian here - there are some obvious things to improve
> > (which I'm working my way through), and profiling should reveal more.
> 
> We keep on vaguely mentioning getting a set of tests which stretch a
> setup to make this kind of tuning investigation easier. Of course that
> only enables tuning for that particular profile, but even so it would
> point the way for others.

I recently bought myself a large disk (SATA) and I've delibetately
partitioned it and only formatted half.  At some point I'm hoping to
try some tests indexing a large test collection using Xapian with
the various Linux filing system options - I suspect there's quite
a variation in performance, but it's hard to make recommendations
without benchmarking.

> If someone has some time to help on building the initial tests, that
> would probably be worth doing.

I think it would.  I was thinking that the wikipedia data (you can
download dumps of it) would be a reasonable large test collection
which should be redistributable without worries, but that's just a
suggestion - anything suitable without onerous redistribution
restrictions would be fine.

> (But Olly, Richard, feel free to correct me if there's more useful
> stuff in the short term.)

I actually had other things in mind when I wrote the text quoted above -
for example, when indexing a lot of new documents, we could use a lot
less memory by holding the changes to posting lists in delta-compressed
like how they are stored on disk.  This would mean we could store a
lot more changes between flushes with the same memory usage, or free
up more memory for caching disk blocks, or some combination.  That
should speed up this indexing case significantly.

There are a few other obvious areas to look at like this, but they all
probably require a good knowledge of Xapian's internals, whereas a test
collection and set of scripts for benchmarking with is a project that
could be tackled by anyone with suitable generic Unix skills (and time!)

The other useful thing to do is run profiling tests to identify where
time is spent - some slow areas are obvious, but often time is spent in
places that don't jump out at you.  On Linux, it seems oprofile is best
for this as it profiles the whole system, so we see where I/O fits in.
Arjen found that the unpacking of position lists in flint was taking ~5%
of the time for a phrase search just using oprofile.

Cheers,
    Olly