[Xapian-discuss] Xapian performance on gmane.org compared

Fri Aug 28 05:32:12 BST 2009

On Thu, Aug 27, 2009 at 08:34:58PM +0200, Henry wrote:
> Quoting "Richard Boulton" <richard at tartarus.org>:
> > and there are certainly issues
> > with its performance (eg http://trac.xapian.org/ticket/326 ).  Have you
> > tried this with flint, and if so, how do the times compare?

We should be careful not to create FUD about ourselves here.

The case in the ticket is fairly particular - searches for a *single
term* when everything is cached are slower for chert, by a factor of
7 in the benchmarked case, due to extra CPU time spent parsing the
document length data.

But to put things in perspective, that means that the average time per
search is 0.001258 seconds each rather than 0.000178 seconds.  We should
try to reclaim as much of that difference as we can (and I'm working on
that currently), but millisecond search times aren't actually a problem
per se, while a 40 second phrase search definitely is.

The relevant change here is that while flint stores the document length
with every posting list entry, chert only stores it once.  So if a
document has 1000 different terms, flint stores its document length 1000
times, but chert stores it just once.

This change reduces the size of the postlist table quite substantially
(by 44% for gmane), which makes quite a difference if you don't have
enough memory to cache the whole database, and even if you do, it will
help the cold cache case.

It will also tend to do better for searches with more terms.

And as a side benefit, it provides us with a compact list of all
documents present in the database which allows "pure NOT" to work faster
when there are gaps in the document id usage, and takes us one step
nearer being able to make the termlist table optional.

> I haven't tried flint; perhaps I should give it a try (and give up on  
> the dream-like belief that the latest bleeding-edge chert *must* be  
> better :).  Do you have any idea how it compares search  
> performance-wise?  Especially with phrase searching?

A phrase by definition has at least two terms, so the effects noted in
#326 shouldn't be as pronounced here.  And most of that 40 seconds will
be fetching positional information from disk, so even a whole
microsecond slowdown wouldn't matter.

Cheers,
    Olly