[Xapian-discuss] Chert vs Flint performance (was Re: Ranking and term proximity)

Thu Sep 15 07:35:24 BST 2011

On Tue, Sep 06, 2011 at 08:35:32AM +0200, goran kent wrote:
> Reminds me of
> http://trac.xapian.org/ticket/326 - chert (without patches, but even
> with, it's still bad) is 7x SLOWER than the older flint format.

This is incorrect.

What is true is that at one point during development chert was about 7x
slower than flint for a particular test case.  But we made changes (see
the comments on the ticket) and this was brought down substantially.
Richard never reported updated results sadly, and I don't have access to
his test dataset or code, so I can't tell you what the true slow-down
for this case actually is for chert in stable releases.

The ticket is still open because there's still scope for improvement
here, not because we just ignored the issue.

> That's embarrassing.  Yes, one can argue that chert *may* perform
> better with larger indexes, but hell, that's still a bad start...

This really isn't an change motivated purely by performance with larger
databases at the expense of smaller ones.

In flint (and earlier) we stored the document length alongside every
posting list entry.  So if a document has 1000 terms, we store its
length in 1000 places.  That's great for the case where you want to run
a single term query which needs the document lengths, but for every
other case it's not really helpful.  Queries with multiple terms (or
which don't want the document length - e.g. using BoolWeight) have to
skip over the document length data in all but one of the posting lists.

We've felt for some time this was not the right approach, and in chert
we fixed it to store a separate chunked list of the document lengths.

For the single term case when using document lengths, this means reading
two lists, which is going to be at least a little slower on average.
But the trade off is that more complex queries will be faster than
before.  And even with chert as it was when #326 was originally
reported, the queries were taking an average of 0.001258 seconds rather
than 0.000178 seconds.  Also, if you really care that much about
performance, you're going to be caching generated results at higher
levels, and single term queries are inherently more amenable to such
caching.

Splitting out the document lengths has other benefits - for example, it
means we now have a list of all the document ids which are in use, which
allows for implementing things like pure-NOT (`NOT <subquery>') much
more efficiently in the general case.  It also means you can add/remove
terms to/from an existing document and only have to update the posting
lists for those terms.  With flint, if the document length changes then
every posting list entry for that document needs to be updated with the
new document length.

This change does also reduce the size of the database quite a bit, which
is indeed likely to benefit larger databases - in particular it can make
the difference between being able to cache your working set in RAM
comfortably or not.  But that's also good for smaller databases - modern
machines have tiered caches, so smaller will tend to be faster even if
you can comfortably fit everything in RAM.  And performance when nothing
is cached matters for some applications - for example, an index of all
my files could easily fit in the memory of my laptop, but unless I've
recently searched, the database is probably not cached in RAM at all.

Cheers,
    Olly