Xapian 1.3.5 snapshot performance and index size

Olly Betts olly at survex.com
Mon Apr 11 02:47:30 BST 2016


On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote:
> Some might notice the 50% index size increase. Excessive index size is
> already one relatively rare, but recurring complaint. Except if I did
> something wrong: I'm actually quite surprised by it.

Did you try compacting the resulting databases?

Creating a database by calling add_document() repeatedly would have
resulted in a close to compact position table with chert, but that's not
true with glass (because the position table is no longer sorted
primarily by the document id).  But if you compact the result, it should
be a fair bit smaller with glass than chert.

Creating a database from scratch is the worst case for this (but of course
a common one).  In general day to day use, this effect should be less
marked.

> Of course, having faster phrase searches is a good thing. Maybe I have not
> run the right tests to display the maximum effect of the new code ?

The cases that motivated these changes were really those taking tens of
seconds (or even minutes for the extreme ones), and were generally
sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end
of the improvements seen.  One particular issue with "to be or not to
be" will be that we don't currently try to reuse the postlist or
positional data for "to" and "be", so it has to decode them twice.

> As it is, and still hoping that more 1.3 optimization will improve the
> situation, I have to wonder if the price payed for faster phrase searches
> is not a bit too high, given that these are rather unfrequent queries, and

It's difficult to make the call on changes like this, but I do feel
that searches taking minutes is completely unacceptable.  How much users
use phrase searches varies a lot, but even if it's a small fraction of
queries, active users will hit such cases and form the impression that
the system is unreliable (and for multi-users systems, it affects the
speed of other queries, as you can end up with the server bogged down
with the long-running searches).  It's made worse by users often
responding to an apparently stalled search by hitting reload in their
browser.

> that the improvement, while very significant, does not completely solve the
> issue.

2.1 seconds is slower than I'd like, but it's at least in the realms of
"that took a while" rather than "the computer has hung".

We're closing in on 1.4.0, so there's not scope for much of this to
change markedly before then.  But I do have plans for internal
improvements which should help the indexing speed and memory usage, and
should be suitable for 1.4.x.

I'm not sure there's an easy solution to the position table not coming
out compact in this case.  Supporting a choice of which key order to use
is possible, but adds some complexity.

Cheers,
    Olly



More information about the Xapian-discuss mailing list