Xapian 1.3.5 snapshot performance and index size

Mon Apr 11 08:54:36 BST 2016

Olly Betts writes:
 > On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote:
 > > Some might notice the 50% index size increase. Excessive index size is
 > > already one relatively rare, but recurring complaint. Except if I did
 > > something wrong: I'm actually quite surprised by it.
 > 
 > Did you try compacting the resulting databases?
 > 
 > Creating a database by calling add_document() repeatedly would have
 > resulted in a close to compact position table with chert, but that's not
 > true with glass (because the position table is no longer sorted
 > primarily by the document id).  But if you compact the result, it should
 > be a fair bit smaller with glass than chert.
 > 
 > Creating a database from scratch is the worst case for this (but of course
 > a common one).  In general day to day use, this effect should be less
 > marked.

I had not compacted. After compacting, the 1.3 index is indeed smaller
than the 1.2 one.

 > > Of course, having faster phrase searches is a good thing. Maybe I have not
 > > run the right tests to display the maximum effect of the new code ?
 > 
 > The cases that motivated these changes were really those taking tens of
 > seconds (or even minutes for the extreme ones), and were generally
 > sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end
 > of the improvements seen.  One particular issue with "to be or not to
 > be" will be that we don't currently try to reuse the postlist or
 > positional data for "to" and "be", so it has to decode them twice.
 > 
 > > As it is, and still hoping that more 1.3 optimization will improve the
 > > situation, I have to wonder if the price payed for faster phrase searches
 > > is not a bit too high, given that these are rather unfrequent queries, and
 > 
 > It's difficult to make the call on changes like this, but I do feel
 > that searches taking minutes is completely unacceptable.  How much users
 > use phrase searches varies a lot, but even if it's a small fraction of
 > queries, active users will hit such cases and form the impression that
 > the system is unreliable (and for multi-users systems, it affects the
 > speed of other queries, as you can end up with the server bogged down
 > with the long-running searches).  It's made worse by users often
 > responding to an apparently stalled search by hitting reload in their
 > browser.
 > 
 > > that the improvement, while very significant, does not completely solve the
 > > issue.
 > 
 > 2.1 seconds is slower than I'd like, but it's at least in the realms of
 > "that took a while" rather than "the computer has hung".

My spinning disk machine was actually "too cold", I should have thought a
bit more and run a query on another index first to get the program text
pages in memory.

This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of
the" gets from 12 S to 0.9 S. Which is of course brilliant !

I think that I can dump my plan of indexing compound terms for runs of
common words :)

 > We're closing in on 1.4.0, so there's not scope for much of this to
 > change markedly before then.  But I do have plans for internal
 > improvements which should help the indexing speed and memory usage, and
 > should be suitable for 1.4.x.
 > 
 > I'm not sure there's an easy solution to the position table not coming
 > out compact in this case.  Supporting a choice of which key order to use
 > is possible, but adds some complexity.

The question which remains for me is if I should run xapian-compact after an
initial indexing operation. I guess that this depends on the amount of
expected updates and that there is no easy answer ?

jf