Xapian 1.3.5 snapshot performance and index size
Jean-Francois Dockes
jf at dockes.org
Mon Apr 11 08:54:36 BST 2016
Olly Betts writes:
> On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote:
> > Some might notice the 50% index size increase. Excessive index size is
> > already one relatively rare, but recurring complaint. Except if I did
> > something wrong: I'm actually quite surprised by it.
>
> Did you try compacting the resulting databases?
>
> Creating a database by calling add_document() repeatedly would have
> resulted in a close to compact position table with chert, but that's not
> true with glass (because the position table is no longer sorted
> primarily by the document id). But if you compact the result, it should
> be a fair bit smaller with glass than chert.
>
> Creating a database from scratch is the worst case for this (but of course
> a common one). In general day to day use, this effect should be less
> marked.
I had not compacted. After compacting, the 1.3 index is indeed smaller
than the 1.2 one.
> > Of course, having faster phrase searches is a good thing. Maybe I have not
> > run the right tests to display the maximum effect of the new code ?
>
> The cases that motivated these changes were really those taking tens of
> seconds (or even minutes for the extreme ones), and were generally
> sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end
> of the improvements seen. One particular issue with "to be or not to
> be" will be that we don't currently try to reuse the postlist or
> positional data for "to" and "be", so it has to decode them twice.
>
> > As it is, and still hoping that more 1.3 optimization will improve the
> > situation, I have to wonder if the price payed for faster phrase searches
> > is not a bit too high, given that these are rather unfrequent queries, and
>
> It's difficult to make the call on changes like this, but I do feel
> that searches taking minutes is completely unacceptable. How much users
> use phrase searches varies a lot, but even if it's a small fraction of
> queries, active users will hit such cases and form the impression that
> the system is unreliable (and for multi-users systems, it affects the
> speed of other queries, as you can end up with the server bogged down
> with the long-running searches). It's made worse by users often
> responding to an apparently stalled search by hitting reload in their
> browser.
>
> > that the improvement, while very significant, does not completely solve the
> > issue.
>
> 2.1 seconds is slower than I'd like, but it's at least in the realms of
> "that took a while" rather than "the computer has hung".
My spinning disk machine was actually "too cold", I should have thought a
bit more and run a query on another index first to get the program text
pages in memory.
This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of
the" gets from 12 S to 0.9 S. Which is of course brilliant !
I think that I can dump my plan of indexing compound terms for runs of
common words :)
> We're closing in on 1.4.0, so there's not scope for much of this to
> change markedly before then. But I do have plans for internal
> improvements which should help the indexing speed and memory usage, and
> should be suitable for 1.4.x.
>
> I'm not sure there's an easy solution to the position table not coming
> out compact in this case. Supporting a choice of which key order to use
> is possible, but adds some complexity.
The question which remains for me is if I should run xapian-compact after an
initial indexing operation. I guess that this depends on the amount of
expected updates and that there is no easy answer ?
jf
More information about the Xapian-discuss
mailing list