Xapian 1.3.5 snapshot performance and index size

Sat Apr 23 07:49:20 BST 2016

On Tue, Apr 12, 2016 at 11:28:52AM +0200, Jean-Francois Dockes wrote:
> Olly Betts writes:
>  > Ideally we'd find a way to make it come out more compact to start with.
>  > 
>  > One thing which could help is making glass more willing to switch to
>  > "sequential mode".  If you fancy some more benchmarking, you could
>  > try changing SEQ_START_POINT in backends/glass/glass_table.cc.
>  > 
>  > It defaults to -10, but I don't think anyone has tried tuning it
>  > recently (this value comes from Martin's original code in commit
>  > 26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done
>  > tests to select it, but even if he did, so much has changed since).
>  > Something like -3 or -4 might work well - probably enough that it
>  > shouldn't enable when it's not useful, and by default we ensure at least
>  > 4 items fit in a block.
> 
> Ok, I tried this, with not much luck.

Many thanks for taking a look at this.

If you have the databases from your test around still, what's the
size of the tables in one of them after compaction?  It shouldn't
make a difference which version of the output database you compact to
find this.

> I used a script to edit the SEQ_START_POINT value, then rebuild and
> install Xapian, then run the indexing.
> 
> Sizes don't change much... Maybe I did something wrong, 

I've been pondering your results, and have a few insights.

Looking at the variations in table size, the postlist table actually
benefits more from changing SEQ_START_POINT, with a reduction in
size of 8% in the best case, which is pretty significant.

I think the reason it makes more difference there is that the items
in the postlist table tend to be larger, whereas a lot of the positional
data entries are actually very small, so in fact we'll often have
inserted enough items sequentially to have switched to sequential mode
before we need to split a block.

And making the wrong call about an uneven split can make things worse
as it creates a block < 50% full and a block much fuller than 50%.  If
the next batch of updates doesn't touch the under-full block but
splits the fuller one, we can end up with more unused space than if
we'd just split evenly.

There looks to be scope for improvement here, but it's not as simple
as just reducing SEQ_START_POINT, as I'd naively hoped.  If we had
an "oracle" which could predict with perfect foresight where we
should split a block for the best end result, we can expect at least
an 8% improvement for the postlist table, and probably significantly
better.  I'd expect good gains for the position table too.

So the question is, can we build at least a useful approximation to an
oracle?

And the answer is likely yes, since we have all the data batched up at
the point this is relevant, so we can look ahead to see what's coming
(or pack it in a speculative way, or something along those lines).
I think with care the overhead of doing so can be kept low too.

A change like this isn't going to happen before 1.4.0, but it doesn't
require format changes, could be done in 1.4.x.

Cheers,
    Olly