Xapian 1.3.5 snapshot performance and index size
olly at survex.com
Tue Apr 12 01:44:14 BST 2016
On Mon, Apr 11, 2016 at 09:54:36AM +0200, Jean-Francois Dockes wrote:
> This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of
> the" gets from 12 S to 0.9 S. Which is of course brilliant !
> I think that I can dump my plan of indexing compound terms for runs of
> common words :)
We had been experimenting with bigrams to accelerate phrases, and not
having to go that route was one motivation for the key order change.
The bigram terms would add significantly to DB size, and to cache
> > I'm not sure there's an easy solution to the position table not coming
> > out compact in this case. Supporting a choice of which key order to use
> > is possible, but adds some complexity.
> The question which remains for me is if I should run xapian-compact after an
> initial indexing operation. I guess that this depends on the amount of
> expected updates and that there is no easy answer ?
I think it's not obvious whether it's a good plan to or not.
Ideally we'd find a way to make it come out more compact to start with.
One thing which could help is making glass more willing to switch to
"sequential mode". If you fancy some more benchmarking, you could
try changing SEQ_START_POINT in backends/glass/glass_table.cc.
It defaults to -10, but I don't think anyone has tried tuning it
recently (this value comes from Martin's original code in commit
26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done
tests to select it, but even if he did, so much has changed since).
Something like -3 or -4 might work well - probably enough that it
shouldn't enable when it's not useful, and by default we ensure at least
4 items fit in a block.
More information about the Xapian-discuss