[Xapian-devel] goodbye QuartzBufferedTable

Fri Aug 13 12:54:40 BST 2004

On Fri, Aug 13, 2004 at 09:10:44AM +0200, Arjen van der Meijden wrote:
> On 13-8-2004 1:56, Olly Betts wrote:
> > I should add something so the batch size can be set without
> > recompiling though.
> 
> I'll watch the cvs-commits for this. Will you also allow a switch (or an 
> environment value or whatever) on scriptindex to adjust this?

For the time being, I'll probably just pull the value from an
environment variable inside quartz itself.  We should also look at
whether a document count based flush is the best approach - now that
we only cache changed postings in memory, counting the number of
cached postings might be more appropriate since that'll mostly
dictate memory usage and how much work the merging step does.

> Making it runtime/startuptime adjustable will at least allow easier 
> searching for semi-optimal values. Finding the real-optimal values will 
> probably cost a lot of extra time, while not really improving the 
> performance that much.

I believe we can pick a reasonable default for most users.  If you've
got 10,000,000 documents, it's worth your while spending a bit of time
tuning.

Also, with a smaller collection, it's nice to be able to see documents
searchable while the indexer is still running.  With a large collection
you'd rather get the indexing done sooner.

Perhaps omindex (and maybe scriptindex) ought to force a flush after 10,
100, 1000 documents or something like that.  Mind you, my first batch of
2000 documents is currently taking 4.5 seconds to index - the box is an
Athlon 64 3000+ with 3G of memory and SCSI disks, but I doubt the first
batch uses much memory at all.

> Currently we allow scriptindex to either run with 1000 documents or a 
> set of documents that results in 16MB of data (whichever limit comes 
> first) and that makes scriptindex use amounts in the range of 150-250MB 
> of ram.

You should find the memory usage is a lot lower now.  Before we were
buffering changes to the tables in memory, which used a lot of memory.
Now we just update the Btree and leave it to the OS to cache blocks
from the Btree which appears to be a better use of the memory.

I've just checked how my build is doing, and interestingly after 3
million documents, the rate is a little under twice what it was before
changing the batch size.

Cheers,
    Olly