[Xapian-devel] goodbye QuartzBufferedTable

Fri Aug 13 18:19:17 BST 2004

On Fri, Aug 13, 2004 at 12:54:40PM +0100, Olly Betts wrote:
> On Fri, Aug 13, 2004 at 09:10:44AM +0200, Arjen van der Meijden wrote:
> > On 13-8-2004 1:56, Olly Betts wrote:
> > > I should add something so the batch size can be set without
> > > recompiling though.
> > 
> > I'll watch the cvs-commits for this. Will you also allow a switch (or an 
> > environment value or whatever) on scriptindex to adjust this?
> 
> For the time being, I'll probably just pull the value from an
> environment variable inside quartz itself.  We should also look at
> whether a document count based flush is the best approach - now that
> we only cache changed postings in memory, counting the number of
> cached postings might be more appropriate since that'll mostly
> dictate memory usage and how much work the merging step does.

I don't have easy access to the number of cached postings (I'd have to
tally them myself), but I do have access to the change in document
length which is similar (except it adds the wdfs for the postings
rather than counting them) so I've added an option to flush on that
too.

Set XAPIAN_FLUSH_THRESHOLD=5000 in the environment (and export it!) to
flush every 5000 documents or XAPIAN_FLUSH_THRESHOLD_LENGTH=1000000 to
flush every 1000000 total change in document length.  Set both to flush
whichever is reached first.  Set neither and the default is to flush
every 1000 documents as before.

I'm now trying flushing every 5000 documents - this seems to work
very well.  After an initial period I so far get a sustained 160
documents per second (currently it's done 519K documents).  I'm using
CVS HEAD with the "dangerous" quartz patch (which I sent to the mailing
list about 2 months ago).

You can see some graphs here:

http://www.survex.com/~olly/gmaneindexrate.html

The plots are of indexing rate (documents indexed per second) against
database size (in 1000s of documents).  Newer plots are at the top -
ignore those from the old box as the hardware isn't even close to
comparable.

There's one point every 1000 documents, so flushing every 5000 gives
the slight rippling effect.

> > Currently we allow scriptindex to either run with 1000 documents or a 
> > set of documents that results in 16MB of data (whichever limit comes 
> > first) and that makes scriptindex use amounts in the range of 150-250MB 
> > of ram.
> 
> You should find the memory usage is a lot lower now.  Before we were
> buffering changes to the tables in memory, which used a lot of memory.
> Now we just update the Btree and leave it to the OS to cache blocks
> from the Btree which appears to be a better use of the memory.

As a data point, my indexer (a custom one, but of similar complexity to
scriptindex) seems to level off at around 60MB with CVS HEAD.

Cheers,
    Olly