[Xapian-discuss] How to speed up indexing ?

Olly Betts olly at survex.com
Thu Aug 21 14:42:10 BST 2008


On Thu, Aug 21, 2008 at 07:17:00PM +1000, cel tix44 wrote:
> I'm trying to test-index my dataset -- some 200'000 docs, each
> document being (on average) 50 bytes long and having 6 words.
> 
> I tried (a) not to use stemmer, (b) commit_transaction() on every
> 50/100/etc. docs, (c) not to use transactions at all -- but in all
> scenarios indexing goes at ~10 doc/sec or 500 bytes per second.

Forcing flushes more frequently will be slower, not faster.  Assuming
you have a decent amount of memory, you want to flush changes in
*larger* batches than the default.  To do this, set
XAPIAN_FLUSH_THRESHOLD in the environment.  The default is 10000.  With
such short documents, I suspect you could index all 200000 in a single
batch.

I've not seen any performance studies of Xapian on Windows, and
I'm not aware of any large deployments, so it is possible that the
Windows VM subsystem just sucks badly for Xapian's usage patterns.  If
you're still struggling to get good performance after setting
XAPIAN_FLUSH_THRESHOLD, I'd suggest trying the same code on a similar
spec box running Linux or similar to see how it compares.  If there's a
problem here it might be possible to improve things.

Cheers,
    Olly



More information about the Xapian-discuss mailing list