[Xapian-discuss] Indexing speed benchmark - Xapian, Solr

Arjen van der Meijden acmmailing at tweakers.net
Sat Apr 18 08:12:12 BST 2009


On 18-4-2009 5:42 Olly Betts wrote:
> Regarding the indexing time, by default Xapian auto-commits every 10000
> documents, which is pretty conservative on modern hardware.  The article
> doesn't mention tuning this (by setting XAPIAN_FLUSH_THRESHOLD) so I
> assume he didn't.  If you have plenty of RAM, increasing that will speed
> up indexing a lot.  I'd imagine on the hardware described you could
> index all million documents in one go, especially since they are
> truncated to 2000 characters which is really short.  And if you index in
> one go, the database shouldn't need compacting either.

A while back I reindexed our entire documentset, which is about 1.3M 
documents, 8GB in text size or so. I did that on a not so fast machine, 
a quad core Xeon with 1.6GHz cores and 8GB of ram on a 10-disk RAID5. 
The reindex was done inside a virtual machine, so it had an additional 
penalty for I/O.

I generated a single file for scriptindex, scriptindex took seven and a 
half hours to process the file, creating a (uncompacted) database of 
about 25GB. And I forgot to tune the flush threshold, so it could've 
been faster.

What I did notice is that the process started out with a few seconds of 
small reads, than a large write batch and than the small reads again. 
But at the end it did the small reads, several very large read batches, 
than the writes, etc.
So I can imagine that both the "small" amount of RAM and/or a somewhat 
underpowered I/O-subsystem can influence these results pretty badly.

> To give an idea how much difference that would make, Gmane's index
> (running on the new chert backend) is 130GB of which the termlist table
> is 62GB.  Gmane doesn't currently index positional data - if it did I
> guess the database would be roughly twice as large, but that's still
> about a 25% space saving if the termlist table were removed.

Our database is as follows: 12GB position, 11GB postlist (4.6GB 
compacted), 3.4GB termlist (3.1GB comp.), giving a 26GB uncompacted and 
19GB compacted total.
So dropping the termlist won't help that much with relatively large 
documents I guess?

Best regards,

Arjen



More information about the Xapian-discuss mailing list