[Xapian-discuss] Indexing speed benchmark - Xapian, Solr
Arjen van der Meijden
acmmailing at tweakers.net
Sat Apr 18 08:12:12 BST 2009
On 18-4-2009 5:42 Olly Betts wrote:
> Regarding the indexing time, by default Xapian auto-commits every 10000
> documents, which is pretty conservative on modern hardware. The article
> doesn't mention tuning this (by setting XAPIAN_FLUSH_THRESHOLD) so I
> assume he didn't. If you have plenty of RAM, increasing that will speed
> up indexing a lot. I'd imagine on the hardware described you could
> index all million documents in one go, especially since they are
> truncated to 2000 characters which is really short. And if you index in
> one go, the database shouldn't need compacting either.
A while back I reindexed our entire documentset, which is about 1.3M
documents, 8GB in text size or so. I did that on a not so fast machine,
a quad core Xeon with 1.6GHz cores and 8GB of ram on a 10-disk RAID5.
The reindex was done inside a virtual machine, so it had an additional
penalty for I/O.
I generated a single file for scriptindex, scriptindex took seven and a
half hours to process the file, creating a (uncompacted) database of
about 25GB. And I forgot to tune the flush threshold, so it could've
been faster.
What I did notice is that the process started out with a few seconds of
small reads, than a large write batch and than the small reads again.
But at the end it did the small reads, several very large read batches,
than the writes, etc.
So I can imagine that both the "small" amount of RAM and/or a somewhat
underpowered I/O-subsystem can influence these results pretty badly.
> To give an idea how much difference that would make, Gmane's index
> (running on the new chert backend) is 130GB of which the termlist table
> is 62GB. Gmane doesn't currently index positional data - if it did I
> guess the database would be roughly twice as large, but that's still
> about a 25% space saving if the termlist table were removed.
Our database is as follows: 12GB position, 11GB postlist (4.6GB
compacted), 3.4GB termlist (3.1GB comp.), giving a 26GB uncompacted and
19GB compacted total.
So dropping the termlist won't help that much with relatively large
documents I guess?
Best regards,
Arjen
More information about the Xapian-discuss
mailing list