[Xapian-discuss] indexing performance

Hongyan Ma hym at ucla.edu
Fri Oct 8 17:14:55 BST 2004


I've some trouble with my indexer, which builds on simpleindex.cc. The problem 
is that indexing process becomes very slow after we indexed 2000k docs (though 
the indexer works quite well with first 2000k docs). It took almost three 
weeks to index 8 million docs. However, we need to index about 20 million 
docs. I have to stop the indexer due to its performance.

I think my question is similar to Jim Lynch's on Sept 1, 2004 and RACHEL 
NAPPER's on Jan, 14, 2004, in that it involves scaling performance. The 
difference is that I do have a good computer and I'm using Xapian 0.8.2 - 
according to my understanding, database update speed with quartz is already 
greatly improved. 

We work on a G5 Mac with about 2GB of RAM. We installed Xapian 0.8.2

Source file: it's a big 1.5G ASCII file, containing data of 20 million docs. 
For each doc, the structure is fixed. So the .txt file looks as:
ID
Title
Abstract
ID
Title
Abstract

We only need to get subject data in Titles and abstracts, keeping each 
title+abstract as a doc.

Performance: It took 2 minutes to index 390k docs, 20 minutes for 1000k docs 
(about 10M of data), and 90 minutes for 2000k docs. But after that, it's very 
slow. It took about 3 weeks to get the following database:
number of documents = 8330000
average document length = 10.8826

I noticed that when the indexing process became very slow, CPU use was only 0%-
 1%, but memory use mounted to  VSZ 244M RSS180M. considering we have 2G RAM, 
I wonder whether we have a way to utilize our machine more to get better 
performance with indexing.

Question: 
How can I expedite our indexer? Did I do sth wrong with my indexer?

BTW, I set the following env parameters:
XAPIAN_FLUSH_THRESHOLD_LENGTH=5000000
XAPIAN_FLUSH_THRESHOLD=10000

Many many thanks.

Hongyan Ma








More information about the Xapian-discuss mailing list