[Xapian-discuss] indexing performance

Sat Oct 9 03:23:00 BST 2004

On Fri, Oct 08, 2004 at 04:14:55PM +0000, Hongyan Ma wrote:
> I think my question is similar to Jim Lynch's on Sept 1, 2004 and RACHEL 
> NAPPER's on Jan, 14, 2004, in that it involves scaling performance. The 
> difference is that I do have a good computer and I'm using Xapian 0.8.2 - 
> according to my understanding, database update speed with quartz is already 
> greatly improved.

It's improved, but there's still plenty of scope for further improvement.

> Source file: it's a big 1.5G ASCII file, containing data of 20 million docs.

So each document averages only about 80 bytes?

> I noticed that when the indexing process became very slow, CPU use was
> only 0%-1%, but memory use mounted to  VSZ 244M RSS180M. considering
> we have 2G RAM, 

The machine is I/O bound at this point, which is why the CPU is almost
completely idle.  Although the indexer process is only explicitly using
244M, in fact the OS is probably using most of the rest of the 2G to
buffer recently accessed disk blocks from the Xapian database Btrees.

> Performance: It took 2 minutes to index 390k docs, 20 minutes for
> 1000k docs (about 10M of data), and 90 minutes for 2000k docs. But
> after that, it's very slow. It took about 3 weeks to get the following
> database: number of documents = 8330000 average document length =
> 10.8826
> 
> I wonder whether we have a way to utilize our machine more to get
> better performance with indexing.

Can you just build 10 databases with 2 million documents in each?  That
should take about 15 hours, and you can search over all of them at once
as if they were all one database (see Xapian::Database::add_database()).

There's currently no way to merge databases at the quartz level but it
could be done quite efficiently I think.  You need to adjust the keys
for the termlist, record, values, and position tables, and adjust the
keys and tags for the postlist table, but both of these operations can
be done in a serial fashion without needing much to be buffered if
you simply renumber the documents for each database to come after
those in the preceding database.

> BTW, I set the following env parameters:
> XAPIAN_FLUSH_THRESHOLD_LENGTH=5000000 XAPIAN_FLUSH_THRESHOLD=10000

XAPIAN_FLUSH_THRESHOLD_LENGTH doesn't do anything (it was in the CVS
version at one point, but removed prior to release).

XAPIAN_FLUSH_THRESHOLD defaults to 10000 anyway.  I'd suggest trying
a larger value for that - 50000 works well for the gmane box which
has 3G of RAM.  Your documents are tiny, so you could consider using
a larger value.

Cheers,
    Olly