[Xapian-discuss] Improving indexing speed

Richard Boulton richard at lemurconsulting.com
Thu Jun 26 20:33:00 BST 2008


Robert Kaye wrote:
> Hi!
> 
> After more work I've managed to get Xapian to work better all around  
> than our previous text search engine. I've been able to tweak, or work  
> around the idiosyncrasies of our data/setup and am getting results I'm  
> quite happy with. Big thumbs up to the Xapian dev team!
> 
> I often times get rewarded with good chocolate from various corners of  
> the world. Do you folks like good chocolate? I can share!

I probably eat too much chocolate already, but thanks for the thought!

> Onward: However, indexing speed is a bit a of a problem for me;  
> smaller indexes build faster than the previous system, large indexes  
> take about 2-3 times as long.
> 
> I noticed disk access is very spikey -- every 3-5 seconds utilization  
> goes to 100%. Then there are long periods of 100% disk utilization. My  
> CPU is never very busy -- at most I find a 50% - 60% load.  And the  
> indexing process only uses about 5% of available RAM. Is there any way  
> I can instruct Xapian to use more resources to speed up indexing?

Yes - you can control the number of documents Xapian batches together 
during an indexing session using the XAPIAN_FLUSH_THRESHOLD environment 
variable, which controls the number of document changes to buffer.  The 
default is to buffer changes to 10000 documents in memory, and then 
apply them to disk.  This is probably a little low for modern systems 
(unless the documents are very large).  Too low a setting will result in 
slow indexing, due to having to do lots of extra IO.  Too high a setting 
will be even slower, due to the indexing process getting into swap.  The 
ideal is probably to find a value which results in around half of your 
memory being used by the indexing process (leaving the other half of the 
memory available for the system to cache disk pages).

If you're currently only seeing aruond 5% of RAM used, I'd try setting 
XAPIAN_FLUSH_THRESHOLD=100000 - hopefully that will result in about 50% 
being used.

Ideally, this would tune itself automatically, but we've not had time to 
get around to that yet.  There are also lots of other things we could 
work on to improve indexing speed, which we've not got around to either.

Another approach, if your index is large, is to build several small 
indexes, and then merge them together with "xapian-compact".  (Probably 
with the "-m" option to do multipass merging, if you end up with _lots_ 
of small indexes.)  This method is a bit clunky, but can build large 
indexes much faster than doing it in one go.  At some point, we'll 
probably merge xapian-compact into the main API, but for now it's only 
available as a standalone executable.

 > My
> index could also be built on a RAM disk -- I suspect that would help,  
> but I'm curious as to what the best practices are...

It might well do; if you experiment with this, I'd be interested to know 
how the speed compares

-- 
Richard



More information about the Xapian-discuss mailing list