[Xapian-discuss] Indexing and commiting

Mon Apr 16 20:40:21 BST 2007

Hi,

On Mon, Apr 16, 2007 at 04:19:43PM +0200, Andreas Marienborg wrote:
> Right now I fetch items from the database, add them to the index,  
> with terms etc,  and commit every 1000 documents.

apart from creating the index in Xapian you should also keep in mind how
fast you can get your data from the database.

The largest SQL-database I use consists of 2.8 million bibliographic entries
(only the OPAC of the University Library of Cologne without institutes
etc.). To actually build the index I use two tables in the database - one
with the terms to index and another with correspondig serialized
data-structures constisting of author/title/year etc.

I use MySQL 4.1.x. Unfortunately MySQL gets quite slow when performing the
select over these to tables - especially after the 1 millionth dataset
fetched.

I benchmarked the performance and measured the time to index 1000 datasets.
>From the 1st to the 200.000th set the time used grew from 0.8 seconds to
around 3.6 seconds for 1000 sets. The time then grew - until after 2 million
sets is was around 59 seconds. I first suspected Xapian and fine-tuned 
XAPIAN_FLUSH_THRESHOLD but quite soon it became quite clear that MySQL was
responsible.

As an alternative way to index (in my setup) I indexed the same data from
two flat files with the same data as in the MySQL-database. To use flat
files was quite a performance boost. Every 1000 sets took around 0.8 to 2.0
seconds to index for the whole 2.8 million bibliographic records.

Just my 0.02 EUR.

Regards,

Oliver

-- 
!- Oliver Flimm - Cologne/Germany | flimm at sigtrap.de | http://www.sigtrap.de/ -!
!    Die Zehn Gebote haben 279 Woerter, die amerikanische Unabhaengigkeits-    !
! erklaerung hat 300 Woerter. Die EU-Verordnung zur Einfuhr von Karamelbonbons !
!-----------------------------  hat 25911 Woerter  ----------------------------!