[Xapian-discuss] Strange behavior with the Python bindings and threads

jarrod roberson jarrod at vertigrated.com
Sun Jul 16 01:23:49 BST 2006


I have a program that walks the file system indexing files.
There are three steps

1. open the file and read the info I need to index.
2. create a xapian document, populate it and add or replace as needed
3. update the file with the docid and write it back out to disk.

doing this serially, is averaging 10ms per file on my test server. At that
rate it is going to take me ~9 days to index the 75.5 million files!

the test server is a QUAD processor Linux machine with 6GB of RAM and a
RAID5 disk with 5 high speed disks, filesystem is Reiser.

so I decided to try and thread the process since it is COMPLETEY IO bound.

1. the first thread walks the files and inserts them into a queue ( using
the python queue.Queue() )
2. a second thread "gets" from the first queue creates the document, and
adds/replaces the doc in the index and updates the docid in the file it in
another queue for closure.
3. a third thread "gets" from the second ( the already indexed files queue )
and then closes the file ( which writes the updates to the disk )

On my 17" Powerbook G4 the threaded version works great, it queues up the
files, indexes them and closes them in a completely async pattern. Each
thread reports batches of hits when it runs, and it sped up the processing
by about 33%

when I moved this code to my test server, everything runs SERIALLY.
I coded all the put() and get() with a 3 second timeout, and it basically
puts the file in the first queue, times out, the second thread indexes it,
times out, the third thread closes it and then times out and then the next
file runs.

The only thing we can think of is the SWIG Python bindings aren't releasing
the GIL correctly or something?

Any ideas? A 30% speed up on ~9 days is significant!


More information about the Xapian-discuss mailing list