[Xapian-discuss] Python bindings not freeing memory during indexing

EJ Johnson ej.johnson at rackspace.com
Sat Jul 7 18:05:08 BST 2007


Hi list,

I'm new to Xapian (great stuff!!) but am running into a problem that I haven't seen explicitly mentioned on the list before.

I'm using the Python bindings for Xapian 1.0.1 on Ubuntu Dapper 6.06 LTS using xapian.org as my repository.  My hardware is an HP DL385 G2, two dual-core AMD Opterons with 8G RAM.

I'm trying to index a good chuck of documents and have a python indexer iterating through the docs and adding them to the DB.  I get up to about 45,000 docs and it croaks.  Sometimes it throws some malloc error and the last time it just segfaulted.  Essentially, the indexer process continues to use more and more RAM until it dies.  It really only makes it up to about 3G of RAM before dying and it never hits swap.

I tried tweaking the XAPIAN_FLUSH_THRESHOLD, but that doesn't seem to matter.  I've tried using transactions and even called flush() after every 1000 docs to no avail.  I've even tried destroying my DB handle to see if that would free up memory by setting it to None and then re-opening the DB every 1000 docs.

I've finally found a work-around by having a wrapper script call my indexer for each 1000 docs as a separate process for each iteration.  That seems to have solved my memory consumption problem and it appears that I'll finally be able to index my entire data set.

Here's a snippet of a log that was tracking my DB/memory usage.  The snippet was essentially the same for every failed attempt to work around the memory consumption.  The first line is calling "du -sh" on my DB directory, the second line is a snippet from "top" (shows 1.3g RAM for the process, using 16.2% of RAM, and has been running for 16 minutes), the other lines are from delve.

==================================================================
=> Space on disk: 403M  xapdb
=> 26611 ej.johns  17   0 1338m 1.3g 3728 D   97 16.2  16:38.51 ticketloader.py
=> Number of documents: 40000
=> Highest doc number: 40000
=> Average doc length: 630.6937
==================================================================
=> Space on disk: 407M  xapdb
=> 26611 ej.johns  21   0 1347m 1.3g 3728 R  101 16.3  16:45.63 ticketloader.py
=> Number of documents: 41000
=> Highest doc number: 41000
=> Average doc length: 630.53604878
==================================================================
=> Space on disk: 409M  xapdb
=> 26611 ej.johns  16   0 1357m 1.3g 3728 S   28 16.5  16:52.09 ticketloader.py
=> Number of documents: 41000
=> Highest doc number: 41000
=> Average doc length: 630.53604878
==================================================================

This next snippet of logs shows the same output from when I use my wrapper script to call out to the indexer in a separate process.

==================================================================
=> Space on disk: 409M  xapdb
=> 29376 ej.johns  15   0 29000  23m 3676 S   36  0.3   0:03.51 ticketloader.py    
=> Number of documents: 40203
=> Highest doc number: 40203
=> Average doc length: 698.684103176
==================================================================
=> Space on disk: 411M  xapdb
=> 29389 ej.johns  16   0 23076  17m 3676 S   22  0.2   0:01.27 ticketloader.py    
=> Number of documents: 40625
=> Highest doc number: 40625
=> Average doc length: 697.493636923
==================================================================
=> Space on disk: 413M  xapdb
=> 29405 ej.johns  16   0 19304  13m 3676 S   28  0.2   0:00.51 ticketloader.py    
=> Number of documents: 40953
=> Highest doc number: 40953
=> Average doc length: 697.187190194
==================================================================
=> Space on disk: 414M  xapdb
=> 29421 ej.johns  16   0 21328  15m 3672 S   18  0.2   0:00.73 ticketloader.py    
=> Number of documents: 41177
=> Highest doc number: 41177
=> Average doc length: 696.544842995
==================================================================

So, you can see that the number of docs, disk space, doc length, etc are basically the same.  The only difference is the amount of memory consumed during a single run versus individual runs in separate processes.

My next step was to recompile Xapian and the Python bindings from source (1.0.2) is out now and see if that helps.  Any other thoughts or suggestions are greatly appreciated!

Thanks in advance,
Eric


Confidentiality Notice: This e-mail message (including any attached or
embedded documents) is intended for the exclusive and confidential use of the
individual or entity to which this message is addressed, and unless otherwise
expressly indicated, is confidential and privileged information of Rackspace
Managed Hosting. Any dissemination, distribution or copying of the enclosed
material is prohibited. If you receive this transmission in error, please
notify us immediately by e-mail at abuse at rackspace.com, and delete the
original message. Your cooperation is appreciated.



More information about the Xapian-discuss mailing list