[Xapian-devel] Problem in Indexing

Jean-Francois Dockes jf at dockes.org
Tue May 10 11:09:24 BST 2011


Parth Gupta writes:
 >    I checked it with the memory usage, the documents which i wished ti
 >    index were fairly large, some were of size 1MB and of type text/plain
 >    so for them the threshold was large enough, I tried to set the
 >    XAPIAN_FLUSH_THRESHLOD env. variable but I think i wasn't set properly
 >    so I just added an commit statement in the omindex.cc as soon as every
 >    1000 docs are indexed.
 >    For the case when that threshold is 10K, the memory is overloaded
 >    because it kept on adding the documents until it reached 10K documents
 >    which hanged my system because it is having only 1GB RAM. So changing
 >    that threshold to 1K or 1.5K worked. Which made the the indexing
 >    possible and smooth just with commit delays but it worked and that
 >    helped to keep memory in bound.
 >    While for another collection where the document size is quite small
 >    like max 1KB or standard html pages, that threshold worked very fine
 >    and even there was a scope to increase that threshold to 20K.
 >    Best,
 >    Parth.

For what it may be worth, Recoll uses an explicit flush after an adjustable
amount of data (cumulated document lengths), and seems to work much more
smoothly than when it used the doc-count Xapian thresholds.

Document sizes vary a lot and averages have no meaning because the doc size
distribution during an indexing pass is arbitrary (ie let's index a few
bibles first, then unwind with a million 1k web pages).

There may be internal reasons why the doc-count thresholds make sense for
Xapian, and a data count makes none, but in actual usage, the latter works
better for me.

jf



More information about the Xapian-devel mailing list