I checked it with the memory usage, the documents which i wished ti index were fairly large, some were of size 1MB and of type text/plain so for them the threshold was large enough, I tried to set the XAPIAN_FLUSH_THRESHLOD env. variable but I think i wasn't set properly so I just added an commit statement in the omindex.cc as soon as every 1000 docs are indexed.<br>
<br>For the case when that threshold is 10K, the memory is overloaded because it kept on adding the documents until it reached 10K documents which hanged my system because it is having only 1GB RAM. So changing that threshold to 1K or 1.5K worked. Which made the the indexing possible and smooth just with commit delays but it worked and that helped to keep memory in bound.<br>
<br>While for another collection where the document size is quite small like max 1KB or standard html pages, that threshold worked very fine and even there was a scope to increase that threshold to 20K.<br><br>Best,<br>Parth.<br>
<br><div class="gmail_quote">On Tue, May 10, 2011 at 5:51 AM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">On Wed, May 04, 2011 at 08:33:46PM +0530, Parth Gupta wrote:<br>
> Types of Files: text files with .txt extension<br>
> Size of the collection: 11400 documents [1.6 GB]<br>
><br>
> This takes a lot of time to index and indexing for last 20 hrs or so. I am<br>
> using omindex.<br>
><br>
> I notice that around 2900 docs are indexed very smoothly and suddenly after<br>
> that indexing becomes very sluggish.<br>
><br>
> I have tried couple of tricks like replacing the index_text() call to<br>
> index_text_without_positions(). I also tried after setting the<br>
> XAPIAN_FLUSH_THRESHLOD to 1500 documents from 10000 default. Above mentioned<br>
> time is after this tricks.<br>
<br>
</div>You probably want to *raise* the threshold, not lower it. Bigger<br>
batches are more efficient, provided you have sufficient memory.<br>
For typical size documents, 10000 is fairly conservative on modern<br>
hardware - you should be able to index 11400 documents in a single<br>
batch I'd think.<br>
<br>
You've told Xapian to commit every 1500 document changes, so at 3000<br>
docs it will be merging postlist changes - that's why there's apparently<br>
a pause at that point. Once the changes are committed, it should go<br>
faster up to 4500 documents, then up to 6000, etc<br>
<br>
If you do need to index in several batches, you can build several<br>
databases, each smaller than your flush threshold. Then you can either<br>
just search these together, or merge them into a single database with<br>
xapian-compact.<br>
<br>
Cheers,<br>
<font color="#888888"> Olly<br>
</font></blockquote></div><br>