[Xapian-devel] Problem in Indexing

Parth Gupta parthg.88 at gmail.com
Tue May 10 10:27:15 BST 2011


I checked it with the memory usage, the documents which i wished ti index
were fairly large, some were of size 1MB and of type text/plain so for them
the threshold was large enough, I tried to set the XAPIAN_FLUSH_THRESHLOD
env. variable but I think i wasn't set properly so I just added an commit
statement in the omindex.cc as soon as every 1000 docs are indexed.

For the case when that threshold is 10K, the memory is overloaded because it
kept on adding the documents until it reached 10K documents which hanged my
system because it is having only 1GB RAM. So changing that threshold to 1K
or 1.5K worked. Which made the the indexing possible and smooth just with
commit delays but it worked and that helped to keep memory in bound.

While for another collection where the document size is quite small like max
1KB or standard html pages, that threshold worked very fine and even there
was a scope to increase that threshold to 20K.

Best,
Parth.

On Tue, May 10, 2011 at 5:51 AM, Olly Betts <olly at survex.com> wrote:

> On Wed, May 04, 2011 at 08:33:46PM +0530, Parth Gupta wrote:
> > Types of Files: text files with .txt extension
> > Size of the collection: 11400 documents [1.6 GB]
> >
> > This takes a lot of time to index and indexing for last 20 hrs or so. I
> am
> > using omindex.
> >
> > I notice that around 2900 docs are indexed very smoothly and suddenly
> after
> > that indexing becomes very sluggish.
> >
> > I have tried couple of tricks like replacing the index_text() call to
> > index_text_without_positions(). I also tried after setting the
> > XAPIAN_FLUSH_THRESHLOD to 1500 documents from 10000 default. Above
> mentioned
> > time is after this tricks.
>
> You probably want to *raise* the threshold, not lower it.  Bigger
> batches are more efficient, provided you have sufficient memory.
> For typical size documents, 10000 is fairly conservative on modern
> hardware - you should be able to index 11400 documents in a single
> batch I'd think.
>
> You've told Xapian to commit every 1500 document changes, so at 3000
> docs it will be merging postlist changes - that's why there's apparently
> a pause at that point.  Once the changes are committed, it should go
> faster up to 4500 documents, then up to 6000, etc
>
> If you do need to index in several batches, you can build several
> databases, each smaller than your flush threshold.  Then you can either
> just search these together, or merge them into a single database with
> xapian-compact.
>
> Cheers,
>     Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110510/bd45fadc/attachment.htm>


More information about the Xapian-devel mailing list