[Xapian-discuss] XAPIAN_FLUSH_THRESHOLD

Thu Jul 16 04:30:36 BST 2009

I see. Thanks.

I did some silly testing where I wrote a python script to generate text 
files with randomly generated words from randomly generated characters 
(so they were not real words).

But I noticed omindex consumes a lot of memory, I then figured it's not 
a good test because there's no efficiencies to be found in organising 
random text so silly idea.

But I did no notice that no matter what the XAPIAN_FLUSH_THRESHOLD was 
set to that omindex's memory foot print grew larger and larger.

My questions are:

Is there stuff that XAPIAN_FLUSH_THRESHOLD doesn't "flush

If so what are the typical work arounds. E.g. throw more memory at it, 
index smaller amounts on each run but have more databases, etc.

Is there a rule of thumb regards to how many documents I am indexing 
versus how much memory versus how much HDD is required on disk for the 
index (I'm dealing with a typical office filled with lots of microsoft 
and pdf file types).

Thanks Again,

Frank

Olly Betts wrote:
> On Thu, Jul 16, 2009 at 02:30:39AM +0930, Frank John Bruzzaniti wrote:
>   
>> Am I right in saying that for my setup I should be doing export 
>> XAPIAN_FLUSH_THRESHOLD=1000 because:
>>
>> 1000 documents * 2MB doc size = 2gig of memory required before a flush 
>> to disk?
>>     
>
> That's a bit simplistic, but probably a reasonable starting point.
>
> What is stored in memory are changes to the postlist and spelling tables
> - changes to other tables are written out (but not switched live).  For
> the postlist table, there are the terms which have changed, the docids
> and wdfs for those changes, and data structure overheads.  That's
> probably going to come out significantly smaller than the raw text size
> of the documents, so you can probably go higher than 1000.
>
> Cheers,
>     Olly
>