[Xapian-discuss] XAPIAN_FLUSH_THRESHOLD
Frank John Bruzzaniti
frank.bruzzaniti at gmail.com
Thu Jul 16 04:30:36 BST 2009
I see. Thanks.
I did some silly testing where I wrote a python script to generate text
files with randomly generated words from randomly generated characters
(so they were not real words).
But I noticed omindex consumes a lot of memory, I then figured it's not
a good test because there's no efficiencies to be found in organising
random text so silly idea.
But I did no notice that no matter what the XAPIAN_FLUSH_THRESHOLD was
set to that omindex's memory foot print grew larger and larger.
My questions are:
Is there stuff that XAPIAN_FLUSH_THRESHOLD doesn't "flush
If so what are the typical work arounds. E.g. throw more memory at it,
index smaller amounts on each run but have more databases, etc.
Is there a rule of thumb regards to how many documents I am indexing
versus how much memory versus how much HDD is required on disk for the
index (I'm dealing with a typical office filled with lots of microsoft
and pdf file types).
Thanks Again,
Frank
Olly Betts wrote:
> On Thu, Jul 16, 2009 at 02:30:39AM +0930, Frank John Bruzzaniti wrote:
>
>> Am I right in saying that for my setup I should be doing export
>> XAPIAN_FLUSH_THRESHOLD=1000 because:
>>
>> 1000 documents * 2MB doc size = 2gig of memory required before a flush
>> to disk?
>>
>
> That's a bit simplistic, but probably a reasonable starting point.
>
> What is stored in memory are changes to the postlist and spelling tables
> - changes to other tables are written out (but not switched live). For
> the postlist table, there are the terms which have changed, the docids
> and wdfs for those changes, and data structure overheads. That's
> probably going to come out significantly smaller than the raw text size
> of the documents, so you can probably go higher than 1000.
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list