[Xapian-discuss] XAPIAN_FLUSH_THRESHOLD

Thu Jul 16 10:46:06 BST 2009

I tried the same omindex on a production system with about 50GB of data 
(ms office and pdf's) and it was OK.
But If I am indexing 2GB worth of text files filled with made up words 
from random ascii characters
I get the same problem.

What kind of data are you trying to index?

Eric Voisard wrote:
> Hi Frank,
>
> Maybe it's not related but what you see makes me think about the problem
> I had a couple of monthes ago: increasing memory usage finally causing
> failures with external converters
>
> I opened a bug: http://trac.xapian.org/ticket/358
>
> Yours, Eric
>
>
> Frank John Bruzzaniti wrote:
>   
>> I see. Thanks.
>>
>> I did some silly testing where I wrote a python script to generate text 
>> files with randomly generated words from randomly generated characters 
>> (so they were not real words).
>>
>> But I noticed omindex consumes a lot of memory, I then figured it's not 
>> a good test because there's no efficiencies to be found in organising 
>> random text so silly idea.
>>
>> But I did no notice that no matter what the XAPIAN_FLUSH_THRESHOLD was 
>> set to that omindex's memory foot print grew larger and larger.
>>
>> My questions are:
>>
>> Is there stuff that XAPIAN_FLUSH_THRESHOLD doesn't "flush
>>
>> If so what are the typical work arounds. E.g. throw more memory at it, 
>> index smaller amounts on each run but have more databases, etc.
>>
>> Is there a rule of thumb regards to how many documents I am indexing 
>> versus how much memory versus how much HDD is required on disk for the 
>> index (I'm dealing with a typical office filled with lots of microsoft 
>> and pdf file types).
>>
>> Thanks Again,
>>
>> Frank
>>
>> Olly Betts wrote:
>>     
>>> On Thu, Jul 16, 2009 at 02:30:39AM +0930, Frank John Bruzzaniti wrote:
>>>   
>>>       
>>>> Am I right in saying that for my setup I should be doing export 
>>>> XAPIAN_FLUSH_THRESHOLD=1000 because:
>>>>
>>>> 1000 documents * 2MB doc size = 2gig of memory required before a flush 
>>>> to disk?
>>>>     
>>>>         
>>> That's a bit simplistic, but probably a reasonable starting point.
>>>
>>> What is stored in memory are changes to the postlist and spelling tables
>>> - changes to other tables are written out (but not switched live).  For
>>> the postlist table, there are the terms which have changed, the docids
>>> and wdfs for those changes, and data structure overheads.  That's
>>> probably going to come out significantly smaller than the raw text size
>>> of the documents, so you can probably go higher than 1000.
>>>
>>> Cheers,
>>>     Olly
>>>   
>>>       
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>>     
> ATIS Uher S.A. 
> CH 2046 Fontaines
> ________________________________________________________________________________________________
>
> This message is confidential. It may also be privileged or otherwise protected by work product immunity or other legal rules. If you have received this message by mistake please let us know by reply and then delete it from your system; you should not copy it or disclose its contents to anyone. All messages sent to and from ATIS Uher S.A. may be monitored to ensure compliance with internal policies and to protect our business. E-Mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, lost or destroyed. Anyone who communicates with us by e-mail is taken to accept these risks.
>