[Xapian-discuss] bigrams search speed and index documents

Fri Nov 27 02:59:55 GMT 2009

Hi Olly,
> This means you ran out of memory.
>
> You're attempting to add 239 million term postings to a single document.
> Document objects are built up in memory, and internally that is a C++
> std::map container, with an entry for each unique term.  So what you're
> doing here is using (or abusing perhaps) Xapian::Document as a memory-based
> associative array.
>
>   
>> Is there other way to index this 3.3G file? It works well on smaller  
>> files. I am testing some extreme cases. Thank you very much!
>>     
>
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read.  The collection frequency of each term
> at the end will be the total number of times it appeared.
>   
When I start a new document every 1 million lines read, for the document 
with 239 million terms, it will run out of memory finally. If I start a 
new document every 0.5 million lines, it can run a little longer than 
the 1 million/doc, but also run out of memory. I guess it maintains the 
whole database in memory and build the B-tree when index.  Do you know 
what's the  proper size of one database for query or index? Or the ratio 
between the memory and database size? 

Thank you,
Ying