[Xapian-discuss] bigrams search speed and index documents

Ying Liu liux0395 at umn.edu
Fri Nov 27 02:59:55 GMT 2009


Hi Olly,
> This means you ran out of memory.
>
> You're attempting to add 239 million term postings to a single document.
> Document objects are built up in memory, and internally that is a C++
> std::map container, with an entry for each unique term.  So what you're
> doing here is using (or abusing perhaps) Xapian::Document as a memory-based
> associative array.
>
>   
>> Is there other way to index this 3.3G file? It works well on smaller  
>> files. I am testing some extreme cases. Thank you very much!
>>     
>
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read.  The collection frequency of each term
> at the end will be the total number of times it appeared.
>   
When I start a new document every 1 million lines read, for the document 
with 239 million terms, it will run out of memory finally. If I start a 
new document every 0.5 million lines, it can run a little longer than 
the 1 million/doc, but also run out of memory. I guess it maintains the 
whole database in memory and build the B-tree when index.  Do you know 
what's the  proper size of one database for query or index? Or the ratio 
between the memory and database size? 

Thank you,
Ying






More information about the Xapian-discuss mailing list