[Xapian-discuss] bigrams search speed and index documents

Sat Nov 28 03:06:52 GMT 2009

Ying Liu <liux0395 <at> umn.edu> writes:

> 
> Hi Olly,
> > This means you ran out of memory.
> >
> > You're attempting to add 239 million term postings to a single document.
> > Document objects are built up in memory, and internally that is a C++
> > std::map container, with an entry for each unique term.  So what you're
> > doing here is using (or abusing perhaps) Xapian::Document as a memory-based
> > associative array.
> >
> >   
> >> Is there other way to index this 3.3G file? It works well on smaller  
> >> files. I am testing some extreme cases. Thank you very much!
> >>     
> >
> > If you are just doing this as a way to count frequencies, you could simple
> > start a new document every N lines read.  The collection frequency of each 
term
> > at the end will be the total number of times it appeared.
> >   
> When I start a new document every 1 million lines read, for the document 
> with 239 million terms, it will run out of memory finally. If I start a 
> new document every 0.5 million lines, it can run a little longer than 
> the 1 million/doc, but also run out of memory. I guess it maintains the 
> whole database in memory and build the B-tree when index.  Do you know 
> what's the  proper size of one database for query or index? Or the ratio 
> between the memory and database size? 

xapian buffers some changes in memory and periodically writes these changes to 
disk. You can either explicitly flush these changes yourself, or set the 
XAPIAN_FLUSH_THRESHOLD environment variable to a lower number to make xapian do 
this more frequently - it defaults to 10000.

There is more detail here:
http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#d0077aca
fa9485c97b73b8726c375732

Cheers,

Shane