[Xapian-discuss] bigrams search speed and index documents
shane
shane.evans at gmail.com
Sat Nov 28 03:06:52 GMT 2009
Ying Liu <liux0395 <at> umn.edu> writes:
>
> Hi Olly,
> > This means you ran out of memory.
> >
> > You're attempting to add 239 million term postings to a single document.
> > Document objects are built up in memory, and internally that is a C++
> > std::map container, with an entry for each unique term. So what you're
> > doing here is using (or abusing perhaps) Xapian::Document as a memory-based
> > associative array.
> >
> >
> >> Is there other way to index this 3.3G file? It works well on smaller
> >> files. I am testing some extreme cases. Thank you very much!
> >>
> >
> > If you are just doing this as a way to count frequencies, you could simple
> > start a new document every N lines read. The collection frequency of each
term
> > at the end will be the total number of times it appeared.
> >
> When I start a new document every 1 million lines read, for the document
> with 239 million terms, it will run out of memory finally. If I start a
> new document every 0.5 million lines, it can run a little longer than
> the 1 million/doc, but also run out of memory. I guess it maintains the
> whole database in memory and build the B-tree when index. Do you know
> what's the proper size of one database for query or index? Or the ratio
> between the memory and database size?
xapian buffers some changes in memory and periodically writes these changes to
disk. You can either explicitly flush these changes yourself, or set the
XAPIAN_FLUSH_THRESHOLD environment variable to a lower number to make xapian do
this more frequently - it defaults to 10000.
There is more detail here:
http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#d0077aca
fa9485c97b73b8726c375732
Cheers,
Shane
More information about the Xapian-discuss
mailing list