[Xapian-discuss] Re: Big process using Xapian

Kevin Duraj kevin.softdev at gmail.com
Tue Feb 6 19:14:25 GMT 2007


I run a big process on Gentoo Linux with 8 CPUs / 32 GB memory, indexing 30
million of small documents. My threshold is set to 10 million and the
process grew to 6.4 GB. I was expecting to grow the process to 10 GB. When I
had threshold set to 10,000 my indexing did not finish over the weekend. Now
with threshold of 10 million my indexing is only 8 hours. I am increasing
the threshold to 40 million and see if my process will surpass 10GB of used
memory and indexing would be faster. That's mean that the entire B-Tree with
all its 30 million records would be constructed in memory and then flush
into hard disk. :-)

-Kevin





On 2/5/07, Olly Betts <olly at survex.com> wrote:
>
> On Fri, Feb 02, 2007 at 03:54:57PM +0000, James Aylett wrote:
> > On Fri, Feb 02, 2007 at 07:17:04AM -0800, Rafael SDM Sierra wrote:
> >
> > > >[1] - 736M   694M biord  182:27  2.15% python
> > >
> > > I change from 1000 to 10000 the xapian flush threshold, and the
> process
> > > become bigger oO...
> > >
> > > 2008M  1504M swread  10:36  0.00% python
> > >
> > > It's all that I have of memory (2GB), my swap is in use now...
> >
> > Some systems cannot free main memory from the process back to the
> > operating system, even if it is unused within the process.
>
> Also, the GNU C++ STL implementation likes to "horde" memory it has been
> allocated to avoid lots of calls to malloc() and free().  That's
> generally great for speed, but makes it harder to release memory back to
> the OS even where this is possible (I think it's uncommon anyway on
> Unix-like platforms).
>
> My long term plan is to buffer this information in memory allocated
> outside the C/C++ heap (using anon mmap or similar) so that we can just
> release it straight back to the OS once flush() or cancel() has been
> called.
>
> > The larger the flush threshold, the more data has to be held in
> > memory, so this might explain what you're seeing.
>
> Indeed.  But it should get reused by the next batch of documents being
> added so you shouldn't see the process size continue to grow.  Also I
> suspect that most of the now unused space can just get paged out until
> another batch gets added so this shouldn't actually be a big problem.
>
> Remember it's not a problem to be using swap per se - it's only a
> problem when the working set of a process is getting swapped out.
>
> Cheers,
>    Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>


More information about the Xapian-discuss mailing list