[Xapian-discuss] Two questions

Sun May 15 22:47:53 BST 2005

On Fri, May 13, 2005 at 10:17:28AM +0200, roki roki wrote:
> as promised here is test perl script with some data. As you can see I only
> use add_term and replace_document.

Thanks!  That's very much appreciated.

> When I generate first time database with
> this script and then execute quartzcompact I get the following result:
> 
> postlist: Reduced by 49.7817% 912K (1832K -> 920K)
> termlist: Reduced by 39.4366% 224K (568K -> 344K)
> 
> When I execute 5 time this test script and the do quartzcompact I get:
> 
> postlist: Reduced by 74.8908% 2744K (3664K -> 920K)
> termlist: Reduced by 69.7183% 792K (1136K -> 344K)

I'm pretty confident that I can improve things, though it's unlikely to
be an instant fix.

> I have also noticed that producing a new database on my 2,5 mil records data
> starting with 60.000 records per hour but after few hundred thousands it go
> to the 30.000 records per hour which. 

It's not at all suprising that the initial rate isn't sustained.  The
first batch of records flushed can just be plonked straight in, but
subsequent batches need to be merged in.  There's another threshold
when the working set fails to fit entirely in RAM.

You can see some performance graphs for gmane indexing here:

http://www.survex.com/~olly/gmaneindexrate.html

I'm afraid I forget exactly what the 3 colours mean or what the
difference between the 3 graphs is, but you can see that the early
performance does drop off.

> This is on the machine with 1 GB RAM and XAPIAN_FLUSH_THRESHOLD = 20000; 
> Can I improve this speed?

Have you experimented to see if 20000 is the best setting?

Otherwise you can build several smaller databases and then merge them
using quartzcompact.  For details and some figures, see:

http://thread.gmane.org/gmane.comp.search.xapian.general/1462

The changes it talks about are in the 0.9.0 release.

Posting list changes are buffered in RAM before being flushed to disk,
which is currently done using standard STL data structures.  We could
be a lot more compact here by exploiting what we know about the
structure of the data (at least in the common case of adding new
documents), which would greatly increase the amount that could be
buffered in a particular size of spare memory.  This is something I'll
be working on fairly soon.

> P.S. I am sending only to you this mail because file is little to big

That's fine.  I've replied to the list because this is probably of wider
interest.

Cheers,
    Olly