errors on rebuild

Mon Apr 3 02:29:39 BST 2017

On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:
> After upgrades my stack is now:
> 
> Python 2.7
> Django 1.8
> Haystack 2.6.0
> Xapian 1.4.3. (latest xapian haystack backend with some modifications)
> 
> Using the same rebuild command as below but with —batch-size=50000
> 
> The issue has now become one of performance.  I am indexing 2.2 million
> documents.  Using delve I can see that performance starts off at about
> 100,000 records an hour.  This is consistent with the roughly 24 hour
> rebuild time I was experiencing with Xapian 1.2.21 (chert).  However,
> after 75 hours of build time, the index is about 75% complete and
> records are processing at a rate of 10,000/hr.  The index is 51GB is
> size, 30GB is position.glass.  

One of the big differences between chert and glass is that glass stores
positional data in a different order such that phrase searches are much
more I/O efficient.  The downside is that this means extra work at index
time, and more data to batch up in memory.  There's a thread discussing
this here:

https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html

> Here is a one minute strace summary
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  63.97    1.272902          13    100240           pread
>  33.71    0.670733          14     48175           pwrite

A one minute sample is hard to extrapolate from, as the indexing process
currently goes through phases of flushing changes, so whichever phase the
one minute is from isn't going to be representative.

But from the information you give, my guess is that the extra memory
used for batching up changes is pushing you over an I/O cliff, and
you would get better throughput by reducing the batch size (assuming
the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something
equivalent).  Especially likely if you tuned that batch size for chert.

There are some longer term plans to rework the batching and flush process
which should improve matters a lot (and hopefully remove the need for
manually tweaking such settings).  I'm hoping that will land in the
next release series, so you could consider sticking with chert for 1.4.x,
assuming the problematic phrase search cases aren't an issue for you.
There are various other improvements between chert and glass (better
tracking of free space, less on-disk overhead) which you'd lose out on
though.

Cheers,
    Olly