errors on rebuild

Fri Apr 7 18:17:34 BST 2017

Thanks for the information on the differences between chert and glass.
This explains the performance / index size changes I’m seeing.  For the 
time being chert 1.4.3 is working and I’ll keep my eye out for new releases.

Thanks,
Ryan

> On Apr 2, 2017, at 6:29 PM, Olly Betts <olly at survex.com> wrote:
> 
> On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:
>> After upgrades my stack is now:
>> 
>> Python 2.7
>> Django 1.8
>> Haystack 2.6.0
>> Xapian 1.4.3. (latest xapian haystack backend with some modifications)
>> 
>> Using the same rebuild command as below but with —batch-size=50000
>> 
>> The issue has now become one of performance.  I am indexing 2.2 million
>> documents.  Using delve I can see that performance starts off at about
>> 100,000 records an hour.  This is consistent with the roughly 24 hour
>> rebuild time I was experiencing with Xapian 1.2.21 (chert).  However,
>> after 75 hours of build time, the index is about 75% complete and
>> records are processing at a rate of 10,000/hr.  The index is 51GB is
>> size, 30GB is position.glass.  
> 
> One of the big differences between chert and glass is that glass stores
> positional data in a different order such that phrase searches are much
> more I/O efficient.  The downside is that this means extra work at index
> time, and more data to batch up in memory.  There's a thread discussing
> this here:
> 
> https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html
> 
>> Here is a one minute strace summary
>> 
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> 63.97    1.272902          13    100240           pread
>> 33.71    0.670733          14     48175           pwrite
> 
> A one minute sample is hard to extrapolate from, as the indexing process
> currently goes through phases of flushing changes, so whichever phase the
> one minute is from isn't going to be representative.
> 
> But from the information you give, my guess is that the extra memory
> used for batching up changes is pushing you over an I/O cliff, and
> you would get better throughput by reducing the batch size (assuming
> the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something
> equivalent).  Especially likely if you tuned that batch size for chert.
> 
> There are some longer term plans to rework the batching and flush process
> which should improve matters a lot (and hopefully remove the need for
> manually tweaking such settings).  I'm hoping that will land in the
> next release series, so you could consider sticking with chert for 1.4.x,
> assuming the problematic phrase search cases aren't an issue for you.
> There are various other improvements between chert and glass (better
> tracking of free space, less on-disk overhead) which you'd lose out on
> though.
> 
> Cheers,
>    Olly