errors on rebuild

Mon Apr 3 04:40:22 BST 2017

On Sun, 2 Apr 2017, at 20:29, Olly Betts wrote:
> On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:
> > After upgrades my stack is now:
> > 
> > Python 2.7
> > Django 1.8
> > Haystack 2.6.0
> > Xapian 1.4.3. (latest xapian haystack backend with some modifications)
> > 
> > Using the same rebuild command as below but with —batch-size=50000
> > 
> > The issue has now become one of performance.  I am indexing 2.2 million
> > documents.  Using delve I can see that performance starts off at about
> > 100,000 records an hour.  This is consistent with the roughly 24 hour
> > rebuild time I was experiencing with Xapian 1.2.21 (chert).  However,
> > after 75 hours of build time, the index is about 75% complete and
> > records are processing at a rate of 10,000/hr.  The index is 51GB is
> > size, 30GB is position.glass.  
> 
> One of the big differences between chert and glass is that glass stores
> positional data in a different order such that phrase searches are much
> more I/O efficient.  The downside is that this means extra work at index
> time, and more data to batch up in memory.  There's a thread discussing
> this here:
> 
> https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html
> 
> > Here is a one minute strace summary
> > 
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  63.97    1.272902          13    100240           pread
> >  33.71    0.670733          14     48175           pwrite
> 
> A one minute sample is hard to extrapolate from, as the indexing process
> currently goes through phases of flushing changes, so whichever phase the
> one minute is from isn't going to be representative.
> 
> But from the information you give, my guess is that the extra memory
> used for batching up changes is pushing you over an I/O cliff, and
> you would get better throughput by reducing the batch size (assuming
> the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something
> equivalent).  Especially likely if you tuned that batch size for chert.
> 
> There are some longer term plans to rework the batching and flush process
> which should improve matters a lot (and hopefully remove the need for
> manually tweaking such settings).  I'm hoping that will land in the
> next release series, so you could consider sticking with chert for 1.4.x,
> assuming the problematic phrase search cases aren't an issue for you.
> There are various other improvements between chert and glass (better
> tracking of free space, less on-disk overhead) which you'd lose out on
> though.

The trick that FastMail/Cyrus IMAPd uses of batching to smaller indexes and then compacting a few of them together at once may be interesting as well.  I have no idea how it performs on really massive indexes though, because we index per user.

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm