errors on rebuild
Bron Gondwana
brong at fastmail.fm
Mon Apr 3 04:40:22 BST 2017
On Sun, 2 Apr 2017, at 20:29, Olly Betts wrote:
> On Sat, Mar 25, 2017 at 06:36:25PM -0500, Ryan Cross wrote:
> > After upgrades my stack is now:
> >
> > Python 2.7
> > Django 1.8
> > Haystack 2.6.0
> > Xapian 1.4.3. (latest xapian haystack backend with some modifications)
> >
> > Using the same rebuild command as below but with —batch-size=50000
> >
> > The issue has now become one of performance. I am indexing 2.2 million
> > documents. Using delve I can see that performance starts off at about
> > 100,000 records an hour. This is consistent with the roughly 24 hour
> > rebuild time I was experiencing with Xapian 1.2.21 (chert). However,
> > after 75 hours of build time, the index is about 75% complete and
> > records are processing at a rate of 10,000/hr. The index is 51GB is
> > size, 30GB is position.glass.
>
> One of the big differences between chert and glass is that glass stores
> positional data in a different order such that phrase searches are much
> more I/O efficient. The downside is that this means extra work at index
> time, and more data to batch up in memory. There's a thread discussing
> this here:
>
> https://lists.xapian.org/pipermail/xapian-discuss/2016-April/009368.html
>
> > Here is a one minute strace summary
> >
> > % time seconds usecs/call calls errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> > 63.97 1.272902 13 100240 pread
> > 33.71 0.670733 14 48175 pwrite
>
> A one minute sample is hard to extrapolate from, as the indexing process
> currently goes through phases of flushing changes, so whichever phase the
> one minute is from isn't going to be representative.
>
> But from the information you give, my guess is that the extra memory
> used for batching up changes is pushing you over an I/O cliff, and
> you would get better throughput by reducing the batch size (assuming
> the "batch size" you specify maps to XAPIAN_FLUSH_THRESHOLD or something
> equivalent). Especially likely if you tuned that batch size for chert.
>
> There are some longer term plans to rework the batching and flush process
> which should improve matters a lot (and hopefully remove the need for
> manually tweaking such settings). I'm hoping that will land in the
> next release series, so you could consider sticking with chert for 1.4.x,
> assuming the problematic phrase search cases aren't an issue for you.
> There are various other improvements between chert and glass (better
> tracking of free space, less on-disk overhead) which you'd lose out on
> though.
The trick that FastMail/Cyrus IMAPd uses of batching to smaller indexes and then compacting a few of them together at once may be interesting as well. I have no idea how it performs on really massive indexes though, because we index per user.
Bron.
--
Bron Gondwana
brong at fastmail.fm
More information about the Xapian-discuss
mailing list