compact checkpoints while doing days-long indexing jobs?

Thu Aug 28 06:26:53 BST 2025

Olly Betts <olly at survex.com> wrote:
> On Wed, Aug 27, 2025 at 05:56:49AM +0000, Eric Wong wrote:
> > One caveat is one of the indexers will avoid creating new Xapian
> > docs for cross-posted messages but add new List-IDs to existing
> > docs.
> > 
> > For example, if one message gets cross-posted to multiple
> > mailing lists and we process each mailing list sequentially, the
> > initial message would be indexed with List-Id:<a.example.com>.
> > 
> > However, somewhere down the line we're processing
> > List-Id:<b.example.com>, we'll add the new List-Id value to the
> > original message we saw (possibly millions of messages ago) so
> > non-sequential performance does end up being important, too.
> > 
> > IOW, if a message is cross-posted to a dozen lists, we end up
> > doing replace_document on the same docid a dozen times (ick!)
> 
> If you do replace_document() with a Document object you got from
> get_document() and use its existing docid, then the update is optimised
> provided you've not modified the database in between.
> 
> I'm not clear how the "List-Id" is stored, but e.g. if it's a boolean
> term then only that term's posting list is actually updated.

Currently both a boolean and also a text phrase, in case of
domain changes maybe wildcards could be useful *shrug*

> > > I'd have gone for making the docids in the combined database match
> > > NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard
> > > (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
> > > docid values and smaller gaps between them will encode a little more
> > > efficiently.
> > 
> > Understood; but would it be possible to continue to do parallel
> > indexing as NNTP article numbers are being allocated sequentially?
> > Since NNTP article numbers are allocated sequentially, they
> > round-robin across the shards to allow parallelism during indexing
> > (I rely on Perl to extract terms and such, so there's a CPU-limited
> > component)
> 
> If the existing approach works, the new one should - it's really just
> the same except the docids in the shards are changed by this mapping:
> 
>     new_docid = (old_docid + 2) / 3    (using integer division)
> 
> > Changing the way docids are allocated now could be very
> > disruptive to users with existing DBs and might be a
> > maintainability/support nightmare.
> 
> Yes.  You could perhaps store a flag in a user metadata entry
> in the DB and used that to select the mapping functions to use.
> 
> Not sure what the overhead reduction would actually amount to - cases
> where a gap between consecutive entries in a posting list for a term is
> between 43 and 128 documents which will reduce from 2 bytes to 1 byte
> each time.  Probably about 2/3 of keys containing docids would reduce in
> size by a byte too.  I'd guess it's probably noticeable but not
> dramatic.

OK, I'll leave it for a later date after more dramatic
improvements are done.

<snip>

> > So, a side question: even on ext4 and ignoring cross-posted
> > messages; I notice Xapian shard commits taking more time as the
> > shards get bigger.  Trying to commit shards one-at-a-time doesn't
> > seem to help, so it doesn't seem bound by I/O contention with 3
> > shards (I capped the default shard count at 3 back in 2018 due to
> > I/O contention).
> 
> This is with Xapian 1.4.x and the glass backend?

Yes.

> Before that, a commit required writing O(file size of DB) data because
> the freelist for a table was stored in a bitmap with one bit per block
> in the table.  This was not problematic for smaller databases, but
> because we need to ensure this data is actually synced to disc and
> we can only write it out just before we sync it, it gradually caused
> more I/O contention as the DB grew in size.
> 
> Glass instead stores the freelist in blocks which are on the freelist.
> 
> The table data still needs to be synced, though that gets written out
> over a period of time so has more chance to get written to disk before
> we sync it.  I'd guess that's what you're seeing.

I actually disable fsync in all my own use and testing; but I
configure my kernels to flush dirty data fairly aggressively
(200MB or so w/ SATA-2).

> > Thus I'm considering allowing the option to split the shards
> > into epochs during the indexing phase, leaving the original set
> > (0, 1, 2) would be untouched above a certain interval (say >100K)
> > until the end of indexing.
> > 
> > During indexing, there'd be (0.1, 1.1, 2.1) set of shards for
> > 100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth.
> > To finalize indexing, `xapian-compact --no-renumber' would
> > combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to
> > maintain compatibility with existing readers.
> > 
> > One downside of this approach would needing much more temporary
> > space so it can't be the default, but I'm hoping the extra work
> > required by compact would offset the high commit times for giant
> > shards when adding a lot of messages to the index.
> > 
> > Small incremental indexing jobs would continue to write directly
> > to (0, 1, 2); only large jobs would use the epochs.
> > 
> > Does that sound reasonable?  Thanks.
> 
> Yes, that's roughly what I'd do if I wanted to maximise the indexing
> rate for an initial build of the DB.

OK, good to know.

> You could try picking the size of each x.y to be indexable as a
> single commit so all the merging happens via xapian-compact.

IOW, according to the default XAPIAN_FLUSH_THRESHOLD, that means
limiting each x.y to only 10000 and doing one commit?

I actually wasn't happy with memory use with the default 10K
threshold and flush more frequently.  It's currently based on
raw bytes processed, my default is to flush every 8MB indexed on
64-bit systems(!).  I use a 64-bit laptop with only 2GB RAM and
my $5/month VPS has 1GB RAM (32-bit userspace), but I often work
via (mosh|ssh) to a busy system with 16GB ECC RAM.

Thanks.