compact checkpoints while doing days-long indexing jobs?

Thu Aug 28 03:30:19 BST 2025

On Wed, Aug 27, 2025 at 05:56:49AM +0000, Eric Wong wrote:
> One caveat is one of the indexers will avoid creating new Xapian
> docs for cross-posted messages but add new List-IDs to existing
> docs.
> 
> For example, if one message gets cross-posted to multiple
> mailing lists and we process each mailing list sequentially, the
> initial message would be indexed with List-Id:<a.example.com>.
> 
> However, somewhere down the line we're processing
> List-Id:<b.example.com>, we'll add the new List-Id value to the
> original message we saw (possibly millions of messages ago) so
> non-sequential performance does end up being important, too.
> 
> IOW, if a message is cross-posted to a dozen lists, we end up
> doing replace_document on the same docid a dozen times (ick!)

If you do replace_document() with a Document object you got from
get_document() and use its existing docid, then the update is optimised
provided you've not modified the database in between.

I'm not clear how the "List-Id" is stored, but e.g. if it's a boolean
term then only that term's posting list is actually updated.

> > I'd have gone for making the docids in the combined database match
> > NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard
> > (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
> > docid values and smaller gaps between them will encode a little more
> > efficiently.
> 
> Understood; but would it be possible to continue to do parallel
> indexing as NNTP article numbers are being allocated sequentially?
> Since NNTP article numbers are allocated sequentially, they
> round-robin across the shards to allow parallelism during indexing
> (I rely on Perl to extract terms and such, so there's a CPU-limited
> component)

If the existing approach works, the new one should - it's really just
the same except the docids in the shards are changed by this mapping:

    new_docid = (old_docid + 2) / 3    (using integer division)

> Changing the way docids are allocated now could be very
> disruptive to users with existing DBs and might be a
> maintainability/support nightmare.

Yes.  You could perhaps store a flag in a user metadata entry
in the DB and used that to select the mapping functions to use.

Not sure what the overhead reduction would actually amount to - cases
where a gap between consecutive entries in a posting list for a term is
between 43 and 128 documents which will reduce from 2 bytes to 1 byte
each time.  Probably about 2/3 of keys containing docids would reduce in
size by a byte too.  I'd guess it's probably noticeable but not
dramatic.

> Yeah, I actually /just/ noticed WritableDatabase supported
> shards while rechecking the docs this week.  I see it was added
> in the 1.3.x days but I started with 1.2.x and supported 1.2
> for ages due to LTS distros.
> 
> And I suppose using the combined WritableDatabase feature would
> require using a single process for indexing and lose parallelism.

Yes, so that's a reason to keep doing the sharding yourself.

> So, a side question: even on ext4 and ignoring cross-posted
> messages; I notice Xapian shard commits taking more time as the
> shards get bigger.  Trying to commit shards one-at-a-time doesn't
> seem to help, so it doesn't seem bound by I/O contention with 3
> shards (I capped the default shard count at 3 back in 2018 due to
> I/O contention).

This is with Xapian 1.4.x and the glass backend?

Before that, a commit required writing O(file size of DB) data because
the freelist for a table was stored in a bitmap with one bit per block
in the table.  This was not problematic for smaller databases, but
because we need to ensure this data is actually synced to disc and
we can only write it out just before we sync it, it gradually caused
more I/O contention as the DB grew in size.

Glass instead stores the freelist in blocks which are on the freelist.

The table data still needs to be synced, though that gets written out
over a period of time so has more chance to get written to disk before
we sync it.  I'd guess that's what you're seeing.

> Thus I'm considering allowing the option to split the shards
> into epochs during the indexing phase, leaving the original set
> (0, 1, 2) would be untouched above a certain interval (say >100K)
> until the end of indexing.
> 
> During indexing, there'd be (0.1, 1.1, 2.1) set of shards for
> 100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth.
> To finalize indexing, `xapian-compact --no-renumber' would
> combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to
> maintain compatibility with existing readers.
> 
> One downside of this approach would needing much more temporary
> space so it can't be the default, but I'm hoping the extra work
> required by compact would offset the high commit times for giant
> shards when adding a lot of messages to the index.
> 
> Small incremental indexing jobs would continue to write directly
> to (0, 1, 2); only large jobs would use the epochs.
> 
> Does that sound reasonable?  Thanks.

Yes, that's roughly what I'd do if I wanted to maximise the indexing
rate for an initial build of the DB.

You could try picking the size of each x.y to be indexable as a
single commit so all the merging happens via xapian-compact.

Cheers,
    Olly