compact checkpoints while doing days-long indexing jobs?

Olly Betts olly at survex.com
Mon Aug 25 23:17:08 BST 2025


On Thu, Aug 21, 2025 at 01:05:10AM +0000, Eric Wong wrote:
> Hello, I'm trying to get Xapian to work better on btrfs which is
> prone to fragmentation regardless on whether or not btrfs CoW
> is enabled.
> 
> Thus, I'm wondering if running xapian-compact occasionally
> during a multi-day indexing can improve indexing performance.

tldr: I'd expect it to harm performance.

Glass database tables have two update modes - sequential and random
access.  Each table automatically switches to/from sequential based on
its update pattern.  E.g. If you only add_document() then some tables
should mostly operate in sequential mode which is why you'll often see
significant differences in how compact the different tables are.

In random access mode, blocks are split to leave some unused space in
each half which necessary for get the theoretical O() performance from a
B-tree.  Compacting removes that unused space and so after compacting
any update will have to split a block.  Gradually updates will require
fewer block splits (because it becomes increasingly likely the blocks
involved will already have been split and so have free space for the
update).

This also means that the unused space shouldn't grow without limit.

> I'll be using the BTRFS_IOC_DEFRAG ioctl to periodically defrag
> glass files after some (probably not all) transaction commits.

Note the format is block-based and as long as individual blocks (which
are 8KB by default) aren't fragmented I would not expect fragmentation
at the filesystem level to make much difference to search performance.
We sometimes need to step to the next leaf block in tree order, but
defragmenting at the file system level won't help that unless the leaf
block order in the tree and in the file match up, which will generally
only be the case right after compaction.

> I've noticed that even on small, fresh imports (with few/minimal
> deletes) compact can reduce file sizes by 20-60%, so I'm
> wondering if compact before btrfs defrag is helpful even if I
> intend to add more docs right after the compact+defrag.

I'd be wary of compaction if you're about to index more unless you
can benchmark and show it actually helps (in which case I'd be very
curious how it is helping).

If you delete a large proportion of documents then perhaps it might make
sense to compact after that to reclaim the space, but even then I'd
suggest leaving it until after indexing unless you're short on disk
space.  The point here is that the .glass files never shrink, but
blocks unused after deletion are added to the freelist and so will
get reused in preference to growing the .glass file.

> I'm dealing with over 20 million docs across 3 (adjustable)
> shards in parallel (Perl is probably a bottleneck, too :x).
> Document numbers are assigned to shards based on
> $NNTP_ARTICLE_NUMBER % $SHARD_COUNT so I rely on --no-renumber.

Do you process in ascending NNTP_ARTICLE_NUMBER order?

If so you should get sequential update mode for tables like the
"data" one.  If not you're probably triggering random access mode
for all tables.

Cheers,
    Olly



More information about the Xapian-discuss mailing list