compact checkpoints while doing days-long indexing jobs?
Olly Betts
olly at survex.com
Wed Aug 27 04:41:21 BST 2025
On Tue, Aug 26, 2025 at 02:31:15AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > Do you process in ascending NNTP_ARTICLE_NUMBER order?
> >
> > If so you should get sequential update mode for tables like the
> > "data" one. If not you're probably triggering random access mode
> > for all tables.
>
> Yes, it's ascending, but I use ->replace_document to keep docid
> matching NNTP_ARTICLE_NUMBER so each shard sees docids
> incrementing by SHARD_COUNT instead of 1. Would doing
> replace_document with ascending docid gaps allow glass to work
> in sequential mode?
Yes - what matters is that each item added to the table goes immediately
after the previously added one, which is still true if there are unused
docids between them.
Are you saying the docids in the shards match NNTP_ARTICLE_NUMBER
and so one has 1, 4, 7, ...; another 2, 5, 8, ...; the third 3, 6, 9,
...?
I'd have gone for making the docids in the combined database match
NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard
(except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
docid values and smaller gaps between them will encode a little more
efficiently.
(Also means you could grow SHARD_COUNT times larger before you run out of
docid space, though with 3 shards and your numbering you can still have
~1.4 billion documents so it sounds like you're nowhere near that being
an issue.)
You could also then use a sharded WritableDatabase object and let Xapian
do the sharding for you:
$db->replace_document($nntp_article_number, $doc);
The docids in the sharded database would also then match NNTP article
numbers at search time.
Cheers,
Olly
More information about the Xapian-discuss
mailing list