Using a document id as metadata key and merges

Olly Betts olly at survex.com
Fri Dec 13 01:21:29 GMT 2024


On Thu, Dec 12, 2024 at 09:51:44AM +0100, Jean-Francois Dockes wrote:
> Following a discussion a few years ago, Recoll stores the documents text
> contents in database metadata entries, with keys derived from document ids.
> 
> More recently an index creation method using several temporary indexes
> merged on completion was implemented. This is still a bit experimental. It
> brings a significant speed increase in some cases.
> 
> I just realised that the merge lost many metadata entries because of the
> document id collisions (I was just using add_document() on the temporary
> dbs). It was not immediately obvious because this only affects snippets
> generation. 

I assume you're merging using Xapian::Database::compact() (or the
xapian-compact tool)?

> Would using replace_document() on the temporary dbs, with unique document
> ids (modulo) ensure that the document ids are preserved during the merge so
> that the metadata keys remain valid ?

Compaction maps document ids by adding/subtracting a per-source-database
offset.  By default this is calculated to abut the ranges from each
source but you can force the offset to be 0 with
Xapian::DBCOMPACT_NO_RENUMBER (--no-renumber).

However compaction can't handle input ranges overlapping (after
offsetting) so "modulo" isn't going to help here.  The reason for this
restriction is that compaction mostly copies the encoded postlist data
rather than decoding and reencoding it which would be slower.

What you'd need to do is index (say) document ids 1 to 10000 to one
DB, then 10001 to 20000 to the next, etc then merge using
DBCOMPACT_NO_RENUMBER.  (You can indeed use replace_document()
to force a particular document ID to be used, though actually you
only need to set the first one and then you could add_document()
which would number from there, if that was easier to implement for
some reason...)

> Or is there another obvious approach which I am missing ?

You could use a unique ID for the document to build metadata keys if you
have one (for recoll, that's probably the filename, or a hash of it with
enough bits that collisions aren't really a concern).

Cheers,
    Olly



More information about the Xapian-discuss mailing list