Using a document id as metadata key and merges

Fri Dec 13 08:56:04 GMT 2024

Olly Betts writes:
 > On Thu, Dec 12, 2024 at 09:51:44AM +0100, Jean-Francois Dockes wrote:
 > > Following a discussion a few years ago, Recoll stores the documents text
 > > contents in database metadata entries, with keys derived from document ids.
 > > 
 > > More recently an index creation method using several temporary indexes
 > > merged on completion was implemented. This is still a bit experimental. It
 > > brings a significant speed increase in some cases.
 > > 
 > > I just realised that the merge lost many metadata entries because of the
 > > document id collisions (I was just using add_document() on the temporary
 > > dbs). It was not immediately obvious because this only affects snippets
 > > generation. 
 > 
 > I assume you're merging using Xapian::Database::compact() (or the
 > xapian-compact tool)?

Xapian::Database::compact()

 > > Would using replace_document() on the temporary dbs, with unique document
 > > ids (modulo) ensure that the document ids are preserved during the merge so
 > > that the metadata keys remain valid ?
 > 
 > Compaction maps document ids by adding/subtracting a per-source-database
 > offset.  By default this is calculated to abut the ranges from each
 > source but you can force the offset to be 0 with
 > Xapian::DBCOMPACT_NO_RENUMBER (--no-renumber).
 > 
 > However compaction can't handle input ranges overlapping (after
 > offsetting) so "modulo" isn't going to help here.  The reason for this
 > restriction is that compaction mostly copies the encoded postlist data
 > rather than decoding and reencoding it which would be slower.
 > 
 > What you'd need to do is index (say) document ids 1 to 10000 to one
 > DB, then 10001 to 20000 to the next, etc then merge using
 > DBCOMPACT_NO_RENUMBER.  (You can indeed use replace_document()
 > to force a particular document ID to be used, though actually you
 > only need to set the first one and then you could add_document()
 > which would number from there, if that was easier to implement for
 > some reason...)

The problem is that I can have no idea of the number of documents before
indexing (e.g., - for the anecdote - actual case: a single zip archive with
around 400 000 epubs resulting into 1.3 million chapters/documents).
Archives and other compound files (e.g. mbox,tar: no index) make it very
slow and expensive to pre-count documents. I guess that I could use
10E8 offsets and hope for the best...

 > > Or is there another obvious approach which I am missing ?
 > 
 > You could use a unique ID for the document to build metadata keys if you
 > have one (for recoll, that's probably the filename, or a hash of it with
 > enough bits that collisions aren't really a concern).

Yes, I've thought of this, but this brings an incompatible index format
change. I could handle this by trying the new method and falling back on
docid, as it's easy to ensure that there can be no collision. I need to
check if the double lookup could seriously affect performance or not. The
problem will disappear in time with reindexes, and it looks like the more
reasonable approach.

Thanks,

jf