Can't handle insanely large tags

Olly Betts olly at survex.com
Wed Mar 12 19:39:32 GMT 2025


On Wed, Mar 12, 2025 at 11:47:29AM +0100, Jean-Francois Dockes wrote:
> I am getting a "Can't handle insanely large tags" exception from a
> replace_document() call (for a new document).
> 
> This happens on a user's very big file system, it's remote and not
> very easy to test.
> 
> This is quite probably a Recoll bug, but, to help with my
> investigation, would someone have any idea of the potential causes ?

In the glass backend, at the B-tree level each table can be thought of
as a key->value store.  Internally in the code, this "value" is called
"tag" (for historical reasons really, but it helps to avoid confusion
with document value slots so the terminology has been kept).

Each entry in the table is limited in size - approximately:

  size(key)+size(tag)+per_entry_overhead <= (block_size-per_block_overhead)/4

That works out at a maximum tag size of a bit under 2K for the default
8K block size - longer tags are supported but get split over multiple
entries.  There's a counter for these which is 2 bytes, so that limits
the total tag size to very roughly 2K * 65536 which is 128MB.  That's
an overestimate as it ignores the overheads and the key size - if the
key is long this limit will be a bit lower (from a quick rough
calculation you should be able to store 109MB with 8K blocks).

The entries in some tables are deflate-compressed - for those tables
these limits are on the compressed data size.

It seems most likely this is triggered by storing a very large document
data but it would need to be over 109MB after compression.  It's
probably theoretically possible to hit for other tables but I'd be much
more surprised.  It's really a Xapian size limit, but if it is the
document data and you aren't intending to store something that large it
could be a Recoll bug too.

There is a simple workaround which is to increase the block size.  That
needs to be done when you create the database, or you can convert an
existing database to a different block size with xapian-compact (also
available via the API).

Honey isn't block based and won't need to split up entries like this.
It doesn't yet support update though, but once it's actually finished
it won't have this problem.

I'll improve the exception message to (a) report the tag size
encountered and (b) suggest using a larger block size.

Cheers,
    Olly



More information about the Xapian-discuss mailing list