Can't handle insanely large tags
Olly Betts
olly at survex.com
Fri Mar 14 21:58:53 GMT 2025
On Fri, Mar 14, 2025 at 09:29:14AM +0100, Jean-Francois Dockes wrote:
> I could not reproduce the exact error on a similar document with a body
> made of concatenated bibles: this caused an error on set_metadata() for the
> text, instead of replace_document(). I guess that the vocabulary was too
> small.
Concatenating multiple copies of the same text will make little
difference to the encoded termlist size. The wdf values will get scaled
up and some will take more space to store as a result, but it's not
going to be equivalent to indexing a real larger text.
If you add each copy with the alphabet permuted differently that's
probably more likely to reproduce. Not sure if it's worth the effort
though as we know the limit that's being hit and having a reproducer
doesn't really help.
Honey shouldn't have this limitation, but we wouldn't want a regression
testcase to ensure this if it requires indexing gigabytes of data.
Cheers,
Olly
More information about the Xapian-discuss
mailing list