Can't handle insanely large tags

Jean-Francois Dockes jf at dockes.org
Fri Mar 14 08:29:14 GMT 2025


Olly Betts writes:
 > On Wed, Mar 12, 2025 at 10:01:50PM +0100, Jean-Francois Dockes wrote:
 > > 
 > > Thanks for the fast answer ! I've certainly no plan to store such big objects in
 > > Xapian. It just means that there is a missing sanity check somewhere.
 > > 
 > > The user succeeded in pinpointing the problem to a 900  MBytes mbox file.
 > > 
 > > A possible reason would be that a really bad mbox would be misparsed, producing
 > > e.g. an enormous Subject: or From: field which would get as an attribute into the data
 > > record. I see that I have no size checks on this at the moment. I'll investigate in this
 > > direction.
 > > 
 > > Can this come from anything other than the data record ?
 > 
 > Probably - the document data is the simplest to reason about (because
 > it gets compressed with zlib and we have a reasonably idea how well
 > zlib will compress typical data).
 > 
 > Postlists are chunked at a higher level to support efficient
 > skipping forwards so postlist table entries shouldn't be more than about
 > 2000 bytes, but I'd think it's probably possible for at least some other
 > tables.
 > 
 > Some other tables might be possible - for example, if you indexed a
 > document by enough distinct terms you'd probably end up with a termlist
 > entry that's too big to store, but the encoding used tends to become
 > more compact the more terms there are so it's hard to say at what point
 > this would happen without testing.

Thanks, this is very helpful as I was able to eliminate the two obvious
candidates: data record and stored document text (as metadata), so that
I was wondering if there still was something mysterious.

The file was some kind of mail archive not in mbox format. It was detected
as a single message (like a Maildir file), which resulted in a few headers
and a 900 MB body. I had a safeguard test on mbox member size, not on email
body...

I could not reproduce the exact error on a similar document with a body
made of concatenated bibles: this caused an error on set_metadata() for the
text, instead of replace_document(). I guess that the vocabulary was too
small.

Nicely enough, nothing crashed., and I now know that Xapian is somehow
limited in its ability to index Gigabyte single documents :) Which I'll try
to avoid in the future as it is not that useful...

Regards,

jf






More information about the Xapian-discuss mailing list