Storing the documents text: data record or value ?

Thu Jan 4 19:02:46 GMT 2018

Olly Betts writes:
 > On Wed, Jan 03, 2018 at 04:18:18PM +0100, Jean-Francois Dockes wrote:
 > > Seen from the outside, it would appear to make sense to use values, so that
 > > code which needs to access the data record but not the full document text
 > > does not pay a performance penalty.
 > > 
 > > I am wondering if there are other arguments for using either method ?
 > 
 > I wouldn't recommend using a value to store large data - fundamentally
 > it's not what they're intended for, and that's likely to end up biting
 > you because design decisions get made based on their intended uses.
 > 
 > A minor current example is that the backend tracks upper and lower
 > bounds on all the values in a given slot, so you get a pointless (for
 > you) extra copy of the text of two documents, plus a lot of pointless
 > comparing of document texts to keep track of which is the largest and
 > smallest.  We've discussed tracking a binned distribution for each slot,
 > which would allow optimisations when sorting or doing value ranges, but
 > would mean more pointless overhead for your case.

Ok, no values then...

 > If you want to store the document text separately, I'd put it in the
 > user metadata (build a key from the docid, ideally one which sorts in
 > the same order as the integer docids do so that append works very
 > efficiently - you could copy Xapian's pack_uint_preserving_sort() for
 > that).
 > 
 > You'll want to compress the document text yourself (currently at least,
 > though I wonder if we should support transparent compression of user
 > metadata entries - mostly they aren't compressed because they're stored
 > in the postlist table which doesn't have transparent compression on
 > because it's unhelpful for updating postlist chunks, and currently
 > transparent compression is either on or off per table, but doing it
 > based on the type of entry wouldn't be hard).

The compression is not the problem (already doing it when storing in values).

What makes user metadata records less convenient is that they are not
linked to a Xapian document by Xapian itself. This makes several things
slightly more complicated.

 > We could also add a way to read document data in chunks rather than
 > all at once, and then if you put the document text last in the document
 > data you should be able to read the other fields without much penalty.

Thanks, for now, I'll reluctantly take a better look at using user metadata
records.

jf