Storing the documents text: data record or value ?

Olly Betts olly at survex.com
Thu Jan 4 05:42:07 GMT 2018


On Wed, Jan 03, 2018 at 04:18:18PM +0100, Jean-Francois Dockes wrote:
> Seen from the outside, it would appear to make sense to use values, so that
> code which needs to access the data record but not the full document text
> does not pay a performance penalty.
> 
> I am wondering if there are other arguments for using either method ?

I wouldn't recommend using a value to store large data - fundamentally
it's not what they're intended for, and that's likely to end up biting
you because design decisions get made based on their intended uses.

A minor current example is that the backend tracks upper and lower
bounds on all the values in a given slot, so you get a pointless (for
you) extra copy of the text of two documents, plus a lot of pointless
comparing of document texts to keep track of which is the largest and
smallest.  We've discussed tracking a binned distribution for each slot,
which would allow optimisations when sorting or doing value ranges, but
would mean more pointless overhead for your case.

If you want to store the document text separately, I'd put it in the
user metadata (build a key from the docid, ideally one which sorts in
the same order as the integer docids do so that append works very
efficiently - you could copy Xapian's pack_uint_preserving_sort() for
that).

You'll want to compress the document text yourself (currently at least,
though I wonder if we should support transparent compression of user
metadata entries - mostly they aren't compressed because they're stored
in the postlist table which doesn't have transparent compression on
because it's unhelpful for updating postlist chunks, and currently
transparent compression is either on or off per table, but doing it
based on the type of entry wouldn't be hard).

We could also add a way to read document data in chunks rather than
all at once, and then if you put the document text last in the document
data you should be able to read the other fields without much penalty.

Cheers,
    Olly



More information about the Xapian-discuss mailing list