[Xapian-discuss] Most efficient update of already existing document?

Olly Betts olly at survex.com
Wed Jun 1 03:13:58 BST 2011


On Tue, May 31, 2011 at 05:26:39PM +0400, Do. wrote:
> So, as I understood - term is most officient to update, then value and
> then data (which is said to be "expensive operation").

No, that's not what I was trying to say at all.

My main point was that Xapian attempts to update things efficiently, so
you should base your decision on the way the information is used, rather
than trying to guess what might be efficient to update.

The warning about document data being "expensive" is at match time.
This is due to implementation decisions based on intended use.  The
document data is intended to be used to store all the things you need
to display a "result", so it's a single blob which contains all the
fields you might need for that, and it is stored in a separate B-tree
entry for each document, and is compressed with zlib.

So you it's not a great idea to ask for the document data for every
document being considered by a MatchDecider or PostingSource or similar
- you want to use a document value there, which are stored in a chunked
stream for each slot, so you get the values in the same slot for
neighbouring documents too.  But that also means that using document
values as fields to display a result isn't a great idea - for each
result you display you need to fetch and decode a value chunk for each
field, and you'll load the data for a whole chunk of neighbouring
documents at the same time.

In fact *updating* the document data for a single document is pretty
cheap, since it's a single Btree entry update.  Adding a unique id
term requires a couple of Btree entry updates, but if you need to be
able to find the document given the external id, then putting it in
the document data isn't going to achieve that.  Putting it in a value
is probably a couple of updates too, but again doesn't allow you to
efficiently find a document given the external id.

(The above is true for chert - the details are likely to evolve with
time, but guided by how each of terms, values, and data is intended
to be used, so using them as intended is strongly recommended).

> Or maybe more
> edfficient could be to use metadata key-value index functionality.

Well, that's a single Btree entry update too, though I think you'll
want to make sure you apply the changes in ascending key order if
there are a lot of updates, which might add to the costs if you
don't naturally get them in that order.

> Search will give me docids which I will need to resolve to IDs.

If you just want to be able to get the external ID for each match, then
putting it in the document data sounds most sensible, unless you're
storing a lot of other data there (since updating it will require
the whole thing to be rewritten, which is painful if it's megabytes
of data).

> I feared that updating document with 1 term changed actually deletes
> and inserts document, with rebuilding postings, etc, which sounds
> quite expensive operation.

Unless you're using a rather old version, this isn't the case.  In
particular, this sequence will just update the document data and won't
even fetch the terms or values:

    Xapian::document doc = db.get_document(did);
    doc.set_data(doc.get_data() + "\nid=" + id);
    db.replace_document(did, doc);

Cheers,
    Olly



More information about the Xapian-discuss mailing list