[Xapian-discuss] Mutliple boolean tags per document

Olly Betts olly at survex.com
Fri Nov 11 05:33:59 GMT 2011


On Mon, Nov 07, 2011 at 05:43:33PM +0000, Richard Boulton wrote:
> On 7 November 2011 17:35, Justin Finkelstein <justin at redwiredesign.com> wrote:
> > I'm about to write something for my search service which updates a
> > number of documents by adding a boolean tag to a series of documents.
> > Is there an efficient way where I can do "only add this tag if it isn't
> > set already"?
> 
> I think the following will be pretty efficient:
> 
>  - fetch each document
>  - call Document.add_boolean_term(tag_term)
>  - use Database.replace_document() to put the modified document back
> in the database.

If the term is likely to often be present already, then you can find and
sort the document ids for all the document you wish to add the term to,
and then iterate the posting list for the term, calling skip_to() for
each document id you want to add, and checking if it is already present.
I don't know what counts as "often" but I suspect this becomes
worthwhile quite quickly, as it's not a lot of extra data to read,
unpack and manipulate.

> The current disk backends (ie, chert, and for that matter, also flint)
> load the Document contents lazily, and do minimal updating work when
> writing modified documents back.  So, the above won't access the
> document data and value tables at all, and will only modify the
> posting list table for the tag_term (not for all the other terms in
> the document).
> 
> I think it will currently read the termlist for the document; that
> could theoretically be avoided in this case, I think, but isn't
> currently.

You need to rewrite the termlist to actually add the term to a document,
so it could only possibly be avoided in the case where the tag is
already present.

The only way I can see to avoid reading the termlist here is if the
Document object tracked the changes being made to a document rather
than reading the existing termlist into a map the first time a term
change is made (which is a change I've suggested before).  Then the
backend could special case adding/removing the same single term (or
perhaps terms from a small set) to multiple documents.

Which seems quite a lot of work and special-case code, though if it is a
significant speed-up for a case which matters to some people then it
could be worth it.

Cheers,
    Olly



More information about the Xapian-discuss mailing list