[Xapian-discuss] Replace a term in a document

Olly Betts olly at survex.com
Fri Apr 20 20:34:39 BST 2007


On Thu, Apr 19, 2007 at 03:10:42PM +0100, Mark Clarkson wrote:
> On Thu, 2007-04-19 at 12:54 +0100, Olly Betts wrote:
> > it's not been a case anyone has been concerned about the
> > performance of before, as far as I can recall.
> 
> I am a bit surprised about that. I've found that I can speed up my
> database queries (by incredible amounts) by simply using Xapian as a
> backend to sql queries. This is especially true when a database query
> returns many results and gives up on its index.

Interesting.  I think this is probably currently an unusual way to
use Xapian, but if it generally works well, it would be great to
be able to support it better.

> > I suggest filing a wishlist bug about this (unless you feel up to
> > implementing it yourself, in which case I can point you in the right
> > direction).  Our bug tracker is at:
> 
> Thanks very much, I'd really appreciate some guidance.

OK, the place to look is in backends/flint (I'll assume you're using
flint - if you wanted to change quartz, it wouldn't be too different
but flint is the future, and will be the default backend as of 1.0
so we'd definitely want to patch flint in preference).

The call to replace_document() is handled in flint_database.cc -
search for FlintWritableDatabase::replace_document - and changes to
terms affect three tables:

* the termlist - this stores a list of terms for each document.  You'll
  need to give it all the terms still, as it compresses them into a
  single btree tag value.  The update to this btree is done right away
  (but not switched live until the next flush, thanks to the btree
  versioning scheme).

* the position table - this stores positional information for the terms.
  There's one btree entry for each (docid, term) pair, and currently
  they're all updated.  Like the termlist, we update the btree right
  away (but the update isn't live until the next flush).

* the postlist table - this stores the inverted file, i.e. mappings from
  a term to a list of docids (and associated information like wdf).  We
  don't update this table right away, but buffer up changes and then
  apply them once we have a whole batch inverted.  

Changes are needed for the position and postlist tables.  Currently
we loop over the current document in the database and remove entries
based on that, then later loop over the Document object passed in and
add new entries based on that.  So you would need to combine the two
loops and only update for terms which have been added, modified, or
removed (if we're replacing a document with itself that is).

To be able to do that, the document object must track which terms have
been updated.  Look at common/document.h and api/omdocument.cc.
Currently we store a flag "terms_here" which says if we are using
"local" term information (in the map "terms"), or getting them from the
database.  If any are modified, we get the termlist entry and populate
"terms" with it, then modify that.

So we either need something extra in "terms" (or add a second structure)
to track addition/modification/deletion, or to make terms a `delta' for
the document (if there is one) in the database, so we just store the
changes rather than pulling all the information.  That's neater in a
way, but makes open_term_list(), etc harder to implement.

Then replace_document() can just check if a document object came from
the same database object and has the same docid (we already store the
database and docid so we can lazily fetch information).  If so, it can
optimise the replacement.

Cheers,
    Olly



More information about the Xapian-discuss mailing list