[Xapian-devel] buffered tables, sessions, and transactions

Olly Betts olly at survex.com
Wed May 19 03:18:55 BST 2004


On Wed, May 19, 2004 at 01:41:57AM +0100, Richard Boulton wrote:
> On Wed, May 05, 2004 at 01:10:56PM +0100, Olly Betts wrote:
> > However, all these changes are being buffered in memory.  We could use
> > that memory to cache more posting list changes, or just let the OS use
> > it to cache more disk blocks.
> 
> I would guess that we would usually be better off letting the OS use the
> memory than using it ourselves.

I feel there may be too many factors at play to make a reliable guess.

But the sweet spot is going to be in there somewhere and we can
experiment to find a good default, and perhaps allow the threshold
to be changed by the user so they can tune for their particular
application and hardware.

> > That's a lot of flushing though - the currently specified API makes
> > transaction inherently inefficient I think.
> 
> For small transactions, this will indeed produce a high flushing overhead.
> However, in some situations I think the guarantee that a transaction has
> completed would be useful - for example, if indexing data from a news feed
> where we can't recover an item if we lose it again.

Does this require transactions though?

I assume the news arrives in batches (or you're batching it up) and that
you add a batch as a transaction.  Once that's safely written to disk
in the Xapian database, the batch can be deleted.

You could just add the batch normally, then call flush().  What
transactions buy you over this is atomicity - they make it easy to pick
up the pieces if an update is interrupted, albeit at the expense of
efficiency if the batches are small.

But if you've got a unique ID in each article, it's pretty trivial to
finish applying an interrupted batch.  And pretty much every application
I've seen has some sort of unique ID...

> It seems to me that it might be useful to implement the interface as it
> stands, but add an "end_transaction" method.  This would be equivalent to
> commit_transaction, but without flushing to disk.  Thus, begin_transaction
> followed by end_transaction provides a facility for grouping changes, and
> begin_transaction followed by commit_transaction provides guaranteed
> storage of a set of changes.

Unfortunately, begin_transaction also needs to flush (so we can
implement cancel_transaction, at least in the design I sketched out).
So this doesn't help a series of transcations - only the last flush is
avoided, and only if there are non-transactional changes after the
series of transactions.

If we remove cancel_transaction, then I think your suggestion is
effectively my second one, but with a "commit_transaction" method which
does "end_transaction" followed by "flush".  I'd probably go for keeping
the methods separate - one fewer API method, begin_X and end_X pair
naturally, and it keeps the "calling this method often will hurt
performance" warnings to just flush().

> The references in the public API should be removed.  I'm not sure about the
> implementation in the backends.

I've had a look, and I think it can go.  The only useful part is calling
flush() if necessary when a database's destructor is called, and the
session framework is overkill for doing that.

Cheers,
    Olly




More information about the Xapian-devel mailing list