[Xapian-devel] buffered tables, sessions, and transactions

Richard Boulton richard at tartarus.org
Wed May 19 01:41:57 BST 2004


On Wed, May 05, 2004 at 01:10:56PM +0100, Olly Betts wrote:
> However, as of 0.8.0 we now buffer changes to the posting lists in
> QuartzWritableDatabase (in the "Private attributes" totlen_added through
> to mod_plists):
> 
> This probably removes the main advantage of having QuartzBufferedTable.

Certainly.  If I can remember back 4 years or so, the original idea of
having a buffered table was that it would avoid having to seek around all
over the place when adding items to a table - instead, a buffer of the
changed items would be kept to avoid rereading bits of the table which are
frequently modified.  It's wasn't very efficient though, largely I think
because it wasted so much memory with bits of table which havn't changed at
all.

> If we're adding new documents to a database, it probably doesn't help us
> at all.  It probably still helps efficiency a little for scattered
> updates to a database (as these will be applied sorted by key, which
> is probably a small win).
>
> However, all these changes are being buffered in memory.  We could use
> that memory to cache more posting list changes, or just let the OS use
> it to cache more disk blocks.

I would guess that we would usually be better off letting the OS use the
memory than using it ourselves.

> So I propose stripping out QuartzBufferedTable.  I believe any remaining
> benefit it provides is small, that it uses a lot of memory which could
> be better used, and that it's really just unneeded code which serves to
> make quartz harder to understand and debug.
>
> In place of this, changes to all but the posting list would be written
> straight to the btrees on disk, but wouldn't be visible to readers until
> the changes were flushed.
> 
> Opinions?

It makes sense to me.  Even if it resulted in a slight performance hit, the
code simplification would be worthwhile.

> While looking at this, I noticed the currently unimplemented transaction
> methods.  The idea is to allow a group of operations to be specified as
> being applied as a unit.  Either they all are applied, or none are.

I hadn't realised these weren't implemented!  Your suggested implementation
is pretty much what I intended, though.

> That's a lot of flushing though - the currently specified API makes
> transaction inherently inefficient I think.

For small transactions, this will indeed produce a high flushing overhead.
However, in some situations I think the guarantee that a transaction has
completed would be useful - for example, if indexing data from a news feed
where we can't recover an item if we lose it again.

It seems to me that it might be useful to implement the interface as it
stands, but add an "end_transaction" method.  This would be equivalent to
commit_transaction, but without flushing to disk.  Thus, begin_transaction
followed by end_transaction provides a facility for grouping changes, and
begin_transaction followed by commit_transaction provides guaranteed
storage of a set of changes.

> And lastly, begin_transaction's docs currently say:
> 
>   "A transaction may only be begun within a session, see begin_session()."
> 
> This method isn't on the public API.  Sessions are mentioned elsewhere
> in the public API docs, but I don't see them as relevant without methods
> to control them!  I propose to remove these references (and perhaps also
> the session mechanism in the backends).  Alternatively we could add
> sessions to the public API.  This would potentially allow multiple
> concurrent writers to the same DB with locking to prevent collisions.
> But I'm not sure that's appropriate for Xapian.  We're not trying to
> build a relational database here...

The references in the public API should be removed.  I'm not sure about the
implementation in the backends.

-- 
Richard




More information about the Xapian-devel mailing list