[Xapian-devel] Full MVCC in Brass

Austin Clements aclements at csail.mit.edu
Wed Aug 20 02:19:12 BST 2014


I'm one of the developers of Notmuch, an email client that makes
extensive use of Xapian.  For some time, full multi-version
concurrency control (MVCC) has been on our wish list.  Olly mentioned
that brass now uses free lists---a key step toward full MVCC---and I
offered to pitch in on putting the other pieces into place.

Before I wade too deep into this, I wanted to propose my plan.


Full MVCC would enable Xapian to keep any database revision valid as
long as any reader is using it, thus eliminating the dreaded
DatabaseModifiedError and simplifying application logic.

The high level idea is to use file range locking to represent locks on
database revisions.  Readers will acquire a shared lock on the latest
revision.  Writers will acquire an exclusive lock from revision 0 to
the highest revision they can lock, up to but not including the latest
committed version.  Writers are then prevented from allocating (and
hence overwriting) blocks unless they were freed within the writer's
locked revision range.

A detailed design follows.

Locking changes
---------------

On Unix platforms, Xapian already uses fcntl byte range locking to
lock byte 0 of "flintlock" to protect against concurrent writers.
Create a new "brasslock" that extends this to use byte R+1 to
represent a lock on revision R.

1) In the writer: Lock byte 0 like usual.  Then let RSAFE be the
current database revision minus 1 (omitting the latest revision
ensures it's available to readers that start during this writer.)  Use
F_GETLK to test for conflicting shared locks on revision range [0,
RSAFE].  If F_GETLK returns a conflicting lock, set RSAFE to one less
than the lower bound of the conflicting revision range and retry.
(Multiple retries may be necessary because it is unspecified which of
multiple conflicting locks F_GETLK will return.)  Once a lockable
range has been found, use F_SETLK to acquire an exclusive lock on it.
This may fail in an unlikely race with a reader; if it does, continue
to back off RSAFE with F_GETLK.  Remember RSAFE.

2) In the reader: Once the version file has been read, attempt to
acquire a shared lock on the current revision.  This may fail if a new
revision is committed after reading the version file; if it does,
re-read the version file and try again.

This has the annoying side-effect of requiring readers to start a
lock-holder subprocess like writers currently do.  I propose also
adding support for F_OFD_SETLK, which won't require a lock-holder
subprocess for either readers or writers.  F_OFD_SETLK was added in
Linux 3.12 (which will be in the next Debian stable) and will
hopefully appear in the next version of POSIX.1, so with any luck it
will become widely available in the future.  If F_OFD_SETLK isn't
available, Xapian can fall back to what it does currently.

Windows has LockFileEx, which is essentially the same as F_OFD_SETLK.

I don't know what to do about EMX or if we care.

Free list changes
-----------------

For free blocks, Xapian will need to keep track of which revision each
block was freed at to know when it's safe to reuse that block.  The
most efficient place to store this is in the on-disk free list itself.
The free list is currently a FIFO queue of block numbers ordered from
oldest free revision to newest free revision.  This is already the
desired order; we just have to add revision information.

1) Add an additional free list entry of the form {pack_uint4(-2)
pack_uint4(revision)} that indicates all block numbers up to the next
revision marker were freed in 'revision'.

2) Track the free revision of the entry at the head of the free list
in BrassFreeList and serialize this to the RootInfo metadata.  When
get_block encounters a revision entry, it updates the head revision
and moves on to the next free list entry.

3) In mark_block_unused, if this is the first block added since the
free list was loaded or committed, then before adding the block
number, add a revision entry with the uncommitted revision number.

4) Modify get_block to take the maximum reusable revision.  If the
head revision exceeds the reusable revision, return an unused block
rather than the next block on the free list.  (This supersedes the
current fl_end check done in get_block and walk.)

5) When a writer calls get_block, pass RSAFE as the maximum reusable
revision.

Bonus points
------------

When a writer exhausts free blocks that are available in the locked
revisions, but there are more available in revisions it couldn't lock,
it could attempt to extend its range of locked revisions.  If a reader
has since closed the database, this could allow the writer to reuse
more blocks rather than extend the database file.  This could be
important for long-running writers.  This would be easy with
F_OFD_SETLK/LockFileEx and huge pain without it (probably requiring a
dedicated lock helper binary).

In DANGEROUS mode, the writer could lock the latest committed revision
as well, thus preventing readers from even attempting to open the
database while it's being stomped on.

Double bonus points
-------------------

In the above design, a long-running reader could prevent writers from
reusing blocks that were both allocated and freed after that reader's
locked revision.  We could solve this by having the writer lock all of
the revision ranges it can, not just the range starting at revision 0.
Then, it could reuse any block whose revision and freed revision both
lie in a contiguous range of locked revisions.  This complicates the
locking procedure and substantially complicates the free list, since
it would no longer be a simple FIFO queue.



More information about the Xapian-devel mailing list