[Xapian-discuss] Threaded test (in C++) to reproduce our database problems

Olly Betts olly@survex.com
Thu, 17 Jun 2004 04:24:29 +0100


On Wed, Jun 16, 2004 at 10:30:46PM -0400, Eric B. Ridge wrote:
> On 6/16/04 9:43 PM, "Olly Betts" <olly@survex.com> wrote:
> 
> > That's less odd.  After the writer has flushed twice, readers will get
> > this exception - they need to call reopen() and restart the operation.
> 
> Woah, can you elaborate here?  We're not calling reopen() anywhere.  Not in
> this little test, and not in our production code.
> 
> <short pause to RTFM>
> 
> Hmm.  Sounds like we should just catch the DatabaseModifiedError, do the
> reopen() and just try again, eh?

That's generally the best approach if you open the database, do a
search, then close it.  In that case the search would have to last
longer than the time between flushes by the writer (up to twice as long
if the search starts right after a flush).  So this exception really is
exceptional unless the writer is really hammering away and searches are
slow.

> Our production code (like this test) does call flush() every-so-often, and
> it's less than every 1000 add_document() calls, so we can probably track
> when we do a flush() and do the re-open ourselves.

If you are holding readable databases open longer term it may be worth
calling reopen preemptively.  I'm not sure it's worth tracking when you
flush - you can probably just call it before any search.  I'd advise
adding logic to retry on the exception too.

> However, what if a reader thread is in the middle of reading back documents
> while we do the flush() in the writer thread?  Sounds like these need to be
> synchronized in some way.

The backend keeps the current version of each Btree and one previous
version, so calling flush during a read is harmless.  It's only if flush
is called (or happens implicitly) twice that you get problems.

It would be better to avoid all this, but it's tricky to do so without a
central server.  My best idea so far would be to for readers to use fcntl
locking to get a shared lock to indicate the revision they're working
with.  Then a writer would only delete old revisions for which it could
obtain an exclusive lock (otherwise it would preserve them).  The Btree
manager is generally written with multiple old revisions in mind, so
this shouldn't be a huge project.

> > I can't see anything wrong from a quick read through.  I'm right in the
> > middle of something right now, but I'll give this a whirl in the next
> > day or so.
> 
> If you happen to get to this sooner, send me your mailing address offline...
> free pizza and beer.  ;)

It looks like you (and I!) may be in luck - it looks like I've finally
tracked down the cause of the problems Arjen reported.  So assuming that
his testcase is fixed, you're next.  It'd be cool if that fix also cured
your problems, but it seems less likely now I've seen a testcase.

Cheers,
    Olly