[Xapian-discuss] Error msg xapian-compact: The revision being read has been discarded - you should call Xapian::Database::reopen() and retry the operation

Fri Jun 9 19:52:09 BST 2006

Olly,

The disk has write cache enabled. I will disable this. I've noticed from strace that when I have many scriptindex processes running concurrently on different flint dbs the fsync operation takes about 30 seconds to a minute to complete at times. This has had a noticable impact on my rate of processing. I was running 150 spiders in parallel, but had to scale back to 30 or less. My hope is with this disabled the updates will be occur at a faster rate, but it could be disabling the writeback cache would not really help that.

Can I modify the copy database tool to ignore the revision check in order to attempt to recover the flint index? I have identified which dbs are busted, and which record precisely -- but as you say I haven't studied the index structure at all. Will I need to make specific decisions about each index error in order to rebuild?

Thanks,
OSC

> ----- Original Message -----
> From: "Olly Betts" <olly at survex.com>
> To: oscaruser at programmer.net
> Subject: Re: [Xapian-discuss] Error msg xapian-compact: The revision being read has been discarded - you should call Xapian::Database::reopen() and retry the operation
> Date: Thu, 8 Jun 2006 02:37:45 +0100
> 
> 
> On Wed, Jun 07, 2006 at 02:40:33PM -0800, oscaruser at programmer.net wrote:
> > While running xapian-compact across a number of flint indicies, I
> > receive the following error message. [...]
> 
> > record .../home/oscar/xapian/bin/xapian-compact: The revision being
> > read has been discarded - you should call Xapian::Database::reopen()
> > and retry the operation
> 
> This means that while reading from an input database's record table we
> encountered part of the tree which has a newer revision than the current
> root block.
> 
> This is almost invariably caused by updating a database while reading
> from it.  If two updates are committed before the read completes, you
> get this error (it's DatabaseModifiedError).  It's a bit of a pain
> and will be going away in the future, but it's not too hard to design
> to avoid it happening at least.
> 
> The alternative is a low-level bug in Xapian's flint backend, or a
> hardware or system software problem in the server.
> 
> I wouldn't rule out a Xapian bug, but the code in question was in
> the quartz backend too, so it's been well hammered and every previous
> occurrence of this has been attributed to simultaneous update or ailing
> hardware.
> 
> > There are no other clients attempting to read or write the databases
> > than xapian-compact. It could be that I killed the scriptindex process
> > while a flint index was being updated, which may have caused
> > corruption.
> 
> It shouldn't be possible to cause corruption in this way.  Even a power
> failure should leave the database in a consistent state (assume the
> power failure doesn't corrupt the actual data being written of course!)
> 
> The only loophole is that we assume fsync/fdatasync actually syncs
> data to disk before returning, but the Linux man page notes:
> 
>         In case the hard disk has write cache enabled, the data may not
>         really be on permanent storage when fsync/fdatasync return.
> 
> After the sync we write a new "baseA" or "baseB" file to point to
> the new root, so there's perhaps a possibility that the base file could
> get written before the sync completes if the disk subsystem writes
> blocks out of order (you'd hope it wouldn't across a flush though), but
> I think this could only be a problem anyway with a kernel crash or power
> failure.
> 
> > Is there a way to repair the index in that case?
> 
> We just don't get corrupt indexes, so nobody's bothered to write a
> repair tool!
> 
> If it's the postlist table that's broken, copydatabase should be able
> to help, since it effectively reconstructs the postlist table from
> the termlist table (which is why it's so much slower than xapian-compact
> and quartzcompact).  But the data in the other tables isn't redundant.
> 
> If the database has a full set of "baseA" and "baseB" files, you could
> try remove all of one and see if you can run xapian-compact then,
> then restore all those and remove all the others and see if that works.
> If this helps, you'll lose the last batch of documents flushed.
> 
> > Are there other reasons why this could
> > have happened? Is there a way to validate the integrity of an index?
> 
> For quartz databases, you can run quartzcheck.  I've not yet written a
> version for flint, since the database format is due to change
> substantially.
> 
> > I tried to use the copydatabase utility to sort out what the problem
> > is with this db, and found that at record 1112, there seems to be some
> > corruption. How can I fix? Do I need to look at the binary data
> > structure to determine how to fix this issue? Part of the problem is
> > that it is not trivial to regenerate the data that was in the file.
> 
> The database format isn't too easy to follow from a hex dump.  It's not
> impossible to work out what's wrong and probably recover much of the
> data, but it's likely to be very time consuming, especially if you
> don't know the format to start with.
> 
> Sorry not to have better news.
> 
> Cheers,
>      Olly

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/