[Xapian-discuss] Error msg xapian-compact: The revision being read has been discarded - you should call Xapian::Database::reopen() and retry the operation

Thu Jun 8 02:37:45 BST 2006

On Wed, Jun 07, 2006 at 02:40:33PM -0800, oscaruser at programmer.net wrote:
> While running xapian-compact across a number of flint indicies, I
> receive the following error message. [...]

> record .../home/oscar/xapian/bin/xapian-compact: The revision being
> read has been discarded - you should call Xapian::Database::reopen()
> and retry the operation

This means that while reading from an input database's record table we
encountered part of the tree which has a newer revision than the current 
root block.

This is almost invariably caused by updating a database while reading
from it.  If two updates are committed before the read completes, you
get this error (it's DatabaseModifiedError).  It's a bit of a pain
and will be going away in the future, but it's not too hard to design
to avoid it happening at least.

The alternative is a low-level bug in Xapian's flint backend, or a
hardware or system software problem in the server.

I wouldn't rule out a Xapian bug, but the code in question was in
the quartz backend too, so it's been well hammered and every previous
occurrence of this has been attributed to simultaneous update or ailing
hardware.

> There are no other clients attempting to read or write the databases
> than xapian-compact. It could be that I killed the scriptindex process
> while a flint index was being updated, which may have caused
> corruption.

It shouldn't be possible to cause corruption in this way.  Even a power
failure should leave the database in a consistent state (assume the
power failure doesn't corrupt the actual data being written of course!)

The only loophole is that we assume fsync/fdatasync actually syncs
data to disk before returning, but the Linux man page notes:

       In case the hard disk has write cache enabled, the data may not
       really be on permanent storage when fsync/fdatasync return.

After the sync we write a new "baseA" or "baseB" file to point to
the new root, so there's perhaps a possibility that the base file could
get written before the sync completes if the disk subsystem writes
blocks out of order (you'd hope it wouldn't across a flush though), but
I think this could only be a problem anyway with a kernel crash or power
failure.

> Is there a way to repair the index in that case?

We just don't get corrupt indexes, so nobody's bothered to write a
repair tool!

If it's the postlist table that's broken, copydatabase should be able
to help, since it effectively reconstructs the postlist table from
the termlist table (which is why it's so much slower than xapian-compact
and quartzcompact).  But the data in the other tables isn't redundant.

If the database has a full set of "baseA" and "baseB" files, you could
try remove all of one and see if you can run xapian-compact then,
then restore all those and remove all the others and see if that works.
If this helps, you'll lose the last batch of documents flushed.

> Are there other reasons why this could
> have happened? Is there a way to validate the integrity of an index?

For quartz databases, you can run quartzcheck.  I've not yet written a
version for flint, since the database format is due to change
substantially.

> I tried to use the copydatabase utility to sort out what the problem
> is with this db, and found that at record 1112, there seems to be some
> corruption. How can I fix? Do I need to look at the binary data
> structure to determine how to fix this issue? Part of the problem is
> that it is not trivial to regenerate the data that was in the file.

The database format isn't too easy to follow from a hex dump.  It's not
impossible to work out what's wrong and probably recover much of the
data, but it's likely to be very time consuming, especially if you
don't know the format to start with.

Sorry not to have better news.

Cheers,
    Olly