Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Olly Betts olly at survex.com
Wed May 24 03:40:32 BST 2017


On Mon, May 22, 2017 at 07:45:59AM +0200, Jean-Francois Dockes wrote:
> Olly Betts writes:
>  > Assuming nobody deleted the log file, this could be a Xapian bug.  This

I meant "lock file" not "log file" here.

>  > isn't something we're drowning in reports of, so presumably it doesn't
>  > trigger easily, so finding a way to reproduce would be good.
>  > 
>  > It could also be memory or disk corruption.  We don't currently store
>  > a checksum for each block, so there's no explicit detection of this.
>  > 
>  > Or something in the same process wrote to an fd that has since been
>  > closed and reused for one of the database tables (Xapian avoids reusing
>  > fds 0, 1 and 2 to avoid this for the standard streams, but it's hard to
>  > fully protect against this given how fds work).
> 
> This is certainly a possibility of course. In this case, we might be able
> to get an idea by looking at the actual data (with luck). What would be the
> best approach to get a peek ?

In this case, the output of xapian-check strongly hints it's unlikely to
be this, or at least not just this.

>  > Or something else perhaps.
>  > 
>  > > I've asked the kind user to run xapian-check on the index and post the
>  > > output.
>  > 
>  > That's a good thing to check.  If xapian-check finds no problems, then
>  > it's presumably just an in-core issue, which points to a Xapian bug or
>  > memory issues.
> 
> The output of xapian-check follows.

> xapian-check ~/.recoll/xapiandb
[...]
> postlist:
> baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238
> B-tree checked okay
> termfreq 197211 != # of entries 197210
> collfreq 10861536 != sum wdf 10861533
> termfreq 14189 != # of entries 14188
> collfreq 98354 != sum wdf 98344
> termfreq 9866 != # of entries 9865
> collfreq 56453 != sum wdf 56443
> termfreq 195141 != # of entries 195137
> collfreq 8126093 != sum wdf 8126079
> postlist table errors found: 8
[...]
> Total errors found: 8

Two interesting things here:

Firstly, the parent vs child block revision inconsistency seems to have
gone (xapian-check includes a check for this situation).

Secondly, the only inconsistencies seem to be in the term and collection
frequencies of 4 terms.

I suspect both are a consequence of the exception you originally reported
during commit() (flush() is just a compatibility alias for commit()).
Some updates were made but not committed and then we hit an exception
which meant corresponding updates didn't get applied.

Then when the database is closed, those pending updates get committed,
which leaves the database inconsistent, but also would likely have fixed
the mismatching revisions (if they were on disk) by writing out a new
version of the child and parent.

We do have code to handle clearing pending changes in such cases, but
it looks to me like it's not applied broadly enough.  I'll take a look
at addressing that.

However, that only affects what happens after the original exception was
thrown, so couldn't have caused it.  Sadly exactly what caused the original
exception is obscured by the effects of this bug and I can't really narrow
down the original exception much - about all I can say is that if that was
due to random corruption or overwriting, it was fairly localised.

Is this a reproducible (or at least recurring) issue?

Cheers,
    Olly



More information about the Xapian-discuss mailing list