Strange index consistency issue

Thu Jan 14 22:09:58 GMT 2016

On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote:
> Olly Betts writes:
>  > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote:
>  > > I will look into the bug you listed to see if it might be related. If there
>  > > is anything else that I can do, please let me know. 
>  > 
>  > If that bug is not the cause, it would be good to get to the bottom of this -
>  > the database shouldn't become corrupt like this.
> 
> I remembered something: I could only reproduce issue #645 with separate
> read/write database objects, but this one is with recoll 1.21, which uses a
> single object, so maybe a different problem. 

The underlying bug for #645 was that cursors weren't getting rebuilt in
some situations where they needed to be, and could end up with bad data
in, and that bad data could be stale data.  So it's plausible a write
might go to the wrong block, which could explain "lost" data like we
have here.

It could easily be a different problem, but testing with the latest
1.2.x would be useful to make sure we aren't trying to track down a bug
we've already fixed.

> While a Xapian bug might be involved, there are many reasons why a Recoll
> indexer can meet an abrupt end in the general case (not saying this is
> the case here).
> A pulled power cord would be the most radical example. Recoll usually does
> not run in a datacenter...
> 
> In most cases, the data is replaceable without too much effort, so that
> reliable detection of an issue is almost as good as assurance that it won't
> occur. The latter seems very difficult to attain when running in an
> uncontrolled environment.

It may not matter for recoll, but more generally we don't want Xapian
databases getting corrupt.  And we do aim to survive power failures,
kernel panics, etc - achieving that in all cases is rather hard, but I
don't think that's a reason to drop it as an aim.

Examples of corruption that can be reproduced (even if it's not entirely
on demand) are very useful - if you can see the corruption happen it's
a lot easier to work out what is going wrong than if you just see the
aftermath.

> There is one weird thing though, which is why, in this situation,
> replace_document() appears to repeatedly accepts data which goes into a
> black hole.

Are you replacing the document with the same data?

If so, I think what happens is that it looks in the termlist table to
see if the document exists.  It does, so it compares the terms and sees
they are the same, and decides there's nothing to do.

It never looks at the document length list, so doesn't see that is
damaged.

Or if it's different data, but with the same "document length" (i.e.
sum(wdf)) then it'll update the termlist, but spot the length hasn't
changes so again not bother to look at the document length list.

If you replaced the document with a modified version with a different
length, I'd expect this would actually "self-heal".

Cheers,
    Olly