Strange index consistency issue

Fri Jan 15 07:48:30 GMT 2016

Olly Betts writes:
 > On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote:
 > > Olly Betts writes:
 > >  > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote:
 > >  > > I will look into the bug you listed to see if it might be
 > >  > > related. If there is anything else that I can do, please let me
 > >  > > know.
 > >  > 
 > >  > If that bug is not the cause, it would be good to get to the bottom
 > >  > of this - the database shouldn't become corrupt like this.
 > > 
 > > I remembered something: I could only reproduce issue #645 with
 > > separate read/write database objects, but this one is with recoll
 > > 1.21, which uses a single object, so maybe a different problem.
 > 
 > The underlying bug for #645 was that cursors weren't getting rebuilt in
 > some situations where they needed to be, and could end up with bad data
 > in, and that bad data could be stale data.  So it's plausible a write
 > might go to the wrong block, which could explain "lost" data like we
 > have here.
 > 
 > It could easily be a different problem, but testing with the latest
 > 1.2.x would be useful to make sure we aren't trying to track down a bug
 > we've already fixed.
 > 
 > > While a Xapian bug might be involved, there are many reasons why a
 > > Recoll indexer can meet an abrupt end in the general case (not saying
 > > this is the case here).
 > > A pulled power cord would be the most radical example. Recoll usually
 > > does not run in a datacenter...
 > > 
 > > In most cases, the data is replaceable without too much effort, so
 > > that reliable detection of an issue is almost as good as assurance
 > > that it won't occur. The latter seems very difficult to attain when
 > > running in an uncontrolled environment.
 > 
 > It may not matter for recoll, but more generally we don't want Xapian
 > databases getting corrupt.  And we do aim to survive power failures,
 > kernel panics, etc - achieving that in all cases is rather hard, but I
 > don't think that's a reason to drop it as an aim.

It was not my intention to suggest this.

As an aside, it *does* matter for Recoll that its index would survive
events. A few Recoll users have gigantic indexes (hopefully in sane
environments), needing multiple days to rebuild.

Being oldish and having spent 30 years around data management issues, I
just happen to believe that datacenter RDBMS-type reliability is *not
possible* for the typical Recoll installation, on a random machine, with an
arbitrary filesystem and IO subsystem (hasn't there been a few issues
around Linux fs data post-crash consistency?).

This is why I believe that, faced with uncertain reliability, and equipped
with backed-up data, corruption detection is a very important feature, even
if it can't be completely reliable either.

 > Examples of corruption that can be reproduced (even if it's not entirely
 > on demand) are very useful - if you can see the corruption happen it's
 > a lot easier to work out what is going wrong than if you just see the
 > aftermath.

And I do intend to provide such examples whenever possible. I was just
trying to make it clear that I was not necessarily looking for a fault in
Xapian code.

 > > There is one weird thing though, which is why, in this situation,
 > > replace_document() appears to repeatedly accepts data which goes into a
 > > black hole.
 > 
 > Are you replacing the document with the same data?

Bob answered this, yes, mystery solved.

Cheers,

jf

 > If so, I think what happens is that it looks in the termlist table to
 > see if the document exists.  It does, so it compares the terms and sees
 > they are the same, and decides there's nothing to do.
 > 
 > It never looks at the document length list, so doesn't see that is
 > damaged.
 > 
 > Or if it's different data, but with the same "document length" (i.e.
 > sum(wdf)) then it'll update the termlist, but spot the length hasn't
 > changes so again not bother to look at the document length list.
 > 
 > If you replaced the document with a modified version with a different
 > length, I'd expect this would actually "self-heal".
 > 
 > Cheers,
 >     Olly
 >