[Xapian-devel] Problems with /bin/cat and flintlock?

Fri Apr 8 06:33:02 BST 2011

Dear Olly,

I've done some further debugging and recreated the issue at least once.

I used gdb and strace to find out some more details. It seems like my initial assessment may be wrong?

Here is where the Ruby/Rack process was hung:

#0  0xb7787424 in __kernel_vsyscall ()
#1  0xb768ef4b in waitpid () from /lib/i686/cmov/libpthread.so.0
#2  0xb592ba87 in FlintLock::release (this=0x939e6c8) at backends/flint_lock.cc:271
#3  0xb596b505 in ChertDatabase::close (this=0x939dee8) at backends/chert/chert_database.cc:494
#4  0xb58f6098 in Xapian::Database::close (this=0x90663e8) at api/omdatabase.cc:118
#5  0xb5ab31c5 in _wrap_Database_close (argc=0, argv=0x0, self=3041254400) at xapian_wrap.cc:19470
#6  0xb76c6b38 in ?? () from /usr/lib/libruby1.8.so.1.8
#7  0xb76d1cb8 in ?? () from /usr/lib/libruby1.8.so.1.8

It was during FlintLock::release that the process was hung up.

I attached to the /bin/cat process:

# strace -p 25694
Process 25694 attached - interrupt to quit
read(0, 0x859b000, 32768)               = ? ERESTARTSYS (To be restarted)
--- SIGHUP (Hangup) @ 0 (0) ---
read(0, 
^C <unfinished ...>
Process 25694 detached

I noticed that it was sent SIGHUP, but it didn't quit for some reason. Maybe you need to change this to SIGKILL? I was wondering if you knew what "= ? ERESTARTSYS (To be restarted)" meant?

I'm trying to reproduce the issue with strace on the parent process to find out what happened. The first time I did this the result was confusing, so I am doing it again to see if the result I got was correct or not. I'll send it here as soon as I get some useful log data.

Kind regards,
Samuel

On 8/04/2011, at 2:41 PM, Olly Betts wrote:

> On Fri, Apr 08, 2011 at 02:56:22AM +1200, Samuel Williams wrote:
>> I've been having intermittent issues with the flintlock code - it
>> seems that the function FlintLock::lock is never returning and this is
>> locking up the Ruby process.
> 
> What OS is this on?  That's likely to be highly relevant.
> 
>> At this point, using strace I found that the application process
>> seemed to be stuck in on
>> 00219         ssize_t n = read(fds[0], &ch, 1);
>> 
>> Obviously child process was cat, nothing really interesting about that.
> 
> The child process should send a single character before it execs
> /bin/cat, which is what the parent is waiting to read there.
> 
> If the write() call in the child fails, then the child exits, so
> unless the OS fails to transfer the byte across the pipe, I struggle
> to see how we can end up in this situation.
> 
>> 00172         // Connect pipe to stdin and stdout.
>> 00173         dup2(fds[1], 0);
>> 00174         dup2(fds[1], 1);
>> 
>> Isn't this setting stdin and stdout to the same end of an existing
>> pipe? Does this make sense?
> 
> It's a bidirectional socket, so that's fine.
> 
>> Anyway, I thought I'd mention this because it is a consistent problem.
>> If there is anything you think I should do with strace, gdb, etc on
>> the processes next time it hangs, let me know.
> 
> It would be useful to attach gdb to the parent and child and do a
> backtrace in each (bt) to see exactly where we are.
> 
>> One option to fix the bug without really understanding the real issue
>> would be to use select in the parent thread, rather than read. Then,
>> use a timeout of a few seconds so that if the child doesn't acquire
>> the lock within x seconds, it is as good as failed.
> 
> I'd prefer to understand the issue rather than paper over it.  Locking
> is rather a critical operation to get right!
> 
> Also, it's rather unclear what a suitable threshold is - you can use
> fcntl locking over NFS if you run the lock daemon, so a few seconds to
> get a lock is probably not impossible with a busy NFS server.
> 
> Cheers,
>    Olly