[Xapian-devel] Problems with /bin/cat and flintlock?
space.ship.traveller at gmail.com
Fri Apr 8 06:33:02 BST 2011
I've done some further debugging and recreated the issue at least once.
I used gdb and strace to find out some more details. It seems like my initial assessment may be wrong?
Here is where the Ruby/Rack process was hung:
#0 0xb7787424 in __kernel_vsyscall ()
#1 0xb768ef4b in waitpid () from /lib/i686/cmov/libpthread.so.0
#2 0xb592ba87 in FlintLock::release (this=0x939e6c8) at backends/flint_lock.cc:271
#3 0xb596b505 in ChertDatabase::close (this=0x939dee8) at backends/chert/chert_database.cc:494
#4 0xb58f6098 in Xapian::Database::close (this=0x90663e8) at api/omdatabase.cc:118
#5 0xb5ab31c5 in _wrap_Database_close (argc=0, argv=0x0, self=3041254400) at xapian_wrap.cc:19470
#6 0xb76c6b38 in ?? () from /usr/lib/libruby1.8.so.1.8
#7 0xb76d1cb8 in ?? () from /usr/lib/libruby1.8.so.1.8
It was during FlintLock::release that the process was hung up.
I attached to the /bin/cat process:
# strace -p 25694
Process 25694 attached - interrupt to quit
read(0, 0x859b000, 32768) = ? ERESTARTSYS (To be restarted)
--- SIGHUP (Hangup) @ 0 (0) ---
^C <unfinished ...>
Process 25694 detached
I noticed that it was sent SIGHUP, but it didn't quit for some reason. Maybe you need to change this to SIGKILL? I was wondering if you knew what "= ? ERESTARTSYS (To be restarted)" meant?
I'm trying to reproduce the issue with strace on the parent process to find out what happened. The first time I did this the result was confusing, so I am doing it again to see if the result I got was correct or not. I'll send it here as soon as I get some useful log data.
On 8/04/2011, at 2:41 PM, Olly Betts wrote:
> On Fri, Apr 08, 2011 at 02:56:22AM +1200, Samuel Williams wrote:
>> I've been having intermittent issues with the flintlock code - it
>> seems that the function FlintLock::lock is never returning and this is
>> locking up the Ruby process.
> What OS is this on? That's likely to be highly relevant.
>> At this point, using strace I found that the application process
>> seemed to be stuck in on
>> 00219 ssize_t n = read(fds, &ch, 1);
>> Obviously child process was cat, nothing really interesting about that.
> The child process should send a single character before it execs
> /bin/cat, which is what the parent is waiting to read there.
> If the write() call in the child fails, then the child exits, so
> unless the OS fails to transfer the byte across the pipe, I struggle
> to see how we can end up in this situation.
>> 00172 // Connect pipe to stdin and stdout.
>> 00173 dup2(fds, 0);
>> 00174 dup2(fds, 1);
>> Isn't this setting stdin and stdout to the same end of an existing
>> pipe? Does this make sense?
> It's a bidirectional socket, so that's fine.
>> Anyway, I thought I'd mention this because it is a consistent problem.
>> If there is anything you think I should do with strace, gdb, etc on
>> the processes next time it hangs, let me know.
> It would be useful to attach gdb to the parent and child and do a
> backtrace in each (bt) to see exactly where we are.
>> One option to fix the bug without really understanding the real issue
>> would be to use select in the parent thread, rather than read. Then,
>> use a timeout of a few seconds so that if the child doesn't acquire
>> the lock within x seconds, it is as good as failed.
> I'd prefer to understand the issue rather than paper over it. Locking
> is rather a critical operation to get right!
> Also, it's rather unclear what a suitable threshold is - you can use
> fcntl locking over NFS if you run the lock daemon, so a few seconds to
> get a lock is probably not impossible with a busy NFS server.
More information about the Xapian-devel