[Xapian-discuss] minor problem

Thu Jan 10 14:47:14 GMT 2008

On Thu, Jan 10, 2008 at 01:09:14AM -0500, Deron Meranda wrote:
> On Dec 23, 2007 10:58 PM, Olly Betts <olly at survex.com> wrote:
> > The semantics of fcntl() locking within a process are rather unhelpful,
> > so we fork a child process to take and hold the lock for us.  To
> > minimise VM use, we just exec /bin/cat once the lock is obtained.
> 
> Olly, I've been curious about this; what kind of troublesome fcntl
> semantics were you running into that necessitated this child
> lock-holding process?  This locking style is rather unusual to me.

There are two problems.  Quoting from the Linux man page:

       F_SETLK
              Acquire  a lock (when l_type is F_RDLCK or F_WRLCK) or release a
              lock (when l_type is F_UNLCK) on  the  bytes  specified  by  the
              l_whence,  l_start,  and l_len fields of lock.  If a conflicting
              lock is held by another process, this call returns -1  and  sets
                              ^^^^^^^^^^^^^^^
              errno to EACCES or EAGAIN.

Attempting to lock the same file again from the process which already
holds the lock will succeed, which is very unhelpful if threads are
involved.  This could potentially be solved using a process-global
map to track locks, but then you need a mutex to protect this, and
it introduces an O(n.log(n)) behaviour in the number of
WritableDatabases open, which seems less than ideal.  It also means
we need to add thread-specific code to the library which caused a lot
of pain in the early days.  Perhaps pthreads was just too immature then
though.

I previously found a post on lkml (which I can't seem to relocate now)
where someone queried this behaviour and was told it was as specified by
POSIX, so it seems fcntl() is just broken by design (or more kindly, it
was probably designed before threads were an issue).

       As well as being removed by an explicit F_UNLCK, record locks are auto-
       matically released when the process terminates or if it closes any file
       descriptor referring to a file on which locks are held.  This  is  bad:
       it  means  that a process can lose the locks on a file like /etc/passwd
       or /etc/mtab when for some reason a library function decides  to  open,
       read and close it.

This is also a big problem - it means that even if fcntl() worked
sensibly between threads we'd be out of luck, because if a process holds
the lock, another thread (or even the same thread) in the process opens
the lock file, tries to fcntl lock it, fails, closes the lock file, and
that releases the lock that the process held.  Argghhh!  It also means
that user code can smash locks just by opening and closing the lock file -
that may seem something that would never happen, but consider an indexer
which traverses the filesystem (like omindex) which was accidentally set
to index a tree including its own database directory...

I don't know why these problems don't seem to be more widely known.
It's less of an issue in an application than a library, since you have
more control over the process.  Other than that, all I can assume is
that people either haven't noticed the problem and have flawed locking,
or that they don't use fcntl() locking.

Suggestions for a better locking approach are certainly welcome.  I've
already considered the obvious ones: lockf() isn't a solution as on
Linux it's just a wrapper for fcntl(), and flock() locks between
processes, according to the Linux man page.

Using the existence of a lockfile (created in an NFS-safe way as
described in the O_EXCL section of Linux's "man 2 open") is what we did
for quartz but leaves stale locks behind if the process is killed (you
can store host and pid in the lock file but it's hard to recover without
avoiding obscure race conditions, and that doesn't help if the database
might be on NFS or similar as you can't tell if a process is still
active on another host).

> Also, I first stumbled into this accidentally when trying to run
> under Linux with some rather tight SELinux security policies
> in place...the exec of /bin/cat was failing because of denied
> permissions that I had no idea that the library required.
> I assume most people won't try using it in such an environment though.

Would this have been less of an issue if we had our own helper binary in
place of /bin/cat?  Or would exec() of anything have been denied?

Can you write a paragraph describing what's needed that can go in a
suitable place in the documentation?

Cheers,
    Olly