Path: utzoo!utgpu!water!watmath!clyde!rutgers!rochester!daemon
From: stuart@cs.rochester.edu
Newsgroups: comp.unix.wizards
Subject: Wait, Select, and a SIGCHLD Race Condition
Message-ID: <5105@sol.ARPA>
Date: 11 Dec 87 05:43:40 GMT
Sender: daemon@cs.rochester.edu
Lines: 53

I need advice (or sympathy) for handling a race condition in 4.3BSD
flavored UNIX.  Briefly, I want to use wait3 to reap all the dead or
stopped children of a process, then use select to wait for the first
new IO or child activity.  Sketch something like this:

  while (0 < (pid = wait3(..., WNOHANG, ...))) {
    /* do something with child */
  }

/* XXX Race condition is here */

  numfds = select(...);
  if (numfds < 0) {
    if (errno == EINTR)
      /* caught a signal, what kind was it, etc */
  }

There is a race condition between reaping children and starting the
select.  It is possible that a child can change status, a SIGCHLD gets
delivered *before* I enter select, I don't notice it, enter select and
hang forever.  Even if I have a handler for SIGCHLD that sets a flag
and I check that flag immediately before calling select, there is still
a (small) window of vulnerability.

Ideally, I would like to set the signal mask to block SIGCHLD and have
select release the signal *after* starting to wait.  That would allow
me to ensure that *all* dead children are noticed.  However, select
does not release any signals as far as I can tell.  Berkeley truly
improved the signal handling features going to 4.3, but the (improved)
features don't seem to let me write this code safely.  (In particular,
the sigblock, signal, sigpause, signal, setsigmask idiom is of no help
here.)

I would appreciate advice on how to safely avoid this race condition
given 4.3BSD features.  I suspect that it's not possible, but would be
delighted to learn otherwise (see next paragraph for an equivocation
for "not possible").  It's not essential that the skeleton code look
like that given above;  all that's needed is that I/O and child
activity is processed as soon as *either* is available.  Neither kind
of activity is guaranteed to happen, and some events may already have
happened, which must not be ignored.

There *is* a kludge that I can fall back on, but I would really like to
avoid it:  Put a maximum on the timeout given to select and check for
more children when select times out.  Even if I miss a SIGCHLD, I would
still reap the child.  This is doable, but a pain, because I am
managing timer requests in addition to IO and child requests in the
same package;  keeping the real timeouts straight from the kluge
timeouts (which might coincide!) is real ugly.  The whole point of this
package is to multiplex lots of request and AVOID POLLING.  The kludge
is, of course, nothing but polling.

Stu Friedberg  {ames,cmcl2,rutgers}!rochester!stuart  stuart@cs.rochester.edu