Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 alpha 4/15/85; site basser.oz Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!think!harvard!seismo!munnari!basser!boyd From: boyd@basser.oz (Boyd Roberts) Newsgroups: net.unix-wizards,net.bugs.usg Subject: Bug in newproc() in 32V and Sys5.0? Message-ID: <470@basser.oz> Date: Thu, 24-Oct-85 01:51:07 EDT Article-I.D.: basser.470 Posted: Thu Oct 24 01:51:07 1985 Date-Received: Sat, 26-Oct-85 04:12:11 EDT Organization: Dept. of Comp. Science, Uni of Sydney, Australia Lines: 109 Xref: watmath net.unix-wizards:15437 net.bugs.usg:369 This could be a duplicate posting as my original got garbled. Do you ever get the feeling that the news software doesn't like you? Our VAX 780 has been crashing a lot lately with "bad memfree"'s and "lost text"'s. While looking for this bug I'm pretty sure I've found a bug in newproc(). It's certainly in 32V and probably in 5.0 (and other 5.N swapping systems). I'm not completely sure about 5.0, but I am about 32V (5.0 has this "curproc" thing, and i'm not too sure about it's function as I haven't got all of the source). Anyway, this bug may cause the problems that we've been experiencing, but I can't think of a scenario. But, it will do hideous things. The situation is this: The machine has run out of memory and it's desperate for core. Some process fork()s, and during the out-swap of the child (by the parent) the parent gets swapped by sched(). This is probably very rare because is not certain that the parent will be a candidate for swapping. The situation is even rarer on our system where we've got copy-on-write fork()s. The problem is that the parent is not prevented from becoming a candidate for swapping. There is an attempt to do this but it just doesn't work. The code goes like this: "op" is the parent and "np" is the child np->p_stat = SRUN; np->p_flag = SLOAD; u.u_procp = np; ... if (save(u.u_ssav)) return 1; if (procdup(np) == NULL) { /* * We've run out of core here, so swap the current * process to generate the copy. */ ... op->p_stat = SIDL; xswap(np, 0, 0, -1); op->p_stat = SRUN; } ... u.u_procp = op; setrq(np); np->p_flag |= SSWAP; ... return 0; Now, the "op->p_stat = SIDL" is an attempt to put the parent in a state where it's not a candidate to be swapped. You've got to be SRUN, SSTOP or SSLEEP state and not SLOCKed. So everything is fine until the parent goes to sleep in swap() (waiting for the io to complete). When this happens you're in the shit. The parent's state then becomes SSLEEP, and from that point on it will cycle between SSLEEP and SRUN (because of the sleep()/wakeup() cycle). Given that sched() wakes up when the parent is either of these states, the parent is then in a position to be swapped WHILE IT IS SWAPPING THE CHILD. Oh dear. Once sched() invokes xswap() on the parent you then have two xswap()s working with the same core. The one called from sched() will cause the core to be freed after the swap. The other won't free the core. The results of the core being freed from underneath the xswap() in newproc() are not really known. But, they are certainly not conducive to data integrity. Random process dumping core or crash city... My fix would be to tear out that revolting mess that is text.c and re-write it. I mean, it's time to use some algorithms and real data-structures. That h-h-h-hideous mess that's there turns my stomach. Do you know how partial swaps work? Un-fortunately I do. However, I'm not in a position to do that as we can't afford the developement time and my resignation becomes effective in a week and a half. Sooo, the fix is this: op->p_flag |= SLOCK; xswap(np, 0, 0, -1); op->p_flag &= ~SLOCK; Just lock the parent across the swap. Normally you don't have to worry because xswap() is called by a process that is SSYS (sched()) or will be SLOCKed (ie. itself for core expansion swaps). Also I'd change things so that across the swap the child's state is SIDL and the parent's state is not changed (ie. it just stays SRUN). These are really style choices. But, changing the child's state to SIDL will doubly protect it from being swapped (it's SLOCKed by xswap()). Boyd Roberts ...!seismo!munnari!basser.oz!boyd "Stand back -- and hold this..."