Path: utzoo!attcan!uunet!mcvax!cernvax!ethz!zeller
From: zeller@ethz.UUCP (Lukas Zeller)
Newsgroups: comp.os.os9
Subject: Interrupt handling error in most OSK drivers ???
Message-ID: <1772@ethz.UUCP>
Date: 15 Aug 89 19:36:28 GMT
Reply-To: zeller@bernina.ethz.ch.UUCP (Lukas Zeller)
Organization: ETH Zuerich, Switzerland
Lines: 101

I am using and programming OS-9/68k for several years now.  I  have  written
some drivers from scratch and I have  modified  many  existing  drivers  for
version updates and system ports. The question  I'd  like  to  ask  the  net
raised from this experience. In particular, I had  to  fix  various  drivers
that tended to "hang" *sometimes* and, in consequence, caused the system  to
block.
I hope there are some OSK gurus out there on the net  (hello,  Microware  !)
who can give an answer.

The problem can occur in all I/O-drivers that initiate some  action  in  the
main process and then do an infinte sleep.  The  completion  of  the  action
generates an interupt, which causes the main process to be woken up. If  for
some obscure reason this interrupt gets lost, obviously,  the  main  process
will never stop sleeping and the driver hangs.  Aside  from  real  "obscure"
reasons that eat up interrupts before they can  be  serviced  there  is  one
possibility for this to happen inherent in *all* original Microware  drivers
I know of (and all other drivers derived from Microware code,  which  covers
most of all existing drivers):

The standard outline for an interrupt controlled I/O driver is  as  follows,
according to existing source code as well as to P.Dibble's "OS-9  Insights",
paragraph 20.6:

   repeat
        mask interrupts
        if (IO request cannot be satisfied until hardware
            generates an interrupt) then
            UNMASK INTERRUPTS
            sleep
            continue
   until (IO request can be satisfied)

Now, what happens if the interrupt occurs *after* the decision that we  need
to wait for an interrupt, but *before* the main  process  is  asleep  ?  The
interrupt routine is called immediately after the "UNMASK  INTERRUPTS"  step
and sends a wakeup signal to the main process. But the main process  is  not
sleeping yet and thus  the  wakeup  signal  is  ignored  (according  to  the
documentation S$Wake insures only that the process is running and will *not*
be queued). Then the main process goes to sleep and will remain sleeping for
ever, because the wakeup event has occurred already before it went to sleep.

This problem is *not* a theoretical one at all. For example, when I  had  to
replace an old, slow SCSI controller  with  a  new,  fast  one,  the  system
suddenly hung *sometimes*: While the old controller was simply  slow  enough
to ensure that the interrupt was issued *always* after the main process  was
asleep, this was not true for the new controller. Sometimes, it responded so
quickly that the interrupt got  served  before  the  F$Sleep  call.  Similar
problems occurred to me with  several  other  drivers  from  many  different
sources. As said above, the prerequisites for  this  problem  are  given  in
virtually all existing  drivers,  but  it  does  actually  occur  with  fast
hardware only.

But how to avoid this problem ? The conditions are obvious:  The  interrupts
MUST NOT BE ENABLED BEFORE THE MAIN PROCESS IS ASLEEP. The only way to match
this condition is to call F$Sleep WHILE THE INTERRUPTS ARE  STILL  DISABLED,
and relying on the F$Sleep itself enabling the interrupts when it is safe. I
could not find any hints in the documentation whether this is legal or  not,
but the experiments done by several members of our local OS-9 interst  group
shows: IT WORKS. We modified most of our drivers of all types (SCF, RBF, SBF
and even NFM) and had no problems yet (there is one caveat described at  the
end of this message), and we have used this technique for more than  a  year
now.

So the problem is solved for practical purposes. But the solution  is  still
based on experiment only, and therefore we cannot be sure that it will  work
in all systems, although it seems like.
Also, we were very puzzled to  recognize  during  the  last  year  that  the
potential problem is not only in some european VME card manufacturer's  OS-9
ports (which show - sad, but true in our experience - very poor  programming
in general), but in many other drivers  of  excellent  programming  quality.
This applies even to the sample 68681 driver described in  "OS-9  Insights".
The "wrong way" (in my opinion) seems to be the official one.

As a conclusion of all this, I'd like to ask the following questions:

   -  Any similar or contradictory experiences ?
   -  Is the solution described above "legal" and reliable ? (especially  in
      future versions of OSK).  If  not,  how  can  the  problem  be  solved
      otherwise ?
   -  How could this fault (if I am not completely wrong, it *is*  a  fault)
      propagate through most existing drivers without getting discovered ?

----------------------------------------------------------------------------
For you real OS-9 hackers interested in details: As written above, there  is
one caveat for the solution above in OS-9 V2.2 (most probably also in  V2.1,
but I could not verify this).  As  long  as  the  system  tick  is  running,
everything works fine. But if the tick has not  yet  been  started,  F$Sleep
returns immediately without error, and the interrupt remains  masked  during
the execution of F$Sleep. Thus, drivers that call  F$Sleep  with  interrupts
disabled will hang in this case  unless  special  handling  is  provided.  I
mention this because it caused quite some headache  to  me  when  I  had  to
upgrade a system from V2.0 to V2.2 a few days ago, and the  harddisk  driver
suddenly did not work any more...
----------------------------------------------------------------------------

==========================  +---------------------------+  *****************
      Lukas Zeller          |\         E-Mail:         /|  *    MS-DOS...  *
 ETH Zurich, Switzerland    | \_______________________/ |  *               *
  (SFIT, Swiss Federal      |  /  zeller@ethz.UUCP   \  |  * just say NO ! *
 Institute of Technology)   | / ..cernvax!ethz!zeller \ |  *               *
==========================  +---------------------------+  *****************