Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!uw-beaver!rice!sun-spots-request From: mcvax!cs.vu.nl!sater@uunet.uu.net (Hans van Staveren) Newsgroups: comp.sys.sun Subject: ie1 problems on Sun 4/280 solved Message-ID: <1693@sater.cs.vu.nl> Date: 9 Dec 88 14:27:10 GMT Sender: usenet@rice.edu Organization: Rice University, Houston, Texas Lines: 48 Approved: Sun-Spots@rice.edu Original-Date: 22 Nov 88 17:02:42 GMT X-Sun-Spots-Digest: Volume 7, Issue 39, message 12 of 13 About two months ago we had big problems with the ie1 board on a Sun 4/280, it lost great amounts of packets, and we had the IP queue filling up, and never emptying again. We ran Sys4-3.2EXPORT and Sun Netherlands was supposed to figure it out. Well, they didn't, we did. The first thing I thought of when I saw the symptoms was a race. I asked Sun whether the interrupt priority of the board was right, and they claimed it was. So now two months and a lot of pain later I found out that the interrupt priority is wrong, although the problem is more subtle then I originally suspected. Bear with me, while I go technical for the next three paragraphs: In the SunOs kernel all networking is supposed to be done at CPU priority splimp() or higher to prevent devices interrupting critical queue manipulations. On Sun 3 workstations splimp() is level 3 and ie0 and ie1 also interrupt at level 3, so all is well. The SPARC chip in the Sun4 has twice the amount of interrupt levels as the MC68020 in the Sun3, and Sun made up a way to map the VMEbus interrupt request levels to SPARC interrupt levels. It *seems* that all offboard interrupts come in at odd levels(1,3,5,7,..) and all onboard interrupts at even levels(2,4,6,8,...). This means that the onboard ie0 and the offboard ie1 *cannot* interrupt at the same level: ie0 comes in at level 6, and ie1 at level 5. On the Sun4 splimp() is level 6. Now this still would have worked if inside the interrupt routine from ie1, running at level 5, a call would have been made to raise the level to 6. Almost needless to say this call is not there. The effect of all this is that while ie1 is queuing packets, ie0 can still interrupt, destroying the consistency of the system. End of technical mode. I am annoyed. I was right within a minute and I had to suffer for two months and then figure it out myself, without documentation or source. Does Sun assume all customers are dumb? They could have checked it at least, I suggested the priority several times as a possible cause. The strangest thing is that this must have happened to lots of other people, but a message to this worthy list brought up nothing. Is there anybody out there who has seen this before? Hans van Staveren Vrije Universiteit Amsterdam, Holland