Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.questions,comp.unix.wizards Subject: Re: Help on deciphering crash Message-ID: <4891@mimsy.UUCP> Date: Tue, 30-Dec-86 21:29:37 EST Article-I.D.: mimsy.4891 Posted: Tue Dec 30 21:29:37 1986 Date-Received: Thu, 1-Jan-87 00:37:56 EST References: <3645@sdcrdcf.UUCP> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 43 Xref: mnetor comp.unix.questions:497 comp.unix.wizards:469 In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes: >Our Vax 750 running 4.2BSD has occassionally been crashing with: >machine check 2: cp tbuf par fault [lots of registers] >panic: mchk >panic: sleep There are two interrelated fixes for this. Both are already in 4.3BSD. The first is that some tbuf parity errors can be corrected by flushing the translation buffer. As I recall, 4.2 has code to do this, but has the wrong test to determine whether it will suffice, masking with an 0xf somewhere where it should be masking with 0xe. The second is a `jelloware' (writable control store) fix for a timing problem in one CPU module. The 4.3 boot program knows to load the file `pcs750.bin' into the 750 patch store. The code to do this is not terribly large, and is all contained in /sys/stand/boot.c at your nearest 4.3 site, which also has /pcs750.bin. Incidentally, the `panic: sleep' is due to a bug in sleep that affects things only after a previous panic. I fixed this in our 4.2 kernels back when Jim O'Toole and I were writing a kernel XNS. I was rather amused to find the very same fix in the 4.3-alpha kernel. It helps considerably when you crash your machine several times a day! Also incidentally, the 4.3 boot program has no way to avoid loading the /pcs750.bin file, something I consider a bug (now that I have been bit by it). We recently had a 750 go down for two weeks. The long downtime was caused by three virtually simultaneous failures. First, one of two CDC9771 HDAs died suddenly. Second, our standby disk system (two RK07s) had some sort of controller backplane problem (considering how often we use the RK07s, it may have developed long ago). Third, and only discovered last Friday, our WCS board went out at the same time as the HDA. As long as I did not load the microcode update, the machine would boot. With the microcode in place, the machine would hang completely: not even control-P did anything. While this hardware failure might be quite rare, it forced me to consider what would happen if part of /pcs750.bin were overwritten. I added another boot flag to prevent the microcode update. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!mimsy!chris ARPA/CSNet: chris@mimsy.umd.edu