Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.questions,comp.unix.wizards Subject: Re: Help on deciphering crash Message-ID: <4914@mimsy.UUCP> Date: Sun, 4-Jan-87 11:38:24 EST Article-I.D.: mimsy.4914 Posted: Sun Jan 4 11:38:24 1987 Date-Received: Sun, 4-Jan-87 21:56:59 EST References: <3645@sdcrdcf.UUCP> <4891@mimsy.UUCP> <1419@cit-vax.Caltech.Edu> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 60 Xref: mnetor comp.unix.questions:522 comp.unix.wizards:495 >In article <3645@sdcrdcf.UUCP> davem@sdcrdcf.UUCP (David Melman) writes: >>Our Vax 750 running 4.2BSD has occassionally been crashing with: >>machine check 2: cp tbuf par fault >> va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5 >> busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016 >In article <4891@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: >>There are two interrelated fixes for this. Both are already in >>4.3BSD. The first is that some tbuf parity errors can be corrected [...] In article <1419@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >Read the registers. This is a cache parity error, not a tbuf parity >error. Never mind that 4.[23] doesn't distinguish between the two. Sure enough. I never bothered to read the bits, knowing that `this occurs all the time and is always a tbuf error'. >We get these all the time. There are two ways to "fix" it: swap >L0003 boards until you get a good one ($$$), or change the machine >check handler to flush the cache and return. Now, can anyone tell >me how to flush the cache? Maybe the microcode fix helps this too? I have never seen a cache error here (but tb errors were extremely rare too: probably a consequence of our ordering our 750s with Ultrix 1.0 way back when.) Anyway, you could try disabling the cache: mtpr(CADR, 1); /* CADR is register 0x25 */ but that will probably slow the machine to a crawl. Disabling and reenabling the cache might well flush it, though. If mtpr(CADR, 1); mtpr(CADR, 0); does not clear the problem, perhaps reenabling it after a long delay will: mtpr(CADR, 1); timeout(cacheenable, (caddr_t) 0, 10*hz); ... cacheenable() { mtpr(CADR, 0); } But according to the registers I can read above (DEC's latest VAX Hardware Handbook does NOT include machine check frames---why?), returning may not help too much in this case, because the machine check error summary register (mcesr) has bit 8 set, bus error. Returning to the failed instruction may well not retry the failed read. Since it occurred in kernel mode, that might bring the machine down anyway. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!mimsy!chris ARPA/CSNet: chris@mimsy.umd.edu