Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!cmcl2!brl-adm!umd5!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.unix.wizards,comp.sys.dec Subject: Re: mcr0: errors Message-ID: <9609@mimsy.UUCP> Date: Thu, 3-Dec-87 20:31:51 EST Article-I.D.: mimsy.9609 Posted: Thu Dec 3 20:31:51 1987 Date-Received: Tue, 8-Dec-87 02:49:30 EST References: <192@hal.UUCP> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 45 Keywords: syslog mcr0 ecc errors Xref: mnetor comp.unix.wizards:5790 comp.sys.dec:481 [I overrided the followup-to header because I know various people do not get comp.sys.dec, particularly those in ARPAland.] In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes: >... I have noticed a large number of soft ecc errors appearing in my system >log. It looks alot like there may be a bad chip on one of my 6 memory >boards. I can isolate which board (probably) by board swapping, but >how can I determine which chip? Not even board swapping is necessary; the address and syndrome values tell which board and which chip, although you will need a table to decode syndrome numbers. >The memory boards all pass the software diagnostic tests from Digital. Memory diagnostics are notoriously unreliable. There are too many ways for the chips to fail to test them all, so diagnostics usually look only for `serious' trouble. >Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4 Drag in your DEC FieldServicePerson and tell (her,him) that if you were running VMS it would have printed %FOO-W-BAR, Some very long message that somewhere mentions something about memory chip corrected errors without giving anyone any clue as to what that means even though it takes several thousand characters to say it,* ADDR=00120D04 The exact meaning of all those bits depends on the memory controller and memory boards in your system; typically the first few bits specify an array number, the middle bits the address within the array, and the last 7 or 8 bits the failing chip. (This is why the address may vary, but not the syndrome number.) ----- *Just out of curiosity, I would like to know the actual message format, and the description under the %FOO-W-BAR key in the VMS manuals (which I suspect is something like this: `VMS has detected and corrected a minor hardware fault; call your Field Service Engineer'---i.e., utterly undescriptive). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris