Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!cmcl2!brl-adm!umd5!mimsy!chris
From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.unix.wizards,comp.sys.dec
Subject: Re: mcr0: errors
Message-ID: <9609@mimsy.UUCP>
Date: Thu, 3-Dec-87 20:31:51 EST
Article-I.D.: mimsy.9609
Posted: Thu Dec  3 20:31:51 1987
Date-Received: Tue, 8-Dec-87 02:49:30 EST
References: <192@hal.UUCP>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 45
Keywords: syslog mcr0 ecc errors
Xref: mnetor comp.unix.wizards:5790 comp.sys.dec:481

[I overrided the followup-to header because I know various people
do not get comp.sys.dec, particularly those in ARPAland.]

In article <192@hal.UUCP> ane@hal.UUCP (Aydin "Bif" Edguer) writes:
>... I have noticed a large number of soft ecc errors appearing in my system
>log.  It looks alot like there may be a bad chip on one of my 6 memory
>boards.  I can isolate which board (probably) by board swapping, but
>how can I determine which chip?

Not even board swapping is necessary; the address and syndrome values
tell which board and which chip, although you will need a table to
decode syndrome numbers.

>The memory boards all pass the software diagnostic tests from Digital.

Memory diagnostics are notoriously unreliable.  There are too many
ways for the chips to fail to test them all, so diagnostics usually
look only for `serious' trouble.

>Nov 28 19:39:55 hal vmunix: mcr0: soft ecc addr 120d syn 4

Drag in your DEC FieldServicePerson and tell (her,him) that if you
were running VMS it would have printed

	%FOO-W-BAR, Some very long message that somewhere mentions
	something about memory chip corrected errors without giving
	anyone any clue as to what that means even though it takes
	several thousand characters to say it,*
		ADDR=00120D04

The exact meaning of all those bits depends on the memory controller
and memory boards in your system; typically the first few bits specify
an array number, the middle bits the address within the array, and the
last 7 or 8 bits the failing chip.  (This is why the address may vary,
but not the syndrome number.)

-----
*Just out of curiosity, I would like to know the actual message
format, and the description under the %FOO-W-BAR key in the VMS
manuals (which I suspect is something like this: `VMS has detected
and corrected a minor hardware fault; call your Field Service
Engineer'---i.e., utterly undescriptive).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris