Path: utzoo!utgpu!water!watmath!clyde!att!pacbell!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: RISC machines and scoreboarding
Message-ID: <2465@winchester.mips.COM>
Date: 25 Jun 88 23:40:29 GMT
References: <1082@nud.UUCP> <2438@winchester.mips.COM> <1098@nud.UUCP>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 70

In article <1098@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
>   But why not interlock on loads as well?....

>1) More compact code...
This is a legitimate issue, although it's about a 5% effect.
We considered doing load-interlocks (and can do so in the future in
an upward -compatible way), but didn't for philosophical reasons,
i.e., we had a rule that we wouldn't put features in that we'd have to
live with forever if we couldn't prove they were worthwhile for
performance (1%); there was some concern that for some design directions,
there might be a cycle-time hit if this were in the critical path somewhere.
Anyway, it is a legitimate topic for debate.

>2) As one other poster pointed out, it makes the coding easier and more
>bug proof since the compiler or coder doesn't have to remember to insert
>a nop on the loads with no useful delay slot instruction.  On the 88k,
>there is no way for a user program can access stale registers
>regardless of how dumb/bug ridden the compiler or writer is.

This one, however, is completely a non-issue, although, to be fair,
it's frequently asked by people running into the R2000 for the first time:
	a) It is trivial to make the assembler insert load-nops where
	needed.  I doubt that our code took more than 10 minutes to write
	and debug, and I don't ever remember having problems with that
	in almost 4 years.
	b) Any software system that couldn't be trusted with this problem,
	shouldn't be trusted with ANY problem: really, there are zillions
	of harder problems that need to be solved.  (well, maybe not zillions,
	but many :-)
	c) Figuring that a RISC compiler should do optimization, but worrying
	that this feature might be buggy, is like worrying about the safety
	of flying in a 747 and bringing your own seat-belt because you
	don't really trust Boeing to remember to include them :-)

....
>    The 88k scoreboarding logic doesn't perform this complex of a task.  It
>merely monitors the instruction stream and stalls the instruction unit if
>an attempt is made to execute an instruction which uses stale
>registers and then lets the instruction unit continue once the correct 
>operands are available.  For example, in the following code sequence
>(assume the ld is a cache miss):
>
>1)	ld	r2,r3,0		; Get value.
>2)	add	r3,r3,16	; Bump pointer
>3)	add	r2,r2,1		; Increment value.
>4)	sub	r4,r4,1		; Dec count.
>
>the instruction unit will stall on instruction 3 since it attempts to
>use stale data.  Even though instruction 4 theoretically could be 
>executed (as it isn't dependent on the ld results), it won't be started
>until the ld is complete, and instruction 3 is completed.

Thanx: we weren't sure whether it had multiple streams or not.
The example seems to indicate that the 88K indeed has a load with
2 cycles of latency (i.e., cycles 2 & 3 above).  From the example in
<1097@nud.UUCP> that gave cycles for the ld/st/ld code, one would have
thought there was only 1 latency cycle.  Can you say:
	a) Are there indeed 2 latency cycles (i.e., that instruction 3
	will indeed stall above)?
	b) If so, what is the reason for the second latency slot?
	(I realize that you may not want to answer this one :-)
Note that our numbers say that in our machines, it would cost us
10-15% in overall performance to go from 1 cycle latency to 2,
and the similarity of machines probably means about the same amount
for an 88K.
-- 
-john mashey	DISCLAIMER: 
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086