Path: utzoo!utgpu!water!watmath!clyde!mcdchg!nud!tom
From: tom@nud.UUCP (Tom Armistead)
Newsgroups: comp.arch
Subject: Re: RISC machines and scoreboarding
Message-ID: <1098@nud.UUCP>
Date: 22 Jun 88 18:13:50 GMT
References: <1082@nud.UUCP> <2438@winchester.mips.COM>
Reply-To: tom@nud.UUCP (Tom Armistead)
Organization: Motorola Microcomputer Division, Tempe, Az.
Lines: 100

In article <2438@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>Following are:
>1) How the MIPS R2000/R3000 does it.
>result of b) - e), and it isn't ready, it is interlocked by the hardware,
>rather than software.  You can consider this "partial scoreboarding",
>although we don't particularly call it that, as the main pipeline does
>completely freeze, rather than trying to issue the following instruction.

    This is similar (sort of) to what the 88k does although we do call it
scoreboarding since we interlock on a per register stale flag rather than a 
per functional unit done flag.  We use the scoreboard on memory loads as well
the FP unit.

>a) Note that the 2 things that are NOT interlocked (load and branch):
>	Have small (1-cycle delays) that are unlikely to be shortened,
>		and are undesirable to lengthen
>	Have about a 70% chance of having a 1-cycle delay slot filled
>	by software.

   But why not interlock on loads as well?  It removes the *requirement* that
a load be followed by nop if something useful can't be found to put there
and yet still allows the compiler/coder to put a useful instruction in
the delay slot if one exists.  This provides the following benefits.

1) More compact code since the loads that don't have a useful delay slot
instruction don't have to have a nop.  Of course, this is a less common
case (only 30% of loads) than those that do so the difference will likely
be slight but nonetheless, is a slight improvement over not interlocking.

2) As one other poster pointed out, it makes the coding easier and more
bug proof since the compiler or coder doesn't have to remember to insert
a nop on the loads with no useful delay slot instruction.  On the 88k,
there is no way a user program can access stale registers
regardless of how dumb/bug ridden the compiler or writer is.

   I see the above 2 items as improvements created by interlocking on
load results and since the interlock circuitry is already partially in
place (for the FP instructions) I don't see any additional disadvantages so
why not interlock on loads?

>SCOREBOARDING
>1) If you look at where scoreboarding came from (I remember it first from
>CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones),
>you had machines with:
>	a) Many independent functional units
>	b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
>	for example, and the FP operations were of course longer).
>	c) Multiple memory pipes, with a long latency to memory, and no caches.
>2) Note that RISC micros probably have a), but normally, only the FP ops,
>and maybe a few others have long multi-cycle operations, and they sure don't
>have multiple memory pipes, and they often have caches.
      ^^^^^^^^^^^^^^^^^^^^^
    The 88k has this as well as a & b (on board FP unit).

>On a machine with a simple load-interlock, and 1-cycle-latency loads,
>the machine would stall for 1 cycle before instruction 3.
>Suppose you do more complex scoreboarding, i.e., you continue attempting
>to issue instructions?  Then, you might do 1, 2, try 3 and save it,
>and then either come back and do 3 again (probably) or go on to 4 and
>discover that it stalls also.  If one looks at integer code sequences,
>one discovers that it is hard to discover many things that don't quickly
>depend on something else (barring, of course, Multiflow-style compiler
>technology and hardware designs not currently feasible in VLSI micros).

    The 88k scoreboarding logic doesn't perform this complex of a task.  It
merely monitors the instruction stream and stalls the instruction unit if
an attempt is made to execute an instruction which uses stale
registers and then lets the instruction unit continue once the correct 
operands are available.  For example, in the following code sequence
(assume the ld is a cache miss):

1)	ld	r2,r3,0		; Get value.
2)	add	r3,r3,16	; Bump pointer
3)	add	r2,r2,1		; Increment value.
4)	sub	r4,r4,1		; Dec count.

the instruction unit will stall on instruction 3 since it attempts to
use stale data.  Even though instruction 4 theoretically could be 
executed (as it isn't dependent on the ld results), it won't be started
until the ld is complete, and instruction 3 is completed.

  Efficient code on the 88k is just as dependent on a good compiler
(or coder) as any other RISC micro.  Scoreboarding on the 88k simply
ensures you won't use stale data inadvertantly - it's up to the
compiler to produce efficient code that minimizes stalls.

>	Might be useful for long multi-cycle operations (such as FP),
>	maybe (although Earl Killian has a long analysis that
>	argues otherwise that I don't have time to condense & post;
>	maybe Earl will),

   Consider the case where multiple FP operations can be in progress at 
the same time (as on the 88k).  How could you handle this situation without
a per register good/stale indicator (read scoreboard)?  Without a 
scoreboard, the processor would seem to be required to process memory
accesses and FP operations sequentially.  This might be acceptable but
it probably will impact performance adversely in FP or memory intensive
applications.

    So it seems to me that scoreboarding is not required as long as the
multi-cycle functional units are single threaded.  Here, the per functional
unit done flag seems like it would work ok.  However, to obtain the
performance boost from pipelined functional units (like on the 88k), 
register scoreboarding is a requirement and not an option.
-- 
Just a few more bits in the stream.

The Sneek