Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!husc6!uwvax!vanvleck!uwmcsd1!ig!agate!ucbvax!decwrl!pyramid!prls!mips!earl
From: earl@mips.COM (Earl Killian)
Newsgroups: comp.arch
Subject: Re: RISC machines and scoreboarding
Message-ID: <2459@gumby.mips.COM>
Date: 25 Jun 88 15:25:09 GMT
References: <1082@nud.UUCP> <2438@winchester.mips.COM> <1098@nud.UUCP>
Lines: 54

In article <1098@nud.UUCP>, tom@nud.UUCP (Tom Armistead) writes:
> >SCOREBOARDING
> >1) If you look at where scoreboarding came from (I remember it
> >first from CDC 6600 (see Bell & Newell book, for example), but
> >there may be earlier ones), you had machines with:
> >	a) Many independent functional units
> >	b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
> >	for example, and the FP operations were of course longer).
> >	c) Multiple memory pipes, with a long latency to memory, and no caches.
> >2) Note that RISC micros probably have a), but normally, only the FP ops,
> >and maybe a few others have long multi-cycle operations, and they sure don't
> >have multiple memory pipes, and they often have caches.
>       ^^^^^^^^^^^^^^^^^^^^^
>     The 88k has this as well as a & b (on board FP unit).

John meant the ability to do multiple data references per cycle, not
the ability to do an instruction fetch and data fetch per cycle (which
any well designed RISC supports).  So I don't think the 88k qualifies
here.

>    Consider the case where multiple FP operations can be in progress at 
> the same time (as on the 88k).  How could you handle this situation without
> a per register good/stale indicator (read scoreboard)?  Without a 
> scoreboard, the processor would seem to be required to process memory
> accesses and FP operations sequentially.  This might be acceptable but
> it probably will impact performance adversely in FP or memory intensive
> applications.

You don't need a scoreboard to do what you say, but when the number of
pending results is large, it is the most appropriate technique.  The
alternative, providing a register field comparator for each pending
result, is appropriate when the number of pending results is small.

But my question in all this is why did the 88000 choose to fully
pipeline floating point and thus allow such a large number of pending
results?  I understand why the 6600 and its successors did it, but the
same analysis for the 88000 suggests it is unnecessary.  You don't
have the cache bandwidth to make fp pipelining useful even on large
highly vectorizable problems (i.e. 32b per cycle isn't enough).  You
can't feed the fp pipeline fast enough.

For example, for linpack on a load/store machine with 32b/cycle you
need only start a new add every 8 cycles and a new multiply every 8
cycles to run at memory speed.  With fp latencies < 8 cycles, no
pipelining is necessary in the individual fp units.  Go through the 24
livermore loops and count the number of words loaded/stored per fp
op and you'll see similar results.

The price you paid for pipelining appears to be enormous: the 88100's
fp op latencies average 2.5x longer than the R3010's when measured in
cycles; 3x longer when measured in ns.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086