Path: utzoo!utgpu!water!watmath!clyde!mcdchg!nud!tom From: tom@nud.UUCP (Tom Armistead) Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Message-ID: <1098@nud.UUCP> Date: 22 Jun 88 18:13:50 GMT References: <1082@nud.UUCP> <2438@winchester.mips.COM> Reply-To: tom@nud.UUCP (Tom Armistead) Organization: Motorola Microcomputer Division, Tempe, Az. Lines: 100 In article <2438@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >Following are: >1) How the MIPS R2000/R3000 does it. >result of b) - e), and it isn't ready, it is interlocked by the hardware, >rather than software. You can consider this "partial scoreboarding", >although we don't particularly call it that, as the main pipeline does >completely freeze, rather than trying to issue the following instruction. This is similar (sort of) to what the 88k does although we do call it scoreboarding since we interlock on a per register stale flag rather than a per functional unit done flag. We use the scoreboard on memory loads as well the FP unit. >a) Note that the 2 things that are NOT interlocked (load and branch): > Have small (1-cycle delays) that are unlikely to be shortened, > and are undesirable to lengthen > Have about a 70% chance of having a 1-cycle delay slot filled > by software. But why not interlock on loads as well? It removes the *requirement* that a load be followed by nop if something useful can't be found to put there and yet still allows the compiler/coder to put a useful instruction in the delay slot if one exists. This provides the following benefits. 1) More compact code since the loads that don't have a useful delay slot instruction don't have to have a nop. Of course, this is a less common case (only 30% of loads) than those that do so the difference will likely be slight but nonetheless, is a slight improvement over not interlocking. 2) As one other poster pointed out, it makes the coding easier and more bug proof since the compiler or coder doesn't have to remember to insert a nop on the loads with no useful delay slot instruction. On the 88k, there is no way a user program can access stale registers regardless of how dumb/bug ridden the compiler or writer is. I see the above 2 items as improvements created by interlocking on load results and since the interlock circuitry is already partially in place (for the FP instructions) I don't see any additional disadvantages so why not interlock on loads? >SCOREBOARDING >1) If you look at where scoreboarding came from (I remember it first from >CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones), >you had machines with: > a) Many independent functional units > b) Many multiple-cycle operations (integer add was 3 cycles in 6600, > for example, and the FP operations were of course longer). > c) Multiple memory pipes, with a long latency to memory, and no caches. >2) Note that RISC micros probably have a), but normally, only the FP ops, >and maybe a few others have long multi-cycle operations, and they sure don't >have multiple memory pipes, and they often have caches. ^^^^^^^^^^^^^^^^^^^^^ The 88k has this as well as a & b (on board FP unit). >On a machine with a simple load-interlock, and 1-cycle-latency loads, >the machine would stall for 1 cycle before instruction 3. >Suppose you do more complex scoreboarding, i.e., you continue attempting >to issue instructions? Then, you might do 1, 2, try 3 and save it, >and then either come back and do 3 again (probably) or go on to 4 and >discover that it stalls also. If one looks at integer code sequences, >one discovers that it is hard to discover many things that don't quickly >depend on something else (barring, of course, Multiflow-style compiler >technology and hardware designs not currently feasible in VLSI micros). The 88k scoreboarding logic doesn't perform this complex of a task. It merely monitors the instruction stream and stalls the instruction unit if an attempt is made to execute an instruction which uses stale registers and then lets the instruction unit continue once the correct operands are available. For example, in the following code sequence (assume the ld is a cache miss): 1) ld r2,r3,0 ; Get value. 2) add r3,r3,16 ; Bump pointer 3) add r2,r2,1 ; Increment value. 4) sub r4,r4,1 ; Dec count. the instruction unit will stall on instruction 3 since it attempts to use stale data. Even though instruction 4 theoretically could be executed (as it isn't dependent on the ld results), it won't be started until the ld is complete, and instruction 3 is completed. Efficient code on the 88k is just as dependent on a good compiler (or coder) as any other RISC micro. Scoreboarding on the 88k simply ensures you won't use stale data inadvertantly - it's up to the compiler to produce efficient code that minimizes stalls. > Might be useful for long multi-cycle operations (such as FP), > maybe (although Earl Killian has a long analysis that > argues otherwise that I don't have time to condense & post; > maybe Earl will), Consider the case where multiple FP operations can be in progress at the same time (as on the 88k). How could you handle this situation without a per register good/stale indicator (read scoreboard)? Without a scoreboard, the processor would seem to be required to process memory accesses and FP operations sequentially. This might be acceptable but it probably will impact performance adversely in FP or memory intensive applications. So it seems to me that scoreboarding is not required as long as the multi-cycle functional units are single threaded. Here, the per functional unit done flag seems like it would work ok. However, to obtain the performance boost from pipelined functional units (like on the 88k), register scoreboarding is a requirement and not an option. -- Just a few more bits in the stream. The Sneek