Path: utzoo!utgpu!water!watmath!clyde!att!pacbell!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!pyramid!prls!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Message-ID: <2465@winchester.mips.COM> Date: 25 Jun 88 23:40:29 GMT References: <1082@nud.UUCP> <2438@winchester.mips.COM> <1098@nud.UUCP> Reply-To: mash@winchester.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 70 In article <1098@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: > But why not interlock on loads as well?.... >1) More compact code... This is a legitimate issue, although it's about a 5% effect. We considered doing load-interlocks (and can do so in the future in an upward -compatible way), but didn't for philosophical reasons, i.e., we had a rule that we wouldn't put features in that we'd have to live with forever if we couldn't prove they were worthwhile for performance (1%); there was some concern that for some design directions, there might be a cycle-time hit if this were in the critical path somewhere. Anyway, it is a legitimate topic for debate. >2) As one other poster pointed out, it makes the coding easier and more >bug proof since the compiler or coder doesn't have to remember to insert >a nop on the loads with no useful delay slot instruction. On the 88k, >there is no way for a user program can access stale registers >regardless of how dumb/bug ridden the compiler or writer is. This one, however, is completely a non-issue, although, to be fair, it's frequently asked by people running into the R2000 for the first time: a) It is trivial to make the assembler insert load-nops where needed. I doubt that our code took more than 10 minutes to write and debug, and I don't ever remember having problems with that in almost 4 years. b) Any software system that couldn't be trusted with this problem, shouldn't be trusted with ANY problem: really, there are zillions of harder problems that need to be solved. (well, maybe not zillions, but many :-) c) Figuring that a RISC compiler should do optimization, but worrying that this feature might be buggy, is like worrying about the safety of flying in a 747 and bringing your own seat-belt because you don't really trust Boeing to remember to include them :-) .... > The 88k scoreboarding logic doesn't perform this complex of a task. It >merely monitors the instruction stream and stalls the instruction unit if >an attempt is made to execute an instruction which uses stale >registers and then lets the instruction unit continue once the correct >operands are available. For example, in the following code sequence >(assume the ld is a cache miss): > >1) ld r2,r3,0 ; Get value. >2) add r3,r3,16 ; Bump pointer >3) add r2,r2,1 ; Increment value. >4) sub r4,r4,1 ; Dec count. > >the instruction unit will stall on instruction 3 since it attempts to >use stale data. Even though instruction 4 theoretically could be >executed (as it isn't dependent on the ld results), it won't be started >until the ld is complete, and instruction 3 is completed. Thanx: we weren't sure whether it had multiple streams or not. The example seems to indicate that the 88K indeed has a load with 2 cycles of latency (i.e., cycles 2 & 3 above). From the example in <1097@nud.UUCP> that gave cycles for the ld/st/ld code, one would have thought there was only 1 latency cycle. Can you say: a) Are there indeed 2 latency cycles (i.e., that instruction 3 will indeed stall above)? b) If so, what is the reason for the second latency slot? (I realize that you may not want to answer this one :-) Note that our numbers say that in our machines, it would cost us 10-15% in overall performance to go from 1 cycle latency to 2, and the similarity of machines probably means about the same amount for an 88K. -- -john mashey DISCLAIMER:UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086