Path: utzoo!attcan!uunet!wyse!mips!earl From: earl@mips.COM (Earl Killian) Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Message-ID: <2484@gumby.mips.COM> Date: 1 Jul 88 02:47:01 GMT References: <1082@nud.UUCP> <2438@winchester.mips.COM> <1098@nud.UUCP> <2459@gumby.mips.COM> <1110@nud.UUCP> Lines: 76 In-reply-to: tom@nud.UUCP's message of 29 Jun 88 18:23:09 GMT In article <1110@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: eak> You don't have the cache bandwidth to make fp pipelining useful eak> even on large highly vectorizable problems (i.e. 32b per cycle eak> isn't enough). You can't feed the fp pipeline fast enough. TA# Assuming the FP operands are not in cache, this is true. However, TA# there will be some class of problems which can make effective use of TA# the FP pipelining and assuming that FP pipelining has no bad side effects TA# (see paragraph below), it only makes sense to provide the feature. ......................................................................... I *was* talking about when the operands are in cache. Let me repeat the example in my original posting: For example, for linpack on a load/store machine with 32b/cycle you need only start a new add every 8 cycles and a new multiply every 8 cycles to run at cache speed. With fp latencies < 8 cycles, no pipelining is necessary in the individual fp units. Go through the 24 livermore loops and count the number of words loaded/stored per fp op and you'll see similar results. What's going on here? The DAXPY inner loop of linpack DY(I) = DY(I) + DA*DX(I) has a fp add, fp multiply, two loads, and one store (ignoring loop overhead, which can be effectively eliminated by unrolling). With a 32b interface to the data cache, the two loads and one store take 6 cycles. The two fp ops take 2 cycles to issue. Total 8 cycles for 2 flops. So if your add latency and multiply latency are <= 8 cycles you can execute DAXPY without any pipelining, running at 75% of your cache bandwidth. Let's formalize this: Let m be the ratio of memory reference cycles to floating point ops for a computation. Let f1, f2, ... fn be the frequencies of the different floating point ops (they sum to 1). Let t1, t2, ... tn be the latencies of the different floating point ops. Let r1, r2, ... rn be the pipeline rates of the different op units. Assume the computation is infinitely parallelizable (the most favorable circumstance for pipelining). Consider a load/store architecture with one instruction issue per cycle. This means the time to do one flop is bounded below by m+1 cycles. A given functional unit will require pipelining (ri < ti) to run at this rate iff ti*fi > m+1. The pipelining required is a new op every ri=(m+1)/fi cycles. Alternatively, the latency required for no pipelining is ti=(m+1)/fi. Example: linpack with one 32-bit access per cycle (e.g. r3000 or 88000): n=2, f1=0.5, f2=0.5, m=3. Thus r1=r2=8. Example: linpack with one 64-bit access per cycle, or half-precision linpack with one 32-bit access per cycle: m=1.5. So 5-cycle adds and multiplies are sufficient (R3010 latencies work fine). Or pipelining to start new ops every 5. So what you need to do to refute my claim is to show important problems with low m values and with fi almost one. One such computation is something like livermore loops 11 and 12 in single precision, for which n=1, f1=1, m=2, so you need a new add every 3 cycles. But of course the R3010's 2 cycle add handles that just fine without pipelining. Another such comutation is something like the multiple-vector techniques for LU-decomposition. E.g. for 8-way vectors, m=0.625, f1=0.5, f2=0.5, so you need 3.25 cycle add and multiply to run at peak rate for sp, 4.5 cycle add and multiply for dp. Still, I would not consider these to prove the need for pipelining with 32b cache interfaces. Do you know a computation which does? What are its parameters in the above terms? ......................................................................... -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086