Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Memory-mapped floating point (was Re: ZISC computers) Keywords: ZISC Message-ID: <9061@winchester.mips.COM> Date: 30 Nov 88 21:53:22 GMT References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 125 In article <1044@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes: >In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>Note, of course, that this is a model of the world likely to go away, >>for anybody who is serious about scalar floating point. The structure >>is only likely to happen when you have a long-cycle-count FP device,... >Gee, this is intersesting, considering that the MIPS integer multiply/divide >is done basically the same way: move operands to special registers, issue >instruction, if a read is attempted before the results are ready, stall. ACtually, it doesn't work this way: the mult instruction picks up the values from the regular registers, i.e., one instruction both fetches the data and initiates the instruction. The only "extra" instruction is the one that moves the result back. >On, say, a 68020, you simply can't have loads and stores complete in one >cycle, so this wouldn't work, and the address computation on the MIPS might >cost a cycle, but on the 29000, where the address is known by the end of >the decode stage, the load/store can leave the chip during the execute >phase and complete that same cycle, so you're only losing one cycle >transferring operands (less , if you make setup time assumptions. >This seems pretty tightly coupled to me. >(For the confused: the 29000 has some extra address modifier bits, some of >which are used to address the floating-point unit, and the others are >used to indicate what sort of transaction this is - are you writing >operand 1, operand 2, or an instruction. You can use both address and >data buses to send 64 bits of data in one cycle. Reads are only 32 >bits at a time, sorry.) >Basically, I claim that this model of FPU operation is very close >in speed to one where the CPU has more knowledge of the FPU, if >done properly. Here's an example. MAybe this is close in speed, or maybe not, or maybe I don't understand the 29K FPU interface well enough. here's a small example (*always be wary of small examples; I don't have time right now for anything more substantive): main() { double x, y, z; x = y + z; } this generates (for an R3000), and assuming data & instrs in-cache: # 2 double x, y, z; # 3 x = y + z; l.d $f4, 8($sp) l.d $f6, 0($sp) add.d $f8, $f4, $f6 s.d $f8, 16($sp) The l.d and s.d actually turn into pairs of loads & stores, and this sequence takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for the add.d, and 2 swc1's. Assume a 29K with cache, so it has loads like the R3000's. As far as I can tell, a 29K would use 17: 4 load cycles (best case: in some cases, offset calculations, or use of load multiple with count-setup would take another one or two) 2 writes (get the data over to the FPU) 1 write (start the add-double) 6 cycles (do the add.double) 2 reads (get the 64-bit result back) 2 stores (put the result back in memory; again assume no offset calculations) Now, the biggest difference is the 6 cycles versus 2, and if this were * instead of +, it would be 6 versus 5. Still, as it stands, making worst case assumptions about the R3000 (that the nop gets left in), and some best-case assumptions about the 29K, you get a 9:17 ratio for this case, a 12:17 for y*z; if you do the single-precision case with the same assumptions, you get a 6:12 ratio for x+y, and 8:12 for x*y. So, we get: DP + DP * SP + SP * R3000 9 12 6 8 (subtract 1 if the nop gets scheduled) 29K 17 17 12 12 Also, interesting to see what would happen if you compare two non-existent machines: an R3000* whose FP operation times increase to match the 29K's, and a 29K whose FP op times decrease to match the R3000s: R3000* 13 15 10 10 AN R3000 with 29K FP cycle times 29K* 13 16 8 10 A 29K whose FP operations had R3000 cycle times What does this mean: ANS: statistically, not much, except that it means that the extra cycles overhead getting data in and out would cancel the advantage of decreasing the FP cycle times. A better comparison might be between R3000 and 29K*, which shows that even with the same FP operation times, one is paying the following % cycle count penalties: 44%, 33%, 33%, 25%, at a minimum. If you guess that you schedule the nop away 70% of the time, you get: R3000 8.3 11.3 5.3 7.3 29K* 13 16 8 10 % hit 57% 42% 51% 37% Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS. This is a microscopic example, and one should never believe in them too much. However, the general effect is to add at least 1 cycle for every FP variable load/store, just by the difference in interface. Having stood on their heads to get things like 2-cycle adds, MIPS designers would roll in their graves before adding a cycle to each variable load/store! (Of course, AMD and we aim at somewhat different markets, and each company has the own tradeoffs, so this tradeoff is not inherently evil, and there are certainly classes of programs where this is less of problem. still...) If I've misunderstood the 29K interface, somebody (Tim?) correct me. >(P.S. Question to MIPS: I think you only need to back up one extra instruction >on an exception if that instruction is a *taken* branch. Do you do it this >way, or on all branches?) I'm not sure what you mean. When there is an exception, the Exception Program Counter EPC points at the instruction that should be resumed to restart execution. If the exception occurred in the branch-delay slot, the cause register's Branch Delay bit is set, so that the OS can analyze those cases where the exception is caused by an instruction in the BD-slot, and where you want to do something different, like emulating an FP operation on a system that has no FP. there is normally no difference in handling taken or untaken branches. the only place you'd need to figure that out is in the case where you want to go back to the user and skip the instruction in the delay slot. then, you figure out whether or not the branch is being taken, and either take it, or not. -- -john mashey DISCLAIMER:UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086