Path: utzoo!utgpu!watmath!clyde!att!rutgers!mit-eddie!uw-beaver!microsoft!w-colinp From: w-colinp@microsoft.UUCP (Colin Plumb) Newsgroups: comp.arch Subject: Re: Memory-mapped floating point (was Re: ZISC computers) Keywords: ZISC Message-ID: <1054@microsoft.UUCP> Date: 1 Dec 88 13:13:57 GMT References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> <9061@winchester.mips.COM> Reply-To: w-colinp@microsoft.UUCP (Colin Plumb) Organization: Microsoft Corp., Redmond WA Lines: 99 Confusion: Microsoft Corp., Redmond WA In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Actually, it doesn't work this way: the mult instruction picks up the values >from the regular registers, i.e., one instruction both fetches the data and >initiates the instruction. The only "extra" instruction is the one that >moves the result back. I stand corrected. Still, If I use the 29027 simply as an integer mutiplier, I can do exactly the same thing. >Here's an example. Maybe this is close in speed, or maybe not, or maybe >I don't understand the 29K FPU interface well enough. here's a small >example (*always be wary of small examples; I don't have time right now >for anything more substantive): >main() { > double x, y, z; > x = y + z; >} This is a *bit* heavy on the loads and stores - register allocation, anyone? While people do add up vectors (which is 1 access per add, not 3), the heaviest thing I can think of that's popular is dot products. Now if I'm allowed to put the 29027 into pipeline mode, it can eat 4 words of data every 3 clocks, which is gonna tax anybody's memory system, but I'll factor time to do the f.p. op out of the comparison. You need to get 4 words per step of the dot product from memory to the floating point chip. That's 4 loads, either way, and an additional 2 stores for the 29027. That's only 50% overhead, assuming all cache hits. On the 29027, I can just leave the multiply-accumulate opcode in the instruction register and keep feeding it operands, while on the MIPS chip I have to issue add.d and mul.d, which is an extra 2 cycles penalty it pays. But sigh, I'm losing track of the point of the argument. If I try and code up a dot-product loop, it gets worse, as I can unroll the pointer incrementing on the MIPS chip by using the base-plus-offset addressing mode, which the 29000 doesn't have. Of course, if it's a small array (up to 30 64-bit words each, or so), I'll just keep the whole thing in registers on the 29000... >Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS. >This is a microscopic example, and one should never believe in them >too much. However, the general effect is to add at least 1 cycle >for every FP variable load/store, just by the difference in interface. >Having stood on their heads to get things like 2-cycle adds, MIPS designers >would roll in their graves before adding a cycle to each variable load/store! Actually, that's half a cycle to loads and a cycle to stores. But you're right, this is a silly sort of thing to benchmark. How about some figures on the frequencies of the various ops in floating point programs, O great benchamrk oracle? :-) Just how often do I need to load and store? And don't forget that on the 29000, if it's just a local variable, I store it in the register file/stack cache (guaranteed one cycle) and the actual memory move may be obviated entirely. Even without all this logic, I think I can safely say that for vector operations, the memory->fpu->memory speed is essential, thus all the tricky things they do in Crays, avoiding the two steps of mem->reg and reg->fpu. For all-register work, like Mandelbrot kernels, it doesn't matter. And in between, I dunno. I still think it doesn't hurt *that* bad. What's the silicon cost for the coprocessor interface on the R2000/R3000? >>(P.S. Question to MIPS: I think you only need to back up one extra instruction >>on an exception if that instruction is a *taken* branch. Do you do it this >>way, or on all branches?) > >I'm not sure what you mean. Quote from the R2000 architecture manual: [The Exception Program Counter register] This register contains the virtual address of the instruction that caused the exception. When that instruction resides in a branch delay slot, the EPC register contains the virtual address of the immediately preceding Branch or Jump instruction. What I'm wondering is, in the instruction sequence foo bar jump (untaken) baz quux where the jump is not taken, is "baz" considered to be in the jump's delay slot? I.e. if baz faults, will the EPC point to it, or to the jump. Of course, if the jump *is* taken, then EPC will point to the jump, but I'm not sure if a "branch delay slot" is the instruction after a change- flow-of-control instruction, or a change in the flow of control. I.e. is the labelling static or dynamic? If dynamic, an instruction emulator wouldn't have to recompute the condition; it would know that the branch should be taken to find the correct return address. (It's late... er, early. Sorry if this could use a little polishing.) -- -Colin (microsof!w-colinp@sun.com)