Path: utzoo!utgpu!watmath!clyde!att!rutgers!mailrus!cornell!uw-beaver!microsoft!w-colinp From: w-colinp@microsoft.UUCP (Colin Plumb) Newsgroups: comp.arch Subject: Memory-mapped floating point (was Re: ZISC computers) Keywords: ZISC Message-ID: <1044@microsoft.UUCP> Date: 30 Nov 88 02:16:40 GMT References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> Reply-To: w-colinp@microsoft.UUCP (Colin Plumb) Organization: Microsoft Corp., Redmond WA Lines: 43 In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Note, of course, that this is a model of the world likely to go away, >for anybody who is serious about scalar floating point. The structure >is only likely to happen when you have a long-cycle-count FP device, >which is still fast enough to be desirable, but where the addition of a few >cycles' overhead doesn't clobber the performance. As micros with >tight-coupled FPUs (MIPS R3000/R3010, Moto 88K, Cypress SPARC + TI 8847) >and low-cycle count operations (2-10 cycles for 64-bit add & mult) >become more common, you just lose too much performance moving data >around to afford that kind of interface, at least in a scalar unit. Gee, this is intersesting, considering that the MIPS integer multiply/divide is done basically the same way: move operands to special registers, issue instruction, if a read is attempted before the results are ready, stall. The only difference is that the registers are on-chip and that the is instruction that starts the multiply isn't a store. On, say, a 68020, you simply can't have loads and stores complete in one cycle, so this wouldn't work, and the address computation on the MIPS might cost a cycle, but on the 29000, where the address is known by the end of the decode stage, the load/store can leave the chip during the execute phase and complete that same cycle, so you're only losing one cycle transferring operands (less , if you make setup time assumptions. This seems pretty tightly coupled to me. (For the confused: the 29000 has some extra address modifier bits, some of which are used to address the floating-point unit, and the others are used to indicate what sort of transaction this is - are you writing operand 1, operand 2, or an instruction. You can use both address and data buses to send 64 bits of data in one cycle. Reads are only 32 bits at a time, sorry.) Basically, I claim that this model of FPU operation is very close in speed to one where the CPU has more knowledge of the FPU, if done properly. (P.S. Question to MIPS: I think you only need to back up one extra instruction on an exception if that instruction is a *taken* branch. Do you do it this way, or on all branches?) -- -Colin (microsof!w-colinp@sun.com)