Path: utzoo!utgpu!watmath!clyde!att!rutgers!mailrus!cornell!uw-beaver!microsoft!w-colinp
From: w-colinp@microsoft.UUCP (Colin Plumb)
Newsgroups: comp.arch
Subject: Memory-mapped floating point (was Re: ZISC computers)
Keywords: ZISC
Message-ID: <1044@microsoft.UUCP>
Date: 30 Nov 88 02:16:40 GMT
References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM>
Reply-To: w-colinp@microsoft.UUCP (Colin Plumb)
Organization: Microsoft Corp., Redmond WA
Lines: 43

In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Note, of course, that this is a model of the world likely to go away,
>for anybody who is serious about scalar floating point.  The structure
>is only likely to happen when you have a long-cycle-count FP device,
>which is still fast enough to be desirable, but where the addition of a few
>cycles' overhead doesn't clobber the performance.  As micros with
>tight-coupled FPUs (MIPS R3000/R3010, Moto 88K, Cypress SPARC + TI 8847)
>and low-cycle count operations (2-10 cycles for 64-bit add & mult)
>become more common, you just lose too much performance moving data
>around to afford that kind of interface, at least in a scalar unit.

Gee, this is intersesting, considering that the MIPS integer multiply/divide
is done basically the same way: move operands to special registers, issue
instruction, if a read is attempted before the results are ready, stall.

The only difference is that the registers are on-chip and that the is
instruction that starts the multiply isn't a store.

On, say, a 68020, you simply can't have loads and stores complete in one
cycle, so this wouldn't work, and the address computation on the MIPS might
cost a cycle, but on the 29000, where the address is known by the end of
the decode stage, the load/store can leave the chip during the execute
phase and complete that same cycle, so you're only losing one cycle
transferring operands (less , if you make setup time assumptions.

This seems pretty tightly coupled to me.

(For the confused: the 29000 has some extra address modifier bits, some of
which are used to address the floating-point unit, and the others are
used to indicate what sort of transaction this is - are you writing
operand 1, operand 2, or an instruction.  You can use both address and
data buses to send 64 bits of data in one cycle.  Reads are only 32
bits at a time, sorry.)

Basically, I claim that this model of FPU operation is very close
in speed to one where the CPU has more knowledge of the FPU, if
done properly.

(P.S. Question to MIPS: I think you only need to back up one extra instruction
on an exception if that instruction is a *taken* branch.  Do you do it this
way, or on all branches?)
-- 
	-Colin (microsof!w-colinp@sun.com)