Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Memory-mapped floating point (was Re: ZISC computers)
Keywords: ZISC
Message-ID: <9061@winchester.mips.COM>
Date: 30 Nov 88 21:53:22 GMT
References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 125

In article <1044@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes:
>In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>Note, of course, that this is a model of the world likely to go away,
>>for anybody who is serious about scalar floating point.  The structure
>>is only likely to happen when you have a long-cycle-count FP device,...

>Gee, this is intersesting, considering that the MIPS integer multiply/divide
>is done basically the same way: move operands to special registers, issue
>instruction, if a read is attempted before the results are ready, stall.
ACtually, it doesn't work this way: the mult instruction picks up the values
from the regular registers, i.e., one instruction both fetches the data and
initiates the instruction.  The only "extra" instruction is the one that
moves the result back.

>On, say, a 68020, you simply can't have loads and stores complete in one
>cycle, so this wouldn't work, and the address computation on the MIPS might
>cost a cycle, but on the 29000, where the address is known by the end of
>the decode stage, the load/store can leave the chip during the execute
>phase and complete that same cycle, so you're only losing one cycle
>transferring operands (less , if you make setup time assumptions.
>This seems pretty tightly coupled to me.

>(For the confused: the 29000 has some extra address modifier bits, some of
>which are used to address the floating-point unit, and the others are
>used to indicate what sort of transaction this is - are you writing
>operand 1, operand 2, or an instruction.  You can use both address and
>data buses to send 64 bits of data in one cycle.  Reads are only 32
>bits at a time, sorry.)

>Basically, I claim that this model of FPU operation is very close
>in speed to one where the CPU has more knowledge of the FPU, if
>done properly.

Here's an example.  MAybe this is close in speed, or maybe not, or maybe
I don't understand the 29K FPU interface well enough.  here's a small
example (*always be wary of small examples; I don't have time right now
for anything more substantive):
main() {
	double x, y, z;
	x = y + z;
}

this generates (for an R3000), and assuming data & instrs in-cache:
 #   2		double x, y, z;
 #   3		x = y + z;
	l.d	$f4, 8($sp)
	l.d	$f6, 0($sp)
	add.d	$f8, $f4, $f6
	s.d	$f8, 16($sp)
The l.d and s.d actually turn into pairs of loads  & stores, and this sequence
takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for
the add.d, and 2 swc1's.

Assume a 29K with cache, so it has loads like the R3000's.
As far as I can tell, a 29K would use 17:
4 load cycles (best case: in some cases, offset calculations, or use of
	load multiple with count-setup would take another one or two)
2 writes (get the data over to the FPU)
1 write (start the add-double)
6 cycles (do the add.double)
2 reads (get the 64-bit result back)
2 stores (put the result back in memory; again assume no offset calculations)

Now, the biggest difference is the 6 cycles versus 2, and if this were
* instead of +, it would be 6 versus 5.  Still, as it stands, making worst
case assumptions about the R3000 (that the nop gets left in), and some
best-case assumptions about the 29K, you get a 9:17 ratio for this case,
a 12:17 for y*z;  if you do the single-precision case with the same assumptions,
you get a 6:12 ratio for x+y, and 8:12 for x*y.
So, we get:
	DP +	DP *	SP +	SP *
R3000	9	12	6	8	(subtract 1 if the nop gets scheduled)
29K	17	17	12	12

Also, interesting to see what would happen if you compare two non-existent
machines: an R3000* whose FP operation times increase to match the 29K's,
and a 29K whose FP op times decrease to match the R3000s:
R3000*	13	15	10	10	AN R3000 with 29K FP cycle times
29K*	13	16	8	10	A 29K whose FP operations had R3000
					cycle times
What does this mean:
ANS: statistically, not much, except that it means that the extra cycles
overhead getting data in and out would cancel the advantage of decreasing the
FP cycle times.  A better comparison might be between R3000 and 29K*,
which shows that even with the same FP operation times, one is paying the
following % cycle count penalties:  44%, 33%, 33%, 25%, at a minimum.
If you guess that you schedule the nop away 70% of the time, you get:
R3000	8.3	11.3	5.3	7.3
29K*	13	16	8	10
% hit	57%	42%	51%	37%

Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS.
This is a microscopic example, and one should never believe in them
too much.  However, the general effect is to add at least 1 cycle
for every FP variable load/store, just by the difference in interface.
Having stood on their heads to get things like 2-cycle adds, MIPS designers
would roll in their graves before adding a cycle to each variable load/store!
(Of course, AMD and we aim at somewhat different markets, and each company
has the own tradeoffs, so this tradeoff is not inherently evil, and there
are certainly classes of programs where this is less of problem.  still...)
If I've misunderstood the 29K interface, somebody (Tim?) correct me.

>(P.S. Question to MIPS: I think you only need to back up one extra instruction
>on an exception if that instruction is a *taken* branch.  Do you do it this
>way, or on all branches?)

I'm not sure what you mean.
When there is an exception, the Exception Program Counter EPC points at
the instruction that should be resumed to restart execution.  If the
exception occurred in the branch-delay slot, the cause register's
Branch Delay bit is set, so that the OS can analyze those cases where
the exception is caused by an instruction in the BD-slot, and where you
want to do something different, like emulating an FP operation on a system
that has no FP.

there is normally no difference in handling taken or untaken branches.
the only place you'd need to figure that out is in the case where you
want to go back to the user and skip the instruction in the delay slot.
then, you figure out whether or not the branch is being taken, and either
take it, or not.
-- 
-john mashey	DISCLAIMER: 
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086