Path: utzoo!utgpu!watmath!clyde!att!rutgers!mit-eddie!uw-beaver!microsoft!w-colinp
From: w-colinp@microsoft.UUCP (Colin Plumb)
Newsgroups: comp.arch
Subject: Re: Memory-mapped floating point (was Re: ZISC computers)
Keywords: ZISC
Message-ID: <1054@microsoft.UUCP>
Date: 1 Dec 88 13:13:57 GMT
References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> <9061@winchester.mips.COM>
Reply-To: w-colinp@microsoft.UUCP (Colin Plumb)
Organization: Microsoft Corp., Redmond WA
Lines: 99
Confusion: Microsoft Corp., Redmond WA

In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Actually, it doesn't work this way: the mult instruction picks up the values
>from the regular registers, i.e., one instruction both fetches the data and
>initiates the instruction.  The only "extra" instruction is the one that
>moves the result back.

I stand corrected.  Still, If I use the 29027 simply as an integer mutiplier,
I can do exactly the same thing.

>Here's an example.  Maybe this is close in speed, or maybe not, or maybe
>I don't understand the 29K FPU interface well enough.  here's a small
>example (*always be wary of small examples; I don't have time right now
>for anything more substantive):
>main() {
>	double x, y, z;
>	x = y + z;
>}

This is a *bit* heavy on the loads and stores - register allocation, anyone?

While people do add up vectors (which is 1 access per add, not 3), the
heaviest thing I can think of that's popular is dot products.  Now if
I'm allowed to put the 29027 into pipeline mode, it can eat 4 words of
data every 3 clocks, which is gonna tax anybody's memory system, but
I'll factor time to do the f.p. op out of the comparison.

You need to get 4 words per step of the dot product from memory to the
floating point chip.  That's 4 loads, either way, and an additional 2
stores for the 29027.  That's only 50% overhead, assuming all cache hits.

On the 29027, I can just leave the multiply-accumulate opcode in the
instruction register and keep feeding it operands, while on the MIPS
chip I have to issue add.d and mul.d, which is an extra 2 cycles penalty
it pays.

But sigh, I'm losing track of the point of the argument.  If I try and
code up a dot-product loop, it gets worse, as I can unroll the pointer
incrementing on the MIPS chip by using the base-plus-offset addressing
mode, which the 29000 doesn't have.  Of course, if it's a small array
(up to 30 64-bit words each, or so), I'll just keep the whole thing in
registers on the 29000...

>Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS.
>This is a microscopic example, and one should never believe in them
>too much.  However, the general effect is to add at least 1 cycle
>for every FP variable load/store, just by the difference in interface.
>Having stood on their heads to get things like 2-cycle adds, MIPS designers
>would roll in their graves before adding a cycle to each variable load/store!

Actually, that's half a cycle to loads and a cycle to stores.  But you're
right, this is a silly sort of thing to benchmark.  How about some
figures on the frequencies of the various ops in floating point programs,
O great benchamrk oracle? :-)  Just how often do I need to load and store?

And don't forget that on the 29000, if it's just a local variable, I
store it in the register file/stack cache (guaranteed one cycle) and
the actual memory move may be obviated entirely.

Even without all this logic, I think I can safely say that for vector
operations, the memory->fpu->memory speed is essential, thus all the
tricky things they do in Crays, avoiding the two steps of mem->reg
and reg->fpu.  For all-register work, like Mandelbrot kernels, it
doesn't matter.  And in between, I dunno.  I still think it doesn't
hurt *that* bad.  What's the silicon cost for the coprocessor interface
on the R2000/R3000?

>>(P.S. Question to MIPS: I think you only need to back up one extra instruction
>>on an exception if that instruction is a *taken* branch.  Do you do it this
>>way, or on all branches?)
>
>I'm not sure what you mean.

Quote from the R2000 architecture manual:
[The Exception Program Counter register]
This register contains the virtual address of the instruction that caused
the exception.  When that instruction resides in a branch delay slot, the
EPC register contains the virtual address of the immediately preceding
Branch or Jump instruction.

What I'm wondering is, in the instruction sequence

	foo
	bar
	jump (untaken)
	baz
	quux

where the jump is not taken, is "baz" considered to be in the jump's delay
slot?  I.e. if baz faults, will the EPC point to it, or to the jump.
Of course, if the jump *is* taken, then EPC will point to the jump, but
I'm not sure if a "branch delay slot" is the instruction after a change-
flow-of-control instruction, or a change in the flow of control.  I.e.
is the labelling static or dynamic?  If dynamic, an instruction emulator
wouldn't have to recompute the condition; it would know that the branch
should be taken to find the correct return address.

(It's late... er, early.  Sorry if this could use a little polishing.)
-- 
	-Colin (microsof!w-colinp@sun.com)