Path: utzoo!utgpu!watmath!clyde!att!rutgers!gatech!purdue!decwrl!pyramid!prls!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Memory-mapped floating point (was Re: ZISC computers)
Keywords: ZISC
Message-ID: <9136@winchester.mips.COM>
Date: 2 Dec 88 05:05:07 GMT
References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> <9061@winchester.mips.COM> <1054@microsoft.UUCP>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 71

In article <1054@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes:
.....
>But sigh, I'm losing track of the point of the argument.  ....
>...  Of course, if it's a small array
>(up to 30 64-bit words each, or so), I'll just keep the whole thing in
>registers on the 29000...
It would be very interesting to see what realistic high-level languauge
programs end up allocating floating-point arrays in the registers...
especially given FORTRAN call-by-reference.....

>>Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS.

>Actually, that's half a cycle to loads and a cycle to stores.  But you're
>right, this is a silly sort of thing to benchmark.  How about some
>figures on the frequencies of the various ops in floating point programs,
>O great benchamrk oracle? :-)  Just how often do I need to load and store?
Here's a few numbers, real quick, % instructions that are FP load/store:
		load	store
spice		15.2%	8.8%		(scalar)
doduc		26.1%	8%		(scalar)
linpack, 64bit	34.5%	18.6%		(vector)

>And don't forget that on the 29000, if it's just a local variable, I
>store it in the register file/stack cache (guaranteed one cycle) and
>the actual memory move may be obviated entirely.
>
>Even without all this logic, I think I can safely say that for vector
>operations, the memory->fpu->memory speed is essential, thus all the
>tricky things they do in Crays, avoiding the two steps of mem->reg
>and reg->fpu.  For all-register work, like Mandelbrot kernels, it
>doesn't matter.  And in between, I dunno.  I still think it doesn't
>hurt *that* bad.  What's the silicon cost for the coprocessor interface
>on the R2000/R3000?

There are algorithms where FP values stick in the registers.  Many
very scalar real programs do many loads&stores that simply will not
go away with zillions of on-chip registers [unless they're stack cache
like CRISP's, where the registers have addresses just like memory.  Even
then, it appears that typical allocatable-on-the-stack arrays blow
away any reasonable on-chip register caches, for a while.

>Quote from the R2000 architecture manual:
>[The Exception Program Counter register]
>This register contains the virtual address of the instruction that caused
>the exception.  When that instruction resides in a branch delay slot, the
>EPC register contains the virtual address of the immediately preceding
>Branch or Jump instruction.
>
>What I'm wondering is, in the instruction sequence
>
>	foo
>	bar
>	jump (untaken)
>	baz
>	quux
>
>where the jump is not taken, is "baz" considered to be in the jump's delay
>slot?  I.e. if baz faults, will the EPC point to it, or to the jump.
>Of course, if the jump *is* taken, then EPC will point to the jump, but
>I'm not sure if a "branch delay slot" is the instruction after a change-
>flow-of-control instruction, or a change in the flow of control.  I.e.
>is the labelling static or dynamic?  If dynamic, an instruction emulator
>wouldn't have to recompute the condition; it would know that the branch
>should be taken to find the correct return address.

It's static, i.e., it's irrelevant whether jump is taken or not.
-- 
-john mashey	DISCLAIMER: 
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086