Path: utzoo!utgpu!watmath!clyde!att!rutgers!mailrus!cwjcc!gatech!amdcad!crackle!tim
From: tim@crackle.amd.com (Tim Olson)
Newsgroups: comp.arch
Subject: Re: Memory-mapped floating point (was Re: ZISC computers)
Message-ID: <23656@amdcad.AMD.COM>
Date: 1 Dec 88 20:52:59 GMT
References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> <9061@winchester.mips.COM>
Sender: news@amdcad.AMD.COM
Reply-To: tim@crackle.amd.com (Tim Olson)
Organization: Advanced Micro Devices, Inc. Sunnyvale CA
Lines: 98
Summary:
Expires:
Sender:
Followup-To:

In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
| Here's an example.  MAybe this is close in speed, or maybe not, or maybe
| I don't understand the 29K FPU interface well enough.  here's a small
| example (*always be wary of small examples; I don't have time right now
| for anything more substantive):
| main() {
| 	double x, y, z;
| 	x = y + z;
| }
| 
| this generates (for an R3000), and assuming data & instrs in-cache:
|  #   2		double x, y, z;
|  #   3		x = y + z;
| 	l.d	$f4, 8($sp)
| 	l.d	$f6, 0($sp)
| 	add.d	$f8, $f4, $f6
| 	s.d	$f8, 16($sp)
| The l.d and s.d actually turn into pairs of loads  & stores, and this sequence
| takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for
| the add.d, and 2 swc1's.
| Assume a 29K with cache, so it has loads like the R3000's.
| As far as I can tell, a 29K would use 17:
| 4 load cycles (best case: in some cases, offset calculations, or use of
| 	load multiple with count-setup would take another one or two)
| 2 writes (get the data over to the FPU)
| 1 write (start the add-double)
| 6 cycles (do the add.double)
| 2 reads (get the 64-bit result back)
| 2 stores (put the result back in memory; again assume no offset calculations)

No, local doubles are kept in the Am29000 register file, so no
loads/stores will occur to the memory stack.  The Am29000 has two
methods of generating floating-point code, either emitting floating
point instructions (which trap in the current Am29000 implementation) or
emitting inline '027 code directly.  The fp instruction code for:

double
g(double x, double y)
{
	return x+y;
}

(essentially the same as your test case, but I had to revise it to make
it emit *any* code) is:

	jmpi	lr0
	dadd	gr96,lr4,lr2

(i.e. return, adding the incoming parameters (lr2-3, lr4-5) into the
return-result registers (gr96,gr97) in the delay slot of the return.

The in-line '027 code for this looks like:

	const	gr96,1	; (0x1)
	consth	gr96,65536	; (0x10000)
	store	1,38,gr96,gr96
	store	1,32,lr4,lr5
	store	1,97,lr2,lr3
	load	1,1,gr97,gr96
	load	1,0,gr96,gr96

The first two instructions "build" the '027 instruction that is to be
performed (in this case, a DADD).  The first store stores that
instruction to the '027 coprocessor.  The second store transfers the 'y'
parameter to the coprocessor in a single cycle, and the third store
transfers the 'x' parameter, as well as starts the coprocessor
operation.  The Am29000 then stalls on the load of the result lsb's, (5
cycles) then grabs the msb's and returns.  This takes 12 cycles total,
counting the building of the add instruction (which would be cached in a
local register if it were to be used again). 

| * instead of +, it would be 6 versus 5.  Still, as it stands, making worst
| case assumptions about the R3000 (that the nop gets left in), and some
| best-case assumptions about the 29K, you get a 9:17 ratio for this case,
| a 12:17 for y*z;  if you do the single-precision case with the same assumptions,
| you get a 6:12 ratio for x+y, and 8:12 for x*y.
| So, we get:
| 	DP +	DP *	SP +	SP *
| R3000	9	12	6	8	(subtract 1 if the nop gets scheduled)
| 29K	17	17	12	12

Nope, it is:

	DP +	DP *	SP +	SP *
  R3000 9	12	6	8
  29K	12	12	9	9

This again assumes that the '027 instruction is not reused (which it
would be in "real" code).  If it were reused, the counts would drop by 2
cycles.

| Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS.

Agreed.

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)