Path: utzoo!utgpu!watmath!clyde!att!rutgers!mailrus!cwjcc!gatech!amdcad!crackle!tim From: tim@crackle.amd.com (Tim Olson) Newsgroups: comp.arch Subject: Re: Memory-mapped floating point (was Re: ZISC computers) Message-ID: <23656@amdcad.AMD.COM> Date: 1 Dec 88 20:52:59 GMT References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> <9061@winchester.mips.COM> Sender: news@amdcad.AMD.COM Reply-To: tim@crackle.amd.com (Tim Olson) Organization: Advanced Micro Devices, Inc. Sunnyvale CA Lines: 98 Summary: Expires: Sender: Followup-To: In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes: | Here's an example. MAybe this is close in speed, or maybe not, or maybe | I don't understand the 29K FPU interface well enough. here's a small | example (*always be wary of small examples; I don't have time right now | for anything more substantive): | main() { | double x, y, z; | x = y + z; | } | | this generates (for an R3000), and assuming data & instrs in-cache: | # 2 double x, y, z; | # 3 x = y + z; | l.d $f4, 8($sp) | l.d $f6, 0($sp) | add.d $f8, $f4, $f6 | s.d $f8, 16($sp) | The l.d and s.d actually turn into pairs of loads & stores, and this sequence | takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for | the add.d, and 2 swc1's. | Assume a 29K with cache, so it has loads like the R3000's. | As far as I can tell, a 29K would use 17: | 4 load cycles (best case: in some cases, offset calculations, or use of | load multiple with count-setup would take another one or two) | 2 writes (get the data over to the FPU) | 1 write (start the add-double) | 6 cycles (do the add.double) | 2 reads (get the 64-bit result back) | 2 stores (put the result back in memory; again assume no offset calculations) No, local doubles are kept in the Am29000 register file, so no loads/stores will occur to the memory stack. The Am29000 has two methods of generating floating-point code, either emitting floating point instructions (which trap in the current Am29000 implementation) or emitting inline '027 code directly. The fp instruction code for: double g(double x, double y) { return x+y; } (essentially the same as your test case, but I had to revise it to make it emit *any* code) is: jmpi lr0 dadd gr96,lr4,lr2 (i.e. return, adding the incoming parameters (lr2-3, lr4-5) into the return-result registers (gr96,gr97) in the delay slot of the return. The in-line '027 code for this looks like: const gr96,1 ; (0x1) consth gr96,65536 ; (0x10000) store 1,38,gr96,gr96 store 1,32,lr4,lr5 store 1,97,lr2,lr3 load 1,1,gr97,gr96 load 1,0,gr96,gr96 The first two instructions "build" the '027 instruction that is to be performed (in this case, a DADD). The first store stores that instruction to the '027 coprocessor. The second store transfers the 'y' parameter to the coprocessor in a single cycle, and the third store transfers the 'x' parameter, as well as starts the coprocessor operation. The Am29000 then stalls on the load of the result lsb's, (5 cycles) then grabs the msb's and returns. This takes 12 cycles total, counting the building of the add instruction (which would be cached in a local register if it were to be used again). | * instead of +, it would be 6 versus 5. Still, as it stands, making worst | case assumptions about the R3000 (that the nop gets left in), and some | best-case assumptions about the 29K, you get a 9:17 ratio for this case, | a 12:17 for y*z; if you do the single-precision case with the same assumptions, | you get a 6:12 ratio for x+y, and 8:12 for x*y. | So, we get: | DP + DP * SP + SP * | R3000 9 12 6 8 (subtract 1 if the nop gets scheduled) | 29K 17 17 12 12 Nope, it is: DP + DP * SP + SP * R3000 9 12 6 8 29K 12 12 9 9 This again assumes that the '027 instruction is not reused (which it would be in "real" code). If it were reused, the counts would drop by 2 cycles. | Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS. Agreed. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)