Path: utzoo!utgpu!watmath!clyde!att!rutgers!gatech!purdue!decwrl!pyramid!prls!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Memory-mapped floating point (was Re: ZISC computers) Keywords: ZISC Message-ID: <9136@winchester.mips.COM> Date: 2 Dec 88 05:05:07 GMT References: <22115@sgi.SGI.COM> <278@antares.UUCP> <2958@ima.ima.isc.com> <8939@winchester.mips.COM> <1044@microsoft.UUCP> <9061@winchester.mips.COM> <1054@microsoft.UUCP> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 71 In article <1054@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes: ..... >But sigh, I'm losing track of the point of the argument. .... >... Of course, if it's a small array >(up to 30 64-bit words each, or so), I'll just keep the whole thing in >registers on the 29000... It would be very interesting to see what realistic high-level languauge programs end up allocating floating-point arrays in the registers... especially given FORTRAN call-by-reference..... >>Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS. >Actually, that's half a cycle to loads and a cycle to stores. But you're >right, this is a silly sort of thing to benchmark. How about some >figures on the frequencies of the various ops in floating point programs, >O great benchamrk oracle? :-) Just how often do I need to load and store? Here's a few numbers, real quick, % instructions that are FP load/store: load store spice 15.2% 8.8% (scalar) doduc 26.1% 8% (scalar) linpack, 64bit 34.5% 18.6% (vector) >And don't forget that on the 29000, if it's just a local variable, I >store it in the register file/stack cache (guaranteed one cycle) and >the actual memory move may be obviated entirely. > >Even without all this logic, I think I can safely say that for vector >operations, the memory->fpu->memory speed is essential, thus all the >tricky things they do in Crays, avoiding the two steps of mem->reg >and reg->fpu. For all-register work, like Mandelbrot kernels, it >doesn't matter. And in between, I dunno. I still think it doesn't >hurt *that* bad. What's the silicon cost for the coprocessor interface >on the R2000/R3000? There are algorithms where FP values stick in the registers. Many very scalar real programs do many loads&stores that simply will not go away with zillions of on-chip registers [unless they're stack cache like CRISP's, where the registers have addresses just like memory. Even then, it appears that typical allocatable-on-the-stack arrays blow away any reasonable on-chip register caches, for a while. >Quote from the R2000 architecture manual: >[The Exception Program Counter register] >This register contains the virtual address of the instruction that caused >the exception. When that instruction resides in a branch delay slot, the >EPC register contains the virtual address of the immediately preceding >Branch or Jump instruction. > >What I'm wondering is, in the instruction sequence > > foo > bar > jump (untaken) > baz > quux > >where the jump is not taken, is "baz" considered to be in the jump's delay >slot? I.e. if baz faults, will the EPC point to it, or to the jump. >Of course, if the jump *is* taken, then EPC will point to the jump, but >I'm not sure if a "branch delay slot" is the instruction after a change- >flow-of-control instruction, or a change in the flow of control. I.e. >is the labelling static or dynamic? If dynamic, an instruction emulator >wouldn't have to recompute the condition; it would know that the branch >should be taken to find the correct return address. It's static, i.e., it's irrelevant whether jump is taken or not. -- -john mashey DISCLAIMER:UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086