Path: utzoo!mnetor!uunet!husc6!rutgers!umd5!ames!sgi!baskett From: baskett@baskett Newsgroups: comp.arch Subject: Why is SPARC so slow? Message-ID: <8809@sgi.SGI.COM> Date: 10 Dec 87 02:42:03 GMT References: <1078@quacky.UUCP> Sender: daemon@sgi.SGI.COM Organization: Silicon Graphics Inc, Mountain View, CA Lines: 148 Summary: Loads, stores, branches, and floating point ops take too many cycles. I have been asking myself the question, why is SPARC so slow? I've been sparked by John Mashey's fascinating "Performance Brief" and by continuing reports from our customers that our own 4D/70 12.5 MHz MIPS based workstations outperform Sun-4's on their CPU intensive applications including image rendering and mechanical design and analysis in a manner consistent with the benchmarks reported in the Performance Brief. SPARC is not slow compared to traditional microprocessors, granted. But as a Risc microprocessor it seems to have some problems, at least in the first two implementations. Below are my observations so far on why the Fujitsu version of SPARC is slow compared to the MIPS Risc microprocessor. At least some of the problems of the Fujitsu version (the one in the Sun-4) are also present in the Cypress version, according to the preliminary data sheets. These problems don't necessarily mean that the SPARC architecture has problems but I'd be reluctant to accept SPARC as the basis for an Application Binary Interface standard until I saw some evidence that high performance implementations of SPARC are possible. Loads and stores are slow. Loads on both implementations take two cycles and stores take 3 cycles for 32-bit words compared to one cycle for each on a MIPS R2000. There are several interrelated reasons for this situation. Briefly, they are lack of a separate address adder, lack of split instruction and data caches, and inability to cycle the address and data bus twice per main clock cycle. Details follow. Lack of a separate address adder for loads and stores. The R2000 can start the address generation for a load or a store in the second stage of the pipeline because the register access is fast and an address adder is present. Thus the load or store can "execute" in stage 3 of the pipeline, just like the rest of the instructions. On SPARCs (so far) address generation appears to use the regular ALU in the third stage of the pipeline and then begin the actual cache access in the fourth stage. For a load, you then need an extra stage to get the data back. Lack of split instruction and data caches. Because both SPARCs have a single cache rather than the separate instruction and data caches of the R2000, the extra pipeline stage needed to get the data back for a load can't be used to fetch an instruction anyway. For a store the relevant cache line is read on the fourth cycle and updated and written back on the fifth cycle. So there are two cycles that can't be used to fetch instructions, bringing the total cost of a store to three cycles. Inability to cycle the address and data bus twice per main clock cycle. The SPARC chips aren't double cycling the address and data bus so that both loads and stores mean that you can't fetch instructions. The R2000 also has a single address bus and a single data bus but it can use them twice per cycle. This means you can then split your cache into an instruction cache and a data cache and make use of the extra bandwidth by fetching an instruction every cycle in spite of loads and stores. However, if register windows eliminated enough loads and stores, these two SPARC implementations might represent reasonable engineering design decisions. Both benchmarks and careful studies of code sequences indicate that the load and store savings are not that great, generally less than five percent. We can also ask if the overhead of register windows leaves enough time in the second stage of the pipe to do an address add assuming we could fit such an adder into the implementation. (Windowed registers take up a lot of space.) Branches are slow. Since taken branches need only one delay slot there must be an address adder for the program counter. But with a single cache you have to decide early what the next instruction address is. Both SPARC chips always decide that a branch will be taken so there is an additional cycle penalty when the condition isn't satisfied and you have to junk the instruction you fetched and fetch the right one. On the R2000, the instruction address comes out in the second half of the cycle on the double-cycled address bus so you have time to check the condition in the first half of the cycle and put out the right target address every time. The separate instruction and data cache only run at single cycle rates but they run a half cycle out of phase with each other so it all works out. (Pretty slick, don't you think?) The first delay slot can be used by a useful instruction a majority of the time on both architectures so they are even there. However, the SPARC architecture requires that conditional branches be based on a value in a condition code register rather than the value in a regular register, as in the MIPS architecture. Honest people can (and do) disagree about which approach is better. But the compiler studies I have seen indicate that, on the average, you need an extra instruction for setting the condition code a noticable fraction of the time. So my guesstimate is that the average conditional branch on a SPARC is 2.5 cycles and on an R2000 is 1.5 cycles. (Further study is needed here.) Floating point is very slow. Here we only know about the Fujitsu version of the architecture. The Cypress version is likely to be better since the Weitek parts that the Fujitsu version uses are rather old designs (WTL 1164 and WTL 1165). Weitek's more recent designs are faster and so we presume the Cypress version will be better, too. Nevertheless, here are the numbers (from the data sheets). I use cycle counts just to keep it simple. Fujitsu SPARC MIPS R2000 SP DP SP DP add/subtract 9 11 2 2 multiply 9 12 4 5 divide 34 65 12 19 These are the total latency times from start to finish for both systems. Both systems can execute other integer operations in parallel with floating point operations after the floating point operations are launched. However the launch cost on SPARC is two cycles while it is one cycle on the R2000. The launch time is included in the above table. Both systems appear able to do simultaneous multiplies and adds with no pipelining. If we summarize these cycles per instruction by looking at a conservative estimate of instruction frequencies we get the following results, first for integer programs and then for single precision floating point programs. SPARC MIPS frequency cycles cycles (percent) loads 2 1 20 stores 3 1 10 branches 2.5 1.5 15 most other 1 1 55 rare other >1 >1 ~0 average 1.63 1.08 ratio = 1.51 SPARC MIPS frequency cycles cycles (percent) loads 2 1 20 stores 3 1 10 branches 2.5 1.5 15 most other 1 1 45 sp fp other 9 2 10 average 2.43 1.18 ratio = 2.06 These ratios are also consistent with the benchmark results in the Performance Brief. Since MIPS and Sun seem to be producing these systems with similar technologies at similar clock rates at similar times in history, these differences in the cycle counts for our most favorite and popular instructions seem to go a long way toward explaining why SPARC is so slow. Forest Baskett Silicon Graphics Computer Systems