Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!gaas!garner From: garner@gaas.Sun.COM (Robert Garner) Newsgroups: comp.arch Subject: Re: Why is SPARC so slow? Message-ID: <36626@sun.uucp> Date: 16 Dec 87 00:47:11 GMT References: <6964@apple.UUCP> <8885@sgi.SGI.COM> <1115@winchester.UUCP> <6993@apple.UUCP> <1941@ncr-sd.SanDiego.NCR.COM> Sender: news@sun.uucp Reply-To: garner@sun.UUCP (Robert Garner) Followup-To: <8809@sgi.SGI.COM> Organization: Sun Microsystems, Mountain View, CA Lines: 151 Keywords: SPARC, RISC, Sun-4/200, R2000, M/1000 Summary: SPARC Implementations vs. Architecture, cpi & performance, register windows The expositions on comp.arch about SPARC and the gate array implementation are interesting. Some of the inaccuracies have been addressed but others remain unanswered. Mashey's recent article <1115@winchester.UUCP> did clear up the confusion surrounding the implementation of conditional branches that was incorrectly portrayed by Forest Baskett <8809@sgi.SGI.COM> and Dennis Russell <1941@ncr-sd.SanDiego.NCR.COM>. Brian Case has taken an fairly impartial look at the architecture in <6964@apple.UUCP> and <6993@apple.UUCP>. Baskett's message was refreshing in that he accurately differentiated between implementation and architecture. (Quite unlike previous criticisms, such as from the so-called "MIPS Performance Brief.") However, Baskett's article continues to incorrectly portray the integer performance of Sun-4/200 workstations and SPARC in general. Sun's data on MIPS performance implies that the Sun-4/200 has approximately the same INTEGER performance as the M/1000. This fact is frequently ignored since the Sun-4/200 floating-point performance is generally (but not always) less than the M/1000. Baskett correctly deduces that this is due to the use of the Weitek 1164/54 floating-point chips, which are slow compared to MIPS' custom FPU. The Fujitsu gate arrays plus the Weitek chips were a reasonable vehicle for a SYSTEMS company like Sun to prove and quickly bring to market an OPEN, RISC-based workstation/server plus a wide range of application SOFTWARE. Sun, unlike MIPS, is not organized around the task of designing and fine tuning custom-designed ICs. It has even taken MIPS, whose lifeblood depends on a fast processor, more time than expected to deliver parts at speed (15-16 MHz). Now that SPARC is established, Sun is working closely with semiconductor companies themselves. This work includes improved floating-point implementations. Forest concluded his article by saying: > Since MIPS and Sun seem to be producing these systems with similar > technologies at similar clock rates at similar times in history, these > differences in the cycle counts for our most favorite and popular > instructions seem to go a long way toward explaining why SPARC is so slow. This hand waving is too fast! A standard, off-the-shelf gate array is NOT in the same league as a custom CMOS design. Indeed, that a gate array has the same integer performance as a tuned, full-custom, "similar technology" implementation is an indication of the strength of the architecture! Forest attempted to deduce the gate-array CPI value for integer and floating-point programs. From this analysis, he concluded: > These ratios [based on CPIs] are also consistent with the benchmark > results in the Performance Brief. Yes, floating-point suffers because of the Weitek chips. And yes, MIPS' "Performance Brief" attempts to stigmatize SPARC by dwelling on this: its benchmark suite and MIPS-rate calculations are conveniently based almost entirely on floating-point programs! But no, one can not accurately judge different processors by comparing their implementation-dependent "cycles per instruction" (CPI) values. Performance also depends on the number of instructions (N) issued by a compiler. For example, MIPS's delayed load does not affect their CPI but increases their N when NOPs are required, whereas SPARC's interlocked load decreases N but counts against its CPI. SPARC's register windows and corresponding fewer loads and stores also decrease its N relative to MIPS. By avoiding a more detailed analysis that includes N (via simulations), one ignores the state of the compilers and associated optimizations (via SPARC's annul bit, for instance.) In general, there is always room for improvement in compiler generated code. The Sun-4/200, for LARGE C, integer programs runs at about 1.65 CPI. This includes 15% loads and 5% stores AND the miss cost associated with the 128K-byte cache and the large, asynchronous main memory. (Baskett's calculation assumed MIPS' distribution, 20% loads and 10% stores, which is not applicable to SPARC. Since cache effects can dominate performance, I suspect that the M/1000, large-C-program CPI could be near 1.6 if its cache/memory is taken into account.) As processor cycle time shrinks, the CPI for CPUs of all types increases because the miss cost rises. This is because main memory access times are not scaling as rapidly as processor cycle times. This negative effect on CPIs must be offset by improvements in CPU pipelines and is even more pronounced in low-CPI machines. SPARC implementations are balanced in a way that achieve shorter cycle times, do not cause an increase in CPI, and carefully consider chip-edge bandwidth issues. SPARC implementations include single-cycle loads and single-cycle untaken branches. Of course, the most error-free measure of performance is wall clock time. Until there are more results of some large integer programs running both on the Sun-4 and the M/1000, speculation can be unproductive. Now, what about register windows? In Baskett's second article <8885@sgi.SGI.COM>, he writes: > It may have been the best they could do but it looks like a mistake to me. > In higher performance technologies the speed of register access becomes > more and more critical so about the only thing you can do with register > windows is to scale them down. And as the number of windows goes down, > the small gain that you might have had goes away and procedure call > overhead goes up. Attacking the procedure call overhead problem at > compile time rather than at run time is a more scalable approach. Two points: (1) It is hard to visualize the future difference between implementing 1K-bit vs. 4K-bit register files (i.e., 32 registers versus 128 registers). Memories can turn out larger and faster than intuition indicates. (2) SPARC does NOT PRECLUDE interprocedural register allocation (IRA) optimizations and thus ALLOWS for "attacking the procedure call overhead problem at compile time rather than at run time." SPARC has two mechanisms to reduce load/store traffic: register windows and IRA! In SPARC, the procedure call and return instructions are different from the ones that increment and decrement the window pointer. (SPARC's "save" and "restore" instructions decrement and increment the window pointer. They also perform an "add", which usually adjusts the stack pointer. The pc-relative "call" and register-indirect "jump-and-link" do NOT effect the window pointer.) A minimum SPARC implementation could have 40 registers: 8 ins, 8 locals, 8 outs, 8 globals, and 8 local registers for the trap handler. Such as implementation is not precluded by the architecture, but would probably imply IRA-type optimizations. It would function as if there were no windows, although window-based code would properly execute, albeit inefficiently. Register windows have several advantages over a fixed set of registers, besides reducing the number of loads and stores by about 30%: They work well in LISP (incremental compilation) and object-oriented environments (type-specific procedure linking) where IRA is impractical. They can also be used in specialized controller applications that require extremely fast context switching: a pair of windows (32 registers) can be allowed per context. -------------------------------- Robert Garner Sun Microsystems P.S. There will be two sessions devoted to SPARC at the IEEE Spring Compcon: One session will cover the architecture, compilers, and the SunOS port and the other will cover the Fujitsu, Cypress, and BIT implementations. DISCLAIMER: I speak for myself only and do not represent the views of Sun Microsystems, or any other company.