Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!gaas!garner
From: garner@gaas.Sun.COM (Robert Garner)
Newsgroups: comp.arch
Subject: Re: Why is SPARC so slow?
Message-ID: <36626@sun.uucp>
Date: 16 Dec 87 00:47:11 GMT
References: <6964@apple.UUCP> <8885@sgi.SGI.COM> <1115@winchester.UUCP> <6993@apple.UUCP> <1941@ncr-sd.SanDiego.NCR.COM>
Sender: news@sun.uucp
Reply-To: garner@sun.UUCP (Robert Garner)
Followup-To: <8809@sgi.SGI.COM>
Organization: Sun Microsystems, Mountain View, CA
Lines: 151
Keywords: SPARC, RISC, Sun-4/200, R2000, M/1000
Summary: SPARC Implementations vs. Architecture, cpi & performance, register windows


The expositions on comp.arch about SPARC and the gate array implementation 
are interesting.  Some of the inaccuracies have been addressed 
but others remain unanswered.   Mashey's recent article <1115@winchester.UUCP>
did clear up the confusion surrounding the implementation of conditional
branches that was incorrectly portrayed by Forest Baskett <8809@sgi.SGI.COM>
and Dennis Russell <1941@ncr-sd.SanDiego.NCR.COM>.  Brian Case has taken
an fairly impartial look at the architecture in <6964@apple.UUCP>
and <6993@apple.UUCP>.

Baskett's message was refreshing in that he accurately differentiated
between implementation and architecture.  (Quite unlike previous
criticisms, such as from the so-called "MIPS Performance Brief.")
  
However, Baskett's article continues to incorrectly portray the integer
performance of Sun-4/200 workstations and SPARC in general.
Sun's data on MIPS performance implies that the Sun-4/200
has approximately the same INTEGER performance as the M/1000.
This fact is frequently ignored since the Sun-4/200 floating-point
performance is generally (but not always) less than the M/1000.
Baskett correctly deduces that this is due to the use of the Weitek
1164/54 floating-point chips, which are slow compared to MIPS' custom FPU.   

The Fujitsu gate arrays plus the Weitek chips were a reasonable vehicle 
for a SYSTEMS company like Sun to prove and quickly bring to market an OPEN,
RISC-based workstation/server plus a wide range of application SOFTWARE.
Sun, unlike MIPS, is not organized around the task of designing 
and fine tuning custom-designed ICs.  It has even taken MIPS, 
whose lifeblood depends on a fast processor, more time than expected
to deliver parts at speed (15-16 MHz).  Now that SPARC is
established, Sun is working closely with semiconductor companies
themselves.  This work includes improved floating-point implementations.

Forest concluded his article by saying:

> Since MIPS and Sun seem to be producing these systems with similar
> technologies at similar clock rates at similar times in history, these
> differences in the cycle counts for our most favorite and popular
> instructions seem to go a long way toward explaining why SPARC is so slow.

This hand waving is too fast!  A standard, off-the-shelf gate array is 
NOT in the same league as a custom CMOS design.  Indeed, that a gate 
array has the same integer performance as a tuned, full-custom, 
"similar technology" implementation is an indication of the strength
of the architecture!
 

Forest attempted to deduce the gate-array CPI value for integer 
and floating-point programs.  From this analysis, he concluded:

> These ratios [based on CPIs] are also consistent with the benchmark 
> results in the Performance Brief. 

Yes, floating-point suffers because of the Weitek chips.
And yes, MIPS' "Performance Brief" attempts to stigmatize SPARC 
by dwelling on this:  its benchmark suite and MIPS-rate calculations
are conveniently based almost entirely on floating-point programs!

But no, one can not accurately judge different processors
by comparing their implementation-dependent "cycles per instruction" 
(CPI) values.  Performance also depends on the number of instructions (N) 
issued by a compiler.  For example, MIPS's delayed load does not affect
their CPI but increases their N when NOPs are required, whereas 
SPARC's interlocked load decreases N but counts against its CPI.  
SPARC's register windows and corresponding fewer loads and stores
also decrease its N relative to MIPS.  By avoiding a more detailed
analysis that includes N (via simulations), one ignores the state
of the compilers and associated optimizations (via SPARC's annul bit, 
for instance.)  In general, there is always room for improvement in
compiler generated code.

The Sun-4/200, for LARGE C, integer programs runs at about 1.65 CPI.  
This includes 15% loads and 5% stores AND the miss cost associated
with the 128K-byte cache and the large, asynchronous main memory.
(Baskett's calculation assumed MIPS' distribution, 20% loads and 10% stores, 
which is not applicable to SPARC.  Since cache effects can dominate 
performance, I suspect that the M/1000, large-C-program CPI
could be near 1.6 if its cache/memory is taken into account.)

As processor cycle time shrinks, the CPI for CPUs of all types increases 
because the miss cost rises.  This is because main memory access
times are not scaling as rapidly as processor cycle times.  
This negative effect on CPIs must be offset by improvements 
in CPU pipelines and is even more pronounced in low-CPI
machines.  SPARC implementations are balanced in a way that achieve
shorter cycle times, do not cause an increase in CPI, and carefully
consider chip-edge bandwidth issues.  SPARC implementations include
single-cycle loads and single-cycle untaken branches.

Of course, the most error-free measure of performance is wall clock time.
Until there are more results of some large integer programs running
both on the Sun-4 and the M/1000, speculation can be unproductive.


Now, what about register windows?  In Baskett's second article
<8885@sgi.SGI.COM>, he writes:

> It may have been the best they could do but it looks like a mistake to me.
> In higher performance technologies the speed of register access becomes
> more and more critical so about the only thing you can do with register
> windows is to scale them down.  And as the number of windows goes down,
> the small gain that you might have had goes away and procedure call
> overhead goes up.  Attacking the procedure call overhead problem at
> compile time rather than at run time is a more scalable approach.

Two points:

(1)  It is hard to visualize the future difference between implementing 
1K-bit vs. 4K-bit register files (i.e., 32 registers versus 128 registers).  
Memories can turn out larger and faster than intuition indicates.

(2)  SPARC does NOT PRECLUDE interprocedural register allocation (IRA)
optimizations and thus ALLOWS for "attacking the procedure call 
overhead problem at compile time rather than at run time."
SPARC has two mechanisms to reduce load/store traffic:  
register windows and IRA!   

In SPARC, the procedure call and return instructions are different 
from the ones that increment and decrement the window pointer.  
(SPARC's "save" and "restore" instructions decrement and increment 
the window pointer.  They also perform an "add", which usually adjusts 
the stack pointer.  The pc-relative "call" and register-indirect
"jump-and-link" do NOT effect the window pointer.)

A minimum SPARC implementation could have 40 registers:  8 ins, 
8 locals, 8 outs, 8 globals, and 8 local registers for the trap handler.
Such as implementation is not precluded by the architecture, but
would probably imply IRA-type optimizations.  It would function
as if there were no windows, although window-based code would
properly execute, albeit inefficiently. 

Register windows have several advantages over a fixed set of registers,
besides reducing the number of loads and stores by about 30%:
They work well in LISP (incremental compilation) and object-oriented
environments (type-specific procedure linking) where IRA is impractical.
They can also be used in specialized controller applications that
require extremely fast context switching:  a pair of windows (32 registers)
can be allowed per context.
--------------------------------
Robert Garner
Sun Microsystems             

P.S.  There will be two sessions devoted to SPARC at the IEEE Spring Compcon:
One session will cover the architecture, compilers, and the SunOS port
and the other will cover the Fujitsu, Cypress, and BIT implementations.

DISCLAIMER:  I speak for myself only and do not represent the views 
of Sun Microsystems, or any other company.