Path: utzoo!mnetor!uunet!husc6!rutgers!umd5!ames!sgi!baskett
From: baskett@baskett
Newsgroups: comp.arch
Subject: Why is SPARC so slow?
Message-ID: <8809@sgi.SGI.COM>
Date: 10 Dec 87 02:42:03 GMT
References: <1078@quacky.UUCP>
Sender: daemon@sgi.SGI.COM
Organization: Silicon Graphics Inc, Mountain View, CA
Lines: 148
Summary: Loads, stores, branches, and floating point ops take too many cycles.



I have been asking myself the question, why is SPARC so slow?
I've been sparked by John Mashey's fascinating "Performance Brief"
and by continuing reports from our customers that our own 4D/70
12.5 MHz MIPS based workstations outperform Sun-4's on their CPU
intensive applications including image rendering and mechanical
design and analysis in a manner consistent with the benchmarks
reported in the Performance Brief.

SPARC is not slow compared to traditional microprocessors, granted.
But as a Risc microprocessor it seems to have some problems, at least
in the first two implementations.  Below are my observations so far on
why the Fujitsu version of SPARC is slow compared to the MIPS Risc
microprocessor.  At least some of the problems of the Fujitsu version
(the one in the Sun-4) are also present in the Cypress version,
according to the preliminary data sheets.  These problems don't
necessarily mean that the SPARC architecture has problems but I'd be
reluctant to accept SPARC as the basis for an Application Binary 
Interface standard until I saw some evidence that high performance
implementations of SPARC are possible.

Loads and stores are slow.  Loads on both implementations take two
cycles and stores take 3 cycles for 32-bit words compared to one cycle
for each on a MIPS R2000.  There are several interrelated reasons for
this situation.  Briefly, they are lack of a separate address adder,
lack of split instruction and data caches, and inability to cycle the
address and data bus twice per main clock cycle.  Details follow.

Lack of a separate address adder for loads and stores.  The R2000 can
start the address generation for a load or a store in the second stage
of the pipeline because the register access is fast and an address adder
is present.  Thus the load or store can "execute" in stage 3 of the
pipeline, just like the rest of the instructions.  On SPARCs (so far)
address generation appears to use the regular ALU in the third stage of
the pipeline and then begin the actual cache access in the fourth stage.
For a load, you then need an extra stage to get the data back.

Lack of split instruction and data caches.  Because both SPARCs have a
single cache rather than the separate instruction and data caches of
the R2000, the extra pipeline stage needed to get the data back for a
load can't be used to fetch an instruction anyway.  For a store the
relevant cache line is read on the fourth cycle and updated and written
back on the fifth cycle.  So there are two cycles that can't be used
to fetch instructions, bringing the total cost of a store to three cycles.

Inability to cycle the address and data bus twice per main clock cycle.
The SPARC chips aren't double cycling the address and data bus so that
both loads and stores mean that you can't fetch instructions.  The R2000
also has a single address bus and a single data bus but it can use them
twice per cycle.  This means you can then split your cache into an
instruction cache and a data cache and make use of the extra bandwidth
by fetching an instruction every cycle in spite of loads and stores.

However, if register windows eliminated enough loads and stores, these
two SPARC implementations might represent reasonable engineering design
decisions.  Both benchmarks and careful studies of code sequences
indicate that the load and store savings are not that great, generally
less than five percent.  We can also ask if the overhead of register
windows leaves enough time in the second stage of the pipe to do an
address add assuming we could fit such an adder into the implementation.
(Windowed registers take up a lot of space.)

Branches are slow.  Since taken branches need only one delay slot
there must be an address adder for the program counter.  But with a
single cache you have to decide early what the next instruction address
is.  Both SPARC chips always decide that a branch will be taken so there
is an additional cycle penalty when the condition isn't satisfied and you
have to junk the instruction you fetched and fetch the right one.  On
the R2000, the instruction address comes out in the second half of the
cycle on the double-cycled address bus so you have time to check the
condition in the first half of the cycle and put out the right target
address every time.  The separate instruction and data cache only run
at single cycle rates but they run a half cycle out of phase with each
other so it all works out.  (Pretty slick, don't you think?)  The first
delay slot can be used by a useful instruction a majority of the time
on both architectures so they are even there.  However, the SPARC
architecture requires that conditional branches be based on a value in a
condition code register rather than the value in a regular register, as
in the MIPS architecture.  Honest people can (and do) disagree about
which approach is better.  But the compiler studies I have seen indicate
that, on the average, you need an extra instruction for setting the
condition code a noticable fraction of the time.  So my guesstimate is
that the average conditional branch on a SPARC is 2.5 cycles and on an
R2000 is 1.5 cycles.  (Further study is needed here.)

Floating point is very slow.  Here we only know about the Fujitsu
version of the architecture.  The Cypress version is likely to be
better since the Weitek parts that the Fujitsu version uses are rather
old designs (WTL 1164 and WTL 1165).  Weitek's more recent designs are
faster and so we presume the Cypress version will be better, too.
Nevertheless, here are the numbers (from the data sheets).  I use cycle
counts just to keep it simple.

                   Fujitsu SPARC      MIPS R2000
                      SP    DP         SP    DP

    add/subtract       9    11          2     2

    multiply           9    12          4     5

    divide             34   65         12    19

These are the total latency times from start to finish for both
systems.  Both systems can execute other integer operations in parallel
with floating point operations after the floating point operations are
launched.  However the launch cost on SPARC is two cycles while it is
one cycle on the R2000.  The launch time is included in the above table.
Both systems appear able to do simultaneous multiplies and adds with no
pipelining.

If we summarize these cycles per instruction by looking at a conservative
estimate of instruction frequencies we get the following results, first
for integer programs and then for single precision floating point programs.

                     SPARC      MIPS     frequency
                     cycles     cycles   (percent)

    loads              2          1         20
    stores             3          1         10
    branches           2.5        1.5       15
    most other         1          1         55
    rare other        >1         >1         ~0

    average            1.63       1.08      ratio = 1.51

                     SPARC      MIPS     frequency
                     cycles     cycles   (percent)

    loads              2          1         20
    stores             3          1         10
    branches           2.5        1.5       15
    most other         1          1         45
    sp fp other        9          2         10

    average            2.43       1.18      ratio = 2.06

These ratios are also consistent with the benchmark results in the
Performance Brief.

Since MIPS and Sun seem to be producing these systems with similar
technologies at similar clock rates at similar times in history, these
differences in the cycle counts for our most favorite and popular
instructions seem to go a long way toward explaining why SPARC is so
slow.

Forest Baskett
Silicon Graphics Computer Systems