Path: utzoo!mnetor!uunet!husc6!rutgers!lll-lcc!pyramid!prls!mips!mash
From: mash@mips.UUCP (John Mashey)
Newsgroups: comp.arch
Subject: Re: Why is SPARC so slow?
Message-ID: <1115@winchester.UUCP>
Date: 14 Dec 87 04:17:52 GMT
References: <1078@quacky.UUCP> <8809@sgi.SGI.COM> <1941@ncr-sd.SanDiego.NCR.COM>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 80
Keywords: RISC, R2000, SPARC

In article <1941@ncr-sd.SanDiego.NCR.COM> dennisr@ncr-sd.SanDiego.NCR.COM (0000-Dennis Russell) writes:
>In article <8809@sgi.SGI.COM> baskett@baskett writes:
......
>>Branches are slow.  Since taken branches need only one delay slot
>>there must be an address adder for the program counter.  But with a
>>single cache you have to decide early what the next instruction address
>>is.  Both SPARC chips always decide that a branch will be taken so there
>>is an additional cycle penalty when the condition isn't satisfied and you
>>have to junk the instruction you fetched and fetch the right one.  On
>>
>I think there might be some confusion here on the operation of the Annul
>Bit during conditional branches.  It is my understanding that when this bit
>is 0 then the delay instruction (the instruction following the branch) is
>executed whether the branch is taken or not.  When this bit is 1 then the
>delay instruction is executed only if the branch is taken - if the branch
>is not taken then the delay instruction which is already in the pipeline is
>aborted.
>
>Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle
>whether the branch is taken or not.  With the Annul Bit at 1 a taken branch
>executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the
>branch and 1 cycle for the aborted delay instruction.

Forrest and Dennis are talking about different things.
See Fujitsu SPARC datasheet,and Namjoo&Agrawal, "Preserve high speed in
CPU-to-cache transfers", Electronic Design, August 20, 1987, 91-96.
These are consistent in saying:
Fujitsu: "In performing delayed control transfer, the MB86900 processor always
fetches the next instruction following a control transfer.  Then the processor
either executes this instruction or annuls it....This enables the pipeline to
advance while the control target instruction is being fetched...By assuming
a conditional branch to be taken, the processor minimizes pipeline interlock
by providing one cycle execution for taken branches, or two cycle execution
for untaken branches."

Namjoo,Agrawal: "In this pipeline, the fetch address for instruction n is
generated during the decoding stage of instruction n-2.  Since all
branch instructions are delayed by one cycle, all relative branch instructions
take one cycle if the branch condition is true because the target instruction
is fetched before the condition codes are ready.  If, after condition codes
are evaluated, it was determined that the branch was not taken, the processor
ignores the target instruction and continues to fetch the next instruction
in the sequence."

Thus, given instructions:
1: conditional branch
2: branch delay slot
3: after branch delay slot
N: target of branch

Taken branch:
1, 2*, N   (*= might be annulled)
Untaken branch:
1, 2*, N**, 3  (** = ignored)

The implication is that the CPU doesn't quite know the condition codes
result in time, and thus has to guess. I can't tell from the Cypress
datasheet whether or not they do the same thing.[Does anybody know who can say?]

Given that one has decided to take some hit, this is probably the right way,
in that taken conditional branches are on the order of 15% of instructions
and untaken ones are on the order of 5% (on our machines), although this
does vary: 1/3 of the programs we looked at had more untaken than taken
branches.  [I think earl killian posted this data a while back].
Thus, the SPARC branch design has (in terms of +=good, -=bad):
	+ annul bit
	+ ability to set condition codes on ALU ops
	- extra cycle for untaken conditional branch
	- condition-code based branch, i.e., often requires compare for
	 eq, neq, etc that could actually be done as 1-cycle cmp-branches

ALso, in looking at SPARC assembly code, one notes that cmp's are usually
moved away from the conditional branches, so that perhaps these CPUs,
or later ones, will take advantage of cases where the condition code setting
is early enough to avoid the extra I-fetch.
-- 
-john mashey	DISCLAIMER: 
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086