Path: utzoo!mnetor!uunet!husc6!rutgers!lll-lcc!pyramid!prls!mips!mash From: mash@mips.UUCP (John Mashey) Newsgroups: comp.arch Subject: Re: Why is SPARC so slow? Message-ID: <1115@winchester.UUCP> Date: 14 Dec 87 04:17:52 GMT References: <1078@quacky.UUCP> <8809@sgi.SGI.COM> <1941@ncr-sd.SanDiego.NCR.COM> Reply-To: mash@winchester.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 80 Keywords: RISC, R2000, SPARC In article <1941@ncr-sd.SanDiego.NCR.COM> dennisr@ncr-sd.SanDiego.NCR.COM (0000-Dennis Russell) writes: >In article <8809@sgi.SGI.COM> baskett@baskett writes: ...... >>Branches are slow. Since taken branches need only one delay slot >>there must be an address adder for the program counter. But with a >>single cache you have to decide early what the next instruction address >>is. Both SPARC chips always decide that a branch will be taken so there >>is an additional cycle penalty when the condition isn't satisfied and you >>have to junk the instruction you fetched and fetch the right one. On >> >I think there might be some confusion here on the operation of the Annul >Bit during conditional branches. It is my understanding that when this bit >is 0 then the delay instruction (the instruction following the branch) is >executed whether the branch is taken or not. When this bit is 1 then the >delay instruction is executed only if the branch is taken - if the branch >is not taken then the delay instruction which is already in the pipeline is >aborted. > >Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle >whether the branch is taken or not. With the Annul Bit at 1 a taken branch >executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the >branch and 1 cycle for the aborted delay instruction. Forrest and Dennis are talking about different things. See Fujitsu SPARC datasheet,and Namjoo&Agrawal, "Preserve high speed in CPU-to-cache transfers", Electronic Design, August 20, 1987, 91-96. These are consistent in saying: Fujitsu: "In performing delayed control transfer, the MB86900 processor always fetches the next instruction following a control transfer. Then the processor either executes this instruction or annuls it....This enables the pipeline to advance while the control target instruction is being fetched...By assuming a conditional branch to be taken, the processor minimizes pipeline interlock by providing one cycle execution for taken branches, or two cycle execution for untaken branches." Namjoo,Agrawal: "In this pipeline, the fetch address for instruction n is generated during the decoding stage of instruction n-2. Since all branch instructions are delayed by one cycle, all relative branch instructions take one cycle if the branch condition is true because the target instruction is fetched before the condition codes are ready. If, after condition codes are evaluated, it was determined that the branch was not taken, the processor ignores the target instruction and continues to fetch the next instruction in the sequence." Thus, given instructions: 1: conditional branch 2: branch delay slot 3: after branch delay slot N: target of branch Taken branch: 1, 2*, N (*= might be annulled) Untaken branch: 1, 2*, N**, 3 (** = ignored) The implication is that the CPU doesn't quite know the condition codes result in time, and thus has to guess. I can't tell from the Cypress datasheet whether or not they do the same thing.[Does anybody know who can say?] Given that one has decided to take some hit, this is probably the right way, in that taken conditional branches are on the order of 15% of instructions and untaken ones are on the order of 5% (on our machines), although this does vary: 1/3 of the programs we looked at had more untaken than taken branches. [I think earl killian posted this data a while back]. Thus, the SPARC branch design has (in terms of +=good, -=bad): + annul bit + ability to set condition codes on ALU ops - extra cycle for untaken conditional branch - condition-code based branch, i.e., often requires compare for eq, neq, etc that could actually be done as 1-cycle cmp-branches ALso, in looking at SPARC assembly code, one notes that cmp's are usually moved away from the conditional branches, so that perhaps these CPUs, or later ones, will take advantage of cases where the condition code setting is early enough to avoid the extra I-fetch. -- -john mashey DISCLAIMER:UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086