Path: utzoo!utgpu!water!watmath!clyde!cbosgd!ncr-sd!dennisr From: dennisr@ncr-sd.SanDiego.NCR.COM (Dennis Russell) Newsgroups: comp.arch Subject: Re: Why is SPARC so slow? Keywords: RISC, R2000, SPARC Message-ID: <1941@ncr-sd.SanDiego.NCR.COM> Date: 11 Dec 87 23:27:14 GMT References: <1078@quacky.UUCP> <8809@sgi.SGI.COM> Reply-To: dennisr@ncr-sd.SanDiego.NCR.COM (0000-Dennis Russell) Distribution: world Organization: NCR Corporation, Rancho Bernardo Lines: 108 In article <8809@sgi.SGI.COM> baskett@baskett writes: > > >I have been asking myself the question, why is SPARC so slow? >....... >Loads and stores are slow. Loads on both implementations take two >cycles and stores take 3 cycles for 32-bit words compared to one cycle >for each on a MIPS R2000. There are several interrelated reasons for >this situation. Briefly, they are lack of a separate address adder, >lack of split instruction and data caches, and inability to cycle the >address and data bus twice per main clock cycle. Details follow. > >Lack of a separate address adder for loads and stores. The R2000 can >start the address generation for a load or a store in the second stage >of the pipeline because the register access is fast and an address adder >is present. Thus the load or store can "execute" in stage 3 of the >pipeline, just like the rest of the instructions. On SPARCs (so far) >address generation appears to use the regular ALU in the third stage of >the pipeline and then begin the actual cache access in the fourth stage. >For a load, you then need an extra stage to get the data back. > The block diagram in the data sheet of the Fujitsu SPARC shows an Address Generation Unit that is separate from the Arithmetic and Logic Unit. Both branch target addresses and load/store addresses are calculated in the AGU. Further on in the data sheet the four stage pipeline is described: Fetch, Decode, Execute, and Write. It is stated explicitly that "Memory addresses are evaluated for loads, stores, and control transfers" in the Decode stage. It can be concluded that the Fujitsu SPARC does indeed have a separate address adder and that load/store addresses are generated in the second stage (Decode) of the pipeline. The R2000 has a five stage pipeline: Fetch, Decode, Execute, Memory Access, Write Back. Memory address generation occurs in the third stage (Execute) and the load/store "executes" in the fourth stage (Memory Access). The reason for the 2 cycle load in the Fujitsu SPARC is the multiplexing of the external address and data busses between instructions and memory data. A SPARC load requires 1 cycle of the external busses so that instruction fetching stalls for this 1 cycle. >Lack of split instruction and data caches. Because both SPARCs have a >single cache rather than the separate instruction and data caches of >the R2000, the extra pipeline stage needed to get the data back for a >load can't be used to fetch an instruction anyway. For a store the >relevant cache line is read on the fourth cycle and updated and written >back on the fifth cycle. So there are two cycles that can't be used >to fetch instructions, bringing the total cost of a store to three cycles. > SPARC supports base register plus index register memory addressing. During the first half of the Decode stage the base and index registers are accessed. During the second half they are added together to form the virtual memory address. Since the register file in the Fujitsu SPARC has only 2 ports, store data cannot be accessed from the register file until the third stage (Execute). Thus, on a store the address goes out during the third stage (Execute) and the data during the fourth stage (Write). Since stores use the external busses for two consecutive cycles during which time fetching of instructions is suspended, the execution time for stores is 3 cycles. >Inability to cycle the address and data bus twice per main clock cycle. >The SPARC chips aren't double cycling the address and data bus so that >both loads and stores mean that you can't fetch instructions. The R2000 >also has a single address bus and a single data bus but it can use them >twice per cycle. This means you can then split your cache into an >instruction cache and a data cache and make use of the extra bandwidth >by fetching an instruction every cycle in spite of loads and stores. > This is indeed true. The price the R2000 pays for this is a complex clocking scheme whereby a 4 phase input clock at double frequency is required in order to control the double cycle external busses. Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to be seen whether the H/W architecture of the R2000 is scaleable - can it be carried to 25-30MHz where the bus must run at 50-60MHz ? >Branches are slow. Since taken branches need only one delay slot >there must be an address adder for the program counter. But with a >single cache you have to decide early what the next instruction address >is. Both SPARC chips always decide that a branch will be taken so there >is an additional cycle penalty when the condition isn't satisfied and you >have to junk the instruction you fetched and fetch the right one. On > I think there might be some confusion here on the operation of the Annul Bit during conditional branches. It is my understanding that when this bit is 0 then the delay instruction (the instruction following the branch) is executed whether the branch is taken or not. When this bit is 1 then the delay instruction is executed only if the branch is taken - if the branch is not taken then the delay instruction which is already in the pipeline is aborted. Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle whether the branch is taken or not. With the Annul Bit at 1 a taken branch executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the branch and 1 cycle for the aborted delay instruction. The advantage of the Annul Bit is in conditional branches that terminate loops. With the Annul Bit at 1 a loop instruction can be placed in the delay slot. This instruction is executed when the loop is executed and is not executed when you fall thru the loop. -- Dennis Russell | NCR Corp., M/S 4720 phone: 619-485-3214 | 16550 W. Bernardo Dr. UUCP: ...{ihnp4|pyramid}!ncr-sd!dennisr | San Diego, CA 92128