Path: utzoo!utgpu!water!watmath!clyde!rutgers!rochester!cornell!batcomputer!pyramid!voder!apple!bcase From: bcase@apple.UUCP (Brian Case) Newsgroups: comp.arch Subject: Re: Why is SPARC so slow? Message-ID: <6993@apple.UUCP> Date: 14 Dec 87 19:56:47 GMT References: <8809@sgi.SGI.COM> <6964@apple.UUCP> <8885@sgi.SGI.COM> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Apple Computer Inc., Cupertino, USA Lines: 93 In article <8885@sgi.SGI.COM> baskett@baskett writes: >In article <6964@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes: >> ... >> >The separate instruction and data cache only run >> >at single cycle rates but they run a half cycle out of phase with each >> >other so it all works out. (Pretty slick, don't you think?) >> >> Yes, I do think it is pretty slick, but I also think this is a liability >> at clock speeds higher than 16 Mhz (and maybe even at 16MHz). I am sure, >> though, that MIPS has a plan to fix this problem. It sure seems like the >> way to go at 8 Mhz. Preventing bus crashes (i.e. meeting real-world >> timing constraints) can be problem. > >The 16 MHz MIPS parts we have work fine. If it becomes a problem, the fix >is pretty obvious, too. Oh, I am sure they work great. I didn't mean that they would be flaky or intermittent or something, just that the system design is trickier. >> I am sure one of their chief concerns was future ECL implementation. >I have an ECL implementation of an experimental Risc processor (board) [Yes, that's a good machine! I hear it is the "DEC Dorado."] >in my office. My experience with the team that designed and built it >(a great group of people at DEC's Western Research Lab, by the way) >tells me that the MIPS architecture is more suitable for ECL implementation >than the SPARC architecture. (see next comment) > >> by choosing register windows (which lets them vary the number of registers, >> in window increments, for a given implementation) and a very simple >> definition otherwise, SUN simply did the best they could to make future >> implementation easy. > >It may have been the best they could do but it looks like a mistake to me. Well, notice that it was *I* who said that they were doing "the best they could." Please don't take my word as the official SUN position! Seldom does anyone really do "the best they could." One man's mistake is another man's stroke of genius. >In higher performance technologies the speed of register access becomes >more and more critical so about the only thing you can do with register >windows is to scale them down. Yes, in the first ECL single-chip implementation. Then, as the technology gets denser, you can scale them back up to the desired level. I was not talking about discrete ECL implementation; I should have made that clear. You may think that even single-chip ECL implementations suffer with large register files, but I don't believe so (but I'm still youngish and naive). >And as the number of windows goes down, >the small gain that you might have had goes away and procedure call >overhead goes up. Attacking the procedure call overhead problem at >compile time rather than at run time is a more scalable approach. Well, I understand what you are saying: "the available density of the technology is irrelevant, to a degree, with a smallish [my opinion], fixed-size register file." On the other hand, *by definition,* the SUN approach is more scalable since there is at least some opportunity for scaling; a fixed-size register file cannot, by definition, be scaled. (Or, have I missed something? Sorry if so.) 1) Notice that if SUN decides to dump the overlapping register window approach, they can! They can treat one procedure context as the only context available and use a procedure calling mechanism like MIPS. Compatibility can be maintained by having the old instructions trap and do the right thing. This will allow them to implement a register file the same size of the MIPS register file. Presumably, we'll be at such processing speeds then that old binaries, which use the old procedure calling mechanism, will run fast enough, even with the trap overhead. (The idea here makes sense, but I'm not sure I'm communicating it well.) 2) Didn't David Wall do research on register allocation at link time that showed that lots of registers are better? Admittedly, his approach needed a large pool of registers, like in the Am29000, not the overlapping register windows of the SPARC (couldn't resist! :-). Do you now think that the MIPS 32-entry file is as good as the 64-entry file on the experiemental machine to which you refer? I'm genuinely curious here, not asking a rhetorical question. I was under the impression that register allocation at link time was sorta "the wave of the future" (I hate that expression); if so, wouldn't 32 be too small? 3) You have to remember that it will be necessary to have at least some TLB-type or other cache-type function finish in one machine cycle. True, the array technology used for TLBs can be denser, and therefore a little faster, than multi-ported register file array technology. However, if you can get your TLB array access and compare in one cycle, why do you think that you can't get your register-file-array access and address compute (be it add, or whatever) in one cycle? What was the cycle-limiting factor in the experimental machine that you have in your office? Thanks in advance.