Path: utzoo!utgpu!water!watmath!clyde!rutgers!rochester!cornell!batcomputer!pyramid!voder!apple!bcase
From: bcase@apple.UUCP (Brian Case)
Newsgroups: comp.arch
Subject: Re: Why is SPARC so slow?
Message-ID: <6993@apple.UUCP>
Date: 14 Dec 87 19:56:47 GMT
References: <8809@sgi.SGI.COM> <6964@apple.UUCP> <8885@sgi.SGI.COM>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc., Cupertino, USA
Lines: 93

In article <8885@sgi.SGI.COM> baskett@baskett writes:
>In article <6964@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
>> ...
>> >The separate instruction and data cache only run
>> >at single cycle rates but they run a half cycle out of phase with each
>> >other so it all works out.  (Pretty slick, don't you think?)
>> 
>> Yes, I do think it is pretty slick, but I also think this is a liability
>> at clock speeds higher than 16 Mhz (and maybe even at 16MHz).  I am sure,
>> though, that MIPS has a plan to fix this problem.  It sure seems like the
>> way to go at 8 Mhz.  Preventing bus crashes (i.e. meeting real-world
>> timing constraints) can be problem.
>
>The 16 MHz MIPS parts we have work fine.  If it becomes a problem, the fix
>is pretty obvious, too.

Oh, I am sure they work great.  I didn't mean that they would be flaky or
intermittent or something, just that the system design is trickier.

>> I am sure one of their chief concerns was future ECL implementation.
>I have an ECL implementation of an experimental Risc processor (board)

[Yes, that's a good machine!  I hear it is the "DEC Dorado."]

>in my office.  My experience with the team that designed and built it
>(a great group of people at DEC's Western Research Lab, by the way)
>tells me that the MIPS architecture is more suitable for ECL implementation
>than the SPARC architecture.  (see next comment)
>
>> by choosing register windows (which lets them vary the number of registers,
>> in window increments, for a given implementation) and a very simple
>> definition otherwise, SUN simply did the best they could to make future
>> implementation easy.
>
>It may have been the best they could do but it looks like a mistake to me.

Well, notice that it was *I* who said that they were doing "the best they
could."  Please don't take my word as the official SUN position!  Seldom
does anyone really do "the best they could."  One man's mistake is another
man's stroke of genius.

>In higher performance technologies the speed of register access becomes
>more and more critical so about the only thing you can do with register
>windows is to scale them down.

Yes, in the first ECL single-chip implementation.  Then, as the technology
gets denser, you can scale them back up to the desired level.  I was not
talking about discrete ECL implementation; I should have made that clear.
You may think that even single-chip ECL implementations suffer with large
register files, but I don't believe so (but I'm still youngish and naive).

>And as the number of windows goes down,
>the small gain that you might have had goes away and procedure call
>overhead goes up.  Attacking the procedure call overhead problem at
>compile time rather than at run time is a more scalable approach.

Well, I understand what you are saying: "the available density of the
technology is irrelevant, to a degree, with a smallish [my opinion],
fixed-size register file."  On the other hand, *by definition,* the SUN
approach is more scalable since there is at least some opportunity for
scaling; a fixed-size register file cannot, by definition, be scaled.
(Or, have I missed something?  Sorry if so.)

1) Notice that if SUN decides to dump the overlapping register window
approach, they can!  They can treat one procedure context as the only
context available and use a procedure calling mechanism like MIPS. 
Compatibility can be maintained by having the old instructions trap and
do the right thing.  This will allow them to implement a register file
the same size of the MIPS register file.  Presumably, we'll be at such
processing speeds then that old binaries, which use the old procedure
calling mechanism, will run fast enough, even with the trap overhead.
(The idea here makes sense, but I'm not sure I'm communicating it well.)

2) Didn't David Wall do research on register allocation at link time
that showed that lots of registers are better?  Admittedly, his approach
needed a large pool of registers, like in the Am29000, not the overlapping
register windows of the SPARC (couldn't resist!  :-).  Do you now think
that the MIPS 32-entry file is as good as the 64-entry file on the
experiemental machine to which you refer?  I'm genuinely curious here, 
not asking a rhetorical question.  I was under the impression that
register allocation at link time was sorta "the wave of the future"
(I hate that expression); if so, wouldn't 32 be too small?

3) You have to remember that it will be necessary to have at least some
TLB-type or other cache-type function finish in one machine cycle.  True,
the array technology used for TLBs can be denser, and therefore a little
faster, than multi-ported register file array technology.  However, if you
can get your TLB array access and compare in one cycle, why do you think
that you can't get your register-file-array access and address compute
(be it add, or whatever) in one cycle?  What was the cycle-limiting
factor in the experimental machine that you have in your office?

Thanks in advance.