Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!bloom-beacon!husc6!yale!mfci!colwell
From: colwell@mfci.UUCP (Robert Colwell)
Newsgroups: comp.arch
Subject: Re: getting rid of branches
Message-ID: <460@m3.mfci.UUCP>
Date: 8 Jul 88 12:58:00 GMT
References: <1941@pt.cs.cmu.edu> <3208@ubc-cs.UUCP> <1986@pt.cs.cmu.edu> <12258@mimsy.UUCP> <236@lfm.fpssun.fps.com>
Sender: root@mfci.UUCP
Reply-To: colwell@mfci.UUCP (Robert Colwell)
Organization: Multiflow Computer Inc., Branford Ct. 06405
Lines: 65

In article <236@lfm.fpssun.fps.com> lfm@fpssun.fps.com (Larry Meadows) writes:
>In article <12258@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
>> In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com>
>> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
>> >You guys aren't thinking big enough.  How about multiple parallel
>> >pipelines to compute all the various instruction threads in parallel
>> >and just keep the results of the one that is actually taken?
>> 
>> Actually, this sort of idea is contained in some research and thesis
>> work that is (was?) going on here at Maryland.
>
>Sounds a lot like multiflow to me.
>
>But what do you do about exceptions????

Please, that's Multiflow with a capital M (they made me say that).

About the multiple pipelines:  I view our VLIW as being something
close to the converse of what you suggest above.  We don't try to do
anything with multiple instruction threads in parallel.  The essence
of trace scheduling is that you find the right path through the code
and remove all the non-essential branch dependencies.  Whatever is
left can be compacted only subject to data dependencies.  Given that
this approach works (and we certainly think it does), it wouldn't
make much sense to tie up N hardware pipelines with calculations,
only to dump N-1 of them sometime in the future, when you could have
gotten all N pipes doing real work.

The only thing we do that's close to what you suggest is in
performing code motions that sometimes result in operations being
kicked off whose results are never used.  There's a whole raft of
interesting problems that come knocking when you do this, including
taking page faults on memory references that aren't going to be used,
avoiding access violations on references that are way out of bounds
due to array subscripts that have walked off the end, etc.

And then there's exceptions.  Suppose there's a loop that's doing
floating point divides along with a lot of other stuff.  You'll
probably want to kick off the divide early due to its long latency.
But on the last trip through you start the divide, then find out that
this trace is done, and you branch out.  Meanwhile, that divide was
unnecessary and its operands are highly likely to be bogus, so it
is liable to try 0/0 or some other offensive calculation.  You don't
want to trap on an exception here, since that calculation wasn't
"real" anyway.

It's a hard problem.  The TRACE has a "store-check" operation that
watches for NaNs being written to memory from the Floating register
file, on the theory that if you're actually writing the result of a
flop to memory, you must really care about it.  At that point you
know the computation went awry.  Problem is, of course, you may no
longer be in the vicinity of the code where the "event" occurred.
You are still in the routine where the NaN showed up, and if
that's enough info you're done.  If it isn't, you will probably need
to re-compile in a non-optimizing mode and re-rerun; in that mode any
computation you do, you want, so the functional units can then report
their exceptions directly and immediately, and you can pinpoint your
problem that way.

I suspect that nobody's got all the answers to this problem yet.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090