Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!bloom-beacon!husc6!yale!mfci!colwell From: colwell@mfci.UUCP (Robert Colwell) Newsgroups: comp.arch Subject: Re: getting rid of branches Message-ID: <460@m3.mfci.UUCP> Date: 8 Jul 88 12:58:00 GMT References: <1941@pt.cs.cmu.edu> <3208@ubc-cs.UUCP> <1986@pt.cs.cmu.edu> <12258@mimsy.UUCP> <236@lfm.fpssun.fps.com> Sender: root@mfci.UUCP Reply-To: colwell@mfci.UUCP (Robert Colwell) Organization: Multiflow Computer Inc., Branford Ct. 06405 Lines: 65 In article <236@lfm.fpssun.fps.com> lfm@fpssun.fps.com (Larry Meadows) writes: >In article <12258@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: >> In article <91odrecXKL1010YEOek@amdahl.uts.amdahl.com> >> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: >> >You guys aren't thinking big enough. How about multiple parallel >> >pipelines to compute all the various instruction threads in parallel >> >and just keep the results of the one that is actually taken? >> >> Actually, this sort of idea is contained in some research and thesis >> work that is (was?) going on here at Maryland. > >Sounds a lot like multiflow to me. > >But what do you do about exceptions???? Please, that's Multiflow with a capital M (they made me say that). About the multiple pipelines: I view our VLIW as being something close to the converse of what you suggest above. We don't try to do anything with multiple instruction threads in parallel. The essence of trace scheduling is that you find the right path through the code and remove all the non-essential branch dependencies. Whatever is left can be compacted only subject to data dependencies. Given that this approach works (and we certainly think it does), it wouldn't make much sense to tie up N hardware pipelines with calculations, only to dump N-1 of them sometime in the future, when you could have gotten all N pipes doing real work. The only thing we do that's close to what you suggest is in performing code motions that sometimes result in operations being kicked off whose results are never used. There's a whole raft of interesting problems that come knocking when you do this, including taking page faults on memory references that aren't going to be used, avoiding access violations on references that are way out of bounds due to array subscripts that have walked off the end, etc. And then there's exceptions. Suppose there's a loop that's doing floating point divides along with a lot of other stuff. You'll probably want to kick off the divide early due to its long latency. But on the last trip through you start the divide, then find out that this trace is done, and you branch out. Meanwhile, that divide was unnecessary and its operands are highly likely to be bogus, so it is liable to try 0/0 or some other offensive calculation. You don't want to trap on an exception here, since that calculation wasn't "real" anyway. It's a hard problem. The TRACE has a "store-check" operation that watches for NaNs being written to memory from the Floating register file, on the theory that if you're actually writing the result of a flop to memory, you must really care about it. At that point you know the computation went awry. Problem is, of course, you may no longer be in the vicinity of the code where the "event" occurred. You are still in the routine where the NaN showed up, and if that's enough info you're done. If it isn't, you will probably need to re-compile in a non-optimizing mode and re-rerun; in that mode any computation you do, you want, so the functional units can then report their exceptions directly and immediately, and you can pinpoint your problem that way. I suspect that nobody's got all the answers to this problem yet. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090