Path: utzoo!attcan!uunet!lll-winken!lll-lcc!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!labrea!rutgers!gatech!hubcap!ian
From: ian@esl.UUCP (Ian Kaplan)
Newsgroups: comp.parallel
Subject: Re: parallel numerical algorithms
Message-ID: <1826@hubcap.UUCP>
Date: 6 Jun 88 13:00:17 GMT
Sender: fpst@hubcap.UUCP
Lines: 139
Approved: parallel@hubcap.clemson.edu

In article <1776@hubcap.UUCP> gerald@umb.umb.edu (Gerald Ostheimer)
writes (in response to a note from George Nelan):
>
>You should take a look at the work on the tagged-token data flow machine of
>MIT's Computation Structures Group under Arvind. (Their work, for some reason
>unbeknownst to me, did not yet receive any attention in this newsgroup.)
[ much deleted ]
>
> A (possibly) surprising problem that turns up is that there can be too
> much parallelism in a program, which can overflow the pipelines and choke the
> machine.
[ text deleted ]
>
> There are of course more problems.
> I for one never quite understood how the program is distributed over the
> CPU's. This must happen dynamically (when calling functions or
> entering loops, for example), if parallelism is to be exploited.

  To make up for the deficit of data flow discussion in this news
group, I will submit this overly long note discussing some of Arvind's
work.  Data flow is complex and I cannot provide an introduction in
the space of a note that I have time to write and you are likely to
read.  As a result, I will assume some familiarity on the part of the
reader with data flow.  If there is enough interest, I might be
persuaded to put together brief bibliography.

  All programs must have a method of handling global state, which is
usually stored in global data structures.  As Mr. Ostheimer points out,
Arvind's I-structures handle this well.  Arvind's group has also
looked at reducing the overhead normally associated with handling
these data structures.  Some of these methods are algorithmic and
others are architectural.  For example, Steve Heller in Arvind's lab
did some work on a hardware implementation of an I-structure store,
which is a sort of "smart memory".

  Arvind's model of data flow is referred to as dynamic data flow (this
is in contrast to Prof. Dennis' model of static data flow).  In
dynamic data flow several instances of a loop can be active at the
same time.  As each loop instance is instantiated a loop field in the
data flow tag is incremented.  This allows separate data flow matching
on the tokens of each loop instance.  While this technique generates a
lot of parallelism for loops, as Mr. Ostheimer points out, it can also
generate so much parallelism that the data flow matching store
overflows.  One way to solve this problem is to allow the loops to
only expand so far.  For example, a given loop would only be allowed
to have five instantiations.  This would be decided in advance by the
compiler.  How the compiler decides this, I have not seen explained.

  Arvind's implementation of dynamic data flow assumes that as
constructs like loops parallelize at run time, new loop instances run
on additional processors.  This means that the code for the loop must
be resident on the processor.  One easy way to handle this was used by
the Cal. Tech. Cosmic Cube group: a complete copy of the program is
loaded onto every processor.  Of course this is a very inefficient way
to use memory, especially in a small grain parallel computer.  Like
Mr. Ostheimer, I have never seen an explanation of how code is allocated
to processors in a dynamic data flow system.  On Arvind's data flow
machine I don't thing that assignment is dynamic (it is on the
Manchester data flow machine).

  The data flow model is an asynchronous model.  A data producer sends
data to a data consumer when ever it wants and the consumer uses
the data when it is ready.  There are no rendezvous in data flow.
This has the advantage of allowing pipelining, which is an important
source of parallelism.  It has the disadvantage of consuming large,
potentially unbounded, amounts of buffer space.  If a data producer
(perhaps a data flow sub-graph) produces data faster than a consumer
can use it, the consumer's buffers will overflow.  Arvind wrote two
papers on "demand driven data flow", which allows the consumer to tell
the producer that it is ready for more data.  I do not know if the
current Id compiler that is being used by Arvind's group uses demand
driven data flow or not.  Without something like demand driven
data flow (or reduction) an asynchronous data flow machine has the
potential of consuming all of its buffer memory and crashing.
      
  Arvind and Dennis propose data flow supercomputers.  Supercomputers
are being used primarily for numeric computation (as opposed to symbol
computation).  This means that numeric programs must run well on the
system being proposed.  As many people have noted, numeric programs
spend most of their time in loops.  While existing programs might be
better restated for a data flow machine (in Id for example), they will
still have a heavy loop content.  Arvind's data flow model takes
advantage of the parallelism in loops and produces a great deal of
parallelism while the loop is executing.  Many loops must synchronize,
either during the loop execution (if the loop is pipelining) or at the
end of the loop.  If one looks at a performance graph for a numeric
program executing on a data flow machine, there will be large peaks,
with a lot of parallelism, and troughs, where synchronization must
take place.  The synchronization is talking place on a relatively
small number of processors.  At this point, the entire computation is
limited by the speed of the processors.  If the computation is highly
parallel, these synchronization points will come to dominate the
computation time.  The data flow model assumes that many (i.e., 10^4
or 10^6) relatively inexpensive, relatively low performance,
processors are used.  This is fine for the "peaks" of parallelism, but
these slow processors become the limiting factor for synchronization.
This is one of the reasons that Prof. Kuck at the Univ. of Ill.
proposes a few very powerful processors for his Cedar parallel
processor.  Since these processors are fast, they will execute the
synchronization sections rapidly.

Conclusion

  Arvind's data flow model has a reasonable solution for handling
global state (global data structures).  The dynamic data flow model
has a potential producing so much loop parallelism that the data flow
matching stores overflow.  This can be managed by choking loop
parallelism.  Arvind also has a solution for handling the
"producer-consumer" problem, but it is very theoretical and needs work
before it can be implemented.  Using a combination of reduction and
data flow may provide an elegant solution, but I have not seen Arvind
propose this (although there are people in his lab who know reduction
well).  Also, it is not clear, at least to me, that small grain data
flow is a good model for numeric computation.  Symbolic computation
might be a better application, since less synchronization may be
required.  Finally, I have heard a lot of grumbling at conferences
along the lines of "Why is there no real hardware implementation of a
data flow machine."  One person even suggested that no machine had
been built "because Arvind knew that it would not work" (a view that I
don't subscribe to).  Although some of the criticism is unfair, it
still remains true that there is no "iron".  Compared to a couple of
years ago, data flow work has lost momentum if the ACM Computer
Architecture Conference proceedings and the International Conference
on Parallel Processing proceedings are any indication.  Of the data
flow papers I saw presented at the last ICPP, none of them addressed
the hard problems in data flow.  All in all, data flow in America does
not appear as lively as it once was.  However the Japanese just
finished a fairly large data flow parallel processor named Sigma, so
there is still work going on.  I have not seen any information on the
latest Sigma work.

  Well, I hope that this generates some light, in addition to heat.
Sorry it was so long.  Any views expressed here are personal, and do
not necessarily reflect those of ESL or my department.

           Ian L. Kaplan
           ESL Inc.
           Advanced Technology Systems
           ian@esl.COM
           ames!esl!ian