Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!amelia!orville.nas.nasa.gov!fouts
From: fouts@orville.nas.nasa.gov (Marty Fouts)
Newsgroups: comp.arch
Subject: Re: Single tasking the wave of the future?
Message-ID: <25@amelia.nas.nasa.gov>
Date: 16 Dec 87 19:04:51 GMT
References: <201@PT.CS.CMU.EDU> <388@sdcjove.CAM.UNISYS.COM> <988@edge.UUCP> <1227@sugar.UUCP> <151@sdeggo.UUCP> <1423@cuuxb.ATT.COM> <439@xyzzy.UUCP> <440@xyzzy.UUCP> <36083@sun.uucp> <18@amelia.nas.nasa.gov> <2341@encore.UUCP>
Sender: news@amelia.nas.nasa.gov
Reply-To: fouts@orville.nas.nasa.gov (Marty Fouts)
Lines: 188

In article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay) writes:
>In regard to general-purpose multiprocessors:
>
>Every fork() (or thread_create() in Mach) in every program can get
>scheduled on a different cpu (that includes every shell per user,
>daemon, background task, ... Also, don't forget all those kernel
>processes (or tasks, threads) running on different cpus (pagers,
>signal handlers, ...). How difficult is it when the O.S.  does it
>transparently? 

Mr. Fay is using transparently in a way with which I am unfamilar.  It
is true that Mach provides primitives which allow the programmer to
introduce multitasking into the program; but these are in no sense
transparent.  Task creation and deletion, synchronization, and task
scheduling all require explict code in the program which is to take
advantage of the tasking.  Even the ancient Unix fork() has to be
explicitly coded for.

>
>And then there is more sophisticated mechanisms ("micro-tasking", gang
>scheduling, vectorizing fortran compilers) available to any user who
>wants more capability. 
>

A problem with these more sophisticated mechanisms is that they can
lead to parallel execution in which the wall clock time goes up as a
function of the number of processors, rather than down. . .  Another
problem is that no compiler can perfectly optimize.  The better the
programmer understands what the optimizer can optimize, the easier it
is to write optimizable code.  With vector code, this is fairly
straightforward to do, with parallel code, it is much more difficult.

>Writing software which exploits the FULL parallelism of a machine MAY
>be hard to do in CERAIN cases. 

It has been my experience in five years of writing code to exploit
distributed and concurrent parallelism that it is hard in a large
number of cases.  There are well behaved algorithms, such as those
involved in PDE solution for which parallelism is trivial. There are
also algorithms requiring high communication cost, much
synchronization and unpredictable work per task which are nearly
untractable.  There appear to be an entire class of algorithms for
which no parallel solution is more efficent than a sequential solution
on the same processor.

>Debugging is a whole other soapbox, but my
>experience is that debugging coding errors is not much more difficult
>than uniprocessors.  What is hard (or "impossible" with current tools)
>is detecting race conditions and bottlenecks - i.e. CONCEPTUAL errors. 
>This is one of the many time lags in parallel software tool development,
>not an inherent defect in architecture. Race conditions are not a
>common occurence for users to debug.
>

I will not quible over the defects inherent nature, only agree that it
is a most difficult problem, and one which decades of parallel
processing research has not solved adequately.

>Price/performance advantage (for small multiprocessor minis and up) is 
>huge.
>

Again, this is only true for well behaved algorithms.  Throwing a
parallel machine at an average workload is like throwing a vector
machine at it, only more so:  The difference between vendor 'not to be
exceeded' speed for the machine and delivered throughput for the
workload can be two order of magnitude, dramatically reducing the real
price / performance.

>I/O can be a bottleneck for PC's, minis and Crays. The solution is to 
>parallelize I/O, which is what multiprocessors do. (General-purpose
>multis are NOT "very high-end" -- they run from $70K - $750K).
>

Oddly enough, the Cray 2 doesn not solve the I/O bottleneck problem by
providing parallel I/O processors.  It is one of the best I/O balanced
multiprocessor machines, and accomplishes this with a single I/O
(foreground) processor, performing work for all four compute
(background) processors.  The factor isn't how the scaling is
achieved, but that it frequently isn't achieved.

>> ...I still remember the PDP 11/782 ...
>
>That was a long time ago in multiprocessor history. All I/O was done
>by one CPU (master/slave) - sequential, not parallel. It was NOT a
>symmetric multi like those today. 

Actually, the 11/782 wasn't a long time ago in multiprocessor history.
The earliest multiprocessor machines were the earliest machines.  It
was John Von Neumann's greatest gift to get us out of the completely
parallel processor business into the uniprocessor business.  But even
recognizable diadic multiprocessors started appearing a decade before
the 11/782.  The point of the analogy, which was poorly drawn, is that
many vendors are still falling into the trap of scaling up one
function of a machine, while not scaling up the rest of the system to
match, leading to very unbalanced systems.  If your favorite PC which
has a faster IP/FLOP rating than a 780 can't support 30 users because
it has poor I/O performance, adding N Transputers isn't going to allow
it support 30 users. . .

(By the way, not all current multiprocessors are symetrical.  They
 actually fall into a fairly wide range of classes with respect to how
 i/o load is distributed among the processors.)

>
>>Saturation of usable HW technology is close.
>
>What does this cliche mean?
>

It means I was in a hurry to finish, so didn't draw this point out.
Sorry about that.  It appears that the rate at which performance
improvement through new implementation technology occurs is slowing
dramatically.  Top end clock rates are down to a few nanoseconds, and
it doesn't appear likely that subnanosecond rates are going to arrive
before the end of the century.  This means the bottom end is within
two orders of magnitude of the top and rapidly getting closer.  To me
this means that radical performance improvements over the spectrum of
machines aren't as likely as a narrowing of the spectrum, thus
 "saturation of usable technology."

When you couple this with the fact that after twenty years of various
kinds of software/hardware research, parallel processing is still
limited to the technology described by Dykstra in 63, and still has
the same problems in software development, it is probably fair to say
that little progress is being made in software either.

>By the way, I don't fault people for not understanding
>about multis (e.g., from past exposure or school). It takes some
>time for common misconceptions to catch up to current reality. 
>

Since I don't see a smiley face on this, I will assume that it was
intended to be as obnoxious and condescending as it sounds.  I won't
argue over who misunderstands what, but I will point out to Mr. Fay,
that I daily program a range of multiprocessors in both distributed
and parallel applications, and that my opinions are the result of
direct experience with all (MIMD, SIMD, multiflow, massive or sparse)
of the currently available kinds of multiprocessors, as well as many
of the dinasours.

As an example of a class of algorithms which is difficult to vectorize
or parallelize, let me pull out the ancient prime finder algorithm:

      IPRIME(1) = 1
      IPRIME(2) = 2
      IPRIME(3) = 3
      NPRIME = 3
      DO 50 N = 5, MAXN, 2
         DO 10 I = 3, NPRIME
            IQ = N / IPRIME(I)
            IR = N - (IPRIME(I) * IQ)
            IF (IR .EQ. 0) GO TO 40
            IF (IQ .LT. IPRIME(I)) GO TO 20
 10      CONTINUE
 20      NPRIME = NPRIME + 1
         IPRIME(NPRIME) = N
         IF (NPRIME .GE. MAXP) GO TO 60
 40      CONTINUE
 50   CONTINUE
 60   CONTINUE


Although there are different algorithms for finding primes, I use this
one to illustrate a class of problems which comes up frequently in my
work.  There exists some set from which must be drawn a subset.  There
exists a rule for ordering the set and another rule for determining if
an element is a member of the subset.  The determination rule requires
the subset drawn from all elements ordered before X in the initial set
be know to determine if X is a member of the subset.  The
determination test requires a comparison with some of the elements of
the known subset to decide if X should be added to the subset.
Usually, although not always, there is an ordering of the subset (not
necessarily the same as the ordering of the set) such that by
comparing X with members of the subset in order, it is frequently 
unnecessary to compare X with all of the members.

None of the vectorizing compilers that I have access to will attempt
to vectorize this algorithm.  The Alliant automatic parallelizer will
not attempt to parallelize it.  Most of the mechanisms I have tried to
handcraft a vector or parallel variant which remains true to the
previous paragraph have added sufficent extra work to the algorithm
that it runs more slowly as the number of processors increase.

I would be indebted to Mr. Fay, or anyone else who could provide me
with a parallel or vectorizable version of the algorithm which
maintained it as an example of the kind of problem I have described
above.