Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!amelia!orville.nas.nasa.gov!fouts From: fouts@orville.nas.nasa.gov (Marty Fouts) Newsgroups: comp.arch Subject: Re: Single tasking the wave of the future? Message-ID: <25@amelia.nas.nasa.gov> Date: 16 Dec 87 19:04:51 GMT References: <201@PT.CS.CMU.EDU> <388@sdcjove.CAM.UNISYS.COM> <988@edge.UUCP> <1227@sugar.UUCP> <151@sdeggo.UUCP> <1423@cuuxb.ATT.COM> <439@xyzzy.UUCP> <440@xyzzy.UUCP> <36083@sun.uucp> <18@amelia.nas.nasa.gov> <2341@encore.UUCP> Sender: news@amelia.nas.nasa.gov Reply-To: fouts@orville.nas.nasa.gov (Marty Fouts) Lines: 188 In article <2341@encore.UUCP> fay@encore.UUCP (Peter Fay) writes: >In regard to general-purpose multiprocessors: > >Every fork() (or thread_create() in Mach) in every program can get >scheduled on a different cpu (that includes every shell per user, >daemon, background task, ... Also, don't forget all those kernel >processes (or tasks, threads) running on different cpus (pagers, >signal handlers, ...). How difficult is it when the O.S. does it >transparently? Mr. Fay is using transparently in a way with which I am unfamilar. It is true that Mach provides primitives which allow the programmer to introduce multitasking into the program; but these are in no sense transparent. Task creation and deletion, synchronization, and task scheduling all require explict code in the program which is to take advantage of the tasking. Even the ancient Unix fork() has to be explicitly coded for. > >And then there is more sophisticated mechanisms ("micro-tasking", gang >scheduling, vectorizing fortran compilers) available to any user who >wants more capability. > A problem with these more sophisticated mechanisms is that they can lead to parallel execution in which the wall clock time goes up as a function of the number of processors, rather than down. . . Another problem is that no compiler can perfectly optimize. The better the programmer understands what the optimizer can optimize, the easier it is to write optimizable code. With vector code, this is fairly straightforward to do, with parallel code, it is much more difficult. >Writing software which exploits the FULL parallelism of a machine MAY >be hard to do in CERAIN cases. It has been my experience in five years of writing code to exploit distributed and concurrent parallelism that it is hard in a large number of cases. There are well behaved algorithms, such as those involved in PDE solution for which parallelism is trivial. There are also algorithms requiring high communication cost, much synchronization and unpredictable work per task which are nearly untractable. There appear to be an entire class of algorithms for which no parallel solution is more efficent than a sequential solution on the same processor. >Debugging is a whole other soapbox, but my >experience is that debugging coding errors is not much more difficult >than uniprocessors. What is hard (or "impossible" with current tools) >is detecting race conditions and bottlenecks - i.e. CONCEPTUAL errors. >This is one of the many time lags in parallel software tool development, >not an inherent defect in architecture. Race conditions are not a >common occurence for users to debug. > I will not quible over the defects inherent nature, only agree that it is a most difficult problem, and one which decades of parallel processing research has not solved adequately. >Price/performance advantage (for small multiprocessor minis and up) is >huge. > Again, this is only true for well behaved algorithms. Throwing a parallel machine at an average workload is like throwing a vector machine at it, only more so: The difference between vendor 'not to be exceeded' speed for the machine and delivered throughput for the workload can be two order of magnitude, dramatically reducing the real price / performance. >I/O can be a bottleneck for PC's, minis and Crays. The solution is to >parallelize I/O, which is what multiprocessors do. (General-purpose >multis are NOT "very high-end" -- they run from $70K - $750K). > Oddly enough, the Cray 2 doesn not solve the I/O bottleneck problem by providing parallel I/O processors. It is one of the best I/O balanced multiprocessor machines, and accomplishes this with a single I/O (foreground) processor, performing work for all four compute (background) processors. The factor isn't how the scaling is achieved, but that it frequently isn't achieved. >> ...I still remember the PDP 11/782 ... > >That was a long time ago in multiprocessor history. All I/O was done >by one CPU (master/slave) - sequential, not parallel. It was NOT a >symmetric multi like those today. Actually, the 11/782 wasn't a long time ago in multiprocessor history. The earliest multiprocessor machines were the earliest machines. It was John Von Neumann's greatest gift to get us out of the completely parallel processor business into the uniprocessor business. But even recognizable diadic multiprocessors started appearing a decade before the 11/782. The point of the analogy, which was poorly drawn, is that many vendors are still falling into the trap of scaling up one function of a machine, while not scaling up the rest of the system to match, leading to very unbalanced systems. If your favorite PC which has a faster IP/FLOP rating than a 780 can't support 30 users because it has poor I/O performance, adding N Transputers isn't going to allow it support 30 users. . . (By the way, not all current multiprocessors are symetrical. They actually fall into a fairly wide range of classes with respect to how i/o load is distributed among the processors.) > >>Saturation of usable HW technology is close. > >What does this cliche mean? > It means I was in a hurry to finish, so didn't draw this point out. Sorry about that. It appears that the rate at which performance improvement through new implementation technology occurs is slowing dramatically. Top end clock rates are down to a few nanoseconds, and it doesn't appear likely that subnanosecond rates are going to arrive before the end of the century. This means the bottom end is within two orders of magnitude of the top and rapidly getting closer. To me this means that radical performance improvements over the spectrum of machines aren't as likely as a narrowing of the spectrum, thus "saturation of usable technology." When you couple this with the fact that after twenty years of various kinds of software/hardware research, parallel processing is still limited to the technology described by Dykstra in 63, and still has the same problems in software development, it is probably fair to say that little progress is being made in software either. >By the way, I don't fault people for not understanding >about multis (e.g., from past exposure or school). It takes some >time for common misconceptions to catch up to current reality. > Since I don't see a smiley face on this, I will assume that it was intended to be as obnoxious and condescending as it sounds. I won't argue over who misunderstands what, but I will point out to Mr. Fay, that I daily program a range of multiprocessors in both distributed and parallel applications, and that my opinions are the result of direct experience with all (MIMD, SIMD, multiflow, massive or sparse) of the currently available kinds of multiprocessors, as well as many of the dinasours. As an example of a class of algorithms which is difficult to vectorize or parallelize, let me pull out the ancient prime finder algorithm: IPRIME(1) = 1 IPRIME(2) = 2 IPRIME(3) = 3 NPRIME = 3 DO 50 N = 5, MAXN, 2 DO 10 I = 3, NPRIME IQ = N / IPRIME(I) IR = N - (IPRIME(I) * IQ) IF (IR .EQ. 0) GO TO 40 IF (IQ .LT. IPRIME(I)) GO TO 20 10 CONTINUE 20 NPRIME = NPRIME + 1 IPRIME(NPRIME) = N IF (NPRIME .GE. MAXP) GO TO 60 40 CONTINUE 50 CONTINUE 60 CONTINUE Although there are different algorithms for finding primes, I use this one to illustrate a class of problems which comes up frequently in my work. There exists some set from which must be drawn a subset. There exists a rule for ordering the set and another rule for determining if an element is a member of the subset. The determination rule requires the subset drawn from all elements ordered before X in the initial set be know to determine if X is a member of the subset. The determination test requires a comparison with some of the elements of the known subset to decide if X should be added to the subset. Usually, although not always, there is an ordering of the subset (not necessarily the same as the ordering of the set) such that by comparing X with members of the subset in order, it is frequently unnecessary to compare X with all of the members. None of the vectorizing compilers that I have access to will attempt to vectorize this algorithm. The Alliant automatic parallelizer will not attempt to parallelize it. Most of the mechanisms I have tried to handcraft a vector or parallel variant which remains true to the previous paragraph have added sufficent extra work to the algorithm that it runs more slowly as the number of processors increase. I would be indebted to Mr. Fay, or anyone else who could provide me with a parallel or vectorizable version of the algorithm which maintained it as an example of the kind of problem I have described above.