Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!rutgers!iuvax!pur-ee!uiucdcs!uxc.cso.uiuc.edu!ccvaxa!aglew From: aglew@ccvaxa.UUCP Newsgroups: comp.arch Subject: Re: Horizontal Pipelining -- a pair Message-ID: <28200071@ccvaxa> Date: Sat, 21-Nov-87 15:43:00 EST Article-I.D.: ccvaxa.28200071 Posted: Sat Nov 21 15:43:00 1987 Date-Received: Sun, 29-Nov-87 18:00:40 EST References: <391@sdcjove.CAM.UNISYS.COM> Lines: 70 Nf-ID: #R:sdcjove.CAM.UNISYS.COM:391:ccvaxa:28200071:000:3506 Nf-From: ccvaxa.UUCP!aglew Nov 21 14:43:00 1987 ..> Superlinear speedups > > Just for drill, consider the fact that a 2-CPU S/370-168-MP ran a > > throughput of 2.1 to 2.25 over a 1-CPU system, depending on just what we > > gave it for workload. > If I understand correctly, you are saying that going from 1 to 2 CPUs > _more than doubled_ the throughput of the system? This is counter intuitive. > If you could expand on it, I would appreciate it. > For instance, what are we talking about as a measure of throughput, and > was the job mix similar? Many people and companies are finding superlinear speedups, where N processors are more than N times better than a single processor, especially on job mixes but also in single programs, on many different performance indices including the most important one, elapsed time. The reason is basically reduced context switching - in ways that you can't necessarily get by applying tricks such as interrupt stacking on a single processor. Now, it should be obvious how multiple processors reduce context switching wrt. a single processor when it's running a timeshared system, but how does this improve the performance of a single program? Well, programs typically have a lot of hidden, internal, context. Consider a program that has four sections, A1, A2, B1, B2, where both A1 and B1 have to run before either A2 or B2. But consider that A1 and A2 share a lot of context - eg. they manipulate a huge array - and B1 and B2 share a lot of context, but there is little context shared between the Ai and Bi families. Eg. A1 does enough processing on the huge array it shares with A2 that it will completely fill the cache, and flush out any data that B1 might have loaded, and so on. Now say that the dependency from B1 to A2 is just a small amount of data. On a single processor, the time for this program to run is P(A1) + P(B1) + P(A2) + P(B2) + 2*C(AB) where P(jobstep) = processing time, and C(AB) = time to change context between A and B (you can analyze this in more ways, but that's the basic idea). On a multi processor, the total run time is P(A1) +..+ P(B2) + 2*M(AB) but the elapsed time is max(P(A1),P(B1)) + max(P(A2),P(B2)) + M(AB) where M(AB) is the cost to send a message between the two processors containing the information from B1 that A2 depends on, and vice versa. If C(AB) is large, it is quite easy for the total elapsed time on the multiprocessor to be less than half that on the uniprocessor. The key point here appears to be context. Here we are comparing N processors with N*M memory to 1 processor with M memory, not N processors with small memories M/N. Or actually not memories, but "context" - caches can do the same role for shared memory machines, as can private register sets. The small memory multiprocessor can still win out in certain cases, but I suspect fewer than the large memory machine. So, not only has superlinear speedup been observed in practice, but a theoretical understanding of it is possible too. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.