Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!rutgers!iuvax!pur-ee!uiucdcs!uxc.cso.uiuc.edu!ccvaxa!aglew
From: aglew@ccvaxa.UUCP
Newsgroups: comp.arch
Subject: Re: Horizontal Pipelining -- a pair
Message-ID: <28200071@ccvaxa>
Date: Sat, 21-Nov-87 15:43:00 EST
Article-I.D.: ccvaxa.28200071
Posted: Sat Nov 21 15:43:00 1987
Date-Received: Sun, 29-Nov-87 18:00:40 EST
References: <391@sdcjove.CAM.UNISYS.COM>
Lines: 70
Nf-ID: #R:sdcjove.CAM.UNISYS.COM:391:ccvaxa:28200071:000:3506
Nf-From: ccvaxa.UUCP!aglew    Nov 21 14:43:00 1987


..> Superlinear speedups

> >      Just for drill, consider the fact that a 2-CPU S/370-168-MP ran a
> > throughput of 2.1 to 2.25 over a 1-CPU system, depending on just what we
> > gave it for workload.
>      If I understand correctly, you are saying that going from 1 to 2 CPUs
> _more than doubled_ the throughput of the system? This is counter intuitive.
> If you could expand on it, I would appreciate it.
>      For instance, what are we talking about as a measure of throughput, and
> was the job mix similar?

Many people and companies are finding superlinear speedups, where N processors
are more than N times better than a single processor, especially on job mixes
but also in single programs, on many different performance indices including 
the most important one, elapsed time.

The reason is basically reduced context switching - in ways that you can't
necessarily get by applying tricks such as interrupt stacking on a single
processor.

Now, it should be obvious how multiple processors reduce context switching
wrt. a single processor when it's running a timeshared system, but how does
this improve the performance of a single program?

Well, programs typically have a lot of hidden, internal, context. Consider
a program that has four sections, A1, A2, B1, B2, where both A1 and B1 have
to run before either A2 or B2. But consider that A1 and A2 share a lot of
context - eg. they manipulate a huge array - and B1 and B2 share a lot of 
context, but there is little context shared between the Ai and Bi families.
Eg. A1 does enough processing on the huge array it shares with A2 that it
will completely fill the cache, and flush out any data that B1 might have 
loaded, and so on.
	Now say that the dependency from B1 to A2 is just a small amount
of data.

On a single processor, the time for this program to run is
	P(A1) + P(B1) + P(A2) + P(B2) + 2*C(AB)
where P(jobstep) = processing time, and C(AB) = time to change context
between A and B (you can analyze this in more ways, but that's the basic
idea).

On a multi processor, the total run time is
	P(A1) +..+ P(B2) + 2*M(AB)
but the elapsed time is
	max(P(A1),P(B1)) + max(P(A2),P(B2)) + M(AB)
where M(AB) is the cost to send a message between the two processors
containing the information from B1 that A2 depends on, and vice versa.
If C(AB) is large, it is quite easy for the total elapsed time on the
multiprocessor to be less than half that on the uniprocessor.

The key point here appears to be context. Here we are comparing N processors
with N*M memory to 1 processor with M memory, not N processors with small
memories M/N. Or actually not memories, but "context" - caches can do the
same role for shared memory machines, as can private register sets.
	The small memory multiprocessor can still win out in certain 
cases, but I suspect fewer than the large memory machine.

So, not only has superlinear speedup been observed in practice, but a
theoretical understanding of it is possible too.


Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.