Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!uwvax!astroatc!johnw
From: johnw@astroatc.UUCP (John F. Wardale)
Newsgroups: comp.arch
Subject: Re: What with these Vector's anyways? (nuts & bolts)
Message-ID: <369@astroatc.UUCP>
Date: Fri, 24-Jul-87 18:12:11 EDT
Article-I.D.: astroatc.369
Posted: Fri Jul 24 18:12:11 1987
Date-Received: Sat, 25-Jul-87 18:21:40 EDT
References: <2378@ames.arpa> <687@elmgate.UUCP> <2806@phri.UUCP>
Reply-To: johnw@astroatc.UUCP (John F. Wardale)
Organization: Astronautics Technology Cntr, Madison, WI
Lines: 69
Keywords: vector Cray Cyber CDC Cpu Supercomputers
Summary: vector lengths

In article <2806@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>In article <687@elmgate.UUCP> jdg@aurora.UUCP (Jeff Gortatowsky) writes:
>wants to know what "vector" means in the context of "vector processors"
>
>	for i goes from 1 to upper-limit-of-x,y,z
>	do
>		z[i] = x[i] * y[i]
>	end
>
>	The problem is that the cpu wastes a lot of time doing the dunky
>work of executing the loop, (increment the index and check for upper
>limit), computing the addresses for the array references, fectching and
>decoding the multiply instruction opcode, etc, and only after all that does
>it get to do the "real" work of doing the floating-point multiply.  On a
>vector processor, you would have a single instruction to do the whole loop.

Generally, you have a limit on the "vector-length"  (64 for the
Crays) but the compiler will break the loop into
for i goes from 1 to max by 64
	z[i..i+63] = x[i..i+64] * y[i..i+64]
(with special, set-len and mult for the last group of max mod 64)

On a scalar processor, this code need not be a grim as Roy would
lead you to think.  The loop will be in an I-cache, and the CPU
could have auto-increment modes.

There are 3 common limits:  
* issue limited:  An instruction is issued each clock.  The memory
and functions keep up.  The bottle neck is in the pipe-line
(decoding etc.)  This is an argument for vectors, and for RISC

* memory limited:  memory bandwidth is saturated.  To improve
this you'll (probably) have to change the processor bus (or
some other drastic measure).  Changing the program may help
but is labled as "cheating."

* compute limited: functional units are busy so "issues" must wait.
(common with vectors; very rare withou vectors)

The real question is: can the scalar loop fetch and store data
(into and out of memory) faster than the multiplyer can multiply.
This is an obvious requirement for vector instructions to be
practical.  [Side question:  Are there any "micros" that have (or 
could benifit) from vectors, or are there memory interfaces too
low-bandwith for this?]

>
>	Furthur, if you look carefully at a floating multiply operation,
>you see it takes a dozen or so atomic steps; multiply the mantissas, add
>the exponents, normalize the result, check for under/overflow, etc.  On a
>scalar machine these operations get done in series.  On a vector machine,
===================================================
>....[description of a pipelined multiplyer]

Not necisarrily so!  A scalar machine *CAN* have pipelined
functional units!    Another concern is segment time (or how often
you can start the op.  once a clock?  one in two clocks? ...)


			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!