Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!uwvax!astroatc!johnw From: johnw@astroatc.UUCP (John F. Wardale) Newsgroups: comp.arch Subject: Re: What with these Vector's anyways? (nuts & bolts) Message-ID: <369@astroatc.UUCP> Date: Fri, 24-Jul-87 18:12:11 EDT Article-I.D.: astroatc.369 Posted: Fri Jul 24 18:12:11 1987 Date-Received: Sat, 25-Jul-87 18:21:40 EDT References: <2378@ames.arpa> <687@elmgate.UUCP> <2806@phri.UUCP> Reply-To: johnw@astroatc.UUCP (John F. Wardale) Organization: Astronautics Technology Cntr, Madison, WI Lines: 69 Keywords: vector Cray Cyber CDC Cpu Supercomputers Summary: vector lengths In article <2806@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >In article <687@elmgate.UUCP> jdg@aurora.UUCP (Jeff Gortatowsky) writes: >wants to know what "vector" means in the context of "vector processors" > > for i goes from 1 to upper-limit-of-x,y,z > do > z[i] = x[i] * y[i] > end > > The problem is that the cpu wastes a lot of time doing the dunky >work of executing the loop, (increment the index and check for upper >limit), computing the addresses for the array references, fectching and >decoding the multiply instruction opcode, etc, and only after all that does >it get to do the "real" work of doing the floating-point multiply. On a >vector processor, you would have a single instruction to do the whole loop. Generally, you have a limit on the "vector-length" (64 for the Crays) but the compiler will break the loop into for i goes from 1 to max by 64 z[i..i+63] = x[i..i+64] * y[i..i+64] (with special, set-len and mult for the last group of max mod 64) On a scalar processor, this code need not be a grim as Roy would lead you to think. The loop will be in an I-cache, and the CPU could have auto-increment modes. There are 3 common limits: * issue limited: An instruction is issued each clock. The memory and functions keep up. The bottle neck is in the pipe-line (decoding etc.) This is an argument for vectors, and for RISC * memory limited: memory bandwidth is saturated. To improve this you'll (probably) have to change the processor bus (or some other drastic measure). Changing the program may help but is labled as "cheating." * compute limited: functional units are busy so "issues" must wait. (common with vectors; very rare withou vectors) The real question is: can the scalar loop fetch and store data (into and out of memory) faster than the multiplyer can multiply. This is an obvious requirement for vector instructions to be practical. [Side question: Are there any "micros" that have (or could benifit) from vectors, or are there memory interfaces too low-bandwith for this?] > > Furthur, if you look carefully at a floating multiply operation, >you see it takes a dozen or so atomic steps; multiply the mantissas, add >the exponents, normalize the result, check for under/overflow, etc. On a >scalar machine these operations get done in series. On a vector machine, =================================================== >....[description of a pipelined multiplyer] Not necisarrily so! A scalar machine *CAN* have pipelined functional units! Another concern is segment time (or how often you can start the op. once a clock? one in two clocks? ...) John W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: John F. Wardale UUCP: ... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw arpa: astroatc!johnw@rsch.wisc.edu snail: 5800 Cottage Gr. Rd. ;;; Madison WI 53716 audio: 608-221-9001 eXt 110 To err is human, to really foul up world news requires the net!