Path: utzoo!attcan!uunet!husc6!rutgers!iuvax!pur-ee!hankd From: hankd@pur-ee.UUCP (Hank Dietz) Newsgroups: comp.arch Subject: Re: Memory latency / cacheing / scientific programs Summary: A comment & more info on Regs != Cache Message-ID: <8444@pur-ee.UUCP> Date: 4 Jul 88 20:24:25 GMT References: <243@granite.dec.com> <779@garth.UUCP> <2033@pt.cs.cmu.edu> <11106@ames.arc.nasa.gov> Organization: Purdue University Engineering Computer Network Lines: 63 In article <11106@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > In article <8429@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > >Registers are not a valid substitute for cache: they are fundamentally more > >restricted in function (although they are efficiently managed at > >compile-time). For example, both a[i] and a[j] can be kept in cache ad > >infinitum, however, if we (the compiler) don't know if i==j, we can't put > >either one in a register without having to flush both from registers every > >time either one is stored into. It's the classic ambiguous alias problem. > > Only scalars are allocated in registers using these compilers so there > isn't an aliasing problem. What these compilers do, and what > the architecture supports, is the use of registers for local scalars, > and the use of memory for everything else: arrays and global variables > of all kinds. While this is patently not the best arrangement for scalar > oriented C, it works very well for Fortran because: Actually, it "works" because FORTRAN doesn't have pointers -- scalars in C are just like array references alias-wise because any pointer could point at just about any datum (and nobody knows which :-), scalars included. > >... As for the number of registers, we've > >recently found that a perfect (or just really good) global register > >allocator should rarely want more than about 10 registers per processor -- > > I have seen the number "32" bandied about previously as the ideal number > of registers for C. 10 looks pretty small to me.... > What sort of programs is "10" based on? We used the standard MIPS benchmarks, which are admittedly small programs, but they are the same ones "everybody" uses (so I guess we did cheat :-). We also constructed random intereference graphs and obtained very similar results even for graph-coloring register allocation using very large graphs which are very strongly connected (up to thousands of nodes, with each node connected to as many as 90% of all other nodes) -- remember that all planar graphs can be 4-colored and the graph complexity which can be N-colored grows MUCH (incredible understatement here :-) faster than N. By the way, although fewer-than-10 colorings nearly always existed and our algorithm usually found them, the standard node-removal technique (see Chaitin, "Register Allocation and Spilling via Graph Coloring," SIGPLAN Symp. on Compiler Construction, June 1982, pp. 201-207) often does not find them. The technique we used was a modified graph walk -- details and sample C code available on request. > Again, the assumption that a load/store costs about the same as another > instruction is not true on a fast pipelined machine with no cache. > If a single load takes a lot of cycles, you need more registers. For the most part, no assumption was needed about load/store cost versus anything -- each datum was in a register while it was live (i.e., while it's value might be referenced again) and not involved with an alias. Further note that we didn't put "variables" in registers; we put "live values" in registers and that makes quite a difference. Of course, your mileage may vary... but 16 registers seems to be plenty. __ /| _ | | __ / | Compiler-oriented / |--| | | | | Architecture / | | |__| |_/ Researcher from \__ | | | \ | Purdue \ | \ \ \ \ Prof. Hank Dietz, (317) 494 3357