Path: utzoo!attcan!uunet!husc6!rutgers!iuvax!pur-ee!hankd
From: hankd@pur-ee.UUCP (Hank Dietz)
Newsgroups: comp.arch
Subject: Re: Memory latency / cacheing / scientific programs
Summary: A comment & more info on Regs != Cache
Message-ID: <8444@pur-ee.UUCP>
Date: 4 Jul 88 20:24:25 GMT
References: <243@granite.dec.com> <779@garth.UUCP> <2033@pt.cs.cmu.edu> <11106@ames.arc.nasa.gov>
Organization: Purdue University Engineering Computer Network
Lines: 63

In article <11106@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> In article <8429@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
> >Registers are not a valid substitute for cache:  they are fundamentally more
> >restricted in function (although they are efficiently managed at
> >compile-time).  For example, both a[i] and a[j] can be kept in cache ad
> >infinitum, however, if we (the compiler) don't know if i==j, we can't put
> >either one in a register without having to flush both from registers every
> >time either one is stored into.  It's the classic ambiguous alias problem.
> 
> Only scalars are allocated in registers using these compilers so there
> isn't an aliasing problem.  What these compilers do, and what
> the architecture supports, is the use of registers for local scalars,
> and the use of memory for everything else: arrays and global variables
> of all kinds.  While this is patently not the best arrangement for scalar
> oriented C, it works very well for Fortran because:

Actually, it "works" because FORTRAN doesn't have pointers -- scalars in C
are just like array references alias-wise because any pointer could point at
just about any datum (and nobody knows which :-), scalars included.

> >...  As for the number of registers, we've
> >recently found that a perfect (or just really good) global register
> >allocator should rarely want more than about 10 registers per processor --
> 
> I have seen the number "32" bandied about previously as the ideal number
> of registers for C.  10 looks pretty small to me....
> What sort of programs is "10" based on?

We used the standard MIPS benchmarks, which are admittedly small programs,
but they are the same ones "everybody" uses (so I guess we did cheat :-).

We also constructed random intereference graphs and obtained very similar
results even for graph-coloring register allocation using very large graphs
which are very strongly connected (up to thousands of nodes, with each node
connected to as many as 90% of all other nodes) -- remember that all planar
graphs can be 4-colored and the graph complexity which can be N-colored
grows MUCH (incredible understatement here :-) faster than N.  By the way,
although fewer-than-10 colorings nearly always existed and our algorithm
usually found them, the standard node-removal technique (see Chaitin,
"Register Allocation and Spilling via Graph Coloring," SIGPLAN Symp. on
Compiler Construction, June 1982, pp.  201-207) often does not find them.
The technique we used was a modified graph walk -- details and sample C code
available on request.

> Again, the assumption that a load/store costs about the same as another
> instruction is not true on a fast pipelined machine with no cache.
> If a single load takes a lot of cycles, you need more registers.

For the most part, no assumption was needed about load/store cost versus
anything -- each datum was in a register while it was live (i.e., while it's
value might be referenced again) and not involved with an alias.  Further
note that we didn't put "variables" in registers; we put "live values" in
registers and that makes quite a difference.

Of course, your mileage may vary...  but 16 registers seems to be plenty.

     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   Prof. Hank Dietz, (317) 494 3357