Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!husc6!purdue!decwrl!granite!jmd
From: jmd@granite.dec.com (John Danskin)
Newsgroups: comp.arch
Subject: Memory latency / cacheing / scientific programs
Keywords: cache latency bus memory
Message-ID: <243@granite.dec.com>
Date: 21 Jun 88 22:03:23 GMT
Organization: DEC Technology Development, Palo Alto, CA
Lines: 69


I am interested in running a class of programs that process large
(bigger than cache but smaller than memory) arrays of data repeatedly.

The inner loop of the program is well behaved, so I can expect that
all of my instructions will fit in cache, and all of my intermediates will
fit in either registers or cache depending on how many registers there are.

The amount of work done per byte of data is such that the total throughput
required from the bus (on a well balanced/very high performance system)
is not a limiting factor.

However, if I have to accept the latency for a cache miss every line,
the performance of my program is halved.

I get 50% utilization of the CPU and 25% utilization of the bus.
If I could hide the latency, then I could get 100% of the CPU and 50% of
the bus.

This problem seems to be common to all of the new RISC machines (and
all of our old vaxen).  A few manufacturers (Amdahl) have tried to fix
the problem by prefetching into the cache. Others (Cray) hide memory
latency with (pipelined) vector instructions or (Cydrome) by exposing
the memory latency to the compiler.

Now Vectors are expensive, and exposing the memory latency directly to
the compiler seems to be a very short term solution (I like to think
that binary compatibility across two models of the same machine is a
reasonable goal). So I would like to look elewhere for my mips/flops.

How has prefetching worked out? Alan Jay Smith seems to recommend
prefetching given that it is implemented well. Has anybody been able to
do it?

How about multiple pending scoreboarded loads? The compiler emits the
load instruction as early as possible and only references the target
register when it is needed.  A trace scheduling compiler could space
loads out through a loop so that most/all of the bus latency is
hidden.  The compiler still has to know what memory latency is for full
efficiency, but if it doesn't (numbers change) the code still works.
This scheme is also register intensive in that several registers may be
waiting for relatively distant events.  Besides, most references are to
the cache (it's just that the misses are so much more important...).

Does anybody do this with scalar machines? Is it too hard?

Software caching: people talk about it, but who has done it?
If I could say
	load	Cache7 with Page 12
and compute using other cache lines (while the load occured) until I said
	csynch	Cache7
things would work just fine. Too hard for compilers to do? Unimplementable?


Is my class of problem interesting? It is my understanding that many
large scientific programs have similar behavior, but that the standard
UNIX timesharing load (whatever that is) has significantly different behavior.

A lot of research seems to have gone into making UNIX run fast, which is a
laudable goal. But I don't want a workstation that runs UNIX fast. I want a
workstation that runs my application fast.


I am eagerly awaiting a full list of the errors in my thinking.
-- 
John Danskin				| decwrl!jmd
DEC Technology Development		| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.