Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!husc6!purdue!decwrl!granite!jmd From: jmd@granite.dec.com (John Danskin) Newsgroups: comp.arch Subject: Memory latency / cacheing / scientific programs Keywords: cache latency bus memory Message-ID: <243@granite.dec.com> Date: 21 Jun 88 22:03:23 GMT Organization: DEC Technology Development, Palo Alto, CA Lines: 69 I am interested in running a class of programs that process large (bigger than cache but smaller than memory) arrays of data repeatedly. The inner loop of the program is well behaved, so I can expect that all of my instructions will fit in cache, and all of my intermediates will fit in either registers or cache depending on how many registers there are. The amount of work done per byte of data is such that the total throughput required from the bus (on a well balanced/very high performance system) is not a limiting factor. However, if I have to accept the latency for a cache miss every line, the performance of my program is halved. I get 50% utilization of the CPU and 25% utilization of the bus. If I could hide the latency, then I could get 100% of the CPU and 50% of the bus. This problem seems to be common to all of the new RISC machines (and all of our old vaxen). A few manufacturers (Amdahl) have tried to fix the problem by prefetching into the cache. Others (Cray) hide memory latency with (pipelined) vector instructions or (Cydrome) by exposing the memory latency to the compiler. Now Vectors are expensive, and exposing the memory latency directly to the compiler seems to be a very short term solution (I like to think that binary compatibility across two models of the same machine is a reasonable goal). So I would like to look elewhere for my mips/flops. How has prefetching worked out? Alan Jay Smith seems to recommend prefetching given that it is implemented well. Has anybody been able to do it? How about multiple pending scoreboarded loads? The compiler emits the load instruction as early as possible and only references the target register when it is needed. A trace scheduling compiler could space loads out through a loop so that most/all of the bus latency is hidden. The compiler still has to know what memory latency is for full efficiency, but if it doesn't (numbers change) the code still works. This scheme is also register intensive in that several registers may be waiting for relatively distant events. Besides, most references are to the cache (it's just that the misses are so much more important...). Does anybody do this with scalar machines? Is it too hard? Software caching: people talk about it, but who has done it? If I could say load Cache7 with Page 12 and compute using other cache lines (while the load occured) until I said csynch Cache7 things would work just fine. Too hard for compilers to do? Unimplementable? Is my class of problem interesting? It is my understanding that many large scientific programs have similar behavior, but that the standard UNIX timesharing load (whatever that is) has significantly different behavior. A lot of research seems to have gone into making UNIX run fast, which is a laudable goal. But I don't want a workstation that runs UNIX fast. I want a workstation that runs my application fast. I am eagerly awaiting a full list of the errors in my thinking. -- John Danskin | decwrl!jmd DEC Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.