Path: utzoo!utgpu!water!watmath!clyde!att!rutgers!mit-eddie!husc6!purdue!decwrl!granite!jmd From: jmd@granite.dec.com (John Danskin) Newsgroups: comp.arch Subject: Re: Memory latency / cacheing / scientific programs Keywords: cache latency bus memory Message-ID: <244@granite.dec.com> Date: 27 Jun 88 17:56:22 GMT References: <243@granite.dec.com> <443@m3.mfci.UUCP> <448@m3.mfci.UUCP> Reply-To: jmd@granite.UUCP (John Danskin) Organization: DEC Technology Development, Palo Alto, CA Lines: 61 In article <448@m3.mfci.UUCP> you write: >In article <443@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >|In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes: >|> ... A trace scheduling compiler could space >|>loads out through a loop so that most/all of the bus latency is >|>hidden. The compiler still has to know what memory latency is for full >|>efficiency, but if it doesn't (numbers change) the code still works. >|I consider our VLIW a scalar machine, and our compiler does what you >|said. I'm not sure what you meant by "if compiler doesn't know >|memory latency, the code still works" though. If the latency becomes >|shorter, it'll still work, ... >| ... Not to mention that you'd have to change >|the register files, which are already pretty crammed with gates in >|order to achieve the 4-read/4-write ports per instr that we need. > >Even if the memory latency decreases, the code will *only* work if you >don't run into register write port and/or other resource conflicts. >If resource constraints are maintained by the compiler, then you are >likely :-) to have to recompile your programs when these constraints >change. If your original code still runs, though, then it wasn't very >compact the first place. > >Stefan Freudenberger mfci!freuden@uunet.uucp Somewhere up in my original posting I said 'scoreboarding'. With scoreboarding, the code still runs. It may not run fast anymore, but that is another problem. You tell the compiler what the latencies are so that it can produce good code. You implement scoreboards so that binaries port from your mark 1 machine to your mark 2 machine (they don't run as fast as they should, but they probably run a little faster). The real problem seems to be that scoreboards are expensive. You guys may be on the right track (blow off compatibility, make the machine soooo fast and relatively cheap that people will use it despite the headache). Tradeoffs are shifting in that direction, but I still see a lot of value in binary compatibility at least for a few models of a design. Have you guys thought about keeping an intermediate language copy of each executable IN the executable with the 'cached' binary? Have the loader check the binary to see if it has the right tag for the current machine, if it does, run the code, if it doesn't then regenerate code from the intermediate language spec and then run. You would want to provide a utility for executable conversion as this is not a real performance answer. If the intermediate language was sophisticated enough you might even be able to do the code generation reasonably quickly... You would pay a disk space penalty now, but avoid being bitten by the inevitable backwards compatibility problems that seem so irrelevant when you build your first machine... -- John Danskin | decwrl!jmd DEC Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.