Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!husc6!yale!mfci!colwell From: colwell@mfci.UUCP (Robert Colwell) Newsgroups: comp.arch Subject: Re: Memory latency / cacheing / scientific programs Keywords: cache latency bus memory Message-ID: <443@m3.mfci.UUCP> Date: 22 Jun 88 13:07:42 GMT References: <243@granite.dec.com> Sender: root@mfci.UUCP Reply-To: colwell@mfci.UUCP (Robert Colwell) Organization: Multiflow Computer Inc., Branford Ct. 06405 Lines: 43 In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes: > >I am interested in running a class of programs that process large >(bigger than cache but smaller than memory) arrays of data repeatedly. > >How about multiple pending scoreboarded loads? The compiler emits the >load instruction as early as possible and only references the target >register when it is needed. A trace scheduling compiler could space >loads out through a loop so that most/all of the bus latency is >hidden. The compiler still has to know what memory latency is for full >efficiency, but if it doesn't (numbers change) the code still works. >This scheme is also register intensive in that several registers may be >waiting for relatively distant events. Besides, most references are to >the cache (it's just that the misses are so much more important...). > >Does anybody do this with scalar machines? Is it too hard? I consider our VLIW a scalar machine, and our compiler does what you said. I'm not sure what you meant by "if compiler doesn't know memory latency, the code still works" though. If the latency becomes shorter, it'll still work, but if it gets longer, the code stops working (the compiler will try to use the operand that it assumes has now made it into the assigned register before the data actually arrives). One could, I suppose, put in the scoreboarding logic that would make the code keep working even when the memory pipes are longer, by watching which registers are targeted, and making sure that they've been loaded before they are next used. That could be a lot of hardware in a VLIW like ours, though; any cluster (of which there are 1, 2, or 4 in the TRACE) can do a load and target some other cluster's registers. Since the CPU is implemented as 4 separate cluster-pairs, I think the cross-instruction-stream watching hardware would be pretty difficult. Not to mention that you'd have to change the register files, which are already pretty crammed with gates in order to achieve the 4-read/4-write ports per instr that we need. A high performance price to pay to accommodate potentially slower memories without re-compiling. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090