Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!husc6!yale!mfci!colwell
From: colwell@mfci.UUCP (Robert Colwell)
Newsgroups: comp.arch
Subject: Re: Memory latency / cacheing / scientific programs
Keywords: cache latency bus memory
Message-ID: <443@m3.mfci.UUCP>
Date: 22 Jun 88 13:07:42 GMT
References: <243@granite.dec.com>
Sender: root@mfci.UUCP
Reply-To: colwell@mfci.UUCP (Robert Colwell)
Organization: Multiflow Computer Inc., Branford Ct. 06405
Lines: 43

In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes:
>
>I am interested in running a class of programs that process large
>(bigger than cache but smaller than memory) arrays of data repeatedly.
>
>How about multiple pending scoreboarded loads? The compiler emits the
>load instruction as early as possible and only references the target
>register when it is needed.  A trace scheduling compiler could space
>loads out through a loop so that most/all of the bus latency is
>hidden.  The compiler still has to know what memory latency is for full
>efficiency, but if it doesn't (numbers change) the code still works.
>This scheme is also register intensive in that several registers may be
>waiting for relatively distant events.  Besides, most references are to
>the cache (it's just that the misses are so much more important...).
>
>Does anybody do this with scalar machines? Is it too hard?

I consider our VLIW a scalar machine, and our compiler does what you
said.  I'm not sure what you meant by "if compiler doesn't know
memory latency, the code still works" though.  If the latency becomes
shorter, it'll still work, but if it gets longer, the code stops
working (the compiler will try to use the operand that it assumes has
now made it into the assigned register before the data actually
arrives).

One could, I suppose, put in the scoreboarding logic that would make
the code keep working even when the memory pipes are longer, by
watching which registers are targeted, and making sure that they've
been loaded before they are next used.  That could be a lot of
hardware in a VLIW like ours, though; any cluster (of which there are
1, 2, or 4 in the TRACE) can do a load and target some other
cluster's registers.  Since the CPU is implemented as 4 separate
cluster-pairs, I think the cross-instruction-stream watching hardware
would be pretty difficult.  Not to mention that you'd have to change
the register files, which are already pretty crammed with gates in
order to achieve the 4-read/4-write ports per instr that we need.
A high performance price to pay to accommodate potentially slower
memories without re-compiling.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090