Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!lll-lcc!pyramid!prls!mips!mash
From: mash@mips.UUCP
Newsgroups: comp.arch
Subject: Re: Horizontal Pipelining -- a pair
Message-ID: <1007@winchester.UUCP>
Date: Mon, 30-Nov-87 03:05:38 EST
Article-I.D.: winchest.1007
Posted: Mon Nov 30 03:05:38 1987
Date-Received: Wed, 2-Dec-87 03:55:26 EST
References: <391@sdcjove.CAM.UNISYS.COM> <28200073@ccvaxa>
Reply-To: mash@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 84

In article <28200073@ccvaxa> aglew@ccvaxa.UUCP writes:
>..> John Mashey talking about barrel architectures

>>Assume we're using split I & D caches. Assume that the cache line
>>is N words long, filled 1 word/cycle after a latency of L cycles.
>>One would expect that efficient cache designs have L <= N.
>I expect this is quite basic, but how do you show this for barrel>..

This wasn't intended as a proof or design, just background for a
first-order back-of-the-envelope analysis.

>>When filling an I-cache miss, you can do L more barrel slots,
>>then you must stall for N slots (or equivalent), because it doesn't
>>make sense to have the I-cache run faster than the chip (if it did,
>>you would run the chip faster). 
>Why not have the I-cache run faster than the chip? I-caches are more regular
>structures than the cpu, and are probably that much easier to make run
>faster.
The original assumption, perhaps not stated strongly enough, is that
the I-cache was built from ordinary SRAMS (to ride the SRAM cost curve).
(There are other ways to do this, however, this is The Right Way :-) see below).

>    Also, why stall for N slots? There are several schemes to deliver data
>from a partially filled cache line as soon as it is available. 
There are, of course, all sorts of schemes.  All I meant was that if you want
to fill N words of I-cache, 1/cycle, you will chew up N cycles where either:
	a) No process will be making progress, because all of the bandwidth
	will be eaten up.
	OR
	b) At best, the process that caused the I-cache miss gets to
	progress, up to N cycles (but this looks suspiciously like a single
	pipelined processor).
However, you cut it, given the original assumpotions, I think you end up
eating N cycles of I-cache bandwidth.
>    Finally, why not continue on a completely separate thread while
>the cache is filling for the thread that caused the cache miss?
>Barrel need not imply round robin.
Again, the assumption was that each access to write a word into the cache
blocked either all threads, or at least, all threads except the one that
caused the fill.

re: The Right Way (with SRAMs)
I'm no expert in this either, but there are a lot of people here who are
world-class experts in this, and maybe one will correct me if the
following summary of what they've said is too wrong:

1) CMOS VLSI CPU and SRAM technologies are very closely coupled:
	a) SRAMs are what VLSI folks often use to debug new processes.
	b) SRAMs are usually a VLSI-process generation ahead of CPUs.

2) One design process is to drive the CPU design process entirely by
your model of projected SRAM performance curves (sort of like surfboarding):
	a) Aim a CPU design to come out about when the SRAMs have debugged the
	technology you want to use.  The first CPUs work with some expensive
	SRAMS obtained in small quantities.
	b) By the time the CPU is yielding reasonably, the SRAMS you need are
	getting to be  reasonably available.
	c) By the time you want to ship lots, the SRAM prices have started
	dropping reasonably.
MEANWHILE:
	e) The SRAM guys are working on their next generation and
	f) You're working on your next generation.  Goto a)

Note: although the CPU is more irregular:
- Some critical things, like register files, are pretty similar to SRAMs.
- The CPU is all there on one chip, and it doesn't have chip-crossing
time penalties.  The cycle thru the SRAMs costs you delays, hence in some sense,
the SRAMs need to be faster just to stay even with the CPUs.

This strategy assumes that one can track the SRAM performance curves,
and generally get better cost/performance by using standard SRAMs,
whose cost rapidly drops, and whose technology is constantly being
pushed by multiple vendors.  It also assumes that the highest-performance
systems of the future will want large caches.  Of course, this
strategy remains to be verified, although the early evidence seems in
favor of it.  Of delivered RISC machines, the ones that use standard
SRAMs {MIPS, SPARC} outperform those that have special-purpose
cache-mmu chips {Clipper}.  The AMD29000 (no special RAM chip designs)
and Motorola 78000 {cache/mmu chips} will add a few more data points.
-- 
-john mashey	DISCLAIMER: 
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086