Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!iuvax!pur-ee!hankd From: hankd@pur-ee.UUCP (Hank Dietz) Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Summary: Power dissipation & the concept of MAST Message-ID: <8479@pur-ee.UUCP> Date: 8 Jul 88 18:52:52 GMT References: <1362@oakhill.UUCP> <11474@steinmetz.ge.com> Organization: Purdue University Engineering Computer Network Lines: 194 A while back, my brother and I sent the following abstract to ICCD88... we never heard anything as to how they liked it, but it seems a very reasonable thing to post here, given the recent discussion of power dissipation.... Limiting Switching Transients in High-Speed Digital Processors Henry G. Dietz Paul H. Dietz hankd@ee.ecn.purdue.edu phd@speech1.cs.cmu.edu Abstract Pin counts on CMOS VLSI processors are currently very high and will probably continue to grow. This causes a variety of problems, not least of which is the possibility of encountering unacceptable switching transients when many output pins change state simultaneously. These transients can drastically reduce the noise immunity of internal gates, severly limiting performance. To limit the number of output pins simultaneously changing state, we propose to directly manage output requests on the basis of predictions of the switching tran- sients implied in each output request. Each chip would be designed assuming a well-specified parameter, the Maximum Allowable Switching Transient (MAST), and an output request which could exceed the MAST would be serialized so that the MAST is not exceeded. This direct control of switching transients can be implemented in either a hardware-intensive or software-intensive style. The overall effect is that a processor chip may incorporate many pins, yet need not be designed to survive the worst case of all output pins attempting to change state simultaneously. 1. Background There are a number techniques that have been used to limit switching transients. These can be grouped into two major categories: reduction in the number of output pins that are active at any one time or reduction of the observed transient itself. The number of output pins can be reduced by transmit- ting data serially or by time multiplexing data buses to serve multiple functions. Alternatively, output times for various signals can be slightly skewed so that the outputs are not set simultaneously. Unfortunately, the quest for higher operating speeds often precludes the obvious application of these techniques. To reduce the switching transient generated per output pin, some manufacturers have devoted large die areas to decoupling capacitors; but this is not practical for designs which are already pushing die-size constraints. Other manufactures use off-chip capacitors mounted in the same package as the die, which can provide much larger decoupling capacitances. However, the series inductance inherent in going off-chip is greater, limiting the effectiveness. Another approach, perhaps more generally applicable, is to maintain separate power buses for output buffers and inter- nal state logic [Car88]. Also, by careful design of the output buffer [GaT88], one can make buffer power consumption more consistent, hence reducing the worst-case values and achieving significant improvement. It is reasonable to assume that next generation devices will incorporate some combination of these methods, yet, all of these techniques require that the chip be designed for the worst case: additional performance gains can be made by restricting simultaneous output operations only when the MAST otherwise would be exceeded. 2. Approach There are two difficulties in directly controlling out- put state transitions based on potential MAST violation. The first problem is how to detect or predict when a MAST violation may occur; this may be done placing the main bur- den either on hardware (detection) or on software (predic- tion). The second problem is, given that a particular out- put request would exceed the MAST if done simultaneously, how can the hardware arrange to perform the output pin state transitions without exceeding the MAST. We will discuss this second problem first. 2.1. Output Serialization Given that a particular logically-simultaneous output operation would exceed the MAST, hardware must intervene to insure that the limit is observed. This can be done by per- mitting only a fraction of the output pins to change simul- taneously in one cycle and performing the rest of the output on successive cycles which are inserted just for that pur- pose. We say that such an output operation has been serial- ized. Although output serialization requires only relatively simple hardware, some care must be taken. For example, strobe/ready bits must change state only once the corresponding data bits are in the correct state. When the MAST is not exceeded, the requested output is performed in a single cycle. (In this case, the additional circuitry has no effect.) This is an efficient technique because, for example, localities in instruction address space often correspond to minor bit changes in the address outputs.[1] In some cases, an optimizing compiler/linker/loader can significantly enhance this kind of locality - these code transformations are discussed in detail in the full version of the paper. A simple example of the type of optimization possible is to generate code so that jump and call targets (labels and function/procedure entries) are placed at addresses which differ from the invoking-instruction's address in only a few bits (more precisely, causing changes in fewer than MAST bits). Another example is that a loop whose code would normally span a high-order memory address bit change could be moved to a portion of the address map where fewer address bits change. Even data-related outputs sometimes can be transformed to minimize simultaneous bit changes, either by careful layout of data structures or by recognition of properties of operations being performed. 2.2. MAST Violation Detection/Prediction As discussed in section 2.1, compiler technology (e.g., flow and other static analyses [AhS86] [DiC88]) can be used to predict, and hence to alleviate by code motion, etc., possible violations of the MAST. This same compiler tech- nology can be used to predict when the MAST will be violated and to directly encode that information in the instructions it generates; hardware would simply serialize any operation which the compiler tagged as suspect.[2] Of course, the compiler must conservatively assume that any operation which it can't prove is less than the MAST, is actually greater than the MAST. This isn't always true - some output changes are always unknown until runtime, and the compiler must assume that all of these change. The more hardware-intensive alternative is to simply use a circuit to detect, at runtime, when a proposed output would actually exceed the MAST, and to invoke serialization only then. In the full-length paper, several techniques are presented for constructing such a circuit. Compared to the software prediction, hardware detection insures that all outputs that can be done in a single cycle are so accomplished, whereas the compiler tagging may cause some to be unnecessarily serialized. The trade-off is that the hardware is fairly complex and that the compiler cannot know precisely how long each instruction will take to exe- cute (which reduces the effectiveness of many conventional compiler optimizations). 3. Conclusion Using either the software-intensive or the hardware- intensive technique proposed, the concept of directly manag- ing output pin state changes can provide substantial perfor- mance increases with only minor impact on the processor design. Typically, a circuit using these techniques will be running at or near its MAST, thus making the best possible use of the available bandwidth. _________________________ [1] Although it might not be practical, use of Gray coded rather than 2's-complement integers to represent addresses would insure that sequential addresses differ by only a single bit. [2] For those who would rather not place such faith in the compiler, a simple circuit can detect a glitch on the power bus, thereby detecting an instruction which the compiler failed to tag but which exceeds the MAST. The circuit would simply initiate a cold-start. References [AhS86] Aho, A. V., Sethi, R., and Ullman, J. D., Com- pilers: Principles, Techniques, and Tools, Addison Wesley, Reading, Massachusetts, 1986. [Car88] Carley, L. R., Personal communication, Jan. 31, 1988. [DiC88] Dietz, H. G. and Chi, C-H., "A Compiler-Writer's View of GaAs Computer System Design," IEEE Proc. of HICSS-21, pp. 256-265, Jan. 1988. [GaT88] Gabara, T. and Thompson, D., "Ground Bounce Con- trol in CMOS Integrated Circuits," to appear in IEEE Proc. of International Solid-State Circuits Conference, 1988.