Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!iuvax!pur-ee!hankd
From: hankd@pur-ee.UUCP (Hank Dietz)
Newsgroups: comp.arch
Subject: Re: RISC machines and scoreboarding
Summary: Power dissipation & the concept of MAST
Message-ID: <8479@pur-ee.UUCP>
Date: 8 Jul 88 18:52:52 GMT
References: <1362@oakhill.UUCP> <11474@steinmetz.ge.com>
Organization: Purdue University Engineering Computer Network
Lines: 194

A while back, my brother and I sent the following abstract to ICCD88...  we
never heard anything as to how they liked it, but it seems a very reasonable
thing to post here, given the recent discussion of power dissipation....

Limiting Switching Transients in High-Speed Digital Processors

     Henry G. Dietz            Paul H. Dietz
     hankd@ee.ecn.purdue.edu   phd@speech1.cs.cmu.edu

Abstract

     Pin counts on CMOS VLSI processors are  currently  very
high  and  will  probably  continue  to grow.  This causes a
variety of problems, not least of which is  the  possibility
of  encountering unacceptable switching transients when many
output pins change state simultaneously.   These  transients
can drastically reduce the noise immunity of internal gates,
severly limiting performance.

     To limit  the  number  of  output  pins  simultaneously
changing   state,  we  propose  to  directly  manage  output
requests on the basis of predictions of the switching  tran-
sients  implied  in each output request.  Each chip would be
designed assuming a well-specified  parameter,  the  Maximum
Allowable  Switching Transient (MAST), and an output request
which could exceed the MAST would be serialized so that  the
MAST  is  not  exceeded.   This  direct control of switching
transients can be implemented in either a hardware-intensive
or  software-intensive  style.  The overall effect is that a
processor chip may incorporate many pins, yet  need  not  be
designed  to  survive  the  worst  case  of  all output pins
attempting to change state simultaneously.

1.  Background

     There are a number techniques that have  been  used  to
limit  switching  transients.  These can be grouped into two
major categories:  reduction in the number  of  output  pins
that are active at any one time or reduction of the observed
transient itself.

     The number of output pins can be reduced  by  transmit-
ting  data  serially  or  by time multiplexing data buses to
serve multiple functions.  Alternatively, output  times  for
various  signals  can be slightly skewed so that the outputs
are not set simultaneously.  Unfortunately,  the  quest  for
higher   operating   speeds   often  precludes  the  obvious
application of these techniques.

     To reduce the switching transient generated per  output
pin,  some  manufacturers  have  devoted  large die areas to
decoupling capacitors; but this is not practical for designs
which  are  already  pushing  die-size  constraints.   Other
manufactures use off-chip capacitors  mounted  in  the  same
package as the die, which can provide much larger decoupling
capacitances.  However, the series  inductance  inherent  in
going  off-chip  is  greater,  limiting  the  effectiveness.
Another approach, perhaps more generally applicable,  is  to
maintain  separate power buses for output buffers and inter-
nal state logic [Car88].  Also, by  careful  design  of  the
output buffer [GaT88], one can make buffer power consumption
more consistent, hence reducing the  worst-case  values  and
achieving significant improvement.

     It is reasonable to assume that next generation devices
will incorporate some combination of these methods, yet, all
of these techniques require that the chip  be  designed  for
the worst case:  additional performance gains can be made by
restricting simultaneous output  operations  only  when  the
MAST otherwise would be exceeded.

2.  Approach

     There are two difficulties in directly controlling out-
put  state  transitions  based  on potential MAST violation.
The first problem is how to detect or predict  when  a  MAST
violation  may occur; this may be done placing the main bur-
den either on hardware (detection) or on  software  (predic-
tion).   The second problem is, given that a particular out-
put request would exceed the MAST  if  done  simultaneously,
how can the hardware arrange to perform the output pin state
transitions without exceeding the  MAST.   We  will  discuss
this second problem first.

2.1.  Output Serialization

     Given that a particular  logically-simultaneous  output
operation  would exceed the MAST, hardware must intervene to
insure that the limit is observed.  This can be done by per-
mitting  only a fraction of the output pins to change simul-
taneously in one cycle and performing the rest of the output
on  successive  cycles which are inserted just for that pur-
pose.  We say that such an output operation has been serial-
ized.

     Although output serialization requires only  relatively
simple  hardware,  some  care  must  be taken.  For example,
strobe/ready  bits  must  change   state   only   once   the
corresponding data bits are in the correct state.

     When the MAST is not exceeded, the requested output  is
performed  in a single cycle.  (In this case, the additional
circuitry has no effect.)  This is  an  efficient  technique
because,  for  example,  localities  in  instruction address
space often correspond to minor bit changes in  the  address
outputs.[1]

     In some cases, an optimizing compiler/linker/loader can
significantly  enhance  this  kind  of locality - these code
transformations are discussed in detail in the full  version
of  the paper.  A simple example of the type of optimization
possible is to generate code so that jump and  call  targets
(labels   and  function/procedure  entries)  are  placed  at
addresses  which  differ  from  the   invoking-instruction's
address  in only a few bits (more precisely, causing changes
in fewer than MAST bits).  Another example is  that  a  loop
whose  code  would normally span a high-order memory address
bit change could be moved to a portion of  the  address  map
where  fewer address bits change.  Even data-related outputs
sometimes can be transformed to  minimize  simultaneous  bit
changes,  either  by careful layout of data structures or by
recognition of properties of operations being performed.

2.2.  MAST Violation Detection/Prediction

     As discussed in section 2.1, compiler technology (e.g.,
flow  and other static analyses [AhS86] [DiC88]) can be used
to predict, and hence to alleviate  by  code  motion,  etc.,
possible  violations  of the MAST.  This same compiler tech-
nology can be used to predict when the MAST will be violated
and  to directly encode that information in the instructions
it generates; hardware would simply serialize any  operation
which the compiler tagged as suspect.[2]

     Of course, the compiler must conservatively assume that
any operation which it can't prove is less than the MAST, is
actually greater than the MAST.  This isn't  always  true  -
some  output  changes  are always unknown until runtime, and
the compiler must assume that all of these change.

     The more hardware-intensive alternative  is  to  simply
use  a circuit to detect, at runtime, when a proposed output
would actually exceed the MAST, and to invoke  serialization
only then.  In the full-length paper, several techniques are
presented for constructing such a circuit.

     Compared to the software prediction, hardware detection
insures  that all outputs that can be done in a single cycle
are so accomplished, whereas the compiler tagging may  cause
some  to be unnecessarily serialized.  The trade-off is that
the hardware is fairly complex and that the compiler  cannot
know  precisely  how long each instruction will take to exe-
cute (which reduces the effectiveness of  many  conventional
compiler optimizations).

3.  Conclusion

     Using either the software-intensive  or  the  hardware-
intensive technique proposed, the concept of directly manag-
ing output pin state changes can provide substantial perfor-
mance  increases  with  only  minor  impact on the processor
design.  Typically, a circuit using these techniques will be
running  at  or near its MAST, thus making the best possible
use of the available bandwidth.

_________________________
  [1] Although it might not be practical, use  of  Gray
coded  rather than 2's-complement integers to represent
addresses would insure that sequential addresses differ
by only a single bit.

  [2] For  those  who would rather not place such faith
in the compiler, a simple circuit can detect  a  glitch
on  the  power  bus,  thereby  detecting an instruction
which the compiler failed to tag but which exceeds  the
MAST.  The circuit would simply initiate a cold-start.

References

[AhS86]   Aho, A. V., Sethi, R., and  Ullman,  J.  D.,  Com-
          pilers: Principles, Techniques, and Tools, Addison
          Wesley, Reading, Massachusetts, 1986.

[Car88]   Carley, L. R., Personal  communication,  Jan.  31,
          1988.

[DiC88]   Dietz, H. G. and Chi, C-H.,  "A  Compiler-Writer's
          View  of  GaAs Computer System Design," IEEE Proc.
          of HICSS-21, pp. 256-265, Jan. 1988.

[GaT88]   Gabara, T. and Thompson, D., "Ground  Bounce  Con-
          trol  in  CMOS  Integrated Circuits," to appear in
          IEEE Proc. of International  Solid-State  Circuits
          Conference, 1988.