Algorithms for formal circuit optimization on a static timing basis.

Algorithms for Formal
Circuit Optimization on
a Static Timing Basis
Chandu Visweswariah
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
1
Acknowledgments
• Partners in crime
– Andy Conn
– Ruud Haring
– David Ling
– Phil Strenski
– Chai Wah Wu
– Ee Cho
– Walt Molzen
– Mike Henderson
– Katya Scheinberg
– Abe Elfadel
– Peter O’Brien
– Greg Northrop
– Pat Williams
– several summer students along the way
– all the users of our tuners in IBM!
2
Outline
• The benefits of tuning/optimization
• Taxonomy
– heuristic methods in brief (e.g., TILOS)
• Algorithms for formal methods
– formulation
– incremental sensitivity computation
– nonlinear optimization
– pruning
– optimality conditions
– Lagrangian Relaxation (LR)
• Methodology implications
3
Benefits of tuning/optimization
• Better circuits
– area, delay, power, noise
• Enhanced designer productivity
– designers’ focus shifts to a higher level like
comparing different topologies
• Better understanding of tradeoffs
– better use of silicon technology
• Effective tuning/optimization is an enabler
of design for manufacturability
– improved parametric yields
4
Taxonomy
Tuners
Dynamic
Static
Formal
Heuristic
• Today’s tutorial will focus mostly on formal
static tuning
5
Dynamic vs. static tuning
• Dynamic tuning is to transient simulation as
static tuning is to static timing analysis
– examples: DELIGHT.SPICE, JiffyTune
– user provides input patterns
– user lists paths to be considered
– user poses carefully thought-out optimization
problem
Transistor/wire sizes
Nonlinear
optimizer
Tuner
Simulator
Function/gradient values
6
Static tuning
• Uses static timing analysis as a basis
• All paths through the logic implicitly
considered
• User does not specify paths or input
patterns; easier to use
• Inherits weaknesses of static timing
analysis
– pessimism
– false paths
• More difficult to include power/noise
considerations during tuning
7
Formal vs. heuristic static tuning
• The formal approach attempts to solve the
problem “exactly”
– better quality of results
– long run times
– sophisticated mathematical algorithms
– significant development effort
• The heuristic method repeatedly upsizes
the most sensitive transistor(s) on the most
critical path(s)
– greedy algorithm
– generates an area vs. speed tradeoff curve
– can handle larger circuits
8
Heuristic tuning
• Example: TILOS (Fishburn and Dunlop, ’85)
• Proved that transistor sizing is a convex
problem under an Elmore delay assumption
• Unfortunately
– Elmore delays are too inaccurate
– slew effects are not taken into account
– meaningful tuning requires additional
constraints like slew limits, input loading,  ratio
constraints which may not be convex
– all of these destroy the convexity of the
problem
• We give up convexity for accuracy
9
TILOS algorithm
• Start all transistors at minimum size
• Determine the critical path
• Find the sensitivity of the critical path delay
to the width of each transistor on the path
• Increase width of transistor with highest
negative sensitivity by a fixed step size
• Repeat
• Generates a speed vs. area tradeoff curve
• Can consider more than one transistor
and/or more than one path at a time
10
Comments on heuristic tuning
• The answer can be sub-optimal
Most critical
Next-most critical
• Can exploit incremental timing algorithms to
determine sensitivities by finite differences
• Can use more realistic delay models
• Difficult to incorporate additional constraints
(loading, slew,  ratio)
11
Algorithms for formal static tuning
• Goals
– obtain an optimal answer (local minimum if
problem is not convex)
– fully automated solution
– take all paths into account
– delay calculation using time-domain simulation
– flexibility to express wide range of constraints
– transistor-level tuner
• Components
– fast simulation and incremental sensitivity
– nonlinear optimization package
12
Static tuning formulation
AT1
AT2
d15
AT5
d 25
d 58
s.t.
s.t.
s.t.
s.t.
AT3
d 36
AT4
d 46
AT7
d 57
d 67
d 68
AT8
AT6
min [max ( AT , AT )]
7 8
AT  max [ AT  d , AT  d ]
7
5 57
6 67
AT  max [ AT  d , AT  d ]
5 58
8
6 68
AT  max [ AT  d , AT  d ]
5
1 15
2 25
AT  max [ AT  d , AT  d ]
6
3 36
4 46
13
Digression: minimax optimization
min [ max{ f1 ( x), f 2 ( x)} ]
x
• This problem is not smooth!
• Re-map to:
min z
x,z
s.t. z  f1 ( x)
s.t. z  f 2 ( x)
• This trick converts a minimax problem into a
standard twice continuously differentiable
nonlinear optimization problem
14
Digression: minimax optimization
f1
f2
z
z*
x0
x

x
15
Remapped problem formulation
min z

min[max( AT7 , AT8 )]  s.t. z  AT7
s.t. z  AT
8

 AT7
AT7  max[ AT5  d 57 , AT6  d 67 ]  
 AT7
 AT8
AT8  max[ AT5  d 58 , AT6  d 68 ]  
 AT8
 AT5
AT5  max[ AT1  d15 , AT2  d 25 ]  
 AT5
 AT5  d 57
 AT6  d 67
 AT5  d 58
 AT6  d 68
 AT1  d15
 AT2  d 25
 AT6  AT3  d 36
AT6  max[ AT3  d 36 , AT4  d 46 ]  
 AT6  AT4  d 46
16
Reformulated problem
• ATs are variables of the problem
• At the solution
– original problem is solved because by definition
z is minimized and the constraints are feasible
– ATs on critical path are correct because these
constraints are tight
– off-critical ATs may be “wrong”
• dij’s are functions of transistor widths
• Amenable to addition of any additional
general (nonlinear) constraints
– slew, input loading,  ratio
17
Springs and planks analogy
z
AT7
d 57
d 58
d 67
AT8
d 68
AT6
AT5
d 15
d 25
AT1
d 36
AT2
d 46
AT3
AT4
18
Springs and planks analogy
z
AT7
AT7
d 67
d 58 d 68
67
d 58 d 68
d 57 d
AT7
d 57 d
67
AT
AT7
5
d 57
AT5
d 57
AT5
d
AT5 15
d
dd 15
15
15
d 67
dd 25
d 25
d
AT
AT
AT11
25
25
AT11
AT
AT
AT22
AT22
dd 36
36
d
d 36
36
AT8
z
AT8
z
AT
AT6
8
d 58
d 68
AT6
d 58
d 68
AT6
z
AT8
d 46
AT6
d 46
d 46
d
AT
AT
AT33
AT33
46
AT
AT
AT44
AT44
19
Constraint generation
wn , w p
i
j
cout j
AT jr  ATi f  d ijr ( wn , w p , cout j , slewinf )
AT j f  ATi r  d ijf ( wn , w p , cout j , slewinr )
slew rj  sijr ( wn , w p , cout j , slewinf )
slew jf  sijf ( wn , w p , cout j , slewinr )
• Each delay, slew depends on a single
input slew
20
Statement of the problem
min z
s.t. z  ATi  RATi
s.t. AT j  ATi  dij (W , S )
s.t. s j  sij (W , S )
s.t. W  area target
s.t. pincap i (W )  target i
s.t. min  i (W )  max
s.t. Wmin Wi Wmax
internal
s.t. Si  Smax
PO
s.t. Si  Smax
for all POs
for each timing arc
for each timing arc
for all PIs
for all gates
for all FETs
for all internal nets
for all POs
21
Simplified view of assertions
Launching
latch
Primary input
arrival time
Combinational
macro being
tuned
Capturing
latch
Delay Primary output
through required arrival
time
macro
z
22
Components of a tuner
Read netlist; create timing graph
Formulate optimization problem
Feed problem to nonlinear optimizer
Solve optimization problem, call simulator
for delays/slews and gradients thereof
Fast simulation and incremental
sensitivity computation
Obtain converged solution
Snap-to-grid; back-annotate; re-time
23
Components of a tuner
• Transistor-level static timing analyzer
– timing graph, sensitizations, simulation of
CCCs, slack calculations
• Fast time-domain circuit simulator
• Fast time-domain incremental sensitivity
– adjoint method
– direct method
• State-of-the-art nonlinear optimizer
– should be able to handle general nonlinear
inequalities and objective functions
– should be able to handle large problems
24
Components: fast simulator
• Many fast transistor-level time-domain
simulators in the literature
– orders of magnitude speedup over SPICE
– event-driven algorithms
– simplified device models (tables for i-v
characteristics, simplified parasitic models)
– 5% typical, 20% worst-case errors on a
stage-delay basis
– typically invoked via an API by the timer with
many “tricks” to ensure efficiency
– simulator must handle simulation of CCCs at
multiple process corners (best-case, nominal,
worst-case)
25
Gradients are indispensable
26
Accurate gradients indispensable
Go West,
young man!
Mount Reality
Mount Elmore
27
Gradient computation
• Direct method
– directly differentiate branch characteristics
– any number of functions, one parameter
– easily extended to the time-domain
• Adjoint method
– best approached via Tellegen’s theorem
– one function, any number of parameters
– requires backward-in-time simulation of the
adjoint circuit and convolution of waveforms
• Either case
– take advantage of event-driven paradigm!
28
Nonlinear optimization
• Much progress in nonlinear optimization in
the last two decades
• A few state-of-the-art packages available
• Can solve large problems (50,000
constraints, 50,000 variables)
• Can accommodate general nonlinear
objective functions and inequality
constraints, and simple bounds on variables
• Must exploit partial separability (structure)
of the problem to render solution practical
29
Customization of the optimizer
• Do not use the optimizer as a black box!
• Customization to circuit tuning yields
tremendous benefits
• Remember:
– any simulation-based scheme involves function
and gradient data that are numerically noisy
– circuit tuning is simulation-intensive; try to
reduce the number of iterations
• tune the optimizer to be aggressive
– gradient computation is expensive
• consider adjoint Lagrangian methods
• consider reducing the number of variables
30
Customization of the optimizer
• Examples of optimizer customization:
– choose tolerances and stopping criteria based
on level of noise
– methods to reduce numerical noise
– failure recovery when the optimizer is too
aggressive
– “2-step updates” for accelerated convergence
– initialization of variables and multipliers
– posing of a well-scaled problem; sensible
choices of units
– reduction of dimensionality of the problem;
e.g., treating fanout capacitances as
“internal variables” of the optimization
31
Outline
• The benefits of tuning/optimization
• Taxonomy
– Heuristic methods in brief (e.g., TILOS)
• Algorithms for formal methods
– formulation
– incremental sensitivity computation
– nonlinear optimization
– pruning
– optimality conditions
– Lagrangian Relaxation (LR)
• Methodology implications
32
Springs and planks analogy
z
33
Degeneracy!
Springs and planks analogy
z
34
Problems with formulation
• Size (problem with 2,388 gates has 24,768
variables and 19,175 inequality constraints)
• Degeneracy
– many equally good solutions since off-critical
arrival times and slews can take one of several
equally correct values
– active (tight) constraints have no impact on the
final solution (i.e., their Lagrange multiplier = 0)
• Redundancy
– many constraints can be removed without
changing the solution
35
Definitions (at the solution)
• A “normal” constraint is active and has a
unique, non-zero multiplier
• A degenerate constraint is active and has a
zero or non-unique multiplier
• A redundant constraint can be removed
without changing the solution; it has a
unique, zero multiplier, but may or may not
be active
• Note: all active degenerate constraints are
redundant; not all degenerate constraints
are redundant
36
Pruning: an example
1
2
3
4
min z
min z
s.t. z  AT4r
r
f
r
f
s.t. z  AT1  d12  d 23  d 34
s.t. z  AT4f
f
r
f
r
r
f
r
s.t.
z

AT

d

d

d
1
12
23
34
s.t. AT4  AT3  d 34
f
r
f
s.t. AT4  AT3  d 34 • 3 timing variables
r
instead of 9
s.t. AT3r  AT2f  d 23
s.t. AT3 f  AT2r  d 23f • 2 nonlinear constraints
s.t. AT2r  AT1 f  d12r
instead of (6 nonlinear
f
r
f
s.t. AT2  AT1  d12
+ 2 linear) constraints
37
Pruning: an example
1
2
1r
2r
3
3r
4
4r
Sink
1f
2f
3f
4f
1r
2f, 3r, 4f
Sink
2r, 3f, 4r
1f
38
Block-based & path-based timing
5
4
AT5  AT4  d45
AT6  AT4  d46
AT4  AT1  d14
AT4  AT2  d24
AT4  AT3  d34
6
AT5  AT1  d14  d 45
AT5  AT2  d 24  d 45
AT5  AT3  d 34  d 45
AT6  AT1  d14  d 46
AT6  AT2  d 24  d 46
AT6  AT3  d 34  d 46
Path-based
Block-based
1
2
3
39
Block-based & path-based timing
1
2
3
d14
d 24
4
d 34
d 45
5
d 46
6
d14  d 45
5
d 24  d 45
d14  d 46
2
d 34  d 45
6
3 d d
34
46 d
24  d 46
1
• In timing graph, if node has n fanins, m
fanouts, eliminating it causes mn
constraints instead of (m+n)
• Criterion: if mn  (m+n)+1, prune!
• Can take slack variable into account
40
Pruning strategy
• During pruning, number of fanins of any
un-pruned node monotonically increases
• During pruning, number of fanouts of any
un-pruned node monotonically increases
• Hence, if a node is not pruned in the first
pass, it will never be pruned, since the
chances of it getting pruned monotonically
worsen
• Therefore, a one-pass algorithm can be
used for a given pruning criterion
41
Three-pass greedy pruning
# fanins 1 2 3 4
#fanouts
1
2 2 2 2
2
2 1 0 X
3
2 0 X X
4
2 X X X
• First pass gain=2, then gain=1, then gain=0
• Not optimal, but yields excellent results
42
Pruning observations
• If either m or n is 1, pruning is good!
• Even if a node is pruned, its rising/falling
slews continue to be variables
• Pruning can be done purely topologically
• Duplicating nonlinear elements (the dijs)
does not increase simulation run time
• Duplicating nonlinear elements does not
adversely impact the optimization
– the gradient/Hessian of each nonlinear
element is only computed and stored once
• Three-pass greedy pruning provides
tremendous pruning benefits efficiently
43
Detailed pruning example
1
2
3
7
9
11
12
4
5
6
8
10
13
14
15
16
44
Detailed pruning example
Edges = 26
Nodes = 16 (+2)
1
2
7
9
3
Source
11
14
12
15
13
16
Sink
4
5
8
10
6
45
Detailed pruning example
Edges = 26  20
Nodes = 16  10
7
1
9
11
14
12
15
13
16
2
3
Source
5
6
Sink
4
8
10
46
Detailed pruning example
Edges = 20  17
Nodes = 10  7
7
1
9
11
14
12
14
2
3
Source
5
6
Sink
15
4
8
10
13
16
47
Detailed pruning example
Edges = 17  16
Nodes = 7  6
9
1,7
11
14
12
14
2,7
Source
5
6
3,7
4
Sink
15
8
10
13
16
48
Detailed pruning example
Edges = 16  15
Nodes = 6  5
9
1,7
11,14
2,7
12
Source
5
6
3,7
4
14
Sink
15
8
10
13
16
49
Detailed pruning example
Edges = 15  14
Nodes = 5  4
9
1,7
11,14
2,7
12
Source
5
6
3,7
4
14
Sink
15
8
10
13,16
50
Detailed pruning example
Edges = 14  13
Nodes = 4  3
9
1,7
11,14
2,7
12
Source
5
6
3,7
4
14
Sink
15
10
8
10,13,16
51
Detailed pruning example
Edges = 13  13
Nodes = 3  2
9
1,7
12,14
2,7
Source
5
6
11,14
12,15
3,7
4
10,12,14
Sink
10,12,15
8
10,13,16
52
Wrap-up on pruning
• Pruning
– reduces the number of variables (88% fewer
arrival time variables, 25% fewer variables)
– reduces the number of constraints (40% fewer
arrival time constraints, 19% fewer total)
– reduces degeneracy and redundancy (93%
fewer arrival time constraints, 38% fewer total)
– reduces memory and CPU time requirements
– leads to superior numerical results (more
consistent/robust in presence of noise because
it is easier to identify correct active set)
– achieved by simple timing graph manipulation!
– can consider pruning slew variables
53
Optimality conditions: minimax
min z
x, z
s.t. z  f ( x )
1
s.t. z  f ( x )
2
L  z   ( f  z)   ( f  z)
1 1
2 2
L / z  1    
1 2
1*  *2  1
*
In general,    1
54
Arrival time optimality conditions
4
1
3
2
5
AT3  AT1  d13 ; AT3  AT2  d 23 ; AT4  AT3  d 34 ; AT5  AT3  d 35
L  z  ...  13 ( AT1  d13  AT3 )  23 ( AT2  d 23  AT3 )
 34 ( AT3  d 34  AT4 )  35 ( AT3  d 35  AT5 )  ...
L
 13  23  34  35
AT3
*  *  *  * or *
*
13


23
34
35  upstream  downstream
55
Slew optimality conditions
4
1
3
5
2
s3  s13 ( s1 , w); s3  s23 ( s2 , w); s4  s34 ( s3 , w); s5  s35 ( s3 , w)
AT4  AT3  d 34 ( s3 , w); AT5  AT3  d 35 ( s3 , w)
S
S
S
S
L  z  ...  13
( s13  s3 )  23
( s23  s3 )  34
( s34  s4 )  35
( s35  s5 )
D
D
 34
( AT3  d 34  AT4 )  35
( AT3  d 35  AT5 )  ...
L
S  S  S s34  S s35  D d34  D d35
 13
23
34 s
35 s
34 s
35 s
s3
3
3
3
3

S*
upstream
 
S*
downstream
s
d
D*
  downstream
s j
s j
56
Observations
• Can start at graph sink (z) and work
backwards
• In the case of a unique critical path, the
multipliers can be determined uniquely
• If all downstream multipliers at a timing
point are zero, this timing point is not part
of the critical path; therefore all upstream
multipliers are also zero (even if the
constraint is active)
• Slew relations are more complicated, but
follow the same logic
57
Using optimality conditions
• Timer knows criticality of each constraint
• Timer knows the sign of the multipliers
• If there is a unique critical path, multipliers
can be determined uniquely
• If there are multiple critical paths, timer
can help the optimizer by staying within the
family of multipliers that satisfy the
optimality conditions
• Quickly ignores off-critical constraints
• Helps optimizer obtain correct active set
58
Lagrangian Relaxation (LR)
• To solve
min f ( x )
s.t. c ( x )  0  complicating constraint
1
s.t. c ( x )  0
2
• Equivalent problem is
min f ( x)  c1( x)
x,   0
s.t. c2 ( x)  0
• We have made the dual variable  a
variable of the problem
59
LR procedure
• Procedure to solve min f ( x)  c1( x)
x,   0
s.t. c2 ( x)  0
– pick a starting value of 
– solve the “inner” problem for fixed 
– now update  using a formula such as

k 1
 k c1 ( x ) 
 max  0,  

k 

k
– intuition: if c1(xk) < 0,  is reduced, ultimately to
zero; if c1(xk) > 0,  is increased, adding to the
weight of the c1(x) term; if c1(xk) = 0,  does not
change
– repeat solution of “inner” problem
60
Bounds in Lagrangian Relaxation
P1
min f ( x)
s.t. c1( x)  0  complicating constraint
s.t. c2( x)  0
P2
min f ( x)  c1( x)
0
s.t. c1( x)  0  complicating constraint
s.t. c2( x)  0
P3
min f ( x)  c1( x)
0
s.t. c2( x)  0
P1  P 2  P3
f ( x ) is an upper bound and f ( x )  c1 ( x )
a lower bound to the solution of P1.
61
Approaching the solution
k
upper bound : f ( x )
“Duality gap”
Solution
Iteration k
lower bound : f ( x k )  k c1 ( x k )
62
LR in circuit tuning: arrival times
• Recall
L  z  ...  13 ( AT1  d13  AT3 )  23 ( AT2  d 23  AT3 )
 34 ( AT3  d 34  AT4 )  35 ( AT3  d 35  AT5 )  ...
*
13
 *23  *34  *35
• Assume that the multipliers are in , i.e.,
they belong to the family of solutions that
satisfy the optimality conditions
• Substituting, we see that AT3 disappears!
• Similarly, all arrival times cancel off, and
L  ...  13d13  23d 23  34 d 34  35d 35  ...
 ...   ij d ij  ...
63
LR in circuit tuning: slews
• No such luck with slews, they do not cancel
even if the s are in ; so slews stay as
variables of the “inner” problem
• However, many slew constraints may have
zero multipliers
• Therefore, there may be benefit to
“relaxing” the slew constraints
• In any case, the slew multipliers can be
chosen to satisfy the slew multiplier
optimality conditions
64
LR in circuit tuning: procedure
• Pose problem with no arrival time variables
• Guess initial s in 
• Solve “inner” problem with fixed s; choose
to relax or not relax the other constraints
(slew, area,  ratio, input loading)
• If upper bound and lower bound are
sufficiently close, or no improvement, quit
• Update the s (subgradient method)
• Project the s to the nearest point in 
• Re-solve the “inner” problem with new s
65
Comments on LR in circuit tuning
• LR is not favored among nonlinear math.
programmers; convergence is at best linear
• But, in the case of Elmore delay models:
– there are no slew complications
– objective function is a linear combination of
Elmore delays
– “inner” implementation can be very efficient
– large number of “outer” iterations is not a
problem
• For simulation-based slew-aware tuning,
however, LR is not recommended
66
What to expect from a static tuner
• 25% or more improvement at constant area
on untuned circuits
• 10-15% on previously hand-tuned circuits
• Impressive area reductions on synthesized
random logic macros
• Capacity limitations (~5K gates)
• Long run times (overnight run to tune a
circuit with a few thousand gates)
• Many minimum-sized non-critical gates;
constraints pushed to the edge of feasibility
67
Characteristics of a formal tuner
• Does the tuner use nonlinear optimization?
Is it gradient-based?
• Does it come with an optimality guarantee?
Is it consistent and robust?
• Does the tuner employ a sufficiently
accurate static timing analysis as its basis?
• Can it accommodate additional constraints
(area, slew, input loading,  ratios)?
• Capacity? CPU time? Methodology fit?
• Is it flexible (hierarchical vs. flat, schematic
vs. extracted, etc.)?
68
Methodology implications
• Consider how layout will be performed
before embarking on tuning!
• If every transistor tuned as an independent
variable:
– data explosion
– who will do your physical design?
• Solution could be to tune on the basis of a
parameterized cell library
• Automated layout solutions can be used for
circuits containing only parameterized cells
69
Possible methodology
Synthesized macros
Customs
Tuning
P-cell
library
Automated layout
Extraction
Timing
70
Methodology implications
• Tuning of extracted netlists
– correlation between diffusion capacitances and
transistor widths is lost
– need more information from extractor
– alternatively, must rely on estimated diffusions
within cells and extractor for other parasitics
• Hierarchical vs. non-hierarchical tuning
– hierarchy preservation reduces data volume
– fewer independent variables; fewer distinct
layouts
– circuit is amenable to future editing by designer
– but performance is left on the table!
71
Conclusions
• Due to many recent innovations...
– formulation/pruning
– fast simulation and incremental sensitivity
computation
– advances in nonlinear optimization
– exploitation of optimality conditions
• … it is possible to optimize fairly large
circuits with reasonable consistency …
• … making formal static optimization of
transistor and wire sizes possible …
• … and leading to very high quality results!
72
Key references
• TILOS: Fishburn and Dunlop, ICCAD ’85
• Gradients: Brayton, Hachtel and
Sangiovanni-Vincentelli survey, Proc. IEEE,
10/81 and Pileggi, Rohrer and
Visweswariah, McGraw Hill book ’95
• Nonlinear optimization: Conn, Gould and
Toint, Spedicato book, Kluwer, ’94
• Formulation/pruning: Visweswariah and
Conn, ICCAD ’99
• Lagrangian Relaxation: Chen, Chu and
Wong, TCAD 7/99
73
Similar pages