Algorithms for Formal Circuit Optimization on a Static Timing Basis Chandu Visweswariah IBM Thomas J. Watson Research Center Yorktown Heights, NY 1 Acknowledgments • Partners in crime – Andy Conn – Ruud Haring – David Ling – Phil Strenski – Chai Wah Wu – Ee Cho – Walt Molzen – Mike Henderson – Katya Scheinberg – Abe Elfadel – Peter O’Brien – Greg Northrop – Pat Williams – several summer students along the way – all the users of our tuners in IBM! 2 Outline • The benefits of tuning/optimization • Taxonomy – heuristic methods in brief (e.g., TILOS) • Algorithms for formal methods – formulation – incremental sensitivity computation – nonlinear optimization – pruning – optimality conditions – Lagrangian Relaxation (LR) • Methodology implications 3 Benefits of tuning/optimization • Better circuits – area, delay, power, noise • Enhanced designer productivity – designers’ focus shifts to a higher level like comparing different topologies • Better understanding of tradeoffs – better use of silicon technology • Effective tuning/optimization is an enabler of design for manufacturability – improved parametric yields 4 Taxonomy Tuners Dynamic Static Formal Heuristic • Today’s tutorial will focus mostly on formal static tuning 5 Dynamic vs. static tuning • Dynamic tuning is to transient simulation as static tuning is to static timing analysis – examples: DELIGHT.SPICE, JiffyTune – user provides input patterns – user lists paths to be considered – user poses carefully thought-out optimization problem Transistor/wire sizes Nonlinear optimizer Tuner Simulator Function/gradient values 6 Static tuning • Uses static timing analysis as a basis • All paths through the logic implicitly considered • User does not specify paths or input patterns; easier to use • Inherits weaknesses of static timing analysis – pessimism – false paths • More difficult to include power/noise considerations during tuning 7 Formal vs. heuristic static tuning • The formal approach attempts to solve the problem “exactly” – better quality of results – long run times – sophisticated mathematical algorithms – significant development effort • The heuristic method repeatedly upsizes the most sensitive transistor(s) on the most critical path(s) – greedy algorithm – generates an area vs. speed tradeoff curve – can handle larger circuits 8 Heuristic tuning • Example: TILOS (Fishburn and Dunlop, ’85) • Proved that transistor sizing is a convex problem under an Elmore delay assumption • Unfortunately – Elmore delays are too inaccurate – slew effects are not taken into account – meaningful tuning requires additional constraints like slew limits, input loading, ratio constraints which may not be convex – all of these destroy the convexity of the problem • We give up convexity for accuracy 9 TILOS algorithm • Start all transistors at minimum size • Determine the critical path • Find the sensitivity of the critical path delay to the width of each transistor on the path • Increase width of transistor with highest negative sensitivity by a fixed step size • Repeat • Generates a speed vs. area tradeoff curve • Can consider more than one transistor and/or more than one path at a time 10 Comments on heuristic tuning • The answer can be sub-optimal Most critical Next-most critical • Can exploit incremental timing algorithms to determine sensitivities by finite differences • Can use more realistic delay models • Difficult to incorporate additional constraints (loading, slew, ratio) 11 Algorithms for formal static tuning • Goals – obtain an optimal answer (local minimum if problem is not convex) – fully automated solution – take all paths into account – delay calculation using time-domain simulation – flexibility to express wide range of constraints – transistor-level tuner • Components – fast simulation and incremental sensitivity – nonlinear optimization package 12 Static tuning formulation AT1 AT2 d15 AT5 d 25 d 58 s.t. s.t. s.t. s.t. AT3 d 36 AT4 d 46 AT7 d 57 d 67 d 68 AT8 AT6 min [max ( AT , AT )] 7 8 AT max [ AT d , AT d ] 7 5 57 6 67 AT max [ AT d , AT d ] 5 58 8 6 68 AT max [ AT d , AT d ] 5 1 15 2 25 AT max [ AT d , AT d ] 6 3 36 4 46 13 Digression: minimax optimization min [ max{ f1 ( x), f 2 ( x)} ] x • This problem is not smooth! • Re-map to: min z x,z s.t. z f1 ( x) s.t. z f 2 ( x) • This trick converts a minimax problem into a standard twice continuously differentiable nonlinear optimization problem 14 Digression: minimax optimization f1 f2 z z* x0 x x 15 Remapped problem formulation min z min[max( AT7 , AT8 )] s.t. z AT7 s.t. z AT 8 AT7 AT7 max[ AT5 d 57 , AT6 d 67 ] AT7 AT8 AT8 max[ AT5 d 58 , AT6 d 68 ] AT8 AT5 AT5 max[ AT1 d15 , AT2 d 25 ] AT5 AT5 d 57 AT6 d 67 AT5 d 58 AT6 d 68 AT1 d15 AT2 d 25 AT6 AT3 d 36 AT6 max[ AT3 d 36 , AT4 d 46 ] AT6 AT4 d 46 16 Reformulated problem • ATs are variables of the problem • At the solution – original problem is solved because by definition z is minimized and the constraints are feasible – ATs on critical path are correct because these constraints are tight – off-critical ATs may be “wrong” • dij’s are functions of transistor widths • Amenable to addition of any additional general (nonlinear) constraints – slew, input loading, ratio 17 Springs and planks analogy z AT7 d 57 d 58 d 67 AT8 d 68 AT6 AT5 d 15 d 25 AT1 d 36 AT2 d 46 AT3 AT4 18 Springs and planks analogy z AT7 AT7 d 67 d 58 d 68 67 d 58 d 68 d 57 d AT7 d 57 d 67 AT AT7 5 d 57 AT5 d 57 AT5 d AT5 15 d dd 15 15 15 d 67 dd 25 d 25 d AT AT AT11 25 25 AT11 AT AT AT22 AT22 dd 36 36 d d 36 36 AT8 z AT8 z AT AT6 8 d 58 d 68 AT6 d 58 d 68 AT6 z AT8 d 46 AT6 d 46 d 46 d AT AT AT33 AT33 46 AT AT AT44 AT44 19 Constraint generation wn , w p i j cout j AT jr ATi f d ijr ( wn , w p , cout j , slewinf ) AT j f ATi r d ijf ( wn , w p , cout j , slewinr ) slew rj sijr ( wn , w p , cout j , slewinf ) slew jf sijf ( wn , w p , cout j , slewinr ) • Each delay, slew depends on a single input slew 20 Statement of the problem min z s.t. z ATi RATi s.t. AT j ATi dij (W , S ) s.t. s j sij (W , S ) s.t. W area target s.t. pincap i (W ) target i s.t. min i (W ) max s.t. Wmin Wi Wmax internal s.t. Si Smax PO s.t. Si Smax for all POs for each timing arc for each timing arc for all PIs for all gates for all FETs for all internal nets for all POs 21 Simplified view of assertions Launching latch Primary input arrival time Combinational macro being tuned Capturing latch Delay Primary output through required arrival time macro z 22 Components of a tuner Read netlist; create timing graph Formulate optimization problem Feed problem to nonlinear optimizer Solve optimization problem, call simulator for delays/slews and gradients thereof Fast simulation and incremental sensitivity computation Obtain converged solution Snap-to-grid; back-annotate; re-time 23 Components of a tuner • Transistor-level static timing analyzer – timing graph, sensitizations, simulation of CCCs, slack calculations • Fast time-domain circuit simulator • Fast time-domain incremental sensitivity – adjoint method – direct method • State-of-the-art nonlinear optimizer – should be able to handle general nonlinear inequalities and objective functions – should be able to handle large problems 24 Components: fast simulator • Many fast transistor-level time-domain simulators in the literature – orders of magnitude speedup over SPICE – event-driven algorithms – simplified device models (tables for i-v characteristics, simplified parasitic models) – 5% typical, 20% worst-case errors on a stage-delay basis – typically invoked via an API by the timer with many “tricks” to ensure efficiency – simulator must handle simulation of CCCs at multiple process corners (best-case, nominal, worst-case) 25 Gradients are indispensable 26 Accurate gradients indispensable Go West, young man! Mount Reality Mount Elmore 27 Gradient computation • Direct method – directly differentiate branch characteristics – any number of functions, one parameter – easily extended to the time-domain • Adjoint method – best approached via Tellegen’s theorem – one function, any number of parameters – requires backward-in-time simulation of the adjoint circuit and convolution of waveforms • Either case – take advantage of event-driven paradigm! 28 Nonlinear optimization • Much progress in nonlinear optimization in the last two decades • A few state-of-the-art packages available • Can solve large problems (50,000 constraints, 50,000 variables) • Can accommodate general nonlinear objective functions and inequality constraints, and simple bounds on variables • Must exploit partial separability (structure) of the problem to render solution practical 29 Customization of the optimizer • Do not use the optimizer as a black box! • Customization to circuit tuning yields tremendous benefits • Remember: – any simulation-based scheme involves function and gradient data that are numerically noisy – circuit tuning is simulation-intensive; try to reduce the number of iterations • tune the optimizer to be aggressive – gradient computation is expensive • consider adjoint Lagrangian methods • consider reducing the number of variables 30 Customization of the optimizer • Examples of optimizer customization: – choose tolerances and stopping criteria based on level of noise – methods to reduce numerical noise – failure recovery when the optimizer is too aggressive – “2-step updates” for accelerated convergence – initialization of variables and multipliers – posing of a well-scaled problem; sensible choices of units – reduction of dimensionality of the problem; e.g., treating fanout capacitances as “internal variables” of the optimization 31 Outline • The benefits of tuning/optimization • Taxonomy – Heuristic methods in brief (e.g., TILOS) • Algorithms for formal methods – formulation – incremental sensitivity computation – nonlinear optimization – pruning – optimality conditions – Lagrangian Relaxation (LR) • Methodology implications 32 Springs and planks analogy z 33 Degeneracy! Springs and planks analogy z 34 Problems with formulation • Size (problem with 2,388 gates has 24,768 variables and 19,175 inequality constraints) • Degeneracy – many equally good solutions since off-critical arrival times and slews can take one of several equally correct values – active (tight) constraints have no impact on the final solution (i.e., their Lagrange multiplier = 0) • Redundancy – many constraints can be removed without changing the solution 35 Definitions (at the solution) • A “normal” constraint is active and has a unique, non-zero multiplier • A degenerate constraint is active and has a zero or non-unique multiplier • A redundant constraint can be removed without changing the solution; it has a unique, zero multiplier, but may or may not be active • Note: all active degenerate constraints are redundant; not all degenerate constraints are redundant 36 Pruning: an example 1 2 3 4 min z min z s.t. z AT4r r f r f s.t. z AT1 d12 d 23 d 34 s.t. z AT4f f r f r r f r s.t. z AT d d d 1 12 23 34 s.t. AT4 AT3 d 34 f r f s.t. AT4 AT3 d 34 • 3 timing variables r instead of 9 s.t. AT3r AT2f d 23 s.t. AT3 f AT2r d 23f • 2 nonlinear constraints s.t. AT2r AT1 f d12r instead of (6 nonlinear f r f s.t. AT2 AT1 d12 + 2 linear) constraints 37 Pruning: an example 1 2 1r 2r 3 3r 4 4r Sink 1f 2f 3f 4f 1r 2f, 3r, 4f Sink 2r, 3f, 4r 1f 38 Block-based & path-based timing 5 4 AT5 AT4 d45 AT6 AT4 d46 AT4 AT1 d14 AT4 AT2 d24 AT4 AT3 d34 6 AT5 AT1 d14 d 45 AT5 AT2 d 24 d 45 AT5 AT3 d 34 d 45 AT6 AT1 d14 d 46 AT6 AT2 d 24 d 46 AT6 AT3 d 34 d 46 Path-based Block-based 1 2 3 39 Block-based & path-based timing 1 2 3 d14 d 24 4 d 34 d 45 5 d 46 6 d14 d 45 5 d 24 d 45 d14 d 46 2 d 34 d 45 6 3 d d 34 46 d 24 d 46 1 • In timing graph, if node has n fanins, m fanouts, eliminating it causes mn constraints instead of (m+n) • Criterion: if mn (m+n)+1, prune! • Can take slack variable into account 40 Pruning strategy • During pruning, number of fanins of any un-pruned node monotonically increases • During pruning, number of fanouts of any un-pruned node monotonically increases • Hence, if a node is not pruned in the first pass, it will never be pruned, since the chances of it getting pruned monotonically worsen • Therefore, a one-pass algorithm can be used for a given pruning criterion 41 Three-pass greedy pruning # fanins 1 2 3 4 #fanouts 1 2 2 2 2 2 2 1 0 X 3 2 0 X X 4 2 X X X • First pass gain=2, then gain=1, then gain=0 • Not optimal, but yields excellent results 42 Pruning observations • If either m or n is 1, pruning is good! • Even if a node is pruned, its rising/falling slews continue to be variables • Pruning can be done purely topologically • Duplicating nonlinear elements (the dijs) does not increase simulation run time • Duplicating nonlinear elements does not adversely impact the optimization – the gradient/Hessian of each nonlinear element is only computed and stored once • Three-pass greedy pruning provides tremendous pruning benefits efficiently 43 Detailed pruning example 1 2 3 7 9 11 12 4 5 6 8 10 13 14 15 16 44 Detailed pruning example Edges = 26 Nodes = 16 (+2) 1 2 7 9 3 Source 11 14 12 15 13 16 Sink 4 5 8 10 6 45 Detailed pruning example Edges = 26 20 Nodes = 16 10 7 1 9 11 14 12 15 13 16 2 3 Source 5 6 Sink 4 8 10 46 Detailed pruning example Edges = 20 17 Nodes = 10 7 7 1 9 11 14 12 14 2 3 Source 5 6 Sink 15 4 8 10 13 16 47 Detailed pruning example Edges = 17 16 Nodes = 7 6 9 1,7 11 14 12 14 2,7 Source 5 6 3,7 4 Sink 15 8 10 13 16 48 Detailed pruning example Edges = 16 15 Nodes = 6 5 9 1,7 11,14 2,7 12 Source 5 6 3,7 4 14 Sink 15 8 10 13 16 49 Detailed pruning example Edges = 15 14 Nodes = 5 4 9 1,7 11,14 2,7 12 Source 5 6 3,7 4 14 Sink 15 8 10 13,16 50 Detailed pruning example Edges = 14 13 Nodes = 4 3 9 1,7 11,14 2,7 12 Source 5 6 3,7 4 14 Sink 15 10 8 10,13,16 51 Detailed pruning example Edges = 13 13 Nodes = 3 2 9 1,7 12,14 2,7 Source 5 6 11,14 12,15 3,7 4 10,12,14 Sink 10,12,15 8 10,13,16 52 Wrap-up on pruning • Pruning – reduces the number of variables (88% fewer arrival time variables, 25% fewer variables) – reduces the number of constraints (40% fewer arrival time constraints, 19% fewer total) – reduces degeneracy and redundancy (93% fewer arrival time constraints, 38% fewer total) – reduces memory and CPU time requirements – leads to superior numerical results (more consistent/robust in presence of noise because it is easier to identify correct active set) – achieved by simple timing graph manipulation! – can consider pruning slew variables 53 Optimality conditions: minimax min z x, z s.t. z f ( x ) 1 s.t. z f ( x ) 2 L z ( f z) ( f z) 1 1 2 2 L / z 1 1 2 1* *2 1 * In general, 1 54 Arrival time optimality conditions 4 1 3 2 5 AT3 AT1 d13 ; AT3 AT2 d 23 ; AT4 AT3 d 34 ; AT5 AT3 d 35 L z ... 13 ( AT1 d13 AT3 ) 23 ( AT2 d 23 AT3 ) 34 ( AT3 d 34 AT4 ) 35 ( AT3 d 35 AT5 ) ... L 13 23 34 35 AT3 * * * * or * * 13 23 34 35 upstream downstream 55 Slew optimality conditions 4 1 3 5 2 s3 s13 ( s1 , w); s3 s23 ( s2 , w); s4 s34 ( s3 , w); s5 s35 ( s3 , w) AT4 AT3 d 34 ( s3 , w); AT5 AT3 d 35 ( s3 , w) S S S S L z ... 13 ( s13 s3 ) 23 ( s23 s3 ) 34 ( s34 s4 ) 35 ( s35 s5 ) D D 34 ( AT3 d 34 AT4 ) 35 ( AT3 d 35 AT5 ) ... L S S S s34 S s35 D d34 D d35 13 23 34 s 35 s 34 s 35 s s3 3 3 3 3 S* upstream S* downstream s d D* downstream s j s j 56 Observations • Can start at graph sink (z) and work backwards • In the case of a unique critical path, the multipliers can be determined uniquely • If all downstream multipliers at a timing point are zero, this timing point is not part of the critical path; therefore all upstream multipliers are also zero (even if the constraint is active) • Slew relations are more complicated, but follow the same logic 57 Using optimality conditions • Timer knows criticality of each constraint • Timer knows the sign of the multipliers • If there is a unique critical path, multipliers can be determined uniquely • If there are multiple critical paths, timer can help the optimizer by staying within the family of multipliers that satisfy the optimality conditions • Quickly ignores off-critical constraints • Helps optimizer obtain correct active set 58 Lagrangian Relaxation (LR) • To solve min f ( x ) s.t. c ( x ) 0 complicating constraint 1 s.t. c ( x ) 0 2 • Equivalent problem is min f ( x) c1( x) x, 0 s.t. c2 ( x) 0 • We have made the dual variable a variable of the problem 59 LR procedure • Procedure to solve min f ( x) c1( x) x, 0 s.t. c2 ( x) 0 – pick a starting value of – solve the “inner” problem for fixed – now update using a formula such as k 1 k c1 ( x ) max 0, k k – intuition: if c1(xk) < 0, is reduced, ultimately to zero; if c1(xk) > 0, is increased, adding to the weight of the c1(x) term; if c1(xk) = 0, does not change – repeat solution of “inner” problem 60 Bounds in Lagrangian Relaxation P1 min f ( x) s.t. c1( x) 0 complicating constraint s.t. c2( x) 0 P2 min f ( x) c1( x) 0 s.t. c1( x) 0 complicating constraint s.t. c2( x) 0 P3 min f ( x) c1( x) 0 s.t. c2( x) 0 P1 P 2 P3 f ( x ) is an upper bound and f ( x ) c1 ( x ) a lower bound to the solution of P1. 61 Approaching the solution k upper bound : f ( x ) “Duality gap” Solution Iteration k lower bound : f ( x k ) k c1 ( x k ) 62 LR in circuit tuning: arrival times • Recall L z ... 13 ( AT1 d13 AT3 ) 23 ( AT2 d 23 AT3 ) 34 ( AT3 d 34 AT4 ) 35 ( AT3 d 35 AT5 ) ... * 13 *23 *34 *35 • Assume that the multipliers are in , i.e., they belong to the family of solutions that satisfy the optimality conditions • Substituting, we see that AT3 disappears! • Similarly, all arrival times cancel off, and L ... 13d13 23d 23 34 d 34 35d 35 ... ... ij d ij ... 63 LR in circuit tuning: slews • No such luck with slews, they do not cancel even if the s are in ; so slews stay as variables of the “inner” problem • However, many slew constraints may have zero multipliers • Therefore, there may be benefit to “relaxing” the slew constraints • In any case, the slew multipliers can be chosen to satisfy the slew multiplier optimality conditions 64 LR in circuit tuning: procedure • Pose problem with no arrival time variables • Guess initial s in • Solve “inner” problem with fixed s; choose to relax or not relax the other constraints (slew, area, ratio, input loading) • If upper bound and lower bound are sufficiently close, or no improvement, quit • Update the s (subgradient method) • Project the s to the nearest point in • Re-solve the “inner” problem with new s 65 Comments on LR in circuit tuning • LR is not favored among nonlinear math. programmers; convergence is at best linear • But, in the case of Elmore delay models: – there are no slew complications – objective function is a linear combination of Elmore delays – “inner” implementation can be very efficient – large number of “outer” iterations is not a problem • For simulation-based slew-aware tuning, however, LR is not recommended 66 What to expect from a static tuner • 25% or more improvement at constant area on untuned circuits • 10-15% on previously hand-tuned circuits • Impressive area reductions on synthesized random logic macros • Capacity limitations (~5K gates) • Long run times (overnight run to tune a circuit with a few thousand gates) • Many minimum-sized non-critical gates; constraints pushed to the edge of feasibility 67 Characteristics of a formal tuner • Does the tuner use nonlinear optimization? Is it gradient-based? • Does it come with an optimality guarantee? Is it consistent and robust? • Does the tuner employ a sufficiently accurate static timing analysis as its basis? • Can it accommodate additional constraints (area, slew, input loading, ratios)? • Capacity? CPU time? Methodology fit? • Is it flexible (hierarchical vs. flat, schematic vs. extracted, etc.)? 68 Methodology implications • Consider how layout will be performed before embarking on tuning! • If every transistor tuned as an independent variable: – data explosion – who will do your physical design? • Solution could be to tune on the basis of a parameterized cell library • Automated layout solutions can be used for circuits containing only parameterized cells 69 Possible methodology Synthesized macros Customs Tuning P-cell library Automated layout Extraction Timing 70 Methodology implications • Tuning of extracted netlists – correlation between diffusion capacitances and transistor widths is lost – need more information from extractor – alternatively, must rely on estimated diffusions within cells and extractor for other parasitics • Hierarchical vs. non-hierarchical tuning – hierarchy preservation reduces data volume – fewer independent variables; fewer distinct layouts – circuit is amenable to future editing by designer – but performance is left on the table! 71 Conclusions • Due to many recent innovations... – formulation/pruning – fast simulation and incremental sensitivity computation – advances in nonlinear optimization – exploitation of optimality conditions • … it is possible to optimize fairly large circuits with reasonable consistency … • … making formal static optimization of transistor and wire sizes possible … • … and leading to very high quality results! 72 Key references • TILOS: Fishburn and Dunlop, ICCAD ’85 • Gradients: Brayton, Hachtel and Sangiovanni-Vincentelli survey, Proc. IEEE, 10/81 and Pileggi, Rohrer and Visweswariah, McGraw Hill book ’95 • Nonlinear optimization: Conn, Gould and Toint, Spedicato book, Kluwer, ’94 • Formulation/pruning: Visweswariah and Conn, ICCAD ’99 • Lagrangian Relaxation: Chen, Chu and Wong, TCAD 7/99 73