Optimization Challenges in Transistor Sizing Chandu Visweswariah IBM Thomas J. Watson Research Center Yorktown Heights, NY Acknowledgments The entire EinsTuner team at IBM, especially Andy Conn of the Applied Mathematics department at IBM Research Two acts focusing on optimization • Act I: what we have done with existing nonlinear optimization techniques – what is EinsTuner? – how does it work? – what is its impact? • Act II: what we would love to do and currently cannot – numerical noise – capacity – robustness, scaling, weighting, degeneracy – recursive minimax support – mixed integer continuous problems Act I: What is EinsTuner? • A static, formal, transistor sizing tool geared towards custom high-performance digital circuits – static: based on static transistor-level timing analysis, so implicitly takes into account all paths through the logic – custom: timing based on real time-domain simulation of all possible transitions through each clump of strongly-connected transistors – formal: guaranteed optimal • “local” according to engineers • “global” according to mathematicians – transistor-level, therefore inherently flat – see DAC ’99, ICCAD ’99, SIAM J. Opt 9/99 Features of EinsTuner • Extensive (sweeping and fine-grained) grouping, ratio-ing and no-touch commands • 25K transistor capacity (60K variables) • Allows delay, area, ratio, input loading, slew and transistor width constraints • Allows minimization of delay, area or linear combination thereof • Easy-to-use; full GUI support; good fit into (semi-custom and custom) methodologies • Timer benefits such as accuracy for custom circuits, pattern-matching, state analysis EinsTuner: formal static optimizer Transistor and wire sizes Nonlinear optimizer LANCELOT Static transistorlevel timer EinsTLT Function and gradient values Embedded timedomain simulator SPECS Components of EinsTuner 1 Read netlist; create 3 timing graph (EinsTLT) Formulate pruned optimization problem 2 Feed problem to nonlinear optimizer (LANCELOT) Solve optimization problem, call simulator for delays/slews and gradients thereof Fast simulation and incremental sensitivity computation (SPECS) Obtain converged solution Snap-to-grid; back-annotate; re-time Static optimization formulation AT1 AT2 d15 AT5 d 25 d 58 s.t. s.t. s.t. s.t. AT3 d 36 AT4 d 46 AT7 d 57 d 67 d 68 AT8 AT6 min [max ( AT , AT )] 7 8 AT max [ AT d , AT d ] 7 5 57 6 67 AT max [ AT d , AT d ] 5 58 8 6 68 AT max [ AT d , AT d ] 5 1 15 2 25 AT max [ AT d , AT d ] 6 3 36 4 46 Digression: minimax optimization f1 f2 z z* x0 x x min z x, z min [ max{ f1 ( x), f 2 ( x)} ] s.t. z f1 ( x) x s.t. z f 2 ( x) Remapped problem min z min[max( AT7 , AT8 )] s.t. z AT7 s.t. z AT 8 AT7 AT7 max[ AT5 d 57 , AT6 d 67 ] AT7 AT8 AT8 max[ AT5 d 58 , AT6 d 68 ] AT8 AT5 AT5 max[ AT1 d15 , AT2 d 25 ] AT5 AT1 d15 AT2 d 25 AT5 d 58 AT3 d 36 AT4 d 46 AT5 d 57 AT6 d 67 AT5 d 58 AT6 d 68 AT1 d15 AT2 d 25 AT6 AT3 d 36 AT6 max[ AT3 d 36 , AT4 d 46 ] AT6 AT4 d 46 AT7 d 57 d 67 d 68 AT6 AT8 Springs and planks analogy z AT7 d 57 d 58 d 67 AT8 d 68 AT6 AT5 d 15 d 25 AT1 d 36 AT2 d 46 AT3 AT4 Springs and planks analogy z AT7 AT7 d 67 d 58 d 68 67 d 58 d 68 d 57 d AT7 d 57 d 67 AT AT7 5 d 57 AT5 d 57 AT5 d AT5 15 d dd 15 15 15 d 67 dd 25 d 25 d AT AT AT11 25 25 AT11 AT AT AT22 AT22 dd 36 36 d d 36 36 AT8 AT8 z AT AT6 8 d 58 d 68 AT6 d 58 d 68 AT6 AT8 d 46 AT6 d 46 d 46 d AT AT AT33 AT33 z 46 AT AT AT44 AT44 z Algorithm demonstration: inv3 Algorithm demonstration: 1b ALU Constraint generation wn , w p i j cout j AT jr ATi f d ijr ( wn , w p , cout j , slewinf ) AT j f ATi r d ijf ( wn , w p , cout j , slewinr ) slew rj sijr ( wn , w p , cout j , slewinf ) slew jf sijf ( wn , w p , cout j , slewinr ) Statement of the problem min z s.t. z ATi RATi s.t. AT j ATi dij (W , S ) s.t. s j sij (W , S ) s.t. W area target s.t. pincap i (W ) target i s.t. min i (W ) max s.t. Wmin Wi Wmax internal s.t. Si Smax PO s.t. Si Smax for all POs for each timing arc for each timing arc for all PIs for all gates for all FETs for all internal nets for all POs LANCELOT • State-of-the-art general-purpose nonlinear optimizer • Trust-region method • Exploitation of group partial separability; augmented Lagrangian merit function • Handles general linear/nonlinear equalities/ inequalities and objective functions • Simple bounds handled by projections • Several choices of preconditioners, exact/inexact BQP solution, etc. LANCELOT algorithms f x1 Trust-region Simple bounds a w u v min f ( x1 , x2 ) 0 x1 a xk 0 x2 b b x2 Demonstration of degeneracy z Degeneracy! Demonstration of degeneracy z Why do we need pruning? Pruning: an example 1 2 min z s.t. z AT4r s.t. z AT4f r f r s.t. AT4 AT3 d 34 f r f s.t. AT4 AT3 d 34 r s.t. AT3r AT2f d 23 s.t. AT3 f AT2r d 23f s.t. AT2r AT1 f d12r f r f s.t. AT2 AT1 d12 3 4 min z r f r f s.t. z AT1 d12 d 23 d 34 f r f r s.t. z AT1 d12 d 23 d 34 Block-based vs. path-based timing 5 4 AT5 AT4 d45 AT6 AT4 d46 AT4 AT1 d14 AT4 AT2 d24 AT4 AT3 d34 6 AT5 AT1 d14 d 45 AT5 AT2 d 24 d 45 AT5 AT3 d 34 d 45 AT6 AT1 d14 d 46 AT6 AT2 d 24 d 46 AT6 AT3 d 34 d 46 Path-based Block-based 1 2 3 Detailed pruning example 1 2 3 7 9 11 12 4 5 6 8 10 13 14 15 16 Detailed pruning example Edges = 26 Nodes = 16 (+2) 1 2 7 9 3 Source 11 14 12 15 13 16 4 5 6 8 10 Sink Detailed pruning example Edges = 26 20 Nodes = 16 10 7 1 9 11 14 12 15 13 16 2 3 Source 5 6 4 8 10 Sink Detailed pruning example Edges = 20 17 Nodes = 10 7 7 1 9 11 14 12 14 2 3 Source 5 6 15 4 8 10 13 16 Sink Detailed pruning example Edges = 17 16 Nodes = 7 6 9 1,7 11 14 12 14 2,7 Source 5 6 3,7 4 15 8 10 13 16 Sink Detailed pruning example Edges = 16 15 Nodes = 6 5 9 1,7 11,14 2,7 12 Source 5 6 3,7 4 14 15 8 10 13 16 Sink Detailed pruning example Edges = 15 14 Nodes = 5 4 9 1,7 11,14 2,7 12 Source 5 6 3,7 4 14 15 8 10 13,16 Sink Detailed pruning example Edges = 14 13 Nodes = 4 3 9 1,7 11,14 2,7 12 Source 5 6 3,7 4 14 15 10 8 10,13,16 Sink Detailed pruning example Edges = 13 13 Nodes = 3 2 9 1,7 12,14 2,7 Source 5 6 11,14 12,15 3,7 4 10,12,14 Sink 10,12,15 8 10,13,16 Pruning vs. no pruning No time to mention... • Simultaneous switching • Tuning of transmission-gate MUXes • Adjoint sensitivity computation • The role of pattern matching • Strength calculations/symmetrized stacks • State analysis • Detection of impossible problems • Latches • Uncertainty-aware tuning EinsTuner impact • Better circuits – 15%+ on previously hand-tuned circuits! • Improved productivity in conjunction with semi-custom design flow (see Northrop et al, DAC ’01) • Better understanding of tradeoffs – lifts designers’ thinking to a higher level • Currently being applied to third generation of z-series (S/390) and PowerPC microprocessors to benefit from EinsTuner Act II: If only... Rich source of research problems What we are able to do What we could do with a robust, fast, high-capacity, noise-immune, degeneracy-immune, easy-to-use, non-finicky nonlinear optimizer Open research problems • Immunity to numerical noise Voltage Time Numerical noise • Need a formal basis for optimization in the presence of inevitable numerical noise • Formal methods for measuring noise level • Tie all tolerances to a multiple of noise level – tolerances for bounds – tolerances for updating multipliers, etc. – stopping criteria • Currently our optimization is a race against noise and the conclusion is not pleasing! • All optimization decisions noise-based Numerical noise • Inject various types of noise (correlated and uncorrelated) with different amplitude into function and gradient evaluation of analytic optimization problems • Easy experimentation thus possible to test theories of noiseimmune optimization Capacity • Today, our biggest problem has about 30,000 transistors, 70,000 variables and 70,000 constraints • Runs for as long as a week! • We go to great lengths to reduce problem size – in the tuner code – in the pre-processor – in our methodology What higher capacity would permit • Tuning of larger macros • Early mode considerations • Manufacturability – replicate constraints and variables at each process corner – rich possibilities in choice of objective function • Explicit (circuit) noise constraints • Benefits of explicit grouping constraints in post-optimality analysis Digression: noise constraints area = 0 v NML t1 ts te t2 v( x, t ) NM L for all t in [t1 , t2 ] te {v( x, t ) NM L } dt 0 t ts • Combine Harmony and EinsTuner – essential for dynamic circuits (see TCAD 6/00) – replace rules-of-thumb by actual noise constraints for static circuits Robustness • Scaling – requires unit change in all variables to have roughly same order of magnitude of impact – need non-uniform trust-region; would subsume 2-step updating [Conn et al, SIAM J. Opt. 9/99] – very difficult in general • Weighting – essential in any large-scale optimizer – need problem-size-dependent weights • Sensitivity to choices should be minimized – finicky behavior not helpful in engineering application Ease of use • Nonlinear optimizers should stop assuming they are the “master” program • There is no standard API for nonlinear optimization packages • Aspects such as error/message handling and memory management are important; FORTRAN does not help • Not easy to help the optimizer with domainspecific knowledge – e.g., availability of information about Lagrange multipliers in recursive minimax problems Mixed integer/continuous problems • Assignment of low threshold voltage devices • Swapping pin order in multi-input gates (combinatorial problem) • Working from a discrete gate library • Discrete choice of taper ratios • In general, the type of re-structuring that a logic synthesis program may consider Recursive minimax support min z s.t. z AT d ( x), i 1,2,, I i iz s.t. AT AT d ( x), j 1,2,, J i j ji s.t. AT AT d ( x), k 1,2,, K j k kj . . . • Need a way to force “timing correctness” – arrival times and slews at lowest feasible value – one of them tight at every stage – product of slacks = 0? Degeneracy and redundancy • Nonlinear optimizers easy to mislead • Most engineering problems have plenty of inherent degeneracy and redundancy • Can optimizers be made less sensitive to degeneracy? • Example: min f ( x) s.t. c(x) 0 s.t. c(x) 0 is degenerate because x f ( x) x c( x)[1 2 ] Practical considerations • Detection of impossible problems – bound problems – obviously impossible constraints – not-so-obviously impossible constraints that need a little function evaluation to detect • Post-optimality analysis – final values of Lagrange multipliers can be used to give the designer guidance on obtaining further improvement – may need to re-solve for multipliers at end – information needs to be collected and presented to designer in nice form – formulation of problem may change to take advantage of this analysis Conclusion (Act I) • Can robustly carry out static circuit tuning • Consistent and high-quality results • Improvement in designers’ productivity • Moves designers’ thinking to a higher level • Presently impacting third generation of z-series (S/390) and PowerPC microprocessors Conclusion (Act II) • Open problems in nonlinear optimization whose solution will have a real impact – numerical noise – capacity – robustness, scaling, weighting – ease of use, common API – discrete variables – support for recursive minimax problems – degeneracy – post-optimality analysis – detection of impossible problems