Chandu Visweswariah IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 [email protected] Dynamic Tuning for High–Performance Custom Digital Design Of 77:1 Dynamic Tuning for High–Performance Custom Digital Design Chandu Visweswariah Acknowledgements Dynamic Tuning for High–Performance Custom Digital Design Andrew R. Conn, Applied Mathematics, IBM Watson Ruud A. Haring, Integrated System Design, IBM Watson Chandu Visweswariah Of 77:2 1. Outline Introduction and motivation 2. Circuit simulation 3. Time–domain gradient computation 4. Nonlinear optimization 5. Case study – JiffyTune: algorithms and implementation – human interface – application to PowerPC microprocessor design – methodology impact and implications – historical perspective Dynamic Tuning for High–Performance Custom Digital Design 6. Conclusions, future work and pointers to the reading list Chandu Visweswariah Of 77:3 Introduction and Motivation Update design custom design poses challenges in performance (delay and slew), area, power, noise, etc.,. competitive environment drives aggressive circuit tuning formerly a manual, tedious, error–prone procedure Simulate Good enough? Dynamic Tuning for High–Performance Custom Digital Design Design re–use not easy! Chandu Visweswariah Of 77:4 Dynamic Tuning dynamic tuning implies automatic tuning based on dynamic or ‘‘pattern–driven” simulation hence dynamic tuning requires ‘‘input patterns” or ‘‘vectors” Static Tuning 25% Dynamic Tuning Piecewise approximate 5% Dynamic Tuning for High–Performance Custom Digital Design SPICE– based 1% Accuracy Sensitivity computation tuning only as good as patterns capture switching situations Chandu Visweswariah 10 30 1K 10K Size Of 77:5 Dynamic vs. Static Tuning Advantages of static tuning can handle larger problems (or problems of the same size faster) does not require sensitization patterns tunes only the ‘‘worst–case” or ‘‘most pessimistic” condition more applicable to control–flow than data–flow circuits requires either pre–characterized cells or simulation on the fly Advantages of dynamic tuning more accurate than static tuning does not suffer from the false–path problem more applicable to data–flow than control–flow circuits has no problem with custom circuits Dynamic Tuning for High–Performance Custom Digital Design The two types of tuning complement each other at different stages of the design process and on different kinds of circuits. Chandu Visweswariah Of 77:6 Statement of the Dynamic Tuning Problem Given a working circuit schematic with initial transistor sizes input signals or patterns or stimuli a precise statement of circuit performance requirements Determine optimal transistor size assignments While supporting... flexible objective functions and constraints simple bounds ratio–ing of transistors and grouping of similar structures i delay, rise/fall time, area and power measurements minimax optimization: min [ max { f i(x) } ] x easy–to–use graphical user interface in schematic environment Dynamic Tuning for High–Performance Custom Digital Design Must be able to tune a circuit with a few 100s of tunable transistors in a few minutes of CPU time. Chandu Visweswariah Of 77:7 Sample Dynamic Tuning Problem Tri–state driver D_in En D_in fast path D_in → D_out when enabled (minimize delay) Dynamic Tuning for High–Performance Custom Digital Design – small margin if D_in switches, larger margin if En switches break–before–make on P_drv and N_drv (constraints) Chandu Visweswariah D_out Of 77:8 Patterns Device models Structure of a Dynamic Tuner Circuit schematic Netlist Optimization problem description Simulator and gradient calculator Function and gradient values Dynamic Tuning for High–Performance Custom Digital Design Y Done? N Nonlinear optimizer Back–annotate Chandu Visweswariah Clobber transistor widths with new widths from non– linear optimizer Of 77:9 Circuit simulation Mini–outline 1. Introduction and motivation 2. – SPICE–like simulation – piecewise–approximate simulation 3. Time–domain gradient computation 4. Nonlinear optimization 5. Case study – JiffyTune: algorithms and implementation – human interface – application to PowerPC microprocessor design – methodology impact and implications – historical perspective Dynamic Tuning for High–Performance Custom Digital Design 6. Conclusions, future work and pointers to the reading list Chandu Visweswariah Of 77:10 Simulation incremental in time analytic device models (gradients and continuity required) 1% numerical accuracy limited to about 10K transistors (n1.5) in size complexity, (n) in simulation interval SPICE–like simulators Piecewise approximate simulators Of 77:11 examples: SPECS, ACES, PowerMill modeling approximations, typically piecewise approximate i–v characteristics, constant (grounded) capacitances event–driven special step a priori to create device models from SPICE models 5% accuracy, 2 orders of magnitude speedup over SPICE 100K transistors easily handled (n) in size complexity, (n) in simulation interval Dynamic Tuning for High–Performance Custom Digital Design Tuning accuracy is only as good as underlying simulator and device models! Chandu Visweswariah SPICE–like Simulation Equation formulation MNA, STA, TLA Integration for all time TR, BE, BD2, Gear Linearization until convergence Newton iterations Device model evaluation and solution Sparse LU factorization device model evaluation time is significant Dynamic Tuning for High–Performance Custom Digital Design Algebraic, non– linear ODEs Algebraic, non– linear equations Of 77:12 Algebraic, linear equations Both device modeling and analysis must be attacked to get reasonable speedups! Chandu Visweswariah v PIECEWISE APPROXIMATE SPECS, ACES, PowerMill i Run time CIRCUIT SPICE i = f(v) Piecewise Approximate Circuit Simulation SWITCH COSMOS, PARCH Dynamic Tuning for High–Performance Custom Digital Design Of 77:13 discretization in i–v characteristics allows event–driven simulation and variable accuracy event–driven simulation leads to latency/multi–rate exploitation creative integration methods employed to avoid matrix equations Chandu Visweswariah Accuracy Mini–outline 1. Introduction and motivation Time–domain gradient computation 2. Circuit simulation 3. – introduction to gradient computation – direct and adjoint methods – extension to nonlinear circuits and time–domain 4. Nonlinear optimization 5. Case study – JiffyTune: algorithms and implementation – human interface – application to PowerPC microprocessor design – methodology impact and implications – historical perspective Dynamic Tuning for High–Performance Custom Digital Design 6. Conclusions, future work and pointers to the reading list Chandu Visweswariah Of 77:14 signal Time–domain Gradient Computation delay width sensitivity lim width0 Dynamic Tuning for High–Performance Custom Digital Design delay width Of 77:15 time delay we would like to determine sensitivities not by finite differences or by perturbation, but by incremental methods; perturbation methods are inefficient and numerically ill–behaved performance measures like delay, slew, power, noise, area are sensitivity functions tunable parameters like transistor widths, interconnect sizes are sensitivity parameters in the direct method, the sensitivity of any number of functions with respect to one parameter can be computed efficiently in the adjoint method, the sensitivity of one function with respect to any number of parameters can be computed efficiently both methods are typically implemented as part of circuit simulator Chandu Visweswariah t 1 C i W eff t I ds W eff 2 C i W eff PW I ds W eff PW i t 2 C i W eff C i W eff PW Incremental Sensitivity and Chain Ruling i t 1 t 1 C i area W eff C i perimeter W eff C i area W eff PW C i perimeter PW W eff delay t 1 t 2 t 1 t delay 2 PW PW PW t I ds W eff 1 I ds W eff PW Of 77:16 number of ‘‘electrical” gradient computations can be very large example: problem with 10 delays, 10 slews and 1 power function, with 100 tunable transistors number of gradient evaluations (conservatively) = 40 iterations x 41 functions x 100 tunable transistors x 6 electrical parameters per transistor 1 Million! dependent (ratio–ed) transistor widths can be accommodated similarly by chain ruling Dynamic Tuning for High–Performance Custom Digital Design Incremental sensitivity computation has to be extremely efficient! Chandu Visweswariah Chandu Visweswariah v – . C ^ i I p v^ V p +– v^ + +– ^ R i i R p ^ v^ i R Ri p + ^ i C v ^ ^ p . i C dv C v p dt – – Sensitivity Analysis by the Direct Method + i R v iR v R i i R p p p vV +– V v p p iI i I p p + i v – C i C dv dt dv dv C i C ( ) p dt p dt p Key: directly differentiate BCRs Dynamic Tuning for High–Performance Custom Digital Design Of 77:17 V5 + – Direct Method: Procedure R1 +– i R1 R2 replace elements by directly differentiating BCRs to get sensitivity circuit reuse LU factors of original solution R3 R4 v R 4 , so p R 1 any number of functions, only one parameter Example: R1 R2 problem is to find R 1 Dynamic Tuning for High–Performance Custom Digital Design 4 solve for v^ R ; that is the required answer! R3 replace V 5 by a short; add a voltage source in series with R 1; Chandu Visweswariah + v^ R – 4 R4 Of 77:18 ^ ^ i i (v, p) di i dv i v dp p dp ^ or i G v^ I x G eq Linear model : i G eq v I eq v Direct Method: Extension to Nonlinear Circuits i i solution I eq v solution in the sensitivity circuit, ^ ^ i G solution v^ I x Dynamic Tuning for High–Performance Custom Digital Design extra source LU factors can be re–used think of it as the sensitivity of the linearized equivalent circuit at the DC solution Chandu Visweswariah Of 77:19 sensitivity circuit time nominal circuit Direct Method: Extension to Time–domain series of linear(ized) equivalents Problems LU factors can be re–used only if the Jacobians are identical Jacobians are identical provided time points are identical time points selected, however, only with consideration to solution of nominal circuit, hence may not be suitable to the sensitivity circuit must either use conservative time–steps or interpolate Jacobians event–driven simulators may have an advantage if event times of original and sensitivity circuit are inherently identical ^ Dynamic Tuning for High–Performance Custom Digital Design can find ‘‘full” transient sensitivity, like v^(t) or i(t) Chandu Visweswariah Of 77:20 Sensitivity Analysis by the Adjoint Method Matrix formulation of adjoint method Ax b A x A x b p p p f f Given a scalar function f (x), [ ] T[x] p x p f Postulate A T [ ] x f TA[x] T[b – A x] p p p p Then re–use LU factors of original solution; LU factors of A T are U T and L T any number of parameters, only one function . ^. Of 77:21 the adjoint of a dynamical system x Ax u is x A Tx^ u^ Dynamic Tuning for High–Performance Custom Digital Design electrical formulation based on Tellegen’s theorem Chandu Visweswariah Tellegen’s Theorem branches v bi v n ifrom– v nito i current of branches incident at each node 0 1, 0 Ai b 0 A topological incidence matrix v b A Tv n i ^ ^ i i ^ ^ same topology ^ ^ ^ ^ ^ Dynamic Tuning for High–Performance Custom Digital Design i vb ib vbT ib [ATvn]T ib vnT A ib 0 i ib vb ibT vb ibT AT vn [A ib]T vn 0 i Tellegen’s theorem Chandu Visweswariah KCL KVL Of 77:22 instantaneous power of branch 0 Conservation of energy v b Ti b 0 nodes Adjoint Sensitivity via Tellegen’s Theorem i ^ ^ bi i i i ^ i ^ v^b,i b i ^ Key relation! i ib vb 0 i (ib ib ) vb 0 i i b i b v^ b ) 0 i i Application of Tellegen’s theorem to a perturbed circuit v b, i b ^ ib i v ) i bi v b v b, i b i b i bi vb i (v i (vb i Motivation ^ Vs ^ others Dynamic Tuning for High–Performance Custom Digital Design Of 77:23 and then into f k p k then we can pick up (in the limit) all the required sensitivities fp k I vI iI iV vV typical term if we can manipulate the above equation into Chandu Visweswariah Adjoint Sensitivity Computation i I I ^ ^ v R i R R R i R R i R Typical term (v I i I I v^ I) i I I; Consider a circuit with only resistors and current sources: for any current source: ^ for any resistor: v R i R R; Is ^ Is ^ Rs ^ . f v I. v I vI iI I vI iR iR R By choosing v^ R R i R, we get typical term i R i R R. total circuit: ^ f v I and ^ f iR iR R given a scalar performance function f (v I), f choose iI f v^ I I pick off the required sensitivities: Thus Dynamic Tuning for High–Performance Custom Digital Design any number of parameters can be accommodated at once! Chandu Visweswariah Of 77:24 I1 R1 R3 R4 ^ + v^ I1 – ^ R1 R2 Adjoint Method: An Example R2 ^ ^ choose I 2 1 ^ v R4 0, R 2 v ^ R4 i R3i R3, R 3 R3 R4 v ^ R4 i R4i R4 R 4 v v R4 R4 0 and 0 R 3 R 4 v R4 ^ v I1 I 1 v ^ R4 i R2i R2, R 2 i R1i R1R 1 i R2i R2R 2 i R3i R3R 3 i R4i R4R 4 I 1v^ I1 v I2 v R4 v ^ R4 i R1i R1, R 1 v note that R4 0, R 1 can re–use LU factors of nominal solution Dynamic Tuning for High–Performance Custom Digital Design All parameters handled at once! Chandu Visweswariah ^ I2 Of 77:25 + v2 v^ 2 i2 i1 Adjoint Sensitivity: Controlled Sources i1 – ^ ^ v 1 v 2 i 2 i 2 ^ Current–controlled current source + v1 0 – ^ ^ v 1 i 1 i 1 v 1 i 1 i 1 v^ 1 v 2 i 2 v^ 2 ( i 1 i 1) Typical term ^ ^ i1 + v^ 2 – i2 0 ^ Choose i 2 0 (current source) KEY TERM! v^ 1 – v^ 2 (voltage–controlled voltage source) v^ 1 v^ 2 + – final typical term is v^ 2 i 1 Dynamic Tuning for High–Performance Custom Digital Design control is reversed! Chandu Visweswariah Of 77:26 ^ Vs ^ Rs ^ Cs + dv dv ^ C C v C v^ dt C dt C KEY TERM! iC vC vI iI iV vV iR iR R ?? Adjoint Sensitivity: Extension to Time–domain Recall I ^ Typical term for Cs v C i C C – Dynamic Tuning for High–Performance Custom Digital Design Of 77:27 these have go although a v C(t 0) term would be handy tf tf tf tf ^ ^. . t C Typical term dt v i dt – C v^ v f C v v dt– v^ v dt C C C C C C C C t0 t0 t0 t0 t0 tf tf ^ ^. . C (i C v ) dt v C v^ (t ) v (t ) C v^ (t ) v (t ) v^ v dt C C C C C C 0 C 0 C C f f t0 t0 Chandu Visweswariah ^. Of 77:28 Adjoint Sensitivity: Extension to Time–domain, Cont’d. ^ to get rid of the unwanted terms, i C Cv C and v^ C(t f) 0 Solution: run time backwards!! k t, usually t 0 t f t make use of the fact that Tellegen’s theorem holds for any choice of time point in the two circuits dv^ dv^ (t) ^ ^. then i C() C C C C Cv C(t) and choose dt d v^ C(t f) v^ C( t 0) as an initial condition with value 0 Dynamic Tuning for High–Performance Custom Digital Design tf . C so finally, typical term C v^ (t ) v (t ) v^ v dt C 0 C 0 C C t0 required sensitivity becomes a convolution integral; in fact, all sensitivities become convolutions Chandu Visweswariah tf t0 (i V v^ V) dt ^ k p k Adjoint Sensitivity: Function Tricks ^ (–v I i I) dt Vs the left hand side of the general relation tf Is t0 ^ must be chosen to be f by good choices of i I and v^ V V k pk t start 1 dt t end sampling of v^ I (t t 1): choose i I (t t 1) so that t end i v I| tt1 k p k power: choose v^ V 0 so that t start Dynamic Tuning for High–Performance Custom Digital Design remember to map ts to s Elmore delay is easily expressed as an integral, too Chandu Visweswariah Of 77:29 cross Adjoint Sensitivity: Crossing Delays Crossing delay sensitivity v cross v N tt v^ N| ttcross dt cross dp . v N| tt cross v^ N dt cross . – vN dp tt cross t cross Dynamic Tuning for High–Performance Custom Digital Design t v v dv dt cross cross 0 d v (t, p) tt N N cross t p dp N dp dp tt cross tt cross vN v cross Chandu Visweswariah Of 77:30 Sensitivity Computation: Wrap–up Direct method directly differentiate BCRs any number of functions, one parameter can re–use LU factors (modulo time–point mismatch) original sources removed; a new source is introduced in conjunction with the single sensitivity parameter ‘‘full” transient sensitivity, e.g., v N(t)W eff can be computed Adjoint method Dynamic Tuning for High–Performance Custom Digital Design based on the ‘‘adjoint” or transposed set of circuit equations; most intuitively approached via Tellegen’s theorem one function, any number of parameters control reversed; time run backwards can re–use LU factors (modulo time–point mismatch) sources removed; a new source is introduced and manipulated to express the single sensitivity function as a convolution sensitivity results are convolutions, hence an overhead ‘‘full” transient sensitivity harder to compute Chandu Visweswariah Of 77:31 Mini–outline 1. Introduction and motivation 2. Circuit simulation 3. Time–domain gradient computation Dynamic Tuning for High–Performance Custom Digital Design 4. Nonlinear optimization – terminology – basics of linear programming – basics of nonlinear programming – LANCELOT and MINOS – minimax optimization – key messages 5. Case study – JiffyTune: algorithms and implementation – human interface – application to PowerPC microprocessor design – methodology impact and implications – historical perspective 6. Conclusions, future work and pointers to the reading list Chandu Visweswariah Of 77:32 Terminology i 1, 2, , n i l 1, l 2, , l m i 1, 2,..., l min f (x) s.t. c i(x) 0 c i(x) 0 li xi u i objective function (maximize/minimize) equality constraints inequality constraints ( and ) c i(x) 0 c i(x) s i 0 s.t. s i 0 c i(x) s i 0 s.t. s i 0 – often handled by introducing a slack variable s i c i(x) 0 – note that s i is a new variable of the problem simple bounds (either l i or u i can be unbounded) Dynamic Tuning for High–Performance Custom Digital Design constrained vs. unconstrained optimization Chandu Visweswariah Of 77:33 Terminology, Cont’d. i 1, 2, , n i l 1, l 2,..., l m min f (x) s.t. c i(x) 0 i 1, 2,..., l c i(x) 0 li xi u i solution vectors x 1, x 2, , x k, , x * feasibility (feasible points, feasible regions) activities are constraints that are ‘‘tight” if f (x) or any of c i(x) is a nonlinear function of x, we have a nonlinear problem linear programming: 1M+ variables is large but tractable nonlinear programming: 1K+ variables is large but tractable Key message: use nonlinear programming to solve nonlinear problems (all circuit tuning problems are nonlinear)! Dynamic Tuning for High–Performance Custom Digital Design Consider good solutions to bad models vs. less good solutions to better (possibly nonlinear) models! Chandu Visweswariah Of 77:34 Classical Theory Unconstrained optima to minimize f (x), f 2 f – Jacobian = x f 0 is a necessary condition x i* 2f 2 f – Hessian = xx 0 is a sufficient condition x i* x j* all eigenvalues of the Hessian must be non–negative think of the second condition as ensuring ‘‘positive curvature” the second condition avoids saddle points and maxima Constrained optima: Lagrangian formulation i1 ici(x) l the minimizer of f (x) subject to equality constrains 0, i 1, 2,..., l is a stationary point of the Lagrangian c i(x) (x, ) f (x) c i(x l ixci(x*) 0 i1 Dynamic Tuning for High–Performance Custom Digital Design – dual feasibility: x f (x *) applying the above criteria, we get *) 0 i – primal feasibility: Chandu Visweswariah Of 77:35 Kuhn–Tucker Conditions for Equality Constraints min f (x) s.t. c(x) 0, f, c 2 consider the set of all feasible directions, i.e., d such that c(x d) – c(x) d Tc 0 optimality implies that there does not exist even a single feasible descent direction, i.e., d Tf 0 we know that if d Tc 0 then d is in the null space of c d Z t for any t is therefore the set of all feasible directions, where Z is a basis for the null space of c (or the columns of Z are a basis for that null space or the columns of Z span the null space) hence t TZ Tf 0 t or Z Tf 0 is the first–order condition Dynamic Tuning for High–Performance Custom Digital Design Of 77:36 just as f (x d) f (x) d f(x) 12 d 2 f(x) is a scalar quadratic model, f (x d) f (x) d Tf 12 d T 2f d is a quadratic model for f (x) using a quadratic model for f (x) and c(x) and restricting ourselves to feasible d, we derive the second–order condition Z T 2f Z 0 Chandu Visweswariah Of 77:37 Kuhn–Tucker Conditions for Inequality Constraints min f (x) s.t. c(x) 0 optimality now implies that there is no further descent direction d for which c(x d) – c(x) d T c 0 d Tf 0 f is a (negative) linear combination of the gradient of the (active) constraints the Lagrange multipliers corresponding to inactive constraints are zero first–order condition: Z Tf 0 where Z is a basis for the null space of the Jacobian of the active constraints (which have positive Lagrange multipliers) Dynamic Tuning for High–Performance Custom Digital Design Z T 2f Z 0 second order condition: Chandu Visweswariah 2 3 Whirlwind Tour of Linear Programming Active set methods (simplex) 4 min f (x) s.t. c i(x) 0 descent direction 1 Dynamic Tuning for High–Performance Custom Digital Design there is always a solution at a point where at least n constraints intersect in n–space (2 in the figure) strategy: move from one vertex to an adjacent vertex while seeking a decrease in the objective function theoretical worst–case complexity is exponential, but works well in practice Chandu Visweswariah Of 77:38 Whirlwind Tour of Linear Programming, Cont’d. Penalty function methods: interior point methods log(xi) for the simple bounds start feasible always stay feasible use ‘‘barrier functions” in merit function to stay within simple bounds 1 example: (x) f (x) i x i 0 ( is a penalty parameter) polynomial complexity less likely to suffer from degeneracy problems Penalty function methods: exterior point methods feasible is guaranteed only at the solution Aside for nonlinear programming log ci(x) for ci(x) 0 Of 77:39 can use ‘‘barrier” function for inequalities in interior method merit 1 functions such as (x) f (x) i Dynamic Tuning for High–Performance Custom Digital Design i can use penalty term for equalities in exterior method merit 1 function such as (x) f (x) c i2(x) Chandu Visweswariah Nonlinear Programming Derivative (D) vs. Derivative Free (DF) Methods never use DF if D is possible – DF popular since easy to understand – DF popular since easy to implement most implementations of DF: – useful only if # variables < 20 – not robust – not efficient (especially for circuit applications) – no convergence theory – however, sometimes the only usable method recent developments – DF algorithms with convergence theory and better robustness have been developed... – ... but they are applicable only to small problems – ... and are still inefficient (but parallelizable) Dynamic Tuning for High–Performance Custom Digital Design Key message: never use DF unless you have no choice! Chandu Visweswariah Of 77:40 Nonlinear Programming Some general comments Key message: tremendous progress in nonlinear optimization in the last 15 years! global optimality is almost impossible to achieve or detect on large–scale problems the best we can hope for is – problem is convex (in which case we can achieve and detect a global optimum) – we converge to a local minimum – improvement is obtained! exploit structure where possible – example: sparsity or group partial separability line–search vs. trust–region approaches Dynamic Tuning for High–Performance Custom Digital Design Of 77:41 Key message: requires sophisticated user or optimization specialist to best exploit nonlinear optimization... using the software as a black box is not ideal! Chandu Visweswariah Unconstrained Nonlinear Optimization Steepest descent (SD) ‘‘low storage” solution, simple and efficient per iteration unfortunately, slow linear– or even non–convergence used early in the optimization to capture global behavior Safeguarded Newton’s method and variations (NM) (safeguarded) step in direction that minimizes quadratic model second–order convergence more expensive used late in the optimization for fast asymptotic convergence Alternatives modified Newton’s method to guarantee positive definiteness of Hessian of quadratic model and hence descent Of 77:42 quasi–Newton (low rank updates to the Hessian) (converts SD to NM) coordinate descent (BAD) and its generalization, conjugate directions (GOOD) Dynamic Tuning for High–Performance Custom Digital Design Levenberg–Marquardt Chandu Visweswariah Unconstrained Nonlinear Optimization Conjugate gradients Steepest descent Newton’s method Dynamic Tuning for High–Performance Custom Digital Design Of 77:43 Method of choice for large problems: conjugate gradient (CG) iterations with appropriate choice of pre–conditioner Chandu Visweswariah MINOS Stanford, ca. 1980 min F(x) c Tx d Ty Dynamic Tuning for High–Performance Custom Digital Design augmentation objective function Lagrangian of nonlinear part of constraints ~ s.t. f(x) A 1y b 1 linearized constraints A 2x A 3y b 2 linear constraints lx x ux simple bounds ly y uy where ~ where f(x) f (x) J k(x–x k) (linearization) nonlinear linear s.t. f (x) A 1y b 1 nonlinear constraints A 2x A 3y b 2 linear constraints lx x ux simple bounds ly y uy is remapped to ~ ~ ~ min F(x) c Tx d Ty kT[f (x) – f(x)] 1 [f (x) – f(x)] T[f (x) – f(x)] 2 Chandu Visweswariah Of 77:44 MINOS, Cont’d. Dynamic Tuning for High–Performance Custom Digital Design large–scale, general–purpose, nonlinear optimization package intuition: – separate linear and nonlinear parts – use active set methods for the linear part (LP–like) – use augmented Lagrangian for the nonlinear part – minimize quadratic model of the merit function subject to linear model of constraints (QP) – line–search methods for minimizing the quadratic model in practice, very effective, especially if – most constraints are linear or near linear – real number of degrees of freedom at the solution is not large Chandu Visweswariah Of 77:45 LANCELOT large–scale, general–purpose, nonlinear optimization package efficient, especially for nonlinear constraints exploits structure via group partial separability Major Iteration update l or m value(s) formulate augmented Lagrangian and iteratively: Minor Iteration trust–region approach simple bounds handled explicitly and well minimize merit function Dynamic Tuning for High–Performance Custom Digital Design Of 77:46 test for convergence many options: – l l 2 trust regions (box and sphere) – various preconditioners and initializations available – approximate/accurate bounded quadratic problem (BQP) solver has successfully handled 20,000 variable problems with 20,000 nonlinear constraints; has been tested on 15,000 test cases representing 1,000 unique problems Chandu Visweswariah i Lagrangian i min f (x 1, x 2) 0 x1 a 0 x2 b x2 Of 77:47 Quadratic model penalty function to weight feasibility i ci(x) 21 ci(x)2 How LANCELOT Works (x, , ) f (x) composite function objective for unconstrained function –f minimization x1 v w b Dynamic Tuning for High–Performance Custom Digital Design xk u Trust region Simple bounds a Chandu Visweswariah Minimax Optimization x i x* x z the original problem is min [ max { f i(x) } ], i 1, 2, , m x, z f 2(x) xk not directly differentiable remap to min (z) s.t. z f i(x) i f 1(x) Special considerations i* 1, so initialize i 1m i initialize z to max f i(x) directly after the first function evaluation optimality condition: i Dynamic Tuning for High–Performance Custom Digital Design can handle minimax problems in other ways Chandu Visweswariah Of 77:48 Andy and Chandu’s Top–10 Key Messages! Time to wake up! Dynamic Tuning for High–Performance Custom Digital Design Of 77:49 1. Replace good solutions to bad models by potentially less good solutions to better models! 2. Use nonlinear optimization on nonlinear problems! 3. Swapping objective function and constraints does not a new ICCAD paper make! 4. Tremendous progress has been made in nonlinear optimization in the last 15 years; do not use outdated methods! 5. Never use derivative–free methods unless you have no choice! 6. In circuit tuning, CPU time is dominated by simulation and gradient computation; must leverage optimizer to reduce number of iterations! 7. Efficient sensitivity computation is essential to any high–performance circuit tuner! 8. Carefully specify all aspects of the problem you really want to solve (forces careful understanding of the circuit and problem being posed)! 9. ‘‘Black box” optimization under–utilizes capabilities of a package (and often leads to greatly sub–optimal solutions)! 10.Any simulation–based optimization is inherently noisy! Chandu Visweswariah Mini–outline 1. Introduction and motivation 2. Circuit simulation 3. Time–domain gradient computation Case study 4. Nonlinear optimization 5. – JiffyTune: algorithms and implementation simulation gradient computation nonlinear optimization – human interface – application to PowerPC microprocessor design – methodology impact and implications – historical perspective Dynamic Tuning for High–Performance Custom Digital Design 6. Conclusions, future work and pointers to the reading list Chandu Visweswariah Of 77:50 Case Study: JiffyTune dynamic circuit optimization tool called JiffyTune has been developed at IBM over the last 16 months – circuit simulated by SPECS (piecewise approximate, event–driven, fast circuit simulation program) – takes advantage of both direct and adjoint transient sensitivity computation – uses LANCELOT (augmented Lagrangian, trust–region–based, large–scale, general–purpose nonlinear optimization package) has successfully tuned circuits with up to 500 tunable transistors has been used by about 50 designers in about 2,700 tuning runs in many different technologies and sites of IBM human interface and methodology issues carefully taken into account during development Dynamic Tuning for High–Performance Custom Digital Design was extensively used on domino circuits of a custom, high–performance, PowerPC microprocessor Chandu Visweswariah Of 77:51 JiffyTune: Features at a Glance Dynamic Tuning for High–Performance Custom Digital Design accommodates delay, slew, area and power functions supports minimax optimization any combination of weighted constraints and objective function(s) permitted all three types of constraints (equality, less–than and greater–than) can be specified ratio–ing of transistors allowed similar instances can be grouped for regular layout automatic failure–recovery allows optimizer to take aggressive steps two manufacturability modes implemented – post–tuning information mode at various process corners – simultaneous tuning at multiple process corners easy–to–use human interface for problem specification as well as back–annotation of results specialized stopping criteria and considerations for noise Chandu Visweswariah Of 77:52 Circuit, models, optimization criteria Chandu Visweswariah User interface Problem specification LANCELOT JiffyTune administration chain ruling ad nauseam X new X old Of 77:53 User interface Back annotation JiffyTune: Software Overview g, J f, f SPECS JiffyTune ‘‘Engine” Dynamic Tuning for High–Performance Custom Digital Design Optimized circuit iL vL iL vL Of 77:54 SPECS: Piecewise Approximate Circuit Simulation Modeling piecewise constant i–v characteristics ( 1% error) linearized junction and diffusion capacitances ( 1% error) gate capacitances linearized and grounded ( 5% error) Dynamic Tuning for High–Performance Custom Digital Design tree/link formulation of circuit equations event–driven simulation to exploit latency and multi–rate behavior incremental processing of KCL and KVL equations whenever a device moves from one segment of its i–v table to the next relative timing error inversely proportional to square of signal slope Analysis Chandu Visweswariah Upside SPECS Summary speed: 70x (6,000 events per second on a workstation) capacity: 10x (0.25M transistors) accuracy: greater than 1s of delays with 5% relative timing accuracy; worst case is 10% used in 12 IBM sites for – timing verification – functional verification – tuning – power estimation of digital and memory circuits Downside Of 77:55 constant–grounded–capacitance–assumption causes inaccuracy stiff circuits a problem Dynamic Tuning for High–Performance Custom Digital Design no inductors Chandu Visweswariah i Sensitivity Analysis in SPECS ^ G v i i (v, p) ^ ^ ^ di i dv i or i Gv^ I x v dp p dp sensitivity or adjoint circuit consists of mostly disconnected, grounded capacitors little ‘‘spurts” of charge sharing occur instantaneously at the times of the events of the original nominal circuit event information from nominal simulation is saved results of sensitivity or adjoint circuit yield required sensitivities; convolution is required in the case of adjoint sensitivity both direct and adjoint methods implemented Dynamic Tuning for High–Performance Custom Digital Design sensitivity or adjoint circuit repeatedly solved with appropriate excitation Chandu Visweswariah v^ Of 77:56 Sensitivity Analysis in SPECS, Cont’d. Functions any crossing time (delay and slew are expressed as the difference of two crossing times) power in any voltage source over any sub–interval of time transient node voltage (direct method only) Parameters transistor widths (chain ruling done automatically) resistor and capacitance values Implementation notes Dynamic Tuning for High–Performance Custom Digital Design convolution in the adjoint method is carried out in an event–driven fashion automatic heuristic to choose between direct and adjoint method (presently checks if # parameters > 3 * # functions) can force direct or adjoint method Chandu Visweswariah Of 77:57 Efficiency of Sensitivity Analysis in SPECS 144 MOS transistors, 27 ns of simulation 36 functions, 104 parameters (more internally) nominal simulation: 2 s on Risc/System 6000 mod 590 sensitivity run time: 10.9 s overhead per adjoint sensitivity circuit simulation = 15% overhead per gradient computation = 0.14% Example 1: PowerPC Branch Scan and Select Circuit Dynamic Tuning for High–Performance Custom Digital Design 6,166 MOS transistors, 770 resistors, 66 ns of simulation 99 functions, 6,166 parameters (more internally) nominal simulation: 3 mins 5 s sensitivity run time: 20 mins 5 s overhead per adjoint sensitivity circuit simulation = 6.5% overhead per gradient computation = 0.001% !!!! Logic panel module IOmux Chandu Visweswariah Of 77:58 v N R –0.98795 –1.98795 0 Chandu Visweswariah 4 6 + – R 8 SPECS Analytic 10 Of 77:59 Time(ns) C vN Sensitivity Results on Simple Circuit 2 Dynamic Tuning for High–Performance Custom Digital Design delay 6.20 6.00 5.80 5.60 5.40 5.20 5.00 4.80 4.60 4.40 4.20 4.00 3.80 5.00 15.00 Delay Sensitivity SENS DATA Results 10.00 Dynamic Tuning for High–Performance Custom Digital Design delay sensitivity plotted as tangents to delay–width curve Chandu Visweswariah 20.00 width Of 77:60 LANCELOT Optimization: Special Issues Formulation minimax optimization (and initialization of Lagrange multipliers) implemented as described earlier CUTE testing environment has allowed us to integrate MINOS from Stanford as an alternative optimizer Problems due to noise any simulation–based optimization is inherently noisy unable to recover from poor initializations; optimizer often wastes time going nowhere does not stop gracefully or predictably Special considerations for noisy optimization tolerance for feasibility and line search break points Of 77:61 consideration of small step beneath which further progress unlikely initial trust region radius initial Hessian approximation for secant methods Dynamic Tuning for High–Performance Custom Digital Design stopping criteria Chandu Visweswariah Manufacturability Modes in JiffyTune general goal: make parametric manufacturability a push–button, transparent, easy–to–use part of tuning ‘‘Information” mode perform nominal tuning as a postprocess, simulate the tuned circuit at each process corner summarize the distribution of circuit performance across process corners Simultaneous tuning at multiple process corners simultaneously tune at all process corners; no surprises after tuning is completed replicate and modify tuning criteria for all process corners i – constraint: convert d(x) t to d i(x) t i x x x i, j i – objective function: convert min d(x) to min[max d i(x)] i x j – minimax: convert min[max d j(x)] j to min[max d ij(x)] i, j Dynamic Tuning for High–Performance Custom Digital Design run time overhead is roughly n where n number of corners Chandu Visweswariah Of 77:62 u Chandu Visweswariah Two–step Updates min f (x) s.t. c(x) u 0, u0 [c(x) u] 2 2 f (x) [c(x) u] x k1, u k1 minimize along this line Trust region x k, u k Dynamic Tuning for High–Performance Custom Digital Design x Of 77:63 Intuition Two–step Updates, Cont’d. if a variable appears in the merit function in a known functional form, a second update can be applied to it without incurring the cost of a new function/gradient evaluation under mild assumptions, the second step can be outside the trust–region; convergence is still guaranteed two–step updates lead to faster convergence, better optimization efficiency Minimax problems i i[z f i(x) u i] 1 2 i [z fi(x) ui]2 can be applied to the z variable in minimax problems; min [ max f i(x) ] leads to x z i Dynamic Tuning for High–Performance Custom Digital Design is a quadratic in z and is therefore amenable to two–step updates Chandu Visweswariah Of 77:64 JiffyTune: The Human Interface good tools are useless without a good human interface JiffyTune engine is file–driven and hence is not amenable to quick and productive design Features of Cadence interface Dynamic Tuning for High–Performance Custom Digital Design developed by actual circuit designer all functionality provided from schematic diagram – familiar environment ⇒ no steep learning curve – visual and interactive – complexities of tool hidden (but accessible) enforces clear expression of otherwise tacit circuit requirements – JiffyTune is greedy! adds significant additional functionality like grouping & ratio–ing overcomes the drudgery of a file–driven interface configurable to different site/project environments Chandu Visweswariah Of 77:65 JiffyTune Interface: Main Form 480ps Dynamic Tuning for High–Performance Custom Digital Design rise time <= 400ps 1400ps CLKG CLKM Chandu Visweswariah Of 77:66 JiffyTune Interface: Parameters Form Dynamic Tuning for High–Performance Custom Digital Design tunable/untunable transistors upper/lower bounds grouping (tracking) of instances individual settings support for hierarchy Parameters Chandu Visweswariah Of 77:67 JiffyTune Interface: Measurements Form rise time <= 400ps 1400ps CLKG CLKM 480ps delay rise/fall time area minimax function Dynamic Tuning for High–Performance Custom Digital Design Measurements (optimization targets) Chandu Visweswariah Of 77:68 JiffyTune Interface: Back Annotation Back annotation of results Dynamic Tuning for High–Performance Custom Digital Design same constraints, 2.5x load (0.6 pF → 1.5 pF) new (suggested) transistor widths and waveform data back–annotated on schematic Chandu Visweswariah Of 77:69 JiffyTune: Methodology Impact JiffyTune was used to tune120 circuits of a custom dynamic–logic microprocessor and was found to be a key productivity tool measurements, functions, weights, tunability, simple bounds, etc., are stored as attributes of schematic they are indeed part of the intellectual exercise of designing a good circuit thus design re–use is encouraged – when device models change – for different loading and arrival time situations – when design specifications change – during design remap forces careful and explicit specification of circuit requirements: the issue becomes how to correctly specify the optimization problem rather than solving the optimization problem itself! can create a power–delay or area–delay tradeoff curve by rerunning optimizer Dynamic Tuning for High–Performance Custom Digital Design Emphasis shifts to: (a) starting with a good topology and a working circuit, (b) posing the problem well Chandu Visweswariah Of 77:70 Historical Perspective: DELIGHT.SPICE W. Nye, D. C. Riley, A. S. Vincentelli, A. L. Tits, 1988 accommodates semi–infinite constraints hard and soft constraints; the latter have ‘‘good” and ‘‘bad” values from which the quality of the solution is judged minimax PHASE–I–II–III optimization by method of feasible directions (always produces feasible design, monotonically decreases merit function while handling semi–infinite constraints) implements sensitivity in SPICE Weaknesses Dynamic Tuning for High–Performance Custom Digital Design sloooooooooooow, tailored to small analog circuits hard to use Chandu Visweswariah Of 77:71 Historical Perspective: TAILOR D. Marple, 1989 operates on the layout constraint graph of a cell no separate post–layout closure needed uses constraint graph with corner–stitching data structure; variable weight for each transistor arc constructs a delay graph based on a single time–constant approximation (static) then works to minimize cell’s pitch in one dimension uses augmented Lagrangian with penalty term alternates steepest descent and steepest ascent to find saddle point of Lagrangian Weaknesses Dynamic Tuning for High–Performance Custom Digital Design used on small cells only (50 transistors) no hierarchy crude delay model Chandu Visweswariah Of 77:72 Historical Perspective: Interconnect Sizing J. Cong, K.–S. Leung, 1994 N. Menezes, S. Pullela, L. T. Pileggi, 1995 concurrent gate and interconnect sizing; flurry of interest in this topic since Elmore delay approximation for gates moment matching models for wires formulated into a convex optimization problem delay models differentiated analytically to provide gradients sequential quadratic programming (SQP) applied – quadratic fit about last solution point – minimize quadratic to get new solution vector routability enhanced by simple congestion weight factors Weaknesses Dynamic Tuning for High–Performance Custom Digital Design inaccurate gate delay model Chandu Visweswariah Of 77:73 Mini–outline 1. Introduction and motivation 2. Circuit simulation 3. Time–domain gradient computation 4. Nonlinear optimization Dynamic Tuning for High–Performance Custom Digital Design Conclusions, future work and pointers to the reading list – JiffyTune: algorithms and implementation – human interface – application to PowerPC microprocessor design – methodology impact and implications – historical perspective 5. Case study 6. Chandu Visweswariah Of 77:74 Chandu Visweswariah Conclusions dynamic tuning is possible and meaningful 1,000 tunable transistors well within reach; perhaps 10,000 tunable transistors is not far out of reach reasonable compromises on simulation accuracy (piecewise approximate circuit simulation) are both necessary and practical gradients necessary to carry out tuning adjoint method can be used to compute large number of gradients efficiently sophisticated nonlinear optimization algorithms essential extensions to manufacturability based on a design–of–experiments paradigm straightforward having a tuning capability not only improves design time and circuit quality, it has methodology impact too Dynamic Tuning for High–Performance Custom Digital Design Of 77:75 Future Work Dynamic Tuning for High–Performance Custom Digital Design accommodation of semi–infinite constraints, e.g., delay should be less than a target for a range of power supply voltages parallel processing for sensitivity computation Hessian computation automatic generation of problems from a static tuner; automatic generation of stimulus patterns and problem constraints exploitation of structure, e.g., group partial separability formulation as a mixed integer–continuous problem for discrete transforms formulation improvements to speedup large optimization runs Chandu Visweswariah Of 77:76 Reading List Dynamic Tuning for High–Performance Custom Digital Design classified into – circuit optimization in general – SPICE–like simulation – piecewise approximate simulation – gradient computation – nonlinear optimization bibliography has been annotated NEOS home page (take the state–of–the–art optimization software for a test run): http://www.mcs.anl.gov/home/otc/Server/neos.html HAPPY READING! Chandu Visweswariah Of 77:77