Optimization challenges in transistor sizing.

Optimization Challenges
in Transistor Sizing
Chandu Visweswariah
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
Acknowledgments
The entire EinsTuner team at IBM, especially
Andy Conn of the Applied Mathematics
department at IBM Research
Two acts focusing on optimization
• Act I: what we have done with existing
nonlinear optimization techniques
– what is EinsTuner?
– how does it work?
– what is its impact?
• Act II: what we would love to do and
currently cannot
– numerical noise
– capacity
– robustness, scaling, weighting, degeneracy
– recursive minimax support
– mixed integer continuous problems
Act I: What is EinsTuner?
• A static, formal, transistor sizing tool geared
towards custom high-performance digital
circuits
– static: based on static transistor-level timing
analysis, so implicitly takes into account all
paths through the logic
– custom: timing based on real time-domain
simulation of all possible transitions through
each clump of strongly-connected transistors
– formal: guaranteed optimal
• “local” according to engineers
• “global” according to mathematicians
– transistor-level, therefore inherently flat
– see DAC ’99, ICCAD ’99, SIAM J. Opt 9/99
Features of EinsTuner
• Extensive (sweeping and fine-grained)
grouping, ratio-ing and no-touch commands
• 25K transistor capacity (60K variables)
• Allows delay, area,  ratio, input loading,
slew and transistor width constraints
• Allows minimization of delay, area or linear
combination thereof
• Easy-to-use; full GUI support; good fit into
(semi-custom and custom) methodologies
• Timer benefits such as accuracy for custom
circuits, pattern-matching, state analysis
EinsTuner: formal static optimizer
Transistor and
wire sizes
Nonlinear
optimizer
LANCELOT
Static transistorlevel timer
EinsTLT
Function and
gradient values
Embedded timedomain simulator
SPECS
Components of EinsTuner
1 Read netlist; create
3 timing graph (EinsTLT)
Formulate pruned optimization problem
2
Feed problem to nonlinear optimizer (LANCELOT)
Solve optimization problem, call simulator
for delays/slews and gradients thereof
Fast simulation and incremental
sensitivity computation (SPECS)
Obtain converged solution
Snap-to-grid; back-annotate; re-time
Static optimization formulation
AT1
AT2
d15
AT5
d 25
d 58
s.t.
s.t.
s.t.
s.t.
AT3
d 36
AT4
d 46
AT7
d 57
d 67
d 68
AT8
AT6
min [max ( AT , AT )]
7 8
AT  max [ AT  d , AT  d ]
7
5 57
6 67
AT  max [ AT  d , AT  d ]
5 58
8
6 68
AT  max [ AT  d , AT  d ]
5
1 15
2 25
AT  max [ AT  d , AT  d ]
6
3 36
4 46
Digression: minimax optimization
f1
f2
z
z*
x0
x

x
min z
x, z
min [ max{ f1 ( x), f 2 ( x)} ]  s.t. z  f1 ( x)
x
s.t. z  f 2 ( x)
Remapped problem
min z

min[max( AT7 , AT8 )]  s.t. z  AT7
s.t. z  AT
8

 AT7
AT7  max[ AT5  d 57 , AT6  d 67 ]  
 AT7
 AT8
AT8  max[ AT5  d 58 , AT6  d 68 ]  
 AT8
 AT5
AT5  max[ AT1  d15 , AT2  d 25 ]  
 AT5
AT1
d15
AT2
d 25
AT5
d 58
AT3
d 36
AT4
d 46
 AT5  d 57
 AT6  d 67
 AT5  d 58
 AT6  d 68
 AT1  d15
 AT2  d 25
 AT6  AT3  d 36
AT6  max[ AT3  d 36 , AT4  d 46 ]  
 AT6  AT4  d 46
AT7
d 57
d 67
d 68
AT6
AT8
Springs and planks analogy
z
AT7
d 57
d 58
d 67
AT8
d 68
AT6
AT5
d 15
d 25
AT1
d 36
AT2
d 46
AT3
AT4
Springs and planks analogy
z
AT7
AT7
d 67
d 58 d 68
67
d 58 d 68
d 57 d
AT7
d 57 d
67
AT
AT7
5
d 57
AT5
d 57
AT5
d
AT5 15
d
dd 15
15
15
d 67
dd 25
d 25
d
AT
AT
AT11
25
25
AT11
AT
AT
AT22
AT22
dd 36
36
d
d 36
36
AT8
AT8
z
AT
AT6
8
d 58
d 68
AT6
d 58
d 68
AT6
AT8
d 46
AT6
d 46
d 46
d
AT
AT
AT33
AT33
z
46
AT
AT
AT44
AT44
z
Algorithm demonstration: inv3
Algorithm demonstration: 1b ALU
Constraint generation
wn , w p
i
j
cout j
AT jr  ATi f  d ijr ( wn , w p , cout j , slewinf )
AT j f  ATi r  d ijf ( wn , w p , cout j , slewinr )
slew rj  sijr ( wn , w p , cout j , slewinf )
slew jf  sijf ( wn , w p , cout j , slewinr )
Statement of the problem
min z
s.t. z  ATi  RATi
s.t. AT j  ATi  dij (W , S )
s.t. s j  sij (W , S )
s.t. W  area target
s.t. pincap i (W )  target i
s.t. min  i (W )  max
s.t. Wmin Wi Wmax
internal
s.t. Si  Smax
PO
s.t. Si  Smax
for all POs
for each timing arc
for each timing arc
for all PIs
for all gates
for all FETs
for all internal nets
for all POs
LANCELOT
• State-of-the-art general-purpose
nonlinear optimizer
• Trust-region method
• Exploitation of group partial separability;
augmented Lagrangian merit function
• Handles general linear/nonlinear equalities/
inequalities and objective functions
• Simple bounds handled by projections
• Several choices of preconditioners,
exact/inexact BQP solution, etc.
LANCELOT algorithms
 f
x1
Trust-region
Simple bounds
a
w
u
v
min f ( x1 , x2 )
0  x1  a
xk
0  x2  b
b
x2
Demonstration of degeneracy
z
Degeneracy!
Demonstration of degeneracy
z
Why do we need pruning?
Pruning: an example
1
2
min z
s.t. z  AT4r
s.t. z  AT4f
r
f
r
s.t. AT4  AT3  d 34
f
r
f
s.t. AT4  AT3  d 34
r
s.t. AT3r  AT2f  d 23
s.t. AT3 f  AT2r  d 23f
s.t. AT2r  AT1 f  d12r
f
r
f
s.t. AT2  AT1  d12
3
4
min z
r
f
r
f
s.t. z  AT1  d12  d 23  d 34
f
r
f
r
s.t. z  AT1  d12  d 23  d 34
Block-based vs. path-based timing
5
4
AT5  AT4  d45
AT6  AT4  d46
AT4  AT1  d14
AT4  AT2  d24
AT4  AT3  d34
6
AT5  AT1  d14  d 45
AT5  AT2  d 24  d 45
AT5  AT3  d 34  d 45
AT6  AT1  d14  d 46
AT6  AT2  d 24  d 46
AT6  AT3  d 34  d 46
Path-based
Block-based
1
2
3
Detailed pruning example
1
2
3
7
9
11
12
4
5
6
8
10
13
14
15
16
Detailed pruning example
Edges = 26
Nodes = 16 (+2)
1
2
7
9
3
Source
11
14
12
15
13
16
4
5
6
8
10
Sink
Detailed pruning example
Edges = 26  20
Nodes = 16  10
7
1
9
11
14
12
15
13
16
2
3
Source
5
6
4
8
10
Sink
Detailed pruning example
Edges = 20  17
Nodes = 10  7
7
1
9
11
14
12
14
2
3
Source
5
6
15
4
8
10
13
16
Sink
Detailed pruning example
Edges = 17  16
Nodes = 7  6
9
1,7
11
14
12
14
2,7
Source
5
6
3,7
4
15
8
10
13
16
Sink
Detailed pruning example
Edges = 16  15
Nodes = 6  5
9
1,7
11,14
2,7
12
Source
5
6
3,7
4
14
15
8
10
13
16
Sink
Detailed pruning example
Edges = 15  14
Nodes = 5  4
9
1,7
11,14
2,7
12
Source
5
6
3,7
4
14
15
8
10
13,16
Sink
Detailed pruning example
Edges = 14  13
Nodes = 4  3
9
1,7
11,14
2,7
12
Source
5
6
3,7
4
14
15
10
8
10,13,16
Sink
Detailed pruning example
Edges = 13  13
Nodes = 3  2
9
1,7
12,14
2,7
Source
5
6
11,14
12,15
3,7
4
10,12,14
Sink
10,12,15
8
10,13,16
Pruning vs. no pruning
No time to mention...
• Simultaneous switching
• Tuning of transmission-gate MUXes
• Adjoint sensitivity computation
• The role of pattern matching
• Strength calculations/symmetrized stacks
• State analysis
• Detection of impossible problems
• Latches
• Uncertainty-aware tuning
EinsTuner impact
• Better circuits
– 15%+ on previously hand-tuned circuits!
• Improved productivity in conjunction with
semi-custom design flow (see Northrop et
al, DAC ’01)
• Better understanding of tradeoffs
– lifts designers’ thinking to a higher level
• Currently being applied to third generation
of z-series (S/390) and PowerPC
microprocessors to benefit from EinsTuner
Act II: If only...
Rich source of
research problems
What we are
able to do
What we could do with a robust, fast,
high-capacity, noise-immune,
degeneracy-immune, easy-to-use,
non-finicky nonlinear optimizer
Open research problems
• Immunity to numerical noise
Voltage
Time
Numerical noise
• Need a formal basis for optimization in the
presence of inevitable numerical noise
• Formal methods for measuring noise level
• Tie all tolerances to a multiple of noise level
– tolerances for bounds
– tolerances for updating multipliers, etc.
– stopping criteria
• Currently our optimization is a race against
noise and the conclusion is not pleasing!
• All optimization decisions noise-based
Numerical noise
• Inject various types of
noise (correlated and
uncorrelated) with
different amplitude into
function and gradient
evaluation of analytic
optimization problems
• Easy experimentation
thus possible to test
theories of noiseimmune optimization
Capacity
• Today, our biggest problem has about
30,000 transistors, 70,000 variables and
70,000 constraints
• Runs for as long as a week!
• We go to great lengths to reduce problem
size
– in the tuner code
– in the pre-processor
– in our methodology
What higher capacity would permit
• Tuning of larger macros
• Early mode considerations
• Manufacturability
– replicate constraints and variables at each
process corner
– rich possibilities in choice of objective function
• Explicit (circuit) noise constraints
• Benefits of explicit grouping constraints in
post-optimality analysis
Digression: noise constraints
area = 0
v
NML
t1
ts
te
t2
v( x, t )  NM L for all t in [t1 , t2 ]
te
 {v( x, t )  NM L } dt  0
t
ts
• Combine Harmony and EinsTuner
– essential for dynamic circuits (see TCAD 6/00)
– replace rules-of-thumb by actual noise
constraints for static circuits
Robustness
• Scaling
– requires unit change in all variables to have
roughly same order of magnitude of impact
– need non-uniform trust-region; would subsume
2-step updating [Conn et al, SIAM J. Opt. 9/99]
– very difficult in general
• Weighting
– essential in any large-scale optimizer
– need problem-size-dependent weights
• Sensitivity to choices should be minimized
– finicky behavior not helpful in engineering
application
Ease of use
• Nonlinear optimizers should stop assuming
they are the “master” program
• There is no standard API for nonlinear
optimization packages
• Aspects such as error/message handling
and memory management are important;
FORTRAN does not help
• Not easy to help the optimizer with domainspecific knowledge
– e.g., availability of information about Lagrange
multipliers in recursive minimax problems
Mixed integer/continuous problems
• Assignment of low threshold voltage
devices
• Swapping pin order in multi-input gates
(combinatorial problem)
• Working from a discrete gate library
• Discrete choice of taper ratios
• In general, the type of re-structuring that a
logic synthesis program may consider
Recursive minimax support
min z
s.t. z  AT  d ( x), i  1,2,, I
i
iz
s.t. AT  AT  d ( x), j  1,2,, J
i
j
ji
s.t. AT  AT  d ( x), k  1,2,, K
j
k
kj
.
.
.
• Need a way to force “timing correctness”
– arrival times and slews at lowest feasible value
– one of them tight at every stage
– product of slacks = 0?
Degeneracy and redundancy
• Nonlinear optimizers easy to mislead
• Most engineering problems have plenty of
inherent degeneracy and redundancy
• Can optimizers be made less sensitive to
degeneracy?
• Example:
min f ( x)
s.t. c(x)  0
s.t. c(x)  0
is degenerate because
 x f ( x)   x c( x)[1  2 ]
Practical considerations
• Detection of impossible problems
– bound problems
– obviously impossible constraints
– not-so-obviously impossible constraints that
need a little function evaluation to detect
• Post-optimality analysis
– final values of Lagrange multipliers can be
used to give the designer guidance on
obtaining further improvement
– may need to re-solve for multipliers at end
– information needs to be collected and
presented to designer in nice form
– formulation of problem may change to take
advantage of this analysis
Conclusion (Act I)
• Can robustly carry out static circuit tuning
• Consistent and high-quality results
• Improvement in designers’ productivity
• Moves designers’ thinking to a higher level
• Presently impacting third generation of
z-series (S/390) and PowerPC
microprocessors
Conclusion (Act II)
• Open problems in nonlinear optimization
whose solution will have a real impact
– numerical noise
– capacity
– robustness, scaling, weighting
– ease of use, common API
– discrete variables
– support for recursive minimax problems
– degeneracy
– post-optimality analysis
– detection of impossible problems