Presentation

Post-Silicon Bug Diagnosis
with Inconsistent Executions
Valeria Bertacco
[email protected]
Computer Science and Engineering
University of Michigan – Ann Arbor
Errors come in many flavors…
• Functional bugs
• Electrical failures
All Phenoms feature infamous L3 cache errata
2007
Another day, another microprocessor delay
2007
• Transistor faults
Intel’s Sandy Bridge Glitch: 7 Things You Need to Know
2011
…and they always hurt
2
Design Validation
Pre-Silicon
Post-Silicon
Product
Debug prototypes before shipment
+ Fast prototypes
+ High coverage
+ Test full system
+ Find deep bugs
- Poor observability
- Slow off-chip transfer
- Noisy
- Intermittent bugs
3
Post-silicon bugs
• Intermittent post-silicon bugs are the most challenging
• A same test does not expose the bug in every run
• Each run exhibits different behaviors
OUR GOAL: locate intermittent bugs
difficult
to
debug!
pushl %epb
movl %epb
same post-silicon test
many different results
4
BPS: “Bug Positioning System”
• Localize failures
• Time (cycle) and space (signals)
• Tolerate non-repeatable executions
• Statistical approach
• Scalable, adaptable to many HW subsystems
5
Signatures
• Goal: summarize signal value
• Encodings (hamming, CRC, etc.)
– Large hardware
– Small change in input -> large change in output
• Counting schemes (time@1, toggles)
time@1=2
time@1=1
window
window
signal A
6
Statistical approach
statistical
debugging
traditional
debugging
passing testcase
Distribution
match?
1
passing
testcases
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Signature value
failing testcase
distribution of
signature values:
same test can yield
different results
time@1
---------------window size
7
Good vs bad signatures
• Characterize populations of signatures
• Statistical separation between noise and bug
Distribution
Distribution
passing
testcases
CRC, hamming
distance, MISR, etc.
passing testcases
time@1, time@0, toggle
failing testcases
0
0.2
0.4
0.6
Signature value
0.8
1
0
0.2
0.4
0.6
0.8
1
Signature value
8
Signature hardware
• Measure
time@1
• Use custom hardware or reuse existing debug infrastructure
Custom HW: 11KB for 100 signals x 100 windows @9bit precision
→ 1.35mm2 with 65nm library
→ 0.4% of OpenSPARC
9
BPS: “Bug Positioning System”
Hardware logging
2. Software post-analysis
1.
10
Bug band model
Signature value
1
Failing band
0.8
Passing band
0
0.6
0.5
1
bug band
0.4
0
0.2
µ ± 2σ
0
bug occurs
0
4
8
behavior of 1 signal from the
MEM stage of a 5-stage pipeline
processor
0.5
1
bug detected
12
16
20
24
Window
11
SW post-analysis
signatures
signals
windows
bug band
signals
signatures
12
Experimental setup
10 random seeds: variable memory
delay, crossbar random traffic
monitored 41,744 top
level control signals
100
passing
runs
1000
buggy
runs
10 testcases
BPS
HW
BPS
SW
detected
signals
detection
latency
10 bugs: e.g., functional bug in PCX,
electrical error in Xbar
13
Testcases
blimp_rand
fp_addsub
fp_muldiv
isa2_basic
isa3_asr_pr
isa3_window
ldst_sync
mpgen_smc
n2_lsu_asi
tlu_rand
n.b. no bug
√ found
√+ √
n.b. f.p.
n.b. f.p.
n.b. f.n.
n.b. √
n.b. √
n.b. √+
n.b. √+
n.b.f.n.
n.b. √+
√
√
√
√
√
√
√
√
√
√
√
√
√
n.b.
f.n.
n.b.
√
√
f.n.
√
√+ exact signal
√+
√
√
√+
√+
√+
√+
√+
√+
√+
EXU elect
MMU combo
MCU combo
Xbar combo
Bugs
PCX fxn
PCX atm SA
MMU fxn
BR fxn
Xbar elect
3bug
noisy
signals
wider
effects,
signal
not
excited
by floating
easier
to catch
observable
point benchmarks
PCX gnt SA
Signal localization
√+ √+ f.n. √+ f.n.
√+ f.p. n.b. √+ f.p.
√+ f.p. f.p. √+ f.p.
√+ √+ √+ n.b. f.n.
√ √+ √+ √
√
√ f.n. f.n. n.b. √
√+ √+ √+ √+ n.b.
√+ √+ √+ √+ √+
√+ √+ √+ √+ n.b.
√+ √+ √+ √+ √+
f.p. false pos.
f.n. false neg.
14/29
AVERAGE
EXU elect
MMU combo
MCU combo
XBar combo
PCX fxn
PCX atm SA
MMU fxn
BR fxn
XBar elect
PCX gnt SA
∆ time bug injection to detection
(cycles)
Time to detect bug
6,000
5,000
4,000
3,000
2,000
1,000
1,273
cycles
0
15
Conclusions
• BPS automatically localizes bug time and location
• Leverages a statistical approach to tolerate noise
• Effective for a variety of bugs: functional, electrical and
manufacturing
• 1,273 cycles, 75 signals on average
16