Post-Silicon Bug Diagnosis with Inconsistent Executions Valeria Bertacco [email protected] Computer Science and Engineering University of Michigan – Ann Arbor Errors come in many flavors… • Functional bugs • Electrical failures All Phenoms feature infamous L3 cache errata 2007 Another day, another microprocessor delay 2007 • Transistor faults Intel’s Sandy Bridge Glitch: 7 Things You Need to Know 2011 …and they always hurt 2 Design Validation Pre-Silicon Post-Silicon Product Debug prototypes before shipment + Fast prototypes + High coverage + Test full system + Find deep bugs - Poor observability - Slow off-chip transfer - Noisy - Intermittent bugs 3 Post-silicon bugs • Intermittent post-silicon bugs are the most challenging • A same test does not expose the bug in every run • Each run exhibits different behaviors OUR GOAL: locate intermittent bugs difficult to debug! pushl %epb movl %epb same post-silicon test many different results 4 BPS: “Bug Positioning System” • Localize failures • Time (cycle) and space (signals) • Tolerate non-repeatable executions • Statistical approach • Scalable, adaptable to many HW subsystems 5 Signatures • Goal: summarize signal value • Encodings (hamming, CRC, etc.) – Large hardware – Small change in input -> large change in output • Counting schemes (time@1, toggles) time@1=2 time@1=1 window window signal A 6 Statistical approach statistical debugging traditional debugging passing testcase Distribution match? 1 passing testcases 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Signature value failing testcase distribution of signature values: same test can yield different results time@1 ---------------window size 7 Good vs bad signatures • Characterize populations of signatures • Statistical separation between noise and bug Distribution Distribution passing testcases CRC, hamming distance, MISR, etc. passing testcases time@1, time@0, toggle failing testcases 0 0.2 0.4 0.6 Signature value 0.8 1 0 0.2 0.4 0.6 0.8 1 Signature value 8 Signature hardware • Measure time@1 • Use custom hardware or reuse existing debug infrastructure Custom HW: 11KB for 100 signals x 100 windows @9bit precision → 1.35mm2 with 65nm library → 0.4% of OpenSPARC 9 BPS: “Bug Positioning System” Hardware logging 2. Software post-analysis 1. 10 Bug band model Signature value 1 Failing band 0.8 Passing band 0 0.6 0.5 1 bug band 0.4 0 0.2 µ ± 2σ 0 bug occurs 0 4 8 behavior of 1 signal from the MEM stage of a 5-stage pipeline processor 0.5 1 bug detected 12 16 20 24 Window 11 SW post-analysis signatures signals windows bug band signals signatures 12 Experimental setup 10 random seeds: variable memory delay, crossbar random traffic monitored 41,744 top level control signals 100 passing runs 1000 buggy runs 10 testcases BPS HW BPS SW detected signals detection latency 10 bugs: e.g., functional bug in PCX, electrical error in Xbar 13 Testcases blimp_rand fp_addsub fp_muldiv isa2_basic isa3_asr_pr isa3_window ldst_sync mpgen_smc n2_lsu_asi tlu_rand n.b. no bug √ found √+ √ n.b. f.p. n.b. f.p. n.b. f.n. n.b. √ n.b. √ n.b. √+ n.b. √+ n.b.f.n. n.b. √+ √ √ √ √ √ √ √ √ √ √ √ √ √ n.b. f.n. n.b. √ √ f.n. √ √+ exact signal √+ √ √ √+ √+ √+ √+ √+ √+ √+ EXU elect MMU combo MCU combo Xbar combo Bugs PCX fxn PCX atm SA MMU fxn BR fxn Xbar elect 3bug noisy signals wider effects, signal not excited by floating easier to catch observable point benchmarks PCX gnt SA Signal localization √+ √+ f.n. √+ f.n. √+ f.p. n.b. √+ f.p. √+ f.p. f.p. √+ f.p. √+ √+ √+ n.b. f.n. √ √+ √+ √ √ √ f.n. f.n. n.b. √ √+ √+ √+ √+ n.b. √+ √+ √+ √+ √+ √+ √+ √+ √+ n.b. √+ √+ √+ √+ √+ f.p. false pos. f.n. false neg. 14/29 AVERAGE EXU elect MMU combo MCU combo XBar combo PCX fxn PCX atm SA MMU fxn BR fxn XBar elect PCX gnt SA ∆ time bug injection to detection (cycles) Time to detect bug 6,000 5,000 4,000 3,000 2,000 1,000 1,273 cycles 0 15 Conclusions • BPS automatically localizes bug time and location • Leverages a statistical approach to tolerate noise • Effective for a variety of bugs: functional, electrical and manufacturing • 1,273 cycles, 75 signals on average 16