!""# $ %&& ' %+ , + $ $ % (, $ * ( '$ ) * - , &. / 0 % $% % % % $% & + ', $ -1 + % % + $% 2 1 3 4, % 3 4, 3 !4, 5& 3 4, 3 . . & & & (& 5& 5& 5 & 5 % 6 &% & 3 5& - % 6 , 3 3 & 5 5& % * - % & 5 - ( 8+ 8 7% * + % % 6 & 9 ! " 1,000,000,000 10,000,000 transistors GeForceFX PowerIV Radeon97 Prescott PentiumIV GeForceIV Crusoe K7 GeforceII Riva TNT2 PentiumPro PentiumII Pentium Nvidia NV2 + & 100,000,000 i486 1,000,000 i386 100,000 sym.cycles/sec 1.E+05 i286 10,000 1980 1.E+04 1985 1990 1995 design size 2000 2005 simulation speed % 4 8 + ! " © EETimes 03/18/2004 ) + % 4 ' # 0% : / & & & & 5 % Pre-silicon logic bugs per generation 25000 ( Source: Tom Schubert, Intel, DAC 2003) 7855 1 + 800 2240 Pentium Pentium Pro %+ %+ 9; 5& Pentium 4 .+ Next ? 2007 1000B 200 2001 10B 20 1995 100M 1M 10M Logic Gates Simulation Vectors Engineer Years 2000 100M Source: Synopsys Data from SoC Designs Source: G. Spirakis, keynote address at DATE 2004 % Percent of Designs $ & Number of spins in IC Designs 60% 50% 71% of SoC re-spins are due to logic bugs 2000 2002 40% 30% 20% 10% 0% 1 2 + 3 % Spin Count 4 4 $ 5 9 & + (< # +* + % 0/ %+ $ = < ! # ' / 9 & + & & % 1 > ? + % 3 3 3 $% ( && 3 %+ %+ * & % % + %+ % & > ? > ? & - , &. / 0 % $% % % % $% & + ', $ -1 + % % + $% ( & .) & / + '+ + % - + 0 % @ '+ + ) & / & % 1 3 ) + % + . % / + + % + % % + / & 5 0% % ) ) + & A > 9 3 + + 9 & + ? + + 5& % % 4 Temperature Variation (°C) Results in Hot spots Heat Flux (W/cm2) Results in Vcc variation 1 50 10 0 50 100 Heat Flux (W/cm2) 200 90 80 70 60 50 40 0 Random Dopant Fluctuations M ean Number of Dopant Atom s """" """ "" " """ #"" !#" " B# Te c hnology N ode (nm ) ! % 4 Temperature (C) 110 25 0 ( *& ! [P. Shivkumar et al., DSN 2002] 1.0E+04 Soft Error Rate (FIT/chip) 1.0E+03 1.0E+02 1.0E+01 1.0E+00 1.0E-01 SRAM 1.0E-02 latch 6 FO4s 1.0E-03 logic 6 FO4s 1.0E-04 1.0E-05 1.0E-06 1.0E-07 600nm 350nm 250nm 180nm 130nm 100nm 70nm 50nm 1992 1994 1997 1999 2002 2005 2008 2011 Technology Generation SER per chip of logic circuits • Nine orders of magnitude increase from 600 nm to 50 nm • Dominant source of soft errors after 50 nm & -3 # 5 & & & ( &* && 3 $ & & & 3 &% 8 '+ + 4 % % + / / & 3 % - % 8 + @ .) && + @ & = 3 7% / / ( $% $ & * / + & + 5 +, / 3 5& + % ' & % & % 8+ 8 + C % % + & 3 %+ 5 % & % $ 5 %+ $ % ' &% & 5 A D & 3 & % 3 & + - , &. / 0 % $% % % % $% & + ', $ -1 + % % + $% ( ) + . % ID REN REG FFFG speculative Instructions EX/ MEM IF E'% SCHEDULER CHK CT in-order with inputs and outputs [Source: Todd Austin, Univ. of Michigan] H 4 & + ( 4 & $ % + 9 % 4 ( - + 97 % + & . % & 9 5 % % & % % 3 + + + ( # + & && , & + + & + $ C ' 6 3 3 + & H ' 3 6 % >$ % $ & / & 8 @ % @ + I & + 8 % ? !""#8 # ! Thread 1 Thread 2 flag Shared Data 0 " J + + 1 Thread - 1 4 J + J + + K" Time 4 + 1 C % + + K" J + + 0 0 J Thread - 2 4 J + 1 0 J 4 + + K" + C % + + K" + + K" + 1 SMT Processor C Error? Checker Processors + ( . Two semaphores sem1, sem2 are initialized to 0 Thread 1: ... SEMV(sem1); SEMP(sem2); … Thread 1 Thread 2: ... SEMV(sem2); SEMP(sem1); … Thread 2 What’s the correct semantic in checker processors? SEMV Time SEMP SEMV SEMP Barrier / & Thread 1: ... if (flag == 0) i = i + 1; data = i; … Thread 2: ... key = data; result = foo(key); … Thread - 1 Thread - 2 Time bne (resolved with correct prediction) add (completed with error) st (buffered in LSQ) (forwarded from LSQ) ld Valid? ! ( Hardware Synchronization Unit 5 % $ + Per-thread retired instructions Retired Instructions dispatch SMT Processor Runtime Monitoring Hardware DIVA checker processor DIVA checker processor Register File Memory Architected State & 1.2 Normalized Execution Time 1.15 1.1 1.05 1 0.95 0.9 FFT LU CHOLESKY BARNES FMM WATERNSQUARED SPLASH-2 Benchmarks Runtime Validation Configuration Fault Rate = 1/100 Fault Rate = 1/1K Fault Rate = 1/10K Fault Rate = 1/100K Fault Ratelarge state space RV ??? small state space RV, MC MC local properties distributed properties & 5& % $% & & & & 9 & $ A + &D > & % & > & & + 3 % & +% ' ($( + % $ * % $ % % 3 + $ % 8 8$ + $ ? + ? + $ 1 $% ' + ' 3 & 4 + + % 3 * + ( 8+ 8 & 8 & & :% ' ' ' 2 / 2 + + ' 2 % % ' 2 & + Proccessor Count with Assumptions + 9 % + without Assumptions Checking Assumptions Unit # Safety/A % 9 % Safety Check A Liveness/A Liveness 4 0.08 0.51 9.87 4 2.03 4.45 27.11 67.19 385.72 6 0.33 8.94 143.57 6 8.16 14.84 105.47 294.94 85428.15 8 0.61 29.06 348.95 8 50.12 74.56 448.99 1840.63 TIME-OUT 10 0.8 68.86 1606.61 10 66.09 57.02 448.32 2007.45 TIME-OUT 12 5.13 141.33 8008.13 12 133.82 82.88 723.48 4212.76 TIME-OUT 14 3.07 183.51 145989.05 14 285.83 120.39 1110.78 9053.72 TIME-OUT 16 13.2 656.41 TIME-OUT 16 408.03 158.98 1423.31 13810.5 TIME-OUT $ % + $% - ! " )*+ & ( '0 ! $! ' ' C # ' 8- $ %! ,,-# @ 9) - ! ( & &&&5 # ! ! 8 & & ( . & > 2 34+) % ' % & $ %" ! & " '/ ! & % ! 1 % $% + ? * !""#8 - , &. / 0 % $% % % % $% & + ', $ -1 + % % + $% " & % % & & + (/ * $% / + 9 & ( * % % ($ * + % 3 3 / $ 4 > / $ $ / , % + ? % & 3 3 % % + / $ D 5 3 % + & 9 + & + ) !6# & 1 / $ +% + $ & % % > / + ( % & ? % & 3 / & 3 $ 3 $ + % / + % 1 % % 5 % & & 3 + 3 ' & +. 8+ 8 % 9 9& & + & & +% + & + / $ & % & % + & '& & % {design D usual HDL description of design }while {checker C monitor property at runtime }else {recovery R recovery procedure } & 7 , / , , 8 9 + - & * 2 *3 4 1* $ $$ & * ! $ *3 1 ' & 0 & +% / / $ & % & & ) H T + & &% &% &% &% &% &% &% & % % & & %+ &% &% 6 / & 0 / && > $ ? ++ & > ? + %& & / &% ) H T &% &% &% &% &% %& % 5 & %+ &% &% ( ! ' / , % + & & / % $ % ( & 3 $ + + % /* & . + & & % D 0 & 3 + % & , +% % +% + & & @ + - , &. / 0 % $% % % % $% & + ', $ -1 + % % + $% ( / + & 5 + 5& % + + %+ & $% + & % % 3 $% % 1 + 3 ' %+ % % & & 3 & & + % % + & + + & $ -1 + %&& + D + %+ +% +