RoboBees + Aladdin + HELIX Approximate Accelerator Architectures Gu-Yeon Wei School of Engineering and Applied Sciences Harvard University CMOS scaling is running out Technological Fallow Period 2 Power wall Lost performance Power wall multicore 3 Two obvious paths to higher performance Parallel processing Specialized processing 4 Today’s talk • RoboBees: A Convergence of Body, Brain and Colony • Aladdin: A pre-RTL accelerator modeling tool • HELIX-UP: Approximate parallelization trades accuracy for speed Future architectures for approximate accelerator systems 5 Our Inspiration 6 The problem • 30% of world’s food supply pollinated by honeybees • Big problem if they all disappear… 7 7 Build micro-sized robotic bees! [K. Ma et al., Science 2013] [Takeoff Video: http://goo.gl/RTkeSF] 8 9 [R. Wood et al., SciAm 2013] 10 The Brain: Sensing Control Actuation Power actuator(s) Passive components Brain IC Sensors Power IC Control actuator(s) Battery Magnetic components 11 A 50mg budget No off-chip components • Fully-integrated 4:1 voltage regulator – Cascade two 2:1 switched-capacitor converters – Brain SoC operates directly off of ~3.7V battery • Self clocking via voltage-tracking oscillator – No external reference clock/oscillator – Frequency tracks operating conditions *A calibrated relaxation oscillator provides 10MHz +/- 1.2% frequency for “absolute” timing to flap wings 12 Power/clock scaffolding test chip 3mm 1mm TSMC 40G [X. Zhang et al., CICC 2013] [T. Tong et al., CICC 2013] SRAM1 Switched-Capacitor Integrated Voltage Regulator Cortex-M0 DCO BIST SRAM0 13 How to “sense” the world? 14 on-bee sensors 2 6 6 6 6 6 6 4 ocelli magnetometer accelerometer gyroscope ✓x ✓y ✓z ˙x ✓ ˙y ✓ ˙z ✓ 32 76 76 76 76 76 76 54 x y z ẋ ẏ ż 3 7 7 7 7 7 7 5 laser rangefinder sonar (audio) antenna (wind) pressure optical flow Sensor Type Mass Power BW Precision Remarks Ocelli1 light 22 mg < 5 mW 100 kHz 2 deg/s Gyroscope MEMS 40 mg 20 mW 200 Hz 1 deg/sec Accelerometer MEMS 40 mg 20 mW 200 Hz 0.2 milli-g Sonar audio 14 mg .4 mW 100 Hz low wings as emitters? Magnetometer magnetic 16 mg 1 mw 1 kHz 1 deg magnetic interference? Pressure MEMS 30 mg 1 mW 200 Hz 1m Optical flow2 CMOS >100 mg 10-20mW 40 Hz NA Airspeed1 PC-MEMS 5 mg 25 mW 100 Hz 10 cm/s Laser rangefinder CMOS 5 mg 1.5 mW 40 Hz ~1 cm vibration-sensitive computation-heavy needs camera Insect flight relies on optical flow 16 optical flow sensor integration Sensor: 4x32 pixels [P. Duhamel et al., Trans. Mechatronics 2013] 17 altitude control w/ optical flow [P. Duhamel et al., Trans. Mechatronics 2013] 18 Optimize SoC power-performance via hardware acceleration Bio-inspired brain / electronic nervous system 19 We built a BrainSoC test-chip prototype [X. Zhang et al., Symp. VLSI Circuit 2015] 20 Why accelerators? Accelerators Performance Performance Accelerators General Purpose Processors General Purpose Processors Power Power Cortex-M0 Accel Core Accel Core Accel Core Accel Core 21 E.g.: Apple A8 SoC has cores, memories, GPU + lots of accelerators Out-of-Core Accelerators Out-of-Core Accelerators Maltiel Consulting estimates Our estimates [www.anandtech.com/show/8562/chipworks-a8] 22 today’s SoC CPU GPU/ DSP CPU Buses Mem InterAcc Acc Acc face Acc Acc Acc Acc Acc Acc 23 future accelerator-centric architectures Big Cores Small Cores Shared Resources Memory Interface GPU/DSP Sea of Fine-Grained Accelerators How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources? Flexibility Design Cost Programmability 24 Aladdin: A pre-RTL accelerator model Shared Memory/Interconnect Models (rest of SoC) Unmodified C-Code Aladdin Accelerator Design Parameters (e.g., # FU, mem. BW) Accelerator Specific Datapath “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems Flexibility Programmability [Y. Shao et al., ISCA 2014] Power/Area Private L1/ Scratchpad Performance “Design Assistant” Understand Algorithmic-HW Design Space before RTL Design Cost Explore design space of accelerators Big Cores GPU/DS P Small Cores Shared Resources Memory Interface Sea of Fine-Grained Accelerators • Rapidly evaluate large design space • Model impact of SoC system, e.g., shared resource contention. 26 Aladdin is a pre-RTL modeling tool C Code Acc Design Parameters Dynamic Data Dependence Graph (DDDG) Performance Activity Power/Area 27 Aladdin is a pre-RTL modeling tool Optimization Phase C Code Optimistic IR Initial DDDG Idealistic DDDG Performance Acc Design Parameters Program Constrained DDDG Resource Constrained DDDG Activity Power/Area Models Power/Area Realization Phase 28 Example of C to accelerator design C Code: For (i=0; i<N; ++i) c[i] = a[i] + b[i]; 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10. r0 = r0 + 1 //++i … 29 Computation depends on parameters optimization phase realization phase Resource Activity Idealistic DDDG 0. i=0 1. ld a 5.i++ 2. ld b 6. ld a 7. ld b 0. i=0 15. i++ 10. i++ 11. ld a 12. ld b 16. ld a 17. ld b 1. ld a 2. ld b 3. + 8. + 13. + 18. + 3. + 4. st c 9. st c 14. st c 19. st c 4. st c MEM MEM + MEM + 5.i++ 6. ld a Accelerator design parameters: • Memory BW <= 2 • Adder(s) = 1 7. ld b MEM MEM + 8. + MEM 9. st c Cycle Computation depends on parameters optimization phase realization phase Resource Activity Idealistic DDDG 0. i=0 1. ld a 5.i++ 2. ld b 6. ld a 15. i++ 10. i++ 7. ld b 0. i=0 11. ld a 12. ld b 16. ld a 17. ld b 1. ld a + 5.i++ 2. ld b 6. ld a 7. ld b MEM MEM MEM MEM 3. + 8. + 13. + 18. + 3. + 8. + + + 4. st c 9. st c 14. st c 19. st c 4. st c 9. st c MEM MEM 11. ld a Accelerator design parameters: • Memory BW <= 4 • Adder(s) = 2 12. ld b 16. ld a + + 15. i++ 10. i++ 17. ld b MEM MEM MEM MEM 13. + 18. + + + 14. st c 19. st c MEM MEM Cycle Get power-performance design points Power Accelerator Design Parameters: • Memory BW <= 4 • Adder(s) = 2 Accelerator Design Parameters: • Memory BW <= 2 • Adder(s) = 1 Cycles 32 Easily generate full design space Power Designer chooses best design from Pareto frontier Cycles 33 Aladdin is accurate… 34 …. and fast Hand-Coded RTL C-to-RTL Programming Effort High Medium RTL Generation Designer Dependent 37 mins Aladdin N/A RTL Simulation 5 mins RTL Synthesis 45 mins Time to Solution per Design 87 mins 1 min Time to Solution (36 Designs) 52 hours 7 min 35 Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC MARSx86 Big Cores ... XIOSim Small Cores … Shared Cacti/Orion2 Resources GPGPUGPU Sim Memory DRAMSim2 Interface Sea of Fine-Grained Accelerators 36 Automatic parallelization is the other path Parallel processing Specialized processing 37 HELIX is an automatic parallelization framework to enhance TLP in CMPs • Parallelize loops by distributing loop iterations across multiple CMP cores – Like DOACROSS parallelization • Heavily relies on code analysis (DDG) and various transformations – No speculation (at this time) • Synchronization via memory – Performance sensitive to core-to-core communication [S. Campanoni et al., CGO 2012] 38 Convert loop into a form that facilitates HELIX parallelization wait(1) prologue seq. segment 1 signal(1) original loop code parallel code wait(2) loop body seq. segment 2 signal(2) parallel code wait(3) seq. segment 3 signal(3) • Sequential segments (ss) contain loop-carried data dependences – ss in next iteration may require data from prior iteration(s) – Synchronization via wait/signal instructions 39 Parallel execution of loop iterations (i) across adjacent cores within CMP communication latency! • Shortening ss improves parallelism • Effectiveness sensitive to communication latency 40 Slow cache coherence in today’s processors impedes HELIX parallelism 41 Simulated results for 16 core CMP: Substantial speedup with Ring Cache Ring Cache (RC) = Fast core-to-core data sharing [S. Campanoni et al., ISCA 2014] 42 What if we relax strict adherence to program semantics? • Inspired by approximate computing • Tolerable distortion is workload-dependent • E.g., bzip2 has 2 outputs Compressed file (100%) Statistics (0 ~ 100%) 43 E.g., break sequential bottlenecks • A code region executed sequentially Thread 1 Thread 2 Thread 3 Inst 1 Inst 2 Inst 3 Inst 4 Inst 1 Inst 2 Inst 1 Inst 2 Dep Inst 3 Inst 4 Inst 3 Inst 4 44 Several opportunities to trade off accuracy for performance/parallelism • Sequential bottleneck – Sequential execution of a code region • Communication bottleneck – Movement of data between threads • Data locality bottleneck – Locality lost due to parallelization No output distortion Baseline performance Max output distortion Max performance 45 HELIX-UP unblocks parallelism with small output distortion (user-defined fn) [S. Campanoni, et al., CGO 2015] Nehalem: 6 cores 2 threads per core 46 Runtime adaption of knobs offers even more performance 256.bzip2 HELIX (%) [S. Campanoni, et al., CGO 2015] 47 Many examples of approximate computing hardware proposed in recent years • Simplify hardware/logic to trade accuracy for power, delay, and area – – – – Approximate logic to “check” computation 1 Imprecise arithmetic for CORDIC 2 Approximate adders for DSP 3 ABACUS: Approximate logic synthesis 4 original exact design description 1[M. Fig. 2. AST modified AST approximate design description Overall methodology of ABACUS. 3 Choudhury et al. DATE 2008] 2[J. Huang & J Lach DAC 2011] [V. Gupta et al. TCAD 2013] Nepal et al. DATE 2014] 4[K. variants through transformations to the AST; and (iii) finally 1. Data massive be a goo requirem point ar truncati signific number 48 metic o Look beyond homogeneous parallelism Programmability SIMD/ SSE In Core GPU Energy Efficiency Approximate Composable Accelerators Out of Core AESDEC H.264 Fixed Function 49 To wrap up… UP modified AST ABACUS approximate design description 1. Data Type S massive data, tru be a good way to requirements for point arithmetic truncation in two 50