HELIX, Aladdin, and RoboBees to build approximate accelerator architectures.

RoboBees + Aladdin + HELIX
 Approximate Accelerator
Architectures
Gu-Yeon Wei
School of Engineering and Applied Sciences
Harvard University
CMOS scaling is running out
Technological
Fallow Period
2
Power wall  Lost performance
Power wall  multicore
3
Two obvious paths to higher performance
Parallel processing
Specialized processing
4
Today’s talk
• RoboBees: A Convergence of Body, Brain and
Colony
• Aladdin: A pre-RTL accelerator modeling tool
• HELIX-UP: Approximate parallelization trades
accuracy for speed
Future architectures for
approximate accelerator systems
5
Our Inspiration
6
The problem
• 30% of world’s food supply pollinated by
honeybees
• Big problem if they all disappear…
7
7
Build micro-sized robotic bees!
[K. Ma et al., Science 2013] [Takeoff Video: http://goo.gl/RTkeSF]
8
9
[R. Wood et al., SciAm 2013]
10
The Brain: Sensing  Control  Actuation
Power actuator(s)
Passive components
Brain IC
Sensors
Power IC
Control actuator(s)
Battery
Magnetic components
11
A 50mg budget  No off-chip components
• Fully-integrated 4:1 voltage regulator
– Cascade two 2:1 switched-capacitor converters
– Brain SoC operates directly off of ~3.7V battery
• Self clocking via voltage-tracking oscillator
– No external reference clock/oscillator
– Frequency tracks operating conditions
*A calibrated relaxation oscillator provides 10MHz +/- 1.2%
frequency for “absolute” timing to flap wings
12
Power/clock scaffolding test chip
3mm
1mm
TSMC 40G
[X. Zhang et al., CICC 2013]
[T. Tong et al., CICC 2013]
SRAM1
Switched-Capacitor
Integrated Voltage Regulator
Cortex-M0
DCO
BIST
SRAM0
13
How to “sense” the world?
14
on-bee sensors
2
6
6
6
6
6
6
4
ocelli
magnetometer
accelerometer
gyroscope
✓x
✓y
✓z
˙x
✓
˙y
✓
˙z
✓
32
76
76
76
76
76
76
54
x
y
z
ẋ
ẏ
ż
3
7
7
7
7
7
7
5
laser
rangefinder
sonar
(audio)
antenna (wind)
pressure
optical flow
Sensor
Type
Mass
Power
BW
Precision
Remarks
Ocelli1
light
22 mg
< 5 mW
100 kHz
2 deg/s
Gyroscope
MEMS
40 mg
20 mW
200 Hz
1 deg/sec
Accelerometer
MEMS
40 mg
20 mW
200 Hz
0.2 milli-g
Sonar
audio
14 mg
.4 mW
100 Hz
low
wings as emitters?
Magnetometer
magnetic
16 mg
1 mw
1 kHz
1 deg
magnetic interference?
Pressure
MEMS
30 mg
1 mW
200 Hz
1m
Optical flow2
CMOS
>100 mg
10-20mW
40 Hz
NA
Airspeed1
PC-MEMS
5 mg
25 mW
100 Hz
10 cm/s
Laser rangefinder
CMOS
5 mg
1.5 mW
40 Hz
~1 cm
vibration-sensitive
computation-heavy
needs camera
Insect flight relies on optical flow
16
optical flow sensor integration
Sensor: 4x32 pixels
[P. Duhamel et al., Trans. Mechatronics 2013]
17
altitude control w/ optical flow
[P. Duhamel et al., Trans. Mechatronics 2013]
18
Optimize SoC power-performance via
hardware acceleration
Bio-inspired brain / electronic nervous system
19
We built a BrainSoC test-chip prototype
[X. Zhang et al., Symp. VLSI Circuit 2015]
20
Why accelerators?
Accelerators
Performance
Performance
Accelerators
General
Purpose
Processors
General
Purpose
Processors
Power
Power
Cortex-M0
Accel
Core
Accel
Core
Accel
Core
Accel
Core
21
E.g.: Apple A8 SoC has cores,
memories, GPU + lots of accelerators
Out-of-Core
Accelerators
Out-of-Core
Accelerators
Maltiel Consulting
estimates
Our estimates
[www.anandtech.com/show/8562/chipworks-a8]
22
today’s SoC
CPU
GPU/
DSP
CPU
Buses
Mem
InterAcc Acc Acc face
Acc Acc Acc
Acc Acc Acc
23
future accelerator-centric architectures
Big
Cores
Small
Cores
Shared Resources
Memory
Interface
GPU/DSP
Sea of Fine-Grained
Accelerators
How to decompose an application to accelerators?
How to rapidly design lots of accelerators?
How to design and manage the shared resources?
Flexibility
Design Cost
Programmability
24
Aladdin: A pre-RTL accelerator model
Shared Memory/Interconnect
Models (rest of SoC)
Unmodified
C-Code
Aladdin
Accelerator Design
Parameters
(e.g., # FU, mem. BW)
Accelerator
Specific
Datapath
“Accelerator Simulator”
Design Accelerator-Rich SoC
Fabrics and Memory Systems
Flexibility
Programmability
[Y. Shao et al., ISCA 2014]
Power/Area
Private L1/
Scratchpad
Performance
“Design Assistant”
Understand Algorithmic-HW
Design Space before RTL
Design Cost
Explore design space of accelerators
Big Cores
GPU/DS
P
Small
Cores
Shared Resources
Memory
Interface
Sea of Fine-Grained
Accelerators
• Rapidly evaluate large design space
• Model impact of SoC system, e.g.,
shared resource contention.
26
Aladdin is a pre-RTL modeling tool
C Code
Acc Design
Parameters
Dynamic Data
Dependence Graph
(DDDG)
Performance
Activity
Power/Area
27
Aladdin is a pre-RTL modeling tool
Optimization Phase
C Code
Optimistic
IR
Initial
DDDG
Idealistic
DDDG
Performance
Acc Design
Parameters
Program
Constrained
DDDG
Resource
Constrained
DDDG
Activity
Power/Area
Models
Power/Area
Realization Phase
28
Example of C to accelerator design
C Code:
For (i=0; i<N; ++i)
c[i] = a[i] + b[i];
0. r0=0 //i = 0
1. r4=load (r0 + r1) //load a[i]
2. r5=load (r0 + r2) //load b[i]
3. r6=r4 + r5
4. store(r0 + r3, r6) //store c[i]
5. r0=r0 + 1 //++i
6. r4=load(r0 + r1) //load a[i]
7. r5=load(r0 + r2) //load b[i]
8. r6=r4 + r5
9. store(r0 + r3, r6) //store c[i]
10. r0 = r0 + 1 //++i
…
29
Computation depends on parameters
optimization phase
realization phase
Resource Activity
Idealistic DDDG
0. i=0
1. ld a
5.i++
2. ld b
6. ld a
7. ld b
0. i=0
15. i++
10. i++
11. ld a 12. ld b 16. ld a 17. ld b
1. ld a
2. ld b
3. +
8. +
13. +
18. +
3. +
4. st c
9. st c
14. st c
19. st c
4. st c
MEM MEM
+
MEM
+
5.i++
6. ld a
Accelerator design parameters:
• Memory BW <= 2
• Adder(s) = 1
7. ld b
MEM MEM
+
8. +
MEM
9. st c
Cycle
Computation depends on parameters
optimization phase
realization phase
Resource Activity
Idealistic DDDG
0. i=0
1. ld a
5.i++
2. ld b
6. ld a
15. i++
10. i++
7. ld b
0. i=0
11. ld a 12. ld b 16. ld a 17. ld b
1. ld a
+
5.i++
2. ld b
6. ld a
7. ld b
MEM MEM MEM MEM
3. +
8. +
13. +
18. +
3. +
8. +
+
+
4. st c
9. st c
14. st c
19. st c
4. st c
9. st c
MEM
MEM
11. ld a
Accelerator design parameters:
• Memory BW <= 4
• Adder(s) = 2
12. ld b
16. ld a
+
+
15. i++
10. i++
17. ld b
MEM MEM MEM MEM
13. +
18. +
+
+
14. st c
19. st c
MEM
MEM
Cycle
Get power-performance design points
Power
Accelerator Design Parameters:
• Memory BW <= 4
• Adder(s) = 2
Accelerator Design Parameters:
• Memory BW <= 2
• Adder(s) = 1
Cycles
32
Easily generate full design space
Power
Designer chooses best design
from Pareto frontier
Cycles
33
Aladdin is accurate…
34
…. and fast
Hand-Coded RTL
C-to-RTL
Programming Effort
High
Medium
RTL Generation
Designer
Dependent
37 mins
Aladdin
N/A
RTL Simulation
5 mins
RTL Synthesis
45 mins
Time to Solution per
Design
87 mins
1 min
Time to Solution
(36 Designs)
52 hours
7 min
35
Aladdin enables pre-RTL simulation of
accelerators with the rest of the SoC
MARSx86
Big
Cores
...
XIOSim
Small
Cores
…
Shared
Cacti/Orion2
Resources
GPGPUGPU
Sim
Memory
DRAMSim2
Interface
Sea of Fine-Grained
Accelerators
36
Automatic parallelization is the other path
Parallel processing
Specialized processing
37
HELIX is an automatic parallelization
framework to enhance TLP in CMPs
• Parallelize loops by distributing loop iterations
across multiple CMP cores
– Like DOACROSS parallelization
• Heavily relies on code analysis (DDG) and
various transformations
– No speculation (at this time)
• Synchronization via memory
– Performance sensitive to core-to-core
communication
[S. Campanoni et al., CGO 2012]
38
Convert loop into a form that facilitates
HELIX parallelization
wait(1)
prologue
seq. segment 1
signal(1)
original
loop
code
parallel code
wait(2)
loop
body
seq. segment 2
signal(2)
parallel code
wait(3)
seq. segment 3
signal(3)
• Sequential segments (ss) contain loop-carried data
dependences
– ss in next iteration may require data from prior iteration(s)
– Synchronization via wait/signal instructions
39
Parallel execution of loop iterations (i)
across adjacent cores within CMP
communication latency!
• Shortening ss improves parallelism
• Effectiveness sensitive to communication latency
40
Slow cache coherence in today’s
processors impedes HELIX parallelism
41
Simulated results for 16 core CMP:
Substantial speedup with Ring Cache
Ring Cache (RC) = Fast core-to-core data sharing
[S. Campanoni et al., ISCA 2014]
42
What if we relax strict adherence to
program semantics?
• Inspired by approximate computing
• Tolerable distortion is workload-dependent
• E.g., bzip2 has 2 outputs
Compressed file (100%)
Statistics (0 ~ 100%) 43
E.g., break sequential bottlenecks
• A code region executed sequentially
Thread 1
Thread 2
Thread 3
Inst 1
Inst 2
Inst 3
Inst 4
Inst 1
Inst 2
Inst 1
Inst 2
Dep
Inst 3
Inst 4
Inst 3
Inst 4
44
Several opportunities to trade off
accuracy for performance/parallelism
• Sequential bottleneck
– Sequential execution of a code region
• Communication bottleneck
– Movement of data between threads
• Data locality bottleneck
– Locality lost due to parallelization
No output distortion
Baseline performance
Max output distortion
Max performance
45
HELIX-UP unblocks parallelism
with small output distortion
(user-defined fn)
[S. Campanoni, et al., CGO 2015]
Nehalem: 6 cores
2 threads per core
46
Runtime adaption of knobs offers even
more performance
256.bzip2
HELIX
(%)
[S. Campanoni, et al., CGO 2015]
47
Many examples of approximate computing
hardware proposed in recent years
• Simplify hardware/logic to trade accuracy for power,
delay, and area
–
–
–
–
Approximate logic to “check” computation 1
Imprecise arithmetic for CORDIC 2
Approximate adders for DSP 3
ABACUS: Approximate logic synthesis 4
original exact
design description
1[M.
Fig. 2.
AST
modified AST
approximate
design description
Overall methodology
of ABACUS.
3
Choudhury et al. DATE 2008]
2[J. Huang & J Lach DAC 2011]
[V. Gupta et al. TCAD 2013]
Nepal et al. DATE 2014]
4[K.
variants through transformations to the AST; and (iii) finally
1. Data
massive
be a goo
requirem
point ar
truncati
signific
number
48
metic
o
Look beyond homogeneous parallelism
Programmability
SIMD/
SSE
In Core
GPU
Energy
Efficiency
Approximate
Composable
Accelerators
Out of Core
AESDEC
H.264
Fixed Function
49
To wrap up…
UP
modified AST
ABACUS
approximate
design description
1. Data Type S
massive data, tru
be a good way to
requirements for
point arithmetic
truncation in two
50