SP-CNN: A Scalable and Programmable CNN-based Accelerator

SP-CNN: A Scalable and Programmable
CNN-based Accelerator
Dilan Manatunga
Dr. Hyesoon Kim
Dr. Saibal Mukhopadhyay
Motivation

Power is a first-order design constraint, especially for embedded
devices.

Certain applications prevalent in embedded devices


Image processing, audio processing, context awareness, etc.
Trend of including more specialized, energy-efficient
accelerators.
Moto X 2013
Cellular Neural Networks

Investigated using Cellular Neural Networks (CNN) as a
specialized accelerator

Used for variety of applications, particularly image processing

Hardware implementations known to consume very little power


Lee et al. (2008) implemented CNN chip that consumed 84mW@ 130nm
Issue: Current hardware Cellular Neural Network processing
capabilities have not scaled with the growth in image sizes.
Outline

Cellular Neural Network (CNN) Background

Multiplexing

SP-CNN Architecture

Results
Neural Networks
Radial Basis Function Network
Self-Organizing Map
Adaptive Resonance Theory
Pulse-Coupled Neural Network
Artificial Neural Network
Hopfield Neural Network
Convolutional
Cellular
NeuralNeural
Network
Network
(CNN)
Recurrent Neural Network
Spiking Neural Network
CNN Background

Introduced by Leon Chua and Lin Yang in 1998

Characterized by a spatial arrangement of locally-coupled cells


The set of cells a cell is coupled with is known as the cell’s neighborhood
Typical Cellular Neural Network consists of an MxN array of cells
C(i,j)
CNN Applications

Image Processing


Edge Detection, Image Segmentation, Movement Detection, etc.
Pattern Recognition

Associative Memory

Solving Partial Differential Equations

Pre-data processing for other neural networks

CNN Universal Machine is Turing Complete
CNN Cell


Each cell is a dynamical system with an input, output, and a state that
evolves according to some prescribed laws.
Most CNNs follow the standard CNN state and output equations:
𝑑𝑥𝑖𝑗 (𝑡)
= −𝑥𝑖𝑗 𝑡 +
𝑑𝑡
𝑎𝑘𝑙 ∗ 𝑦(𝑡)𝑘𝑙 +
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑗
𝑦𝑖𝑗 𝑡 =






𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑗
1
𝑥𝑖𝑗 𝑡 + 1 − 𝑥𝑖𝑗 𝑡 − 1
2
𝑁𝑟 𝑖, 𝑘 represents neighborhood for cell at location i,j
x represents the current state
y is the cell outputs
u is the cell inputs
z is the threshold value
a and b are the feed-forward and feedback weights
Digital CNN Cell

For our work, we focused on digital CNN implementation. In
these implementations, the state equation simplifies to:
𝑥𝑖𝑗 (𝑡 + 1) =
𝑎𝑘𝑙 ∗ 𝑦(𝑡)𝑘𝑙 +
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘
𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘
(from neighborhood cells)
A
(outputs)
z
B
(inputs)
Σ
xij
yij
CNN Gene

The programmable unit of a CNN is known as the CNN gene

The CNN gene specifies:



the z threshold value of the CNN state equation
the a and b coefficients of the CNN state equation
Furthermore, the programmer will specify the initial state and
boundary conditions for the CNN
0.0 1.0 0.0
0.0 0.0
𝐻𝑜𝑙𝑒 𝐹𝑖𝑙𝑙𝑖𝑛𝑔 𝐺𝑒𝑛𝑒: 𝐴 = 1.0 2.0 1.0 , 𝐵 = 0.0 4.0
0.0 1 0.0
0.0 0.0
𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝑆𝑡𝑎𝑡𝑒: 𝑥𝑖𝑗 0 = 1
𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛: 𝑦𝑘𝑙 = 0, 𝑢𝑘𝑙 = 0
0.0
0.0 , 𝑧 = −1.0
0.0
Example Application

The following is an example of running the HoleFilling gene, an
application commonly used in character recognition, on a input
image of the number eight.
Hardware CNNs


Different companies and research institutions have
implemented CNNs in hardware.
Name
Array Size
Technology Node
VAE
80x60
130 nm
Flak+
4x4
180 nm
SCAMP
21x21
600 nm
4-layer Gabor Filter
32x64
250 nm
ACE16k
128x128
350 nm
QCIF
176x144
150 nm
Kingetset+
Estimated Size
128x128
240 nm
The array sizes for most implementations is typically small.
Issue: Image Dim vs CNN Array Dim

Generally, most applications assume the CNN array’s dimensions
match the input (image) dimensions.
But what if CNN
array is smaller?
1. Shrink the Image
Issue: Image Size to
CNN Size Ratio
Ex. 1024x1024 image
to 128x128 CNN.
That’s a 64:1 ratio!
2. Emulate Ideal
CNN through
multiplexing
Ideal Multiplexing
Partition
Partition Partition
Partition
00 CNN
11 CNN
State
State
State
State
@T
@T
T=01
T=01
Partition
Partition Partition
Partition
22 CNN
33 CNN
State
State
State
State
@T
@T
T=01
T=01
With ideal multiplexing, we transfer and run each partition
on the CNN array for 1 CNN unit of time.
Heavily Memory Bound – Number of input/state data
transfers proportional to1convergence
CNN Time Unittime.
Rd. P0 u
Rd. P0 x(0)
Comp
Wr. P0 x(1)
Rd. P1 u
time (t)
CNN
P0 CNN
CNN
Array
State
Array
@ T1+1
Too little work to
utilize memory-level
parallelism (MLP)
What if we add another CNN array to take advantage
of independence between computations?
Key-Insight



Let’s take advantage of CNNs ability to still converge to the
correct solution in presence of small errors.
So, instead of running for partition on CNN array for 1 unit time,
run partition for an interval T >> 1 unit of time.
This is the key insight to our Scalable and Programmable CNN
(SP-CNN) architecture.
SP-CNN Multiplexing

With SP-CNN multiplexing, we run each partition on the CNN
array for a certain interval of time.
Run for 1 CNN
Unit of Time
Rd. P0 u
Rd. P0 x(0)
Rd. P0 u
Rd. P0 x(0)
Comp
Wr. P0 x(1)
Run for INTERVAL
CNN unit of Time
Comp (INTERVAL=4)
Wr. P0 x(1)
time (t)
1st iteration
SP-CNN
INTERVAL=4
0
4
interval
8
2nd iteration
12 16 20
24 28
32
cnn-time (t)
Boundary Conditions


When running a partition on the CNN array, what should we set
boundary conditions to?
To ensure that information propagates between partitions, we
set the boundary condition for a given partition to its
neighboring cells.
Partition 0 Pixel
Partition 1 Pixel
Partition 0 Boundary Pixel
Partition 0 Boundary made of
Partition 1 Data
Partition 0 Boundary also part
of True Boundary
Image Pixel
True Boundary
Condition for
Image
Boundary Conditions Example
P0
@T11
P1
P1
@T
@T11
P2
@T1
P3
@T1
CNN
Array
P0
P1
P2
P3
@T22
@T
P0
@T2
True Boundary Cond.
Next-State
State
@ T2 @
T2=T1+4*INTERVAL
Next-State
State @ T1 @
T3=T2+4*INTERVAL
P0
@T1
T1
T1+INT
P1
@T1
P2
@T1
P3
@T1
P0
@T2
T1+2*INT T1+3*INT T2=T1+4*INT T2+INT
cnn-time (t)
Total Time and Virtual Time

To compare convergence of Ideal-CNN vs SP-CNN, we introduce
the concept of Virtual Time.
Ideal-CNN
4
0
time (t)
8
v(t)=8
v(t)=4
SP-CNN
INTERVAL=4
0
4
8
12 16 20
24 28
interval
1st iteration
2nd iteration
32
time (t)
Reduced Memory Transfers

If virtual convergence time does not significantly increase in the SP-CNN case,
we can significantly reduce number of memory transfers.
T Memory Transfers
P0 P1 P2 P3
Ideal-CNN
0
1
2
3
P0 P1 P2 P3
P0 P1 P2 P3
4
5
6
7
8
T-4 T-2 T-2 T-1 T
Virtual Convergence Time (t)
T2/INTERVAL Memory Transfers
P0
SP-CNN
INTERVAL=4
0
1
2
P3
P1
3
4
5
6
7
8
T2-4 T2-3 T2-2 T2-1 T2
Virtual Convergence Time (t)
If T≅T2, than memory transfers reduced by factor of about 1/INTERVAL.
Increased MLP

By increasing computation time, we can take advantage of
parallel computation within the iteration.
CNN0
CNN1
CNN2
Rd. P0 u
Rd. P0 x(0)
Comp (INTERVAL=4)
Rd. P1 u
Rd. P1 x(0)
Increased memory level parallelism
due to lengthier computation.
Wr. P0 x(1)
Comp (INTERVAL=4)
Rd. P2 u
time (t)
SP-CNN Architecture
Host Processor
• Sends work to
SP-CNN
Scheduler
• Assigns partitions
to CNN-P
Global memory
• Stores the input
and cnn state.
CNN-P
• CNN processing
unit.
• Processes
partition
Methodology



6 benchmarks with 10 input
images.
Name
Mechanism
Ideal
CNN array size equal to Image
Size
The test images are of size
1024x1024 (~720p) or
2048x2048 (~1440p)
SP-CNN
Proposed architecture
*Local applications are
essentially simple convolution
operations that only care
about neighborhood values.
Application
Type
Corner Detection
Local*
Edge Detection
Local*
Connected Component
Global
Hole Filling
Global
Rotation Detector
Global
Shadow Creator
Global
Simulators

Functional Simulator Specifications



CNN Array Size = 128x128
Interval Time = 128 cnn time units
Timing simulator (DRAMSim2 for memory)

CNN timing parameters based on VAE architecture* (200Mhz @ 130nm)



1 Processing Element for 32 cells
Each CNN-P unit has 64 MSHR entries. (Based off Nvidia Fermi)
Used 2GB DRAM memory with DDR3 timing parameters.
*VAE is an existing hardware CNN chip implementation [Lee et. Al 2008]
Virtual Convergence Results
Ideal (1024x1024)
SP-CNN (1024x1024)
Ideal (2048x2048)
SP-CNN (2048x2048)
Virtual Convergence Time
(CNN Time Units)
4000
3500
3000
2500
2000
1500
1000
500
0
3
4
3
4
2
3
2
3
Crn-Detect Edge-Detect Conn-Comp

Hole-Fill
Rot-Detect
Shadow
The virtual convergence time of using SP-CNN is comparable to
the ideal case for both dimensions of images.
Timing Results for 1024x1024
128
112
Time (ms)
96
80
CNN-P=1
64
CNN-P=2
CNN-P=4
48
32
16
CNN-P=8
30 FPS Boundary
60 FPS Boundary
0
Crn-Detect


Edge-Detect
Conn-Comp
Hole-Fill
Rot-Detect
Meet 60 FPS for most applications when CNN-P = 1
Meet 60 FPS for all applications when CNN-P = 8
Shadow
Timing Results for 2048x2048
375
128
195
112
Time (ms)
96
80
CNN-P=1
64
CNN-P=2
CNN-P=4
48
32
16
CNN-P=8
30 FPS Boundary
60 FPS Boundary
0
Crn-Detect


Edge-Detect
Conn-Comp
Hole-Fill
Rot-Detect
Meet 30 FPS for most applications when CNN-P=8
60 FPS difficult to meet even with CNN-P=8
Shadow
Future Work

Try more computationally-intensive CNN applications


FPGA implementation of SP-CNN


Video processing, Face recognition, pattern recognition, etc.
Will allow us to perform better performance and power comparisons
against CPU, GPU, etc.
Further SP-CNN optimizations



Determine optimal interval values
Boundary Condition Propagation Time
Determine if a CNN gene is suitable for SP-CNN
Conclusion


Our proposed SP-CNN architecture brings scalability to small
sized hardware CNN arrays.
SP-CNN shows performance that is comparable to the ideal case
in terms of virtual convergence time.


Energy consumption around 10 to 30 mJ.
SP-CNN can meet 30 FPS and 60 FPS standards for most
benchmarks. This can further improved with better scaling of
the CNN technology.
Questions?
31
Digital CNN Operation
𝑥𝑖𝑗 (𝑡 + 1) =
𝑎𝑘𝑙 ∗ 𝑦(𝑡)𝑘𝑙 +
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘
0.0 1.0 0.0
A = 1.0 2.0 1.0
0.0 1.0 0.0
𝑥𝑖𝑗 𝑡 + 1
=
𝑦
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘
𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘
0.0 0.0 0.0
B = 0.0 4.0 0.0
0.0 0.0 0.0
0.0 1.0 0.0
1.0 2.0 1.0
𝑡 0.0
∗ 0.0
𝑘𝑙 1.0
z = -1.0
+
𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧
𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘
y(t) = 1.0
u = -1.0
y(t) = .75
u = -1.0
y(t)
u=

The basis of the CNN operation is that we eventually
converge to the correct solution.
However, there are no guarantees on the convergence time
and different inputs can converge at different rates.
Output Error of Running HoleFilling Gene on a CNN for
Various 1024x1024 Test Images
100%
90%
capitalA
80%
circle
70%
eight
60%
filledSquare
50%
lowerA
40%
nine
30%
rect
20%
seven
10%
vertLine
zero
0%
1
26
51
76
101
126
151
176
201
226
251
276
301
326
351
376
401
426
451
476
501
526
551
576
601
626
651
676
701
726
751
776
801

33

Slow Propagation

Updated information in boundary conditions is seen at the beginning
of the next iteration
1
5
2
partition 0
y(0)
y(0)’s output
y(1)
4
3
partition 1
y(0)
y(0)’s output
6
y(1)
34
The SP-CNN algorithm can be viewed below:
while change do
change = false
for p in partitions do
loadStateAndInputToCNN(p)
for n = 0; n < interval; n++, t++ do
change |= parallel-computation of state
parallel-computation of output
end for
saveToNextStateFromCNN(p)
end for
swapStateAndNextStatePointers(p)
iter += 1, vt += interval
end while

35

Fast propagation

Updated information in boundary conditions is seen at the beginning
of the next interval
partition 0
1
y(0)
2
y(0)’s output
5
y(1)
partition 1
3
y(0)
4
y(0)’s output
36

Using fast propagation, the partition order can improve on
program convergence time.


For example, if data is passed through the cells from right to left, then
we can converge faster by moving through partitions in a reverse
column-major order.
Partition ordering only matters if fast propagation used.
37
Avg. Speedup in
Total Convergence Time
1024x1024
2048x2048
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Crn-Detect Edge-Detect Conn-Comp


Hole-Fill
Rot-Detect
Shadow
For Connected Component, Hole Filling, and Rotation Detector,
fast propagation can provide an average speedup from 15% to
30%.
Shadow Creator shows no benefit since the information
propagation in that benchmark is from right-to-left.
38
Avg. Speedup in
Total Convergence Time
1024x1024

2048x2048
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
With Shadow Creator, using reverse row-major order can
provide an average speedup from 13% and 30%
respectively.
39

Virtual Convergence Time


Originally: 𝑣𝑖𝑟𝑡𝑢𝑎𝑙𝐶𝑜𝑛𝑣𝑇𝑖𝑚𝑒 = 𝑛 ∗ 𝑇, where N is the number of
iterations
Now: 𝑣𝑖𝑟𝑡𝑢𝑎𝑙𝐶𝑜𝑛𝑣𝑇𝑖𝑚𝑒 =
𝑛
𝑖=1 min(𝑇, max(∀𝑝∈𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑣𝑇𝑖𝑚𝑒(𝑝, 𝑖)))

convTime(p, i) is the convergence time for partition p during iteration i
40
No-Share Avg. Error
No-Share Max. Error
Share Avg. Error
Share Max. Error
100%
Percent Error
80%
60%
40%
20%
0%
Crn-Detect Edge-Detect Conn-Comp



Hole-Fill
Rot-Detect
Shadow
Both naïve mechanisms show errors on average of 20%, with max errors
as high as 90% for global applications.
Local type applications only show error for the No-Share mechanisms.
SP-CNN converged correctly to solution for all benchmarks and tests.
41
Avg. Execution Time Speedup


CNN-P=2 (1024x1024)
CNN-P=4 (1024x1024)
CNN-P=8 (1024x1024)
CNN-P=2 (2048x2048)
CNN-P=4 (2048x2048)
CNN-P=8 (2048x2048)
6
5
4
3
2
1
0
Crn-Detect Edge-Detect Conn-Comp
Hole-Fill
Rot-Detect
Shadow
For the global applications, memory bandwidth contention
decreases the ability to achieve linear scaling.
Shadow shows very poor scaling since many of the partitions
converge quickly, therefore the CNN occupation time is low.
42
Avg. Execution Time Speedup
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Crn-Detect
Edge-Detect
Conn-Comp
Hole-Fill
Rot-Detect
Shadow
1


2
4
Number of CNN-P
8
Prefetching only provides some benefits when the number of CNN-Ps is
equal to 1.
When the number of CNN-Ps scales up, than the memory bandwidth
contention becomes too large, and prefetching can actually cause
slowdowns.
43

We also evaluated our SP-CNN mechanisms over CPU and
GPU implementations of the benchmarks.

For two benchmarks, Hole-Filling and Rotation Detector, we could not
easily develop a CPU/GPU algorithm, so we collect their timing results
by emulating the CNN’s operation on the corresponding platform.
Name
Model
Power
Frequency
Technology
CPU
Intel® Core ™ i535550
77W (TDP)
3.3 GHz
22nm
GPU
NVIDIA K1
mobile
1.5W
900 MHz
28nm
0.73mW per
PE, 35.6 uW
per node
200 MHz
45nm
SP-CNN
44
CPU*: 2923
GPU*:1006
96
CPU*: 3004
GPU*:1038
CPU
GPU
80
SP-CNN (CNN-P=1)
SP-CNN (CNN-P=4)
Time (ms)
64
Ideal-Mul (CNN-P=1)
48
Ideal-Mul (CNN-P=4)
32
30 FPS Boundary
16
60 FPS Boundary
0
Crn-Detect Edge-Detect Conn-Comp


Hole-Fill*
Rot-Detect*
Shadow
For simple global applications like Connected Component and Shadow, the
CPU/GPU versions perform better than the SP-CNN versions.
However, emulation of a CNN is prohibitively slow, and for those applications
SP-CNN provides much better performance.
45
60
CPU
GPU
SP-CNN (CNN-P=1)
SP-CNN (CNN-P=4)
Energy (mJ)
50
40
30
20
10
0
Crn-Detect Edge-Detect Conn-Comp


Hole-Fill*
Rot-Detect*
Shadow
For Connected Component and Shadow, execution time dominates over the
low power consumption of SP-CNN, causing overall energy to be larger than
the GPU case.
Hole-Filling and Rotation Detector do show cases where the SP-CNN can
perform complex tasks at a low energy cost.
46
Optimization: Early-Finish


Originally, each partition run for on the CNN for the entire
interval time T.
We saw cases where a partition converged on the CNN before
interval finished.


Rest of execution essentially does nothing.
So, introduced Early-Finish, where a partition runs on the CNN
till it either converges or interval time T is reached.