SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain applications prevalent in embedded devices Image processing, audio processing, context awareness, etc. Trend of including more specialized, energy-efficient accelerators. Moto X 2013 Cellular Neural Networks Investigated using Cellular Neural Networks (CNN) as a specialized accelerator Used for variety of applications, particularly image processing Hardware implementations known to consume very little power Lee et al. (2008) implemented CNN chip that consumed 84mW@ 130nm Issue: Current hardware Cellular Neural Network processing capabilities have not scaled with the growth in image sizes. Outline Cellular Neural Network (CNN) Background Multiplexing SP-CNN Architecture Results Neural Networks Radial Basis Function Network Self-Organizing Map Adaptive Resonance Theory Pulse-Coupled Neural Network Artificial Neural Network Hopfield Neural Network Convolutional Cellular NeuralNeural Network Network (CNN) Recurrent Neural Network Spiking Neural Network CNN Background Introduced by Leon Chua and Lin Yang in 1998 Characterized by a spatial arrangement of locally-coupled cells The set of cells a cell is coupled with is known as the cell’s neighborhood Typical Cellular Neural Network consists of an MxN array of cells C(i,j) CNN Applications Image Processing Edge Detection, Image Segmentation, Movement Detection, etc. Pattern Recognition Associative Memory Solving Partial Differential Equations Pre-data processing for other neural networks CNN Universal Machine is Turing Complete CNN Cell Each cell is a dynamical system with an input, output, and a state that evolves according to some prescribed laws. Most CNNs follow the standard CNN state and output equations: 𝑑𝑥𝑖𝑗 (𝑡) = −𝑥𝑖𝑗 𝑡 + 𝑑𝑡 𝑎𝑘𝑙 ∗ 𝑦(𝑡)𝑘𝑙 + 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑗 𝑦𝑖𝑗 𝑡 = 𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑗 1 𝑥𝑖𝑗 𝑡 + 1 − 𝑥𝑖𝑗 𝑡 − 1 2 𝑁𝑟 𝑖, 𝑘 represents neighborhood for cell at location i,j x represents the current state y is the cell outputs u is the cell inputs z is the threshold value a and b are the feed-forward and feedback weights Digital CNN Cell For our work, we focused on digital CNN implementation. In these implementations, the state equation simplifies to: 𝑥𝑖𝑗 (𝑡 + 1) = 𝑎𝑘𝑙 ∗ 𝑦(𝑡)𝑘𝑙 + 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘 𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘 (from neighborhood cells) A (outputs) z B (inputs) Σ xij yij CNN Gene The programmable unit of a CNN is known as the CNN gene The CNN gene specifies: the z threshold value of the CNN state equation the a and b coefficients of the CNN state equation Furthermore, the programmer will specify the initial state and boundary conditions for the CNN 0.0 1.0 0.0 0.0 0.0 𝐻𝑜𝑙𝑒 𝐹𝑖𝑙𝑙𝑖𝑛𝑔 𝐺𝑒𝑛𝑒: 𝐴 = 1.0 2.0 1.0 , 𝐵 = 0.0 4.0 0.0 1 0.0 0.0 0.0 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝑆𝑡𝑎𝑡𝑒: 𝑥𝑖𝑗 0 = 1 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛: 𝑦𝑘𝑙 = 0, 𝑢𝑘𝑙 = 0 0.0 0.0 , 𝑧 = −1.0 0.0 Example Application The following is an example of running the HoleFilling gene, an application commonly used in character recognition, on a input image of the number eight. Hardware CNNs Different companies and research institutions have implemented CNNs in hardware. Name Array Size Technology Node VAE 80x60 130 nm Flak+ 4x4 180 nm SCAMP 21x21 600 nm 4-layer Gabor Filter 32x64 250 nm ACE16k 128x128 350 nm QCIF 176x144 150 nm Kingetset+ Estimated Size 128x128 240 nm The array sizes for most implementations is typically small. Issue: Image Dim vs CNN Array Dim Generally, most applications assume the CNN array’s dimensions match the input (image) dimensions. But what if CNN array is smaller? 1. Shrink the Image Issue: Image Size to CNN Size Ratio Ex. 1024x1024 image to 128x128 CNN. That’s a 64:1 ratio! 2. Emulate Ideal CNN through multiplexing Ideal Multiplexing Partition Partition Partition Partition 00 CNN 11 CNN State State State State @T @T T=01 T=01 Partition Partition Partition Partition 22 CNN 33 CNN State State State State @T @T T=01 T=01 With ideal multiplexing, we transfer and run each partition on the CNN array for 1 CNN unit of time. Heavily Memory Bound – Number of input/state data transfers proportional to1convergence CNN Time Unittime. Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Rd. P1 u time (t) CNN P0 CNN CNN Array State Array @ T1+1 Too little work to utilize memory-level parallelism (MLP) What if we add another CNN array to take advantage of independence between computations? Key-Insight Let’s take advantage of CNNs ability to still converge to the correct solution in presence of small errors. So, instead of running for partition on CNN array for 1 unit time, run partition for an interval T >> 1 unit of time. This is the key insight to our Scalable and Programmable CNN (SP-CNN) architecture. SP-CNN Multiplexing With SP-CNN multiplexing, we run each partition on the CNN array for a certain interval of time. Run for 1 CNN Unit of Time Rd. P0 u Rd. P0 x(0) Rd. P0 u Rd. P0 x(0) Comp Wr. P0 x(1) Run for INTERVAL CNN unit of Time Comp (INTERVAL=4) Wr. P0 x(1) time (t) 1st iteration SP-CNN INTERVAL=4 0 4 interval 8 2nd iteration 12 16 20 24 28 32 cnn-time (t) Boundary Conditions When running a partition on the CNN array, what should we set boundary conditions to? To ensure that information propagates between partitions, we set the boundary condition for a given partition to its neighboring cells. Partition 0 Pixel Partition 1 Pixel Partition 0 Boundary Pixel Partition 0 Boundary made of Partition 1 Data Partition 0 Boundary also part of True Boundary Image Pixel True Boundary Condition for Image Boundary Conditions Example P0 @T11 P1 P1 @T @T11 P2 @T1 P3 @T1 CNN Array P0 P1 P2 P3 @T22 @T P0 @T2 True Boundary Cond. Next-State State @ T2 @ T2=T1+4*INTERVAL Next-State State @ T1 @ T3=T2+4*INTERVAL P0 @T1 T1 T1+INT P1 @T1 P2 @T1 P3 @T1 P0 @T2 T1+2*INT T1+3*INT T2=T1+4*INT T2+INT cnn-time (t) Total Time and Virtual Time To compare convergence of Ideal-CNN vs SP-CNN, we introduce the concept of Virtual Time. Ideal-CNN 4 0 time (t) 8 v(t)=8 v(t)=4 SP-CNN INTERVAL=4 0 4 8 12 16 20 24 28 interval 1st iteration 2nd iteration 32 time (t) Reduced Memory Transfers If virtual convergence time does not significantly increase in the SP-CNN case, we can significantly reduce number of memory transfers. T Memory Transfers P0 P1 P2 P3 Ideal-CNN 0 1 2 3 P0 P1 P2 P3 P0 P1 P2 P3 4 5 6 7 8 T-4 T-2 T-2 T-1 T Virtual Convergence Time (t) T2/INTERVAL Memory Transfers P0 SP-CNN INTERVAL=4 0 1 2 P3 P1 3 4 5 6 7 8 T2-4 T2-3 T2-2 T2-1 T2 Virtual Convergence Time (t) If T≅T2, than memory transfers reduced by factor of about 1/INTERVAL. Increased MLP By increasing computation time, we can take advantage of parallel computation within the iteration. CNN0 CNN1 CNN2 Rd. P0 u Rd. P0 x(0) Comp (INTERVAL=4) Rd. P1 u Rd. P1 x(0) Increased memory level parallelism due to lengthier computation. Wr. P0 x(1) Comp (INTERVAL=4) Rd. P2 u time (t) SP-CNN Architecture Host Processor • Sends work to SP-CNN Scheduler • Assigns partitions to CNN-P Global memory • Stores the input and cnn state. CNN-P • CNN processing unit. • Processes partition Methodology 6 benchmarks with 10 input images. Name Mechanism Ideal CNN array size equal to Image Size The test images are of size 1024x1024 (~720p) or 2048x2048 (~1440p) SP-CNN Proposed architecture *Local applications are essentially simple convolution operations that only care about neighborhood values. Application Type Corner Detection Local* Edge Detection Local* Connected Component Global Hole Filling Global Rotation Detector Global Shadow Creator Global Simulators Functional Simulator Specifications CNN Array Size = 128x128 Interval Time = 128 cnn time units Timing simulator (DRAMSim2 for memory) CNN timing parameters based on VAE architecture* (200Mhz @ 130nm) 1 Processing Element for 32 cells Each CNN-P unit has 64 MSHR entries. (Based off Nvidia Fermi) Used 2GB DRAM memory with DDR3 timing parameters. *VAE is an existing hardware CNN chip implementation [Lee et. Al 2008] Virtual Convergence Results Ideal (1024x1024) SP-CNN (1024x1024) Ideal (2048x2048) SP-CNN (2048x2048) Virtual Convergence Time (CNN Time Units) 4000 3500 3000 2500 2000 1500 1000 500 0 3 4 3 4 2 3 2 3 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow The virtual convergence time of using SP-CNN is comparable to the ideal case for both dimensions of images. Timing Results for 1024x1024 128 112 Time (ms) 96 80 CNN-P=1 64 CNN-P=2 CNN-P=4 48 32 16 CNN-P=8 30 FPS Boundary 60 FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Meet 60 FPS for most applications when CNN-P = 1 Meet 60 FPS for all applications when CNN-P = 8 Shadow Timing Results for 2048x2048 375 128 195 112 Time (ms) 96 80 CNN-P=1 64 CNN-P=2 CNN-P=4 48 32 16 CNN-P=8 30 FPS Boundary 60 FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Meet 30 FPS for most applications when CNN-P=8 60 FPS difficult to meet even with CNN-P=8 Shadow Future Work Try more computationally-intensive CNN applications FPGA implementation of SP-CNN Video processing, Face recognition, pattern recognition, etc. Will allow us to perform better performance and power comparisons against CPU, GPU, etc. Further SP-CNN optimizations Determine optimal interval values Boundary Condition Propagation Time Determine if a CNN gene is suitable for SP-CNN Conclusion Our proposed SP-CNN architecture brings scalability to small sized hardware CNN arrays. SP-CNN shows performance that is comparable to the ideal case in terms of virtual convergence time. Energy consumption around 10 to 30 mJ. SP-CNN can meet 30 FPS and 60 FPS standards for most benchmarks. This can further improved with better scaling of the CNN technology. Questions? 31 Digital CNN Operation 𝑥𝑖𝑗 (𝑡 + 1) = 𝑎𝑘𝑙 ∗ 𝑦(𝑡)𝑘𝑙 + 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘 0.0 1.0 0.0 A = 1.0 2.0 1.0 0.0 1.0 0.0 𝑥𝑖𝑗 𝑡 + 1 = 𝑦 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘 𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘 0.0 0.0 0.0 B = 0.0 4.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 1.0 𝑡 0.0 ∗ 0.0 𝑘𝑙 1.0 z = -1.0 + 𝑏𝑘𝑙 ∗ 𝑢𝑘𝑙 + 𝑧 𝐶 𝑘,𝑙 ∈𝑁𝑟 𝑖,𝑘 y(t) = 1.0 u = -1.0 y(t) = .75 u = -1.0 y(t) u= The basis of the CNN operation is that we eventually converge to the correct solution. However, there are no guarantees on the convergence time and different inputs can converge at different rates. Output Error of Running HoleFilling Gene on a CNN for Various 1024x1024 Test Images 100% 90% capitalA 80% circle 70% eight 60% filledSquare 50% lowerA 40% nine 30% rect 20% seven 10% vertLine zero 0% 1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 551 576 601 626 651 676 701 726 751 776 801 33 Slow Propagation Updated information in boundary conditions is seen at the beginning of the next iteration 1 5 2 partition 0 y(0) y(0)’s output y(1) 4 3 partition 1 y(0) y(0)’s output 6 y(1) 34 The SP-CNN algorithm can be viewed below: while change do change = false for p in partitions do loadStateAndInputToCNN(p) for n = 0; n < interval; n++, t++ do change |= parallel-computation of state parallel-computation of output end for saveToNextStateFromCNN(p) end for swapStateAndNextStatePointers(p) iter += 1, vt += interval end while 35 Fast propagation Updated information in boundary conditions is seen at the beginning of the next interval partition 0 1 y(0) 2 y(0)’s output 5 y(1) partition 1 3 y(0) 4 y(0)’s output 36 Using fast propagation, the partition order can improve on program convergence time. For example, if data is passed through the cells from right to left, then we can converge faster by moving through partitions in a reverse column-major order. Partition ordering only matters if fast propagation used. 37 Avg. Speedup in Total Convergence Time 1024x1024 2048x2048 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For Connected Component, Hole Filling, and Rotation Detector, fast propagation can provide an average speedup from 15% to 30%. Shadow Creator shows no benefit since the information propagation in that benchmark is from right-to-left. 38 Avg. Speedup in Total Convergence Time 1024x1024 2048x2048 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 With Shadow Creator, using reverse row-major order can provide an average speedup from 13% and 30% respectively. 39 Virtual Convergence Time Originally: 𝑣𝑖𝑟𝑡𝑢𝑎𝑙𝐶𝑜𝑛𝑣𝑇𝑖𝑚𝑒 = 𝑛 ∗ 𝑇, where N is the number of iterations Now: 𝑣𝑖𝑟𝑡𝑢𝑎𝑙𝐶𝑜𝑛𝑣𝑇𝑖𝑚𝑒 = 𝑛 𝑖=1 min(𝑇, max(∀𝑝∈𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑣𝑇𝑖𝑚𝑒(𝑝, 𝑖))) convTime(p, i) is the convergence time for partition p during iteration i 40 No-Share Avg. Error No-Share Max. Error Share Avg. Error Share Max. Error 100% Percent Error 80% 60% 40% 20% 0% Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow Both naïve mechanisms show errors on average of 20%, with max errors as high as 90% for global applications. Local type applications only show error for the No-Share mechanisms. SP-CNN converged correctly to solution for all benchmarks and tests. 41 Avg. Execution Time Speedup CNN-P=2 (1024x1024) CNN-P=4 (1024x1024) CNN-P=8 (1024x1024) CNN-P=2 (2048x2048) CNN-P=4 (2048x2048) CNN-P=8 (2048x2048) 6 5 4 3 2 1 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow For the global applications, memory bandwidth contention decreases the ability to achieve linear scaling. Shadow shows very poor scaling since many of the partitions converge quickly, therefore the CNN occupation time is low. 42 Avg. Execution Time Speedup 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill Rot-Detect Shadow 1 2 4 Number of CNN-P 8 Prefetching only provides some benefits when the number of CNN-Ps is equal to 1. When the number of CNN-Ps scales up, than the memory bandwidth contention becomes too large, and prefetching can actually cause slowdowns. 43 We also evaluated our SP-CNN mechanisms over CPU and GPU implementations of the benchmarks. For two benchmarks, Hole-Filling and Rotation Detector, we could not easily develop a CPU/GPU algorithm, so we collect their timing results by emulating the CNN’s operation on the corresponding platform. Name Model Power Frequency Technology CPU Intel® Core ™ i535550 77W (TDP) 3.3 GHz 22nm GPU NVIDIA K1 mobile 1.5W 900 MHz 28nm 0.73mW per PE, 35.6 uW per node 200 MHz 45nm SP-CNN 44 CPU*: 2923 GPU*:1006 96 CPU*: 3004 GPU*:1038 CPU GPU 80 SP-CNN (CNN-P=1) SP-CNN (CNN-P=4) Time (ms) 64 Ideal-Mul (CNN-P=1) 48 Ideal-Mul (CNN-P=4) 32 30 FPS Boundary 16 60 FPS Boundary 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For simple global applications like Connected Component and Shadow, the CPU/GPU versions perform better than the SP-CNN versions. However, emulation of a CNN is prohibitively slow, and for those applications SP-CNN provides much better performance. 45 60 CPU GPU SP-CNN (CNN-P=1) SP-CNN (CNN-P=4) Energy (mJ) 50 40 30 20 10 0 Crn-Detect Edge-Detect Conn-Comp Hole-Fill* Rot-Detect* Shadow For Connected Component and Shadow, execution time dominates over the low power consumption of SP-CNN, causing overall energy to be larger than the GPU case. Hole-Filling and Rotation Detector do show cases where the SP-CNN can perform complex tasks at a low energy cost. 46 Optimization: Early-Finish Originally, each partition run for on the CNN for the entire interval time T. We saw cases where a partition converged on the CNN before interval finished. Rest of execution essentially does nothing. So, introduced Early-Finish, where a partition runs on the CNN till it either converges or interval time T is reached.