http://w3.ibm.com/ibm/presentations Cell Broadband Engine Compiler Mediated Performance of a Cell Processor Alexandre Eichenberger, Kathryn O’Brien, Kevin O’Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao,Yaoqing Gao www.research.ibm.com/cellcompiler/compiler.htm Haifa Compiler and Architecture Seminar - 2005 © 2002 IBM Corporation http://w3.ibm.com/ibm/presentations Cell Broadband Engine Background Prototype compiler Some functionality shipped in Alphaworks Cell xlc compiler Based on and integrated with the product code base Collaboration with compiler development group Many research issues still in progress 2 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Cell Broadband Engine Multiprocessor on a chip 241M transistors, 235mm2 200 GFlops (SP) @3.2GHz 200 GB/s bus (internal) @ 3.2GHz Power Proc. Element (PPE) general purpose running full-fledged OSs Synergistic Proc. Element (SPE) optimized for compute density 3 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Cell Broadband Engine Overview SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE Heterogeneous, multi-core engine 1 multi-threaded power processor up to 8 compute-intensive-ISA engines Local Memories 8 Bytes (per dir) 4 16Bytes (one dir) fast access to 256KB local memories globally coherent DMA to transfer data To ToExternal ExternalIO IO To ToExternal ExternalMem Mem Power Power L1 Processor Processor Element Element L2 (PPE) (PPE) Element Interconnect Bus (96 Bytes/cycle) Pervasive SIMD PPE has VMX SPEs are SIMD-only engines High bandwidth 128Bytes (one dir) Compiler mediated performance for a Cell processor fast internal bus (200GB/s) dual XDRTM controller (25.6GB/s) two configurable interfaces (76.8GB/s) numbers based on 3.2GHz clock rate Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Compiler support for the full spectrum of Cell programming expertise Highest performance with programmer control Automatic tuning for each ISA Explicit parallelization with local memories SIMD PROGRAMS multipe ISAs Explicit SIMD coding SIMD/alignment directives Automatic simdization PARALLELIZATION Hand tuned for Shared memory abstraction through single source compilation Automatic parallelization Highest Productivity with fully automatic compiler technology 5 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Automatic tuning for each ISA 6 Explicit SIMD coding Part 2: Parallelization and memory abstraction Explicit parallelization with local memories SIMD PROGRAMS Multiple-ISA hand-tuned programs Part 3: Automatic simdization SIMD/alignment directives Automatic simdization Compiler mediated performance for a Cell processor PARALLELIZATION Part 1: Automatic SPE tuning Shared memory, single program abstraction Automatic parallelization Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Cell Architecture Overview SPE SPE SPE SPE SPE SPE SPE SPE PPE executes traditional code SPE SPE SPE SPE SPE SPE SPE SPE Cell Processor SPE optimized for high compute throughput small area footprint Element Interconnect Bus (96 Bytes/cycle) SPE tradeoff 8 Bytes (per dir) 7 16Bytes (one dir) peak performance requires compiler assistance To ToExternal ExternalIO IO To ToExternal ExternalMem Mem Power Power L1 Processor Processor Element Element L2 (PPE) (PPE) lower hardware complexity 128Bytes (one dir) Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Architecture: Relevant SPE Features Synergistic Processing Element (SPE) SIMD-only functional units 16-bytes register/memory accesses EvenPipe Pipe Even Floating/ Floating/ Fixed Fixed Point Point OddPipe Pipe Odd Branch Branch Memory Memory Permute Permute 1 Dual-Issue Dual-Issue Instruction Instruction Logic Logic Instr.Buffer (3.5 x 32 instr) (128xx16Byte 16Byteregister) register) (128 LocalStore Store Local must be parallel & properly aligned (256KByte, KByte,Single SinglePorted) Ported) (256 8 bytes (per dir) 3 8 16 bytes (one dir) Dual-issue for instructions full dependence check in hardware 2 (Globally-Coherent) (Globally-Coherent) no hardware branch predictor compiler managed hint/predication RegisterFile File Register DMA DMA Simplified branch architecture branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 Single-ported local memory aligned accesses only contentions alleviated by compiler 128 bytes (one dir) Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Feature #1: Functional Units are SIMD Only SPE Functional units are SIMD only all transfers are 16 Bytes wide, EvenPipe Pipe Even Floating/ Floating/ Fixed Fixed Point Point OddPipe Pipe Odd Branch Branch Memory Memory Permute Permute 1 Dual-Issue Dual-Issue Instruction Instruction Logic Logic including register file and memory How to handle scalar code? Instr.Buffer RegisterFile File Register (3.5 x 32 instr) (128xx16Byte 16Byteregister) register) (128 2 LocalStore Store Local (256KByte, KByte,Single SinglePorted) Ported) (256 3 DMA DMA (Globally-Coherent) (Globally-Coherent) 8 bytes (per dir) 9 16 bytes (one dir) branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128 bytes (one dir) Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Single Instruction Multiple Data (SIMD) Meant to process multiple “b[i]+c[i]” data per operation 16-byte boundaries b0 b1 b2 b3 b4 b5 b6 b7 b8 memory stream b9 b10 registers R1 b0 b1 b2 b3 VADD R2 c0 c1 c2 c3 c0 c1 c2 c3 c4 c5 c6 b0+ b1+ b2+ b3+ c0 c1 c2 c3 c7 c8 c9 c10 R3 memory stream 16-byte boundaries 10 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Scalar Code on SIMD Functional Units Example: a[2] = b[1] + c[3] b-1 b0 b1 b2 b3 b4 c0 c1 b6 b7 16-byte boundaries LOAD b[1] c-1 b5 c2 c3 c4 c5 c6 b0 b1 b1 b2 b2 b3 b3 b0 r1 c0 c1 c1 c2 c2 b1 b1 c0 c3 r2 c7 LOAD c[3] ADD Problem #1: Memory alignment defines data location in register Problem #2: Adding aligned values yields wrong result b0+ b0+ c0 c0 b1+ b1+ c1 c1 b3+ b3+ c3 c3 b3+ b3+ c3 c3 r3 STORE a[2] b1+ b3+ b2+ a-1 b0+ a0 a1 a2 c0 c1 c2 a3 c3 a4 11 a5 a6 Compiler mediated performance for a Cell processor a7 Problem #3: Vector store clobbers neighboring values Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Scalar Load Handling Use read-rotate sequence b-1 b0 b1 b2 b3 b4 LOAD b[1] b0 b1 b1 b2 b2 b3 b3 b0 r1 b1 b2 b2 b3 b3 b0 b0 b1 r1’ ROTATE &b[1] Overhead (1 op, in blue) one quad-word byte rotate Outcome desired scalar value always in the first slot of the register this addresses Problems 1 & 2 12 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Scalar Store Handling Use read-modify-write sequence a-1 a0 a1 a2 a3 a4 LOAD a[2] b1+ b1 b1 c3 ** ** ** Computations SHUFFLE r3 CWD &a[2] insertion mask for &a[2] STORE a[2] a-1 a0 a1 b1+ b1 c3 a3 a4 Overhead (1 to 3 ops, in blue) one shuffle byte, one mask formation (may reuse), one load (may reuse) Outcome SIMD store does not clobber memory (this addresses Problem 3) 13 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Optimizations for Scalar on SIMD Significant overhead for scalar load/store can be lowered For vectorizable code generate SIMD code directly to fully utilize SIMD units done by expert programmers or compilers (see SIMD presentation) For scalar variable allocate scalar variables in first slot, by themselves i * * * eliminate need for rotate when loading data is guaranteed to be in first slot (Problems 1 & 2) eliminate need for read-modify-write when storing other data in 16-byte line is garbage (Problem 3) can also deal with the alignment of scalar data 14 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Feature #2: Software-Assisted Instruction Issue Dual-issue for Instructions SPE can dual-issue parallel instructions EvenPipe Pipe Even Floating/ Floating/ Fixed Fixed Point Point OddPipe Pipe Odd Branch Branch Memory Memory Permute Permute 1 Dual-Issue Dual-Issue Instruction Instruction Logic Logic Instr.Buffer RegisterFile File Register (3.5 x 32 instr) (128xx16Byte 16Byteregister) register) (128 2 code layout constrains dual issuing full dependence check in hardware Alleviate constraints by making the scheduler aware of code layout issue LocalStore Store Local (256KByte, KByte,Single SinglePorted) Ported) (256 3 DMA DMA (Globally-Coherent) (Globally-Coherent) 8 Bytes (per dir) 15 16Bytes (one dir) branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128Bytes (one dir) Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Alleviating Issue Restriction Scheduling finds the best possible schedule dependence graph modified to account for latency of false dependences Bundling ensures code layout restrictions keep track of even/odd code layout at all times swap parallel ops when needed insert (even or odd) nops when needed Engineering issues each function must start at known even/odd code layout boundary one cannot add any instructions after the last scheduling phase as it would impact the code layout and thus the dual-issuing constraints 16 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Feature 3: Single-Ported Local Memory Local store is single ported SPE denser hardware EvenPipe Pipe Even Floating/ Floating/ Fixed Fixed Point Point OddPipe Pipe Odd Branch Branch Memory Memory Permute Permute 1 Dual-Issue Dual-Issue Instruction Instruction Logic Logic Instr.Buffer If we are not careful, we may starve for instructions 2 LocalStore Store Local (256KByte, KByte,Single SinglePorted) Ported) (256 8 Bytes (per dir) 17 Ifetch 3 16Bytes (one dir) DMA > MEM > IFETCH (3.5 x 32 instr) (128xx16Byte 16Byteregister) register) (128 (Globally-Coherent) (Globally-Coherent) 16 bytes for load/store ops 128 bytes for IFETCH/DMA static priority RegisterFile File Register DMA DMA asymmetric port branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128Bytes (one dir) Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Instruction Starvation Situation There are 2 instruction buffers Dual-Issue Dual-Issue Instruction Instruction Logic Logic up to 64 ops along the fall-through path First buffer is half-empty initiate refill after half empty 18 FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM can initiate refill When MEM port is continuously used starvation occurs (no ops left in buffers) instruction buffers Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Instruction Starvation Prevention SPE has an explicit IFETCH op Dual-Issue Dual-Issue Instruction Instruction Logic Logic initiate refill after half empty 19 which initiates a instruction fetch FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM before it is too late to hide latency Scheduler monitors starvation situation when MEM port is continuously used insert IFETCH op within the red window refill IFETCH latency Compiler design scheduler must keep track of code layout Hardware design instruction buffer Compiler mediated performance for a Cell processor IFETCH op is not needed if memory port is idle for one or more cycles within the red window Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Engineering Issues for Dual-Issue & Starvation Prevention Initially, the scheduling and bundling phases were separate satisfy dual-issue & instruction-starvation constraints by adding nops Code (not sched) Sched Code (not bundled) find best schedule, using latencies, issue, & resource constraints Bundle Code (sched & bundled) Problem: Bundler adds an IFETCH to prevent starvation. A better schedule could be found if the scheduler had known that. But the schedule is already “finalized”. 20 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Engineering Issues for Dual-Issue & Starvation Prevention We integrate Scheduling and Bundling tightly, on a cycle per cycle basis satisfy dual-issue & instruction-starvation constraints by adding nops Code (not sched) Sched Code (not bundled) find best schedule, using latencies, issue, & resource constraints 21 Compiler mediated performance for a Cell processor Bundle Code (sched & bundled) Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Feature #4: Software-Assisted Branch Architecture Branch architecture SPE no hardware branch-predictor, but EvenPipe Pipe Even Floating/ Floating/ Fixed Fixed Point Point OddPipe Pipe Odd Branch Branch Memory Memory Permute Permute 1 Dual-Issue Dual-Issue Instruction Instruction Logic Logic Instr.Buffer RegisterFile File Register (3.5 x 32 instr) (128xx16Byte 16Byteregister) register) (128 compare/select ops for predication software-managed branch-hint one hint active at a time Lowering overhead by predicating small if-then-else hinting predictably taken branches 2 LocalStore Store Local (256KByte, KByte,Single SinglePorted) Ported) (256 3 DMA DMA (Globally-Coherent) (Globally-Coherent) 8 bytes (per dir) 22 16 bytes (one dir) branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128 bytes (one dir) Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Hinting Branches & Instruction Starvation Prevention SPE provides a HINT operation Dual-Issue Dual-Issue Instruction Instruction Logic Logic fetches the branch target into HINT buffer no penalty for correctly predicted branches FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM instruction buffers 23 HINT buffer HINT br, target IFETCH window fetches ops from target; needs a min of 15 cycles and 8 intervening ops refill latency BRANCH if true target compiler inserts hints when beneficial Impact on instruction starvation Compiler mediated performance for a Cell processor after a correctly hinted branch, IFETCH window is smaller Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine SPE Optimization Results Relative reductions in execution time . . . . . . Original +Bundle +Branch Hint + Ifetch single SPE performance, optimized, simdized code 24 Compiler mediated performance for a Cell processor Av er ag e Sa xp y ul t at M M ne ra yX Y O C on vo lu tio n ck VL D LU EA ID FF T Li np a H uf fm an . (avg 1.00 0.78) Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Automatic tuning for each ISA 25 Explicit SIMD coding Part 2: Parallelization and memory abstraction Explicit parallelization with local memories SIMD PROGRAMS Multiple-ISA hand-tuned programs Part 3: Automatic simdization SIMD/alignment directives Automatic simdization Compiler mediated performance for a Cell processor PARALLELIZATION Part 1: Automatic SPE tuning Shared memory, single program abstraction Automatic parallelization Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Cell Broadband Engine Multiprocessor on a chip Power Proc. Element (PPE) general purpose running full-fledged OSs Synergistic Proc. Element (SPE) optimized for compute density Compiler support for parallelizing across the heterogeneous processing elements 26 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Single Source Compiler user prepares an application as a collection of one or more source files containing user directives targetting the Cell architecture compiler, guided by these directives, takes care of the code partitioning between PE and SPE compiler takes care of data transfers. identifies accesses in SPE functions that refer to data in system memory locations uses software cache or static buffers to manage transfer of this data to/from SPE local stores compiler takes care of code size explore extending Code partitioning to Single Source, i.e. automatic partitioning based on functionality rather than size 27 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Using OpenMP to partition/parallelize across Cell Single source program contains C or Fortran with user directives or pragmas Compiler ‘outlines’ all code within the pragmas into separate functions compiled for the SPE. Replaces ‘outlined’ code with call to the parallel runtime and compiles this code for the PPE Master thread continues to execute on PPE and participate in workshare constructs PPE Runtime places outlined functions on a work queue containing information about number of iterations to execute, or ‘chunk’ size for each SPE Creates up to 8 SPE threads to pull work items (outlined parallel functions) from queue and execute on SPEs May wait for SPE completion, or proceed with other PPE statement execution 28 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Cloning for heterogeneous processors Outlined function becomes new node in call graph In pass 2 of TPO, using whole program call graph, outlined function is cloned, then specialized to create a ppe and an spe version Ke y Flo w G ra p h No d e Ca ll G ra p h No d e Flo w G ra p h Ed g e Ca ll G ra p h Ed g e Co m p ile fo r PPE All called functions must also be cloned SPE call sites modified to call SPE versions of cloned subroutines Co m p ile fo r SPE Partitioning pass creates SPE and PPE partitions and invokes lower level optimizer for machine specific optimization 29 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine “Single Source” Compiler, using OpenMP Foo1(); Foo1(); #pragmaOmp Ompparallel parallelfor for #pragma for(i i=0; =0;i<10000; i<10000;i++) i++) for( A[i]==x* x*B[i]; B[i]; A[i] Foo2(); Foo2(); outline omp region voidol$1() ol$1() void for(i i=0; =0;i<10000; i<10000;i++) i++) for( A[i]==x* x*B[i]; B[i]; A[i] Compile for PPE Foo1(); Foo1(); Callomp-rte-init; omp-rte-init; Call Callomp_rte_do_par; omp_rte_do_par; Call Foo2(); Foo2(); clone for SPE clone for PPE voidOL$1_PPE() OL$1_PPE() void for(i=LB; i=LB; i<UB; i<UB; i++) i++)´´ for( A[i]==xx**B[i]; B[i]; A[i] 30 Compiler mediated performance for a Cell processor voidfoo_SPE() foo_SPE() void void foo_SPE() void foo_SPE() voidfoo_SPE() foo_SPE() for(k=LB; k=LB; k<UB; k<UB; k++) k++) void for( void OL$1_SPE() for( k=LB; k<UB; k<UB; k++) k++) void OL$1_SPE() for( k=LB; for(k=LB; k=LB; k<UB; k++) DMAk<UB; 100BBelements elements intoB´ B´ for( k<UB; k++) DMA 100 into for( k=LB; k++) DMAk<UB; 100BBelements elements into B´ for( k=LB; k++) DMA 100 into B´ DMA 100 elements intoB´ B´ for (j=0; j<100; j++) j++) DMA 100 BBelements elements into for (j=0; j<100; DMA 100 B into B´ for100 (j=0; j<100; j++) j++) DMA B j<100; elements into B´ for (j=0; j<100; for (j=0; j++) A´[j] = cache_lookup(x) *B´[j]; B´[j]; for(j=0; (j=0; j<100; j++) A´[j] =cache_lookup(x) cache_lookup(x) *B´[j]; for j<100; j++) A´[j] = * for (j=0; j<100; j++) A´[j]==cache_lookup(x) cache_lookup(x)**B´[j]; B´[j]; A´[j] DMA 100 A elements out of A´ cache_lookup(x) * B´[j]; A´[j] = A´[j] =100 cache_lookup(x) *B´[j]; B´[j]; DMA 100 Aelements elementsout out ofA´ A´ DMA A of A´[j] = cache_lookup(x) * DMA100 100AAelements elementsout outofofA´ A´ DMA DMA 100 A elements out of A´ DMA100 100AAelements elementsout outofofA´ A´ DMA Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Cell Memory & DMA Architecture Main Memory* SPE #1 MMU Local LocalStore store#1 1 Local store 1 SPU SPE #8 MMU LocalStore store#8 8 Local Local store 1 SPU set access rights SPE can initiate DMAs to any global addresses, TLBs MFC Regs L1 QofS* / L3* PPE L2 IO Devices Memory requests * external 31 PPE can access/DMA memory ... ALIAS TO Local stores are mapped in global address space DMA Mapped Compiler mediated performance for a Cell processor including local stores of others. translation done by MMU Note all elements may be masters, there are no designated slaves Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Competing for the SPE Local Store Local store is fast, need support when full. irregular data Provided compiler support: SPE code too large compiler partitions code partition manager pulls in code as needed code Local Store regular data Data with regular accesses is too large compiler stages data in & out using static buffering can hide latencies by using double buffering Data with irregular accesses is present e.g. indirection, runtime pointers... use a software cache approach to pull the data in & out (last resort solution) 32 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine How can we run programs that have lots of data? Data won’t all fit in Local Store Solution: Use System Memory, its large and virtual But the SPU doesn’t have Load/Store access to it The compiler can automatically manage the transferring of Data between System Memory, and Local Store We call this Data Partitioning 33 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Data Partitioning Single Source assumption: all data lives in System Memory Naïve implementation, every load and store requires a dma operation Too costly (~300 cycles per load or store) MP will require locking on every reference What can be done to make this acceptable? 34 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Reducing the Overhead of Accesses to System Memory Rather than executing a DMA transaction for every variable access, we would like to bring data over in larger pieces, and keep it in Local Store for some portion of its lifetime There are several techniques that can accomplish this Prefetching predictable references, and accumulating writes Software Cache on demand, but leverage spatial and temporal locality requires additional instructions inline, so it is essentially a fallback strategy We can apply Tiling to increase reuse for both of these 35 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Software Cache for Irregular Accesses Data with irregular accesses cannot reside permanently in the SPE’s local memory (typically) thus reside in global memory when accessed, must translate the global address into a local store address must pull the data in/out when its not already present Use a software cache managed by the SPEs in the local store generate DMA requests to transfer data to/from global memory use 4-way set associative cache to naturally use the SIMD units of the SPE 36 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Software Cache Overview A portion of Local Store is set aside to hold the cache Cache is made up of two arrays Tag Array: Contains the comparands for the address lookups and pointers to the data lines, also contains “dirty-bits” for MP support set tags data ptrs. 0 a2 a2c2 c2e3 e3h4 h4 1 b2 b2c3 c3 f4 f4 i3 i3 ... x c4 c4 f6 f6 a1 a1 j5 j5 ... dirty bits... ... d2 d2 .. .. .. Data Array: Contains the data lines Geometry Currently, 4-way Set Associative line size is 128 bytes, and there are 128 of them in each set these parameters can be changed Total size, Tags (16K) + Data(64K) is 80K 37 Compiler mediated performance for a Cell processor data array d1 d2 dn ... 42 42 ... ... ... ... ... Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Software Cache Lookup 20: 20: This code is optimized along with all the other instructions The branch to the misshandler is treated as a regular instruction throughout optimization, and expanded only at the very end of compilation The misshandler uses a tailored register convention, so that it has no impact on the hit path vr844=gr664,vr842 vr844=gr664,vr842 LI 20: VLR 20: VSHUFB 20:20: VLR VLR 20:20: VAND VAND vr849=vr843,vr845 vr849=vr843,vr845 20:20: VLQ VLQ vr850=.L_tagaddr(relative,0) vr850=.L_tagaddr(relative,0) 20:20: A A gr851=vr844,vr850 gr851=vr844,vr850 20:20: VLQ VLQ vr852=.L_tagarray(gr851,0) vr852=.L_tagarray(gr851,0) 20: 20: 20: 20: 20: LI VLR VSHUFB VLQ VLQ VANDC vr847=0x10203 vr847=0x10203 *vr846=vr847 *vr846=vr847 vr848=gr664,gr664,vr846 vr848=gr664,gr664,vr846 *vr845=vr848 *vr845=vr848 vr853=.L_tagarray(gr851,16) vr853=.L_tagarray(gr851,16) vr854=gr664,vr843 20: VANDC vr854=gr664,vr843 20: VCEQW vr855=vr849,vr852 20: VGBB vr856=vr855 20:20: VCNTLZ VQBR vr857=vr856 vr858=vr853,vr857,gr1 20:20: VQBR MISS vr858=vr853,vr857,gr1 *vr858,.L_tagarray=gr664,vr856,vr858 20:20: MISS A 20:20: A VLR 20:20: LI VLR 20:20: LI VLR 20: 20: 20: 20: 20: 20: VCEQW VGBB VCNTLZ VSHUFB VLR VLR 20: VSHUFB 20: VLR 20: VLQ 20: LR 20: 20: 38 VAND 20: 20: The cache lookup code is inserted inline at each Read or Write reference to a Cacheable Variable VAND Compiler mediated performance for a Cell processor VLQ LR vr855=vr849,vr852 vr856=vr855 vr857=vr856 *vr858,.L_tagarray=gr664,vr856,vr858 gr859=vr854,vr858 *gr799=gr859 gr859=vr854,vr858 vr847=0x10203 *gr799=gr859 *vr846=vr847 vr847=0x10203 vr848=gr664,gr664,vr846 *vr846=vr847 *vr845=vr848 vr848=gr664,gr664,vr846 vr800=c[]0(gr799,0) *vr845=vr848 *vr663=vr800 vr800=c[]0(gr799,0) *vr663=vr800 Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine translate this global address addr a1 a1 addr Software Cache Access subset of addr used by tags extract & load set set tags data ptrs. 0 a2 a2c2 c2e3 e3h4 h4 1 b2 b2c3 c3 f4 f4 i3 i3 ... ... x c4 c4 f6 f6 a1 a1 j5 j5 splat addr dirty bits... ... d2 d2 .. .. addr offset .. a1a1 a1a1 a1a1 a1 a1 compare ‘=‘ compute addr 0000 00FF FF00 00 00 data array d1 d2 dn 39 ... 42 42 ... ... ... ... ... locate data ptr. SIMD comparison hit when successful Compiler mediated performance for a Cell processor hit latency: ~ 20 extra cycles Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Single Source Compiler Results optimization ps in v re si d gr id m ca lc ca lc ca lc sw im softcache rp rj Speedup with SPEs Results for Swim, Mgrid, & some of their kernels baseline: execution on one single PPE 40 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Some performance results for specOMP Speedup with Softcache (softcache only, no optimization of data transfer, no simd) Speedup over one PU apsi ammp applu art equake mgrid sw im w upw ise spu= 41 spu= spu= Compiler mediated performance for a Cell processor spu= spu= Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Automatic tuning for each ISA 42 Explicit SIMD coding Part 2: Parallelization and memory abstraction Explicit parallelization with local memories SIMD PROGRAMS Multiple-ISA hand-tuned programs Part 3: Automatic simdization SIMD/alignment directives Automatic simdization Compiler mediated performance for a Cell processor PARALLELIZATION Part 1: Automatic SPE tuning Shared memory, single program abstraction Automatic parallelization Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Extract Parallelism Successful Simdizer loop level Satisfy Constraints alignment constraints for (i=0; i<256; i++) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries a[i] = vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b1 b2 b3 b4 basic-block level a[i+0] = a[i+1] = a[i+2] = b0b1b2b3b4b5b6b7b8b9b10 R1 b0b1b2b3 R2 c0c1c2c3 for (i=0; i<8; i++) a[i] = 43 load b[i] b0+ b1+ b2+ b3+ c0c1c2c3 R3 SHORT + c0c1c2c3c4c5c6c7c8c9c10 load a[i] unpack add a[i+3] = entire short loop data size conversion store multiple targets unpack load a[i+4] add INT 1 store INT 2 non stride-one GENERIC VMX Compiler mediated performance for a Cell processor SPE b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Traditional Execution of a Loop Sequential execution of “for (i=0;i<64;i++) a[i+2] = b[i+1] + c[i+3] ” b0 b1 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 c2 c3 c4 c5 c6 c7 c8 c9 load b[1] c0 c1 c2 c10 load c[3] Memory streams add (grey is 1st iteration) store a[2] a0 44 a1 a3 a2 a3 a4 a5 Compiler mediated performance for a Cell processor a6 a7 a8 a9 a10 Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine SIMD Alignment Problem SIMD execution of “for(i=0;i<64;i+) a[i+2] = b[i+1] + c[i+3]” b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 16-byte boundaries vload b[1] offset 4 b0 b1 b1 b2 b3 b0+ b1+ b2+ b3+ c0 c1 c2 c3 vadd offset 8 c0 c1 c2 c2 c3 this is not b[1]+c[3], ... because the alignments vload c[3] c0 c1 c2 do not match c2 c3 c4 c5 c6 c7 c8 c9 c10 16-byte boundaries 45 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Loop-Level Simdization, Naïve Way SIMD execution of “for(i=0;i<64;i+) a[i+2] = b[i+1] + c[i+3]” 16-byte boundaries Memory stream Register stream b0 b1 b1 b2 b3 b1 b2 b3 b4 c0 c1 c2 c3 c3 c4 c5 c6 StreamShift Left StreamShift Right b4 b5 c4 c7 b5 b6 b6 c5 b7 c6 c8 b7 b8 b8 c7 b9 b10 b9 b10 b11 b12 c8 c9 c10 c9 c10 c11 c12 c13 c14 + + + b1+ b2+ b3+ b4+ c3 c4 c5 c6 b5+ b6+ b7+ b8+ c7 c8 c9 c10 b9+ b10+ b11+ b12+ c11 c12 c13 c14 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 16-byte boundaries 46 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Solving the Alignment Problem Data Reorganization Graph original graph with alignment label each load/store resolve alignment conflicts vload b[i+1] vload c[i+3] offset 4 offset 12 vadd conflict: reaching alignments are not all identical offset 8 47 vstore a[i+2] Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Solving the Alignment Problem Data Reorganization Graph original graph with alignment label each load/store resolve alignment conflicts by adding “stream-shift” aligning operations vload b[i+1] vload c[i+3] offset 4 offset 12 stream-shift(4,0) stream-shift(12,0) vadd offset 0 stream-shift(0,8) offset 8 48 vstore a[i+2] Compiler mediated performance for a Cell processor set alignment of streams to zero [VAST] Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Loop-Level Simdization, Optimized Way SIMD execution of “for(i=0;i<64;i+) a[i+2] = b[i+1] + c[i+3]” 16-byte boundaries Memory stream b0 b1 b1 b2 b3 Register stream b-1 b1 b0 b2 b1 b3 b2 b4 c0 c1 c2 c3 c1 c3 c2 c4 c3 c5 c4 c6 Stream-Shift Left Stream-Shift Right b4 b3 b5 c4 c5 c7 b5 b6 b4 b6 c5 b5 b7 c6 c6 c8 b7 b8 b6 b8 c7 b9 b10 b7 b9 b10 b8 b11 b9 b10 b12 c8 c7 c9 c10 c8 c9 c10 c11 c9 c10 c12 c11 c13 c12 c14 + + + b1+ b-1 b0+ b2+ b1+ b3+ b2+ b4+ c3 c2 +c1 c4 c3 c5 c4 c6 b5+ b4+ b6+ b5+ b7+ b6+ b8+ b3+ c7 c6 c8 c7 c9 c10 c5 c8 b9+ b10+ b12+ b7+ b8+ b11+ b9+ b10+ c11 c12 c11 c13 c12 c14 c9 c10 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 16-byte boundaries 49 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Optimized Solving of the Alignment Problem, Eager Policy Data Reorganization Graph eagerly align to store alignment vload b[i+1] vload c[i+3] offset 4 offset 12 stream-shift(4,8) stream-shift(12,8) vadd offset 8 50 vstore a[i+2] Compiler mediated performance for a Cell processor set to store alignment; reduce stream-shift # from 3 to 2 [PLDI04, CGO05] Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Optimized Solving of the Alignment Problem, Lazy Policy Data Reorganization Graph (slightly different “b[i+1] + c[i+1]” example) lazily align to store alignment vload b[i+1] vload c[i+1] offset 4 offset 4 vadd stream-shift(4,8) offset 8 51 vstore a[i+2] Compiler mediated performance for a Cell processor lazily set to store align; reduce stream-shift # from 3 to 1 [PLDI04, CGO05] Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine SIMD Code Generation SIMD codes are generated from a valid data reorganization graph Code generation for simdized loop <simdized prologue> for(i=0; i<100; i+=4) <simdized loop body> <simdized epilogue> simdized loop (steady state) simdized prologue if store is misaligned simdized epilogue depending on store alignment and tripcount 52 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation for Stream-Shift When you need a stream of vectors starting at address b[1] b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 ... 16-byte boundaries load b[1] b0 b1 b2 offset 4 load b[5] b3 b4 b5 b6 shuffle streamshift(4, 0) b1 b2 b3 load b[9] b7 b8 b9 b10 b11 shuffle b4 b5 b6 b7 ... shuffle b8 b9 b10 b11 b12 offset 0 53 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation for Partial Store for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3]; a0 a1 a2 a3 * b1+ b2+ c3 c4 * b3+ b4+ b5+ b6+ c5 c6 c7 c8 ... shuffle a0 load a[2] a1 b1+ b2+ c3 c4 store a[2] a0 a1 a2 store a[6] a3 a4 a5 a6 a7 … 16-byte boundaries offset 8 54 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Code Generation for Loops (Multiple Statements) for (i=0; i<n; i++) { a[i] = …; Implicit loop skewing (steady-state) b[i+1] = …; a[i+4] = … c[i+3] = …; b[i+4] = … } c[i+4] = … a[i]= a0 a1 a2 b3 a4 a5 a6 a7 ... a96 a97 a98 a99 a a a a 100 101 102 103 b[i+1]= b0 b1 b2 b3 b4 b5 b6 b7 ... b96 b97 b98 b99 b b b b 100 101 102 103 c[i+3]= c0 c1 c2 b3 c3 c4 c5 c6 c7 ... c96 c97 c98 c99 c c c c 100 101 102 103 loop prologue (simdized) 55 loop steady state (simdized) Compiler mediated performance for a Cell processor loop epilogue (simdized) Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Machine-Independent Code After Simdization Pseudo codes after simdization (no software pipelining on adjacent loads) all operations are vectors loads/stores normalized to truncated address splice, shiftpairl, and shiftpairr are pseudo data reorganization operations to be mapped to spu_shuffle or spu_sel i = 0; a[i] = splice(a[i], shiftpairr(b[i - 4], b[i], 4) + shiftpairl(c[i - 4], c[i],4), 8); do { a[i + 4] = shiftpairr(b[i], b[i + 4], 4) + shiftpairl(c[i], c[i + 4],4); i = i + 4; } while (i < 95); a[i + 4] = splice(shiftpairr(b[i], b[i + 4],4) + shiftpairl(c[i], c[i + 4], 4), a[i + 4],8)); 56 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Machine-Dependent Intrinsic Codes after Simdization after software pipelining, loop normalization, and address truncation spu_shuffle, spu_sel, spu_add, spu_mask are SPU intrinsics <0x08090a0b, …> is a vector literal a[0] = spu_sel(a[0],spu_add(spu_shuffle(b[-4],b[0], <0x08090a0b,0x0c0d0e0f,0x10111213,0x14151617>), spu_shuffle(c[-4],c[0], <0x0c0d0e0f,0x10111213,0x14151617,0x18191a1b>),spu_maskb(15)); oldSPCopy0 = c[0]; oldSPCopy1 = b[0]; i = 0; do { tc = c[i*4+4]; tb = b[i*4+4]; a[i*4+4] = spu_add(spu_shuffle(oldSPCopy1,tb,<0x08090a0b,0x0c0d0e0f,0x10111213,0x14151617>), spu_shuffle(oldSPCopy0, tc,<0x0c0d0e0f,0x10111213,0x14151617,0x18191a1b>); oldSPCopy0 = tc; oldSPCopy1 = tb; i = i + 1; } while (i < 24); a[100] = spu_sel(spu_add(spu_shuffle(b[96],b[100],<0x08090a0b,0x0c0d0e0f,0x10111213, 0x14151617>), spu_shuffle(c[96], c[100], <0x0c0d0e0f,0x10111213,0x14151617,0x18191a1b>), a[100], spu_maskb(15)); 57 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Outline Simdization example Integrated simdization framework 58 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Extract Parallelism Successful Simdizer loop level Satisfy Constraints alignment constraints for (i=0; i<256; i++) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries a[i] = vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b1 b2 b3 b4 basic-block level a[i+0] = a[i+1] = a[i+2] = b0b1b2b3b4b5b6b7b8b9b10 R1 b0b1b2b3 R2 c0c1c2c3 for (i=0; i<8; i++) a[i] = 59 load b[i] b0+ b1+ b2+ b3+ c0c1c2c3 R3 SHORT + c0c1c2c3c4c5c6c7c8c9c10 load a[i] unpack add a[i+3] = entire short loop data size conversion store multiple targets unpack load a[i+4] add INT 1 store INT 2 non stride-one GENERIC VMX Compiler mediated performance for a Cell processor SPE b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple Sources of SIMD Parallelism loop level for (i=0; i<256; i++) a[i] = Loop level SIMD for a single statement across consecutive iterations successful at: basic-block level a[i+0] = a[i+1] = efficiently handling misaligned data pattern recognition (reduction, linear recursion) leverage loop transformations in most compilers amortize overhead (versioning, alignment handling) and employ cost models a[i+2] = a[i+3] = entire short loop [Bik et al, IJPP 2002] for (i=0; i<8; i++) a[i] = [VAST compiler, 2004] [Eichenberger et al, PLDI 2004] [Wu et al, CGO 2005] [Naishlos, GCC Developer’s Summit 2004] 60 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple Sources of SIMD Parallelism (cont.) loop level for (i=0; i<256; i++) Basic-block level SIMD across multiple isomorphic operations successful at a[i] = handling unrolled loops (manually or by compiler) extracting SIMD parallelism within structs, e.g. basic-block level a[i].x = a[i+0] = a[i].y = a[i+1] = a[i].z = a[i+2] = a[i+3] = extracting SIMD parallelism within a statement += a(i)*b(i) a(i)*b(i)++a(i+1)*b(i+1) a(i+1)*b(i+1)++a(i+2)*b(i+2) a(i+2)*b(i+2)++ ss+= a(i+3)*b(i+3)++a(i+4)*b(i+4) a(i+4)*b(i+4) a(i+3)*b(i+3) entire short loop for (i=0; i<8; i++) a[i] = [Larsen et al, PLDI 2000] [Shin et al, PACT 2002] 61 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple Sources of SIMD Parallelism (cont.) loop level for (i=0; i<256; i++) Short-loop level SIMD across entire loop iterations effectively collapse innermost loop a[i] = we can now extract SIMD at the next loop level basic-block level a[i+0] = e.g. FIR for (k=0; k<248; k++) a[i+1] = a[i+2] = a[i+3] = for (i=0; i<8; i++) res[k] += in[k+i] * coef[k+i]; entire short loop for (i=0; i<8; i++) a[i] = 62 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple SIMD Hardware Constraints Alignment in SIMD units matters alignment constraints when alignments within inputs do not match b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries must realign the data vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b0 b1 b2 b3 b4 b5 b6 b7 b1 b2 b3 b4 data size conversion shuffle load b[i] load a[i] R1 b1 b2 b3 b4 + b1+ b2+ b3+ b4+ c0 c1 c2 c3 R2 c0 c1 c2 c3 unpack add store SHORT unpack load a[i+4] add INT 1 store INT 2 non stride-one c0 c1 c2 c3 c4 c5 c6 c7 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 16-byte boundaries 63 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple SIMD Hardware Constraints (cont.) Size of data in SIMD registers matters a vector of int int short alignment constraints b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute ½ vector of short b1 b2 b3 b4 data size conversion a vector of short short int load b[i] load a[i] 2 vectors of int unpack add store SHORT unpack load a[i+4] add INT 1 store INT 2 non stride-one E.g. when converting from short to integer we must issue 2x integer SIMD operations b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 64 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple SIMD Hardware Constraints (cont.) Hardware supports 16-byte continuous access only Non stride-one load requires packing 16-byte boundaries b0 b1 b2 b3 b4 b5 b6 b7 load b[5] load b[1] b0 b1 b2 b3 alignment constraints b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b1 b2 b3 b4 data size conversion load b[i] b4 b5 b1 b6 b7 load a[i] pack b1 b3 b1 b5 b7 unpack add store for(i=0;i<n;i++) …= a[i] + b[2i+1] SHORT unpack load a[i+4] add INT 1 store INT 2 non stride-one Non stride-one store requires unpacking b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 65 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Multiple SIMD Hardware Constraints (cont.) Different platforms have varying SIMD support e.g. VMX / SPE have SIMD permute instructions e.g. SPE has no memory page fault, VMX does multiple targets GENERIC VMX 66 SPU Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Extract ParallelismHow to support the cross Satisfy Constraints product of all these? loop level alignment constraints for (i=0; i<256; i++) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries a[i] = vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b1 b2 b3 b4 basic-block level a[i+0] = a[i+1] = a[i+2] = b0b1b2b3b4b5b6b7b8b9b10 R1 b0b1b2b3 R2 c0c1c2c3 for (i=0; i<8; i++) a[i] = 67 load b[i] b0+ b1+ b2+ b3+ c0c1c2c3 R3 SHORT + c0c1c2c3c4c5c6c7c8c9c10 load a[i] unpack add a[i+3] = entire short loop data size conversion store multiple targets unpack load a[i+4] add INT 1 store INT 2 non stride-one GENERIC VMX Compiler mediated performance for a Cell processor SPE b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Key Abstraction: Virtual SIMD Vector Virtual SIMD Vector has arbitrary length has no alignment constraints loop level for (i= ; i< use virtual vector as representation abstract away all the hardware complexity Progressive “de-virtualization” of the virtual vector b b b b b b b b b ... b -byte boundaries vload b[ ] a[i] = b b b b vload b[ ] b b b b vpermute b basic-block level a[i+ ] = Extraction of SIMD Parallelism alignment constraints ; i++) a[i+ ] = a[i+ ] = a[i+ ] = b b b b b b b b b bb R b b b b R c c c c b b b data size conversion load b[i] b b+ b + b+ + c c c c R + c c c c c c c c c cc load a[i] unpack add for (i= ; i< ; i++) a[i] = load a[i+ ] add store entire short loop SHORT unpack INT store INT multiple targets GENERIC VMX SPU BG/L until each vector in the loop satisfies all hardware constraints or revert vectors back to scalars (if too much overhead) 68 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Integrated Simdization Framework Characteristics: Support Modular Global Integrated Alignment Analysis Hide complexity Global Pointer Analysis SIMD Extraction Basic-block level aggregation Idiom Recognition Data Layout Optimization arbitrary length arbitrary alignment Short-loop level aggregation Loop-level aggregation Constant Propagation Dependence Elimination Virtual Vectors Transition Alignment devirtualization alignment handling short length handling Length devirtualization Physical Vectors SIMD Codegen 16 bytes long machine alignment Machine Specific 69 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine SPE Simdization Results Speedup factors . . . . . . . . . Linpack Swim-l FIR Autcor . Dot Checksum Alpha Product Blending Saxpy Mat Mult Average single SPE, optimized, automatic simdization vs. scalar code 70 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Conclusions Cell Broadband Engine architecture heterogeneous parallelism dense compute architecture Present the application writer with a wide range of tool from support to extract maximum performance to support to achieve maximum productivity with automatic tools Shown respectable speedups using automatic tuning, simdization, and support for shared-memory abstraction 71 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Questions For additional info: www.research.ibm.com/cellcompiler/compiler.htm 72 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine First Example: Basic-Block & Loop Level Aggregation Original loop for(i=0; (i=0;i<256; i<256;i++) i++){{ for a[i].x== a[i].x a[i].y== a[i].y a[i].z== a[i].z b[i+1]== b[i+1] Value streams a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y … b1 b2 b3 b4 }} 4 iterations shown here for the purpose of illustration 1st iter 73 Compiler mediated performance for a Cell processor 2nd iter 3rd iter 4th iter Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 1: Basic-Block Level Aggregation Original loop Value streams for(i=0; (i=0;i<256; i<256;i++) i++){{ for a[i].x== a[i].x a[i].y== a[i].y a[i].z== a[i].z b[i+1]== b[i+1] pack a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y … b1 a0.x a0.y a0.z b1 }} b2 b3 a1.x a1.y a1.z a2.x a2.y a2.z b2 b3 b4 a3.x a3.y … b4 Basic-Block level aggregation pack a[i].x, a[i].y, a[i].z into a vector of 3 elements pack regardless of alignment 74 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 2: Loop-Level Aggregation BB-aggregated loop for(i=0; (i=0;i<256; i<256;i++) i++){{ for (a[i].x,y,z)== (a[i].x,y,z) b[i+1]== b[i+1] Value streams a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y … b2 b1 a0.x a0.y a0.z }} a1.x a1.y a1.z a2.x a2.y a2.z b2 b3 b1 pack with self b3 b4 a3.x a3.y … b4 a0.x a0.y a0.z a1.x a1.y a1.z a0.x a0.y a0.z a0.x a0.y a0.z b1 b2 b3 b4 Loop-level aggregation pack each statement with itself across consecutive iterations final vector lengths must be multiple of 16 bytes scalar “b[i]” or vector “(a[i].x,y,z)” are treated alike pack regardless of alignment 75 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 3: Alignment Devirtualization Loop-aggregated Value streams a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y … for(i=0; (i=0;i<256; i<256;i+=4) i+=4){{ for (a[i].x,…,a[i+3].z)== (a[i].x,…,a[i+3].z) (b[i+1],…,b[i+4])== (b[i+1],…,b[i+4]) }} align access b2 b1 a0.x a0.y a0.z b3 a1.x a1.y a1.z a2.x a2.y a2.z b2 b3 b1 b4 a3.x a3.y … b4 a0.x a0.y a0.z a1.x a1.y a1.z a0.x a0.y a0.z a0.x a0.y a0.z b1 b2 b3 b4 a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y a3.z b4 Alignment * b5 b6 b7 shift misaligned streams skew the computations so that loop computes (b[i+4]…b[i+7]) * Arrays (e.g. &a[0], &b[0],…) are assumed here 16-byte aligned. 76 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 4: Length Devirtualization Aligned loop (b[1]…b[3])== (b[1]…b[3]) for(i=0; (i=0;i<252; i<252;i+=4) i+=4){{ for (a[i].x,…,a[i+3].z)== (a[i].x,…,a[i+3].z) (b[i+4],…,b[i+7])== (b[i+4],…,b[i+7]) }} (a[252].x,…,a[255].z)== (a[252].x,…,a[255].z) b[256]== b[256] Value streams a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y … b2 b1 a0.x a0.y a0.z b3 a1.x a1.y a1.z a2.x a2.y a2.z b2 b3 b1 b4 a3.x a3.y … b4 a0.x a0.y a0.z a1.x a1.y a1.z a0.x a0.y a0.z a0.x a0.y a0.z b1 b2 b3 b4 a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y a3.z b4 Length break into 16-byte chunks b6 b7 a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y a3.z b4 77 b5 b5 b6 b7 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Second Example: Data-Size Conversion and Misalignment Original loop for(i=0; (i=0;i<256; i<256;i++) i++){{ for a[i] += += (int) (int)b[i+1] b[i+1] a[i] load b[i+1] }} load a[i] short computations integer computations 78 convert short (2 bytes) add store a[i] Compiler mediated performance for a Cell processor integer (4 bytes) Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 1: Loop-Level Aggregation Original loop for(i=0; (i=0;i<256; i<256;i++) i++){{ for a[i] += += (int) (int)b[i+1] b[i+1] a[i] vload b[i+1] }} vload a[i] pack with self convert 8 shorts vadd vstore a[i] 8 integer Loop-level aggregation pack each statement with itself across consecutive iterations virtual vectors have uniform number of elements, even when vector of 8 integer = 32 bytes of data vector of 8 short = 16 bytes of data 79 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 2: Alignment Devirtualization Original loop for(i=0; (i=0;i<256; i<256;i++) i++){{ for a[i] += += (int) (int)b[i+1] b[i+1] a[i] vload b[i+1] }} streamshift align b[i+1] vload a[i] convert 8 shorts vadd vstore a[i] 8 integer Alignment shift misaligned streams easy to do as we are still dealing with long vectors 80 Compiler mediated performance for a Cell processor Kevin O’Brien http://w3.ibm.com/ibm/presentations Cell Broadband Engine Phase 3: Length Devirtualization Original loop for(i=0; (i=0;i<256; i<256;i++) i++){{ for a[i] += += (int) (int)b[i+1] b[i+1] a[i] 8 shorts vload b[i+1] }} streamshift 4 integer 4 integer vload a[i] unpack unpack vload a[i+4] vadd vadd vstore a[i] vstore a[i+4] Length 8 shorts fit into a 16-byte register 8 integers do not fit; must replicate integer registers and associated instructions 81 Compiler mediated performance for a Cell processor Kevin O’Brien