Multi-Platform Auto-Vectorization Dorit Nuzman, IBM Richard Henderson, RedHat IBM Labs in Haifa IBM Labs in Haifa Multi-Platform Auto-Vectorization - Talk Layout Vectorization for SIMD Alignment Example Vectorization in GCC Vector Abstractions Abstractions for Alignment Multi-platform Evaluation Related Work & Conclusion 2 IBM Labs in Haifa Vectorization SIMD (Single Instruction Multiple Data) model Communications, Video, Gaming MMX/SSE, Altivec Programming for Vector Platforms Fortran90 a[0:N] = b[0:N] + c[0:N]; Intrinsics vector float vb = vec_load (0, ptr_b); vector float vc = vec_load (0, ptr_c); vector float va = vec_add (vb, vc); vec_store (va, 0, ptr_a); Autovectorization: Automatically transform serial code to vector code by the compiler. 3 IBM Labs in Haifa What is vectorization VF = 4 0 1 2 3 VR1 a b c d VR2 VR3 VR4 VR5 Vector Registers Data elements packed into vectors Vector length Vectorization Factor (VF) No Data Dependences SIMD Architectural Capabilities OP(a) OP(b) VOP( a, VR1 b, c, d ) OP(c) OP(d) vectorization Data in Memory: a b c d e f g h i j k l m n o p 4 Vector operation IBM Labs in Haifa Limitations of SIMD Architectures: Unaligned memory access 0 1 2 OP(c) 3 VR1 a b c d VOP( c, d, e, f ) OP(d) VR2 e f g h 0 V1 OP(f) VR4 VR5 Vector Registers Data in Memory: 5 8 16 32 3 4 5 6 7 48 V2 V3 V3 vec-shift-left v1, v2, 2 vec-permute v1, v2, {2,3,4,5} V1 V3 extql (v1, addr), v2 vec-or (v1, v2) V3 alvn.ps v1,v2,addr (MIPS64) V3 load-left, load-right (MIPS MDMX) a b c d e f g h i j k l m n o p 0 2 a b c d e f g h OP(e) VR3 c d e f 1 extqh (v2, addr) (alpha) IBM Labs in Haifa Multi-Platform Auto-Vectorization - Talk Layout Vectorization for SIMD Alignment Example Vectorization in GCC Vector Abstractions Abstractions for Alignment Multi-platform Evaluation Related Work & Conclusion 6 IBM Labs in Haifa GCC Free Software Foundation Multi-platform 7 IBM Labs in Haifa GCC Free Software Foundation Multi-platform Who’s involved Volunteers Linux distributors (RedHat, Suse…) Code Sourcery, AdaCore… IBM, HP, Intel, Apple… 8 IBM Labs in Haifa GCC Passes … Ada front-end Fortran front-end C++ front-end C front-end parse trees GIMPLE Abstractions middle-end GIMPLE trees SSA optimizations CCP PRE DCE CSE loop optimizations invariant motion unswitching DSE linear transform loop opts If-conversion forward prop vectorization copy prop unrolling VRP Sibling call optimizations Common subexpression elimination Vector Size Loop optimizations … mips port i386 port rs6000 port Data flow analysis back-end RTL machine description 9 Instruction combination Instruction scheduling Register allocation and reloading Instruction scheduling (repeated) assembly Branch shortening IBM Labs in Haifa Multi-Platform Auto-Vectorization - Talk Layout Vectorization for SIMD Alignment Example Vectorization in GCC Vector Abstractions Abstractions for Alignment Multi-platform Evaluation Related Work & Conclusion 10 IBM Labs in Haifa s = 0; for (i=0; i<N; i++) { s = s + a[i] * b[i]; } Vector Abstractions: Why Needed Represent high-level idioms that otherwise can’t be vectorized reduction special idioms (sad, subtract-and-saturate, dot-product) Express vector operations in GIMPLE “reduc-plus” extract, shuffle,… s1,s2,s3,s4 00 610 802 10 4 03 API for targets to convey availability and cost of a functionality optab/type add reduc-plus 11 char f1 short f2 int f3 v16char v8short f4 f7 f5 f8 v4int f6 IBM Labs in Haifa Vector Abstractions: Considerations Generality vs. applicability General enough to cover all uses Minimize increase of operation-codes Not generally supported permute 0 1 2 3 4 5 6 7 a b c d e f g h V1 V2 Compound vs. building blocks Increase of operation-codes subtract-and-saturate, dot-product Complicated “black-box” operations Increase ways to represent same functionality Improved direct support of a high-level idiom over basic functionalities GCC convensions naming, existing-operation-codes, default values… Performance Translates to most efficient code 12 IBM Labs in Haifa Vector Abstractions: Abstractions for alignment V3 V3 Implicit Realignment movdqu load-left, load-right misaligned_ref (ptr, mis) 0 Data in Memory: alvn.ps v1,v2,addr 13 7 V2 vec-permute v1, v2, {2,3,4,5} extql (v1, addr), v2 vec-or (v1, v2) V3 6 e f g h V3 a b c d e f g h i j k l m n o p 48 5 vec-shift-left v1, v2, 2 V3 32 4 V3 V1 16 3 V1 aligned_ref (ptr) realign_load (v1, v2, RT) Realignment Token (RT) 8 2 (MIPS MDMX) a b c d Explicit Realignment 0 1 (MMX/SSE) (Altivec) extqh (v2, addr) (Alpha) (MIPS64) IBM Labs in Haifa Handling Alignment for (i=0; i<N; i++){ x = a[i]; b[i] = x; } addra_0 = &a[0]; addrb = &b[0]; vector vx, vx; vx1, vx2; addr_0; addra_i = addra_0; LOOP: vx1= =misaligned_ref align_ref (addra_i), (addra_i,0); vx vx2 = align_ref (addra_i+15); vx = realign_load (vx1, vx2, addra_i); indirect_ref indirect_ref(addrb) (addrb)==vx; vx; addra_i += 16; addrb += 16; 14 addra_0 = &a[0]; addrb = &b[0]; vector vx, vx1, vx2; vx1 = align_ref (addra_0); addra_i = addra_0 + 15; LOOP: vx2 = align_ref (addra_i); vx = realign_load (vx1, vx2, addra_i); indirect_ref (addrb) = vx; addra_i += 16; addrb += 16; vx1 = vx2; IBM Labs in Haifa Handling Alignment addra_0 = &a[0]; adrb = &b[0]; vector vx; addra_i = addra_0; LOOP: vx = misaligned_ref (addra_i,0); indirect_ref (addrb) = vx; addra_i += 16; addrb += 16; 15 for (i=0; i<N; i++){ x = a[i]; b[i] = x; } addra_0 = &a[0]; addrb = &b[0]; vector vx, vx1, vx2; vx1 = align_ref (addra_0); RT = target_get_RT (addra_0); addra_i = addra_0 + 15; LOOP: vx2 = align_ref (addra_i); vx = realign_load (vx1, vx2, RT); indirect_ref (addrb) = vx; addra_i += 16; addrb += 16; vx1 = vx2; IBM Labs in Haifa GIMPLE Vector Abstractions Alignment: misaligned_ref, align_ref realign_load, target_get_RT 16 Conditional operations: (cond) ? x : y Reduction: reduc_plus Type Conversions unpack_high, unpack_low pack_mod, pack_sat Special patterns: dot_prod, sad sub_sat widen_mult, widen_sum Strided-Accesses: extract_odd, extract_even interleave_high, interleave_low IBM Labs in Haifa Multi-Platform Auto-Vectorization - Talk Layout Vectorization for SIMD Alignment Example Vectorization in GCC Vector Abstractions Abstractions for Alignment Multi-platform Evaluation Related Work & Conclusion 17 IBM Labs in Haifa Multi-Platform Evaluation IBM PowerPC970, Altivec (VS = 16) Intel Pentium4, SSE2 (VS = 16) AMD Athlon64, SSE2 (VS = 16) Intel Itanium2 (VS = 8) MIPS64, paired-single-fp (VS = 8) Alpha (VS = 8) 18 IBM Labs in Haifa Multi-Platform Evaluation Vectorization Speedup Factors - Aligned blas.sdot_fp . blas.saxpy_fp . blas.dscal_fp . vecmax_fp . checksum_s . chromakey_u . vecmax_s vecsum_u . chromakey_u . . 19 powerpc pentium athlon itanium alpha mips vecmax_u IBM Labs in Haifa Multi-Platform Evaluation Vectorization Speedup Factors - Unaligned blas.sdot_fp . blas.saxpy_fp . blas.dscal_fp . vecmax_fp . checksum_s . chromakey_u . vecmax_s vecsum_u . chromakey_u . . 20 powerpc pentium athlon itanium alpha mips vecmax_u IBM Labs in Haifa Related Work Vectorizing compilers available for a specific architecture XL (Eichenberger, Wu). Altivec icc (Bik). MMX/SSE CoSy (Krall). VIS SUIF (Larsen,Amarasinghe ; Shin,Chame,Hall) – Altivec Vectorizing compilers available for multiple SIMD targets source-to-source compilers Vienna MAP, 2-way, domain-specific patterns. BG + SWARP. source-to-source, multimedia patterns. Trimedia + This Work: In a robust industrial-strength compiler Experimental results on several different SIMD platforms 21 IBM Labs in Haifa Concluding Remarks SIMD Hardware limitations Unique Hardware mechanisms Diverse nature Multi-platform vectorizer Bridge gap across different SIMD targets Efficiently support each individual platform Identify proper abstractions Developing the vectorizer in the GCC platform Collaborative investment of different vendors/developers Open, available http://gcc.gnu.org/projects/tree-ssa/vectorization.html 22 IBM Labs in Haifa The End 23