IBM Research Efficient SIMD Code Generation for Runtime Alignment & Length Conversion Peng Wu, Alexandre Eichenberger Amy Wang IBM T.J. Watson Research Center IBM Toronto Laboratory CGO 2005 © 2002 IBM Corporation IBM Research Overview ÿ Background on SIMD code generation ÿ Contribution #1: efficient alignment handling ÿ Contribution #2: support for data size conversion ÿ Measurements 2 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Objectives ÿ Single Instructions Multiple Data (SIMD): may yield 2x -16x speedup moderate hardware cost available in most platform: x86, PowerPC 970, CELL, BlueGene/L ÿ Target applications: data intensive: multimedia, DSP, game, numerical rich in data types: chars, shorts, ints, float,… rich in data alignments, often runtime ÿ Goal: produce SIMD code with low overhead for the type of code found in target apps 3 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Sequential Execution of a Loop ÿ for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3]; b0 b1 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 memory streams LOAD b[1] c0 c1 c2 c3 c3 c4 c5 c6 c7 c8 c9 c10 LOAD c[3] 1st original i=0 loop iteration in yellow ADD STORE a[2] a0 4 a1 a2 a2 a3 a4 a5 a6 a7 a8 a9 a10 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research SIMD Load/Store Preserve Memory Alignment ÿ Access only a 16-byte chunk of 16-byte aligned data* 0x1000 b0 b1 0x1010 b2 b3 b4 b5 0x1020 b6 b7 b8 b9 b10 16-byte boundaries VLOAD b[1] &b[1] = 0x1004 b0 b1 b2 b3 byte offset 4 in register * AltiVec/VMX and others; SSE supports unaligned access, but less efficiently 5 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research SIMD Load/Store Preserve Memory Alignment ÿ Access only a 16-byte chunk of 16-byte aligned data* 0x1000 b0 b1 0x1010 b2 b3 b4 b5 0x1020 b6 b7 b8 b9 b10 16-byte boundaries VLOAD b[1] &b[1] = 0x1004 b0 b1 b2 b3 1st original i=0 loop iteration in yellow 1st SIMD loop iteration in grey * AltiVec/VMX and others; SSE supports unaligned access, but less efficiently 6 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Erroneous SIMD Execution ÿ for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3]; 16-byte boundaries b0 b1 b1 b2 b3 b0 b2 b1 b2 b3 c0 c1 c2 c3 c0 c1 c2 c3 c6 b4 b5 b6 b7 b8 b9 b10 b11 VLOAD b[1] b4 c4 b5 c5 b6 c6 b6 c7 b8 c8 c9 b9 b10 b11 c10 c11 VLOAD c[3] VADD b0+ b1+ b1+ b2+ b3+ c1 c2 c3 c0 c1 c4 c5 c6 c7 c8 c9 c10 c11 b[1] and c[3] are not aligned => wrong results* *got b[1]+c[1], wanted b[1]+c[3] 7 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Correct SIMD Execution (Zero-Shift) [VAST 04] ÿ for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3]; 16-byte boundaries 4 byte b0 b1 b1 b2 b3 b1 b2 b3 b4 b4 b5 b6 b7 b8 b9 b10 b11 VLOAD & SHIFT b[1] b5 b6 b7 b8 b9 b10 b11 b12 12 byte VLOAD & SHIFT C[3] c0 c1 c2 c3 c3 c4 c5 c6 b1+ b1+ b2+ b3+ b4+ c3 c4 c5 c6 c3 VADD c4 c7 c5 c8 c6 c7 c8 c9 c10 c11 c9 c10 c11 c12 c13 c14 b5+ b6+ b7+ b8+ c7 c8 c9 c10 b9+ b10+ b11+ b12+ c11 c12 c13 c14 8 byte VSTORE & SHIFT a[2] 8 a0 a1 a2 a3 a4 a5 a6 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion a7 a8 a9 a10 a11 Alexandre Eichenberger IBM Research How to Align Data in Registers ÿ Load 4 values from misaligned address b[1] b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 ... 16-byte boundaries VLOAD b[1] b0 b1 b2 VLOAD b[5] b3 b4 b5 b6 b7 offset 4 VPERMUTE* b1 offset 0 9 b2 b3 LOAD/STORE & SHIFT are expensive => want to minimize them b4 * AltiVec/VMX and most other SIMD ISA have support for shuffling data of 2 registers Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Better SIMD Execution (Lazy-Shift) [Eichen 04] for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3]; 16-byte boundaries 4 byte b-1 VLOAD & SHIFT b[1] b0 b1 b1 b2 b3 b-1 b0 b1 b2 b4 b3 b5 b4 b6 b5 b7 b8 b6 b9 b10 b11 b7 b8 b9 b10 4 byte c0 c1 c2 c3 c1 c2 c5 c3 c4 c4 c5 c6 c7 c8 c9 c10 c11 VLOAD & SHIFT C[3] b-1 b0+ b1+ b3+ b2+ c3 c4 +c3 c2 c5 VADD VSTORE a[2] 10 a0 a1 a2 a3 c5 c6 c7 c8 c9 c10 c11 c12 b7+ b8+ b9+ b10+ c9 c10 c11 c12 b3+ b4+ b5+ b6+ c5 c6 c7 c8 a4 a5 a6 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion a7 a8 a9 a10 a11 Alexandre Eichenberger IBM Research Contribution #1: Handling Runtime Alignment Efficiently ÿ Runtime alignment is frequent: algorithmic characteristics: e.g. “a[i-1] + a[i] + a[i+1]” coding style: pointers, libraries ÿ Previous work does not support runtime alignment for minimized numbers of shifts (e.g. lazy-shift policy) reason: code generation issue ÿ We propose: an additional phase that “normalizes” the computation to eliminate this issue 11 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Problem with Runtime Alignment ÿ Shift c[3] left: Shift b[1] right: 4 byte 4 byte c0 c1 c2 c3 c4 c5 c6 c7 b-4 b-3 b-2 b-1 b0 b5 b1 b2 b3 VLOAD c[3] VLOAD c[3+4] VLOAD b[1-4] VLOAD b[1] c0 c1 c2 c3 c4 c5 c6 c7 b-4 b-3 b-2 b-1 b0 b1 b2 b3 VPERM-LEFT VPERM-RIGHT c1 c2 c3 c4 Shift in data from the NEXT vector b-1 b0 b1 b2 Shift in data from the PREV vector Different code sequences for shift LEFT & RIGHT. With Runtime alignment, we don’t know which direction we are shifting to. => Hence the compiler cannot generate this code statically. 12 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Insight: Convert Arbitrary Shifts into Left-Shifts ÿ Instead of shifting b[1] RIGHT 4 byte b-4 b-3 b-2 b-1 b0 b1 b5 b2 b3 we may shift b[-1] to the LEFT: 12 byte b-4 b-3 b-2 b-1 b0 b5 b1 b2 b3 VLOAD b[1-4] VLOAD b[1] VLOAD b[-1] VLOAD b[-1+4] b-4 b-3 b-2 b-1 b0 b1 b2 b3 b-4 b-3 b-2 b-1 b0 b1 b2 b3 PREV VPERM-LEFT VPERM-RIGHT b-1 b0 b1 b2 NEXT b-1 b0 b1 b2 By changing the reference point, i.e. by pre-pending 8 bytes, the shift RIGHT becomes a shift LEFT. 13 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Approach ÿ “Normalize” each shift into a shift LEFT prepend input stream to transform shift into a shift left prepend amount is mechanically derived from alignment for runtime alignment, prepend amount is only known at runtime details and proofs are in the paper 14 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Contribution #2: Handling Data Conversion Efficiently ÿ Data conversion is frequent multimedia often store in small data type, operate on larger data types data conversion often occurs jointly with misalignment ÿ Previous work does not support data conversion and misalignment for minimized numbers of shifts reason: code generation issue ÿ We extend the our approach to handle such cases 15 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Data Size Conversion ÿ What happens when converting between data of different sizes? 2 vectors of integers b0 b1 b2 b3 b4 b5 b6 b7 VPACK VPACK perform the packing b0 b1 b2 b3 b4 b5 b6 b7 part of the int=>short conversion becomes 1 vector of short ÿ To fully utilize the SIMD bandwidth, we must compute over 2 vectors of 4 integers 1 vector of 8 shorts *The logical part of the conversion (e.g. sign extension) has no SIMD impact and is ignored here 16 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Alignment of Data is Impacted by Conversion b[3] reside at offset 12 b0 b1 b2 b3 b4 b5 b6 b7 integers VPACK shorts b0 b1 b2 b3 b4 b5 b6 b7 this become offset 12/2 = 6 as b[3] gets packed into shorts 17 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Alignment of Data is Impacted by Conversion (cont.) Truncates at 16 bytes, VLOAD b[3] b0 b1 As it gets pack into short offset < 8 b2 thus offset < 16 b3 b4 b5 b6 b7 integers VPACK shorts b0 b1 b2 b3 b4 b5 b6 b7 Store still truncate VSTORE a[3] at 16 bytes Because of conversions, we must keep track of the truncation factors as data flows through the graph 18 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Approach ÿ Additional steps in presence of data conversion: propagate offsets through data conversion (vpack/vunpack) keep track of truncation factors must satisfy some constraints on truncation factors more details in the paper 19 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Measurements ÿ Target: Mac G5 with PowerPC 970 with VMX (AltiVec) ÿ Compiler: IBM’s XL compiler comparison performed at identical level of optimizations ÿ Metric: speedup factor (SIMD vs. scalar) 20 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Measurements for Alignment Handling ÿ Synthetic benchmark (avg over 50 loops) with random align for 2 loads/1 store 13.00 12.4 Zero-Shift 12.00 Lazy-Shift 11.00 Sppedup Factor 10.00 9.00 8.00 7.00 6.1 6.00 5.00 4.0 4.00 3.00 2.00 1.00 RT CT int (4 per vect) Our contribution here 21 RT CT short (8 per vect) RT CT char (16 per vect) Optimized Alignment Handling (Compile time only) [Eichen 04] Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Measurements for Kernels ÿ With/without proposed runtime alignment handling, conversion support 8.25 8 7 Speedup Factor 6 zero-shift (prior work) lazy-shift conversion + zero-shift conversion + lazy-shift 6.14 4.97 5 4 3.13 2.92 3 2.52 2.24 2 1.08 1 0.71 1.08 1.21 1.38 1.21 1.38 1.00 1.00 1.00 1.00 0.71 0 numerical.saxpy 22 numerical.swim tcp/ip.checksum Efficient SIMD Code Generation for Runtime Alignment and Length Conversion video.alphablending Alexandre Eichenberger IBM Research Measurements for Benchmarks 2.5 2.16 Speedup Factor 2 1.62 1.41 1.5 1.20 1.02 1.02 1.02 1.05 1.06 1 0.5 0 23 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger IBM Research Conclusions ÿ Realistic benchmarks have: runtime alignment data size conversion ÿ We propose here a technique that support both in an integrated fashion fully compatible with optimized number of shifts ÿ Demonstrated good speedups 1.1 – 6.1 x for kernels up to 2.2 x for benchmarks more integration with full compiler needed for better performance 24 Efficient SIMD Code Generation for Runtime Alignment and Length Conversion Alexandre Eichenberger