Efficient SIMD Code Generation for Runtime

IBM Research
Efficient SIMD Code Generation
for
Runtime Alignment & Length Conversion
Peng Wu, Alexandre Eichenberger
Amy Wang
IBM T.J. Watson Research Center
IBM Toronto Laboratory
CGO 2005
© 2002 IBM
Corporation
IBM Research
Overview
ÿ Background on SIMD code generation
ÿ Contribution #1: efficient alignment handling
ÿ Contribution #2: support for data size conversion
ÿ Measurements
2
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Objectives
ÿ Single Instructions Multiple Data (SIMD):
may yield 2x -16x speedup
moderate hardware cost
available in most platform: x86, PowerPC 970, CELL, BlueGene/L
ÿ Target applications:
data intensive: multimedia, DSP, game, numerical
rich in data types: chars, shorts, ints, float,…
rich in data alignments, often runtime
ÿ Goal:
produce SIMD code with low overhead for the type of code found in target apps
3
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Sequential Execution of a Loop
ÿ for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3];
b0
b1
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10
memory streams
LOAD b[1]
c0
c1
c2
c3
c3
c4
c5
c6
c7
c8
c9 c10
LOAD c[3]
1st original i=0 loop iteration in yellow
ADD
STORE a[2]
a0
4
a1
a2
a2
a3
a4
a5
a6
a7
a8
a9 a10
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
SIMD Load/Store Preserve Memory Alignment
ÿ Access only a 16-byte chunk of 16-byte aligned data*
0x1000
b0
b1
0x1010
b2
b3
b4
b5
0x1020
b6
b7
b8
b9 b10
16-byte boundaries
VLOAD b[1]
&b[1] = 0x1004
b0
b1
b2
b3
byte offset 4 in register
* AltiVec/VMX and others; SSE supports unaligned access, but less efficiently
5
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
SIMD Load/Store Preserve Memory Alignment
ÿ Access only a 16-byte chunk of 16-byte aligned data*
0x1000
b0
b1
0x1010
b2
b3
b4
b5
0x1020
b6
b7
b8
b9 b10
16-byte boundaries
VLOAD b[1]
&b[1] = 0x1004
b0
b1
b2
b3
1st original i=0 loop iteration in yellow
1st SIMD loop iteration in grey
* AltiVec/VMX and others; SSE supports unaligned access, but less efficiently
6
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Erroneous SIMD Execution
ÿ for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3];
16-byte boundaries
b0
b1
b1
b2
b3
b0
b2
b1
b2
b3
c0
c1
c2
c3
c0
c1
c2
c3
c6
b4
b5
b6
b7
b8
b9 b10 b11
VLOAD b[1]
b4
c4
b5
c5
b6
c6
b6
c7
b8
c8
c9
b9 b10 b11
c10 c11
VLOAD c[3]
VADD
b0+ b1+
b1+ b2+ b3+
c1 c2 c3
c0 c1
c4
c5
c6
c7
c8
c9 c10 c11
b[1] and c[3] are not aligned
=> wrong results*
*got b[1]+c[1], wanted b[1]+c[3]
7
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Correct SIMD Execution (Zero-Shift)
[VAST 04]
ÿ for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3];
16-byte boundaries
4 byte
b0
b1
b1
b2
b3
b1
b2
b3
b4
b4
b5
b6
b7
b8
b9 b10 b11
VLOAD &
SHIFT b[1]
b5
b6
b7
b8
b9 b10 b11 b12
12 byte
VLOAD &
SHIFT C[3]
c0
c1
c2
c3
c3
c4
c5
c6
b1+
b1+ b2+ b3+ b4+
c3 c4 c5 c6
c3
VADD
c4
c7
c5
c8
c6
c7
c8
c9
c10 c11
c9 c10
c11 c12 c13 c14
b5+ b6+ b7+ b8+
c7 c8 c9 c10
b9+ b10+ b11+ b12+
c11 c12 c13 c14
8 byte
VSTORE &
SHIFT a[2]
8
a0
a1
a2
a3
a4
a5
a6
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
a7
a8
a9 a10 a11
Alexandre Eichenberger
IBM Research
How to Align Data in Registers
ÿ Load 4 values from misaligned address b[1]
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10 b11 b12
...
16-byte boundaries
VLOAD b[1]
b0
b1
b2
VLOAD b[5]
b3
b4
b5
b6
b7
offset 4
VPERMUTE*
b1
offset 0
9
b2
b3
LOAD/STORE & SHIFT
are expensive
=> want to minimize them
b4
* AltiVec/VMX and most other SIMD ISA have support for shuffling data of 2 registers
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Better SIMD Execution (Lazy-Shift)
[Eichen 04]
for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3];
16-byte boundaries
4 byte
b-1
VLOAD &
SHIFT b[1]
b0
b1
b1
b2
b3
b-1
b0
b1
b2
b4
b3
b5
b4
b6
b5
b7
b8
b6
b9 b10 b11
b7
b8
b9 b10
4 byte
c0
c1
c2
c3
c1
c2
c5
c3
c4
c4
c5
c6
c7
c8
c9
c10 c11
VLOAD &
SHIFT C[3]
b-1 b0+ b1+
b3+ b2+
c3 c4
+c3 c2 c5
VADD
VSTORE a[2]
10
a0
a1
a2
a3
c5
c6
c7
c8
c9 c10 c11 c12
b7+ b8+ b9+ b10+
c9 c10 c11 c12
b3+ b4+ b5+ b6+
c5 c6 c7 c8
a4
a5
a6
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
a7
a8
a9 a10 a11
Alexandre Eichenberger
IBM Research
Contribution #1: Handling Runtime Alignment Efficiently
ÿ Runtime alignment is frequent:
algorithmic characteristics: e.g. “a[i-1] + a[i] + a[i+1]”
coding style: pointers, libraries
ÿ Previous work does not support
runtime alignment for minimized numbers of shifts (e.g. lazy-shift policy)
reason: code generation issue
ÿ We propose:
an additional phase that “normalizes” the computation to eliminate this issue
11
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Problem with Runtime Alignment
ÿ Shift c[3] left:
Shift b[1] right:
4 byte
4 byte
c0 c1 c2 c3 c4 c5 c6 c7
b-4 b-3 b-2 b-1 b0 b5
b1 b2 b3
VLOAD c[3]
VLOAD c[3+4]
VLOAD b[1-4]
VLOAD b[1]
c0 c1 c2 c3
c4 c5 c6 c7
b-4 b-3 b-2 b-1
b0 b1 b2 b3
VPERM-LEFT
VPERM-RIGHT
c1 c2 c3 c4
Shift in data from the NEXT vector
b-1 b0 b1 b2
Shift in data from the PREV vector
Different code sequences for shift LEFT & RIGHT.
With Runtime alignment, we don’t know which direction we are shifting to.
=> Hence the compiler cannot generate this code statically.
12
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Insight: Convert Arbitrary Shifts into Left-Shifts
ÿ Instead of shifting b[1] RIGHT
4 byte
b-4 b-3 b-2 b-1 b0 b1
b5 b2 b3
we may shift b[-1] to the LEFT:
12 byte
b-4 b-3 b-2 b-1 b0 b5
b1 b2 b3
VLOAD b[1-4]
VLOAD b[1]
VLOAD b[-1]
VLOAD b[-1+4]
b-4 b-3 b-2 b-1
b0 b1 b2 b3
b-4 b-3 b-2 b-1
b0 b1 b2 b3
PREV
VPERM-LEFT
VPERM-RIGHT
b-1 b0 b1 b2
NEXT
b-1 b0 b1 b2
By changing the reference point, i.e. by pre-pending 8 bytes,
the shift RIGHT becomes a shift LEFT.
13
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Approach
ÿ “Normalize” each shift into a shift LEFT
prepend input stream to transform shift into a shift left
prepend amount is mechanically derived from alignment
for runtime alignment, prepend amount is only known at runtime
details and proofs are in the paper
14
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Contribution #2: Handling Data Conversion Efficiently
ÿ Data conversion is frequent
multimedia often store in small data type, operate on larger data types
data conversion often occurs jointly with misalignment
ÿ Previous work does not support
data conversion and misalignment for minimized numbers of shifts
reason: code generation issue
ÿ We extend the our approach to handle such cases
15
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Data Size Conversion
ÿ What happens when converting between data of different sizes?
2 vectors of integers
b0
b1
b2
b3
b4
b5
b6
b7
VPACK
VPACK perform the packing
b0 b1 b2 b3 b4 b5 b6 b7
part of the int=>short conversion
becomes 1 vector of short
ÿ To fully utilize the SIMD bandwidth, we must compute over
2 vectors of 4 integers
1 vector of 8 shorts
*The logical part of the conversion (e.g. sign extension) has no SIMD impact and is ignored here
16
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Alignment of Data is Impacted by Conversion
b[3] reside at offset 12
b0
b1
b2
b3
b4
b5
b6
b7
integers
VPACK
shorts
b0 b1 b2 b3 b4 b5 b6 b7
this become offset 12/2 = 6
as b[3] gets packed into shorts
17
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Alignment of Data is Impacted by Conversion (cont.)
Truncates at 16 bytes,
VLOAD b[3]
b0
b1
As it gets pack into short
offset < 8
b2
thus offset < 16
b3
b4
b5
b6
b7
integers
VPACK
shorts
b0 b1 b2 b3 b4 b5 b6 b7
Store still truncate
VSTORE a[3]
at 16 bytes
Because of conversions, we must keep track of the truncation factors
as data flows through the graph
18
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Approach
ÿ Additional steps in presence of data conversion:
propagate offsets through data conversion (vpack/vunpack)
keep track of truncation factors
must satisfy some constraints on truncation factors
more details in the paper
19
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Measurements
ÿ Target:
Mac G5 with PowerPC 970 with VMX (AltiVec)
ÿ Compiler:
IBM’s XL compiler
comparison performed at identical level of optimizations
ÿ Metric:
speedup factor (SIMD vs. scalar)
20
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Measurements for Alignment Handling
ÿ Synthetic benchmark (avg over 50 loops) with random align for 2 loads/1 store
13.00
12.4
Zero-Shift
12.00
Lazy-Shift
11.00
Sppedup Factor
10.00
9.00
8.00
7.00
6.1
6.00
5.00
4.0
4.00
3.00
2.00
1.00
RT
CT
int (4 per vect)
Our contribution here
21
RT
CT
short (8 per vect)
RT
CT
char (16 per vect)
Optimized Alignment Handling
(Compile time only) [Eichen 04]
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Measurements for Kernels
ÿ With/without proposed runtime alignment handling, conversion support
8.25
8
7
Speedup Factor
6
zero-shift (prior work)
lazy-shift
conversion + zero-shift
conversion + lazy-shift
6.14
4.97
5
4
3.13
2.92
3
2.52
2.24
2
1.08
1
0.71
1.08
1.21
1.38
1.21
1.38
1.00 1.00
1.00 1.00
0.71
0
numerical.saxpy
22
numerical.swim
tcp/ip.checksum
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
video.alphablending
Alexandre Eichenberger
IBM Research
Measurements for Benchmarks
2.5
2.16
Speedup Factor
2
1.62
1.41
1.5
1.20
1.02
1.02
1.02
1.05
1.06
1
0.5
0
23
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger
IBM Research
Conclusions
ÿ Realistic benchmarks have:
runtime alignment
data size conversion
ÿ We propose here a technique that support both
in an integrated fashion
fully compatible with optimized number of shifts
ÿ Demonstrated good speedups
1.1 – 6.1 x for kernels
up to 2.2 x for benchmarks
more integration with full compiler needed for better performance
24
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Alexandre Eichenberger