Presentation

Multi-Platform Auto-Vectorization
Dorit Nuzman, IBM
Richard Henderson, RedHat
IBM Labs in Haifa
IBM Labs in Haifa
Multi-Platform Auto-Vectorization - Talk Layout
Vectorization for SIMD
Alignment Example
Vectorization in GCC
Vector Abstractions
Abstractions for Alignment
Multi-platform Evaluation
Related Work & Conclusion
2
IBM Labs in Haifa
Vectorization
SIMD (Single Instruction Multiple Data) model
Communications, Video, Gaming
MMX/SSE, Altivec
Programming for Vector Platforms
Fortran90
a[0:N] = b[0:N] + c[0:N];
Intrinsics
vector float vb = vec_load (0, ptr_b);
vector float vc = vec_load (0, ptr_c);
vector float va = vec_add (vb, vc);
vec_store (va, 0, ptr_a);
Autovectorization: Automatically transform serial code to vector code
by the compiler.
3
IBM Labs in Haifa
What is vectorization
VF = 4
0
1
2
3
VR1 a b c d
VR2
VR3
VR4
VR5
Vector Registers
Data elements packed into vectors
Vector length
Vectorization Factor (VF)
No Data Dependences
SIMD Architectural Capabilities
OP(a)
OP(b)
VOP( a, VR1
b, c, d )
OP(c)
OP(d)
vectorization
Data in Memory:
a b c d e f g h i j k l m n o p
4
Vector operation
IBM Labs in Haifa
Limitations of SIMD Architectures:
Unaligned memory access
0
1
2
OP(c)
3
VR1 a b c d
VOP( c, d, e, f )
OP(d)
VR2 e f g h
0
V1
OP(f)
VR4
VR5
Vector Registers
Data in Memory:
5
8
16
32
3
4
5
6
7
48
V2
V3
V3
vec-shift-left v1, v2, 2
vec-permute v1, v2, {2,3,4,5}
V1
V3
extql (v1, addr), v2
vec-or (v1, v2)
V3
alvn.ps v1,v2,addr
(MIPS64)
V3
load-left, load-right
(MIPS MDMX)
a b c d e f g h i j k l m n o p
0
2
a b c d e f g h
OP(e)
VR3 c d e f
1
extqh (v2, addr)
(alpha)
IBM Labs in Haifa
Multi-Platform Auto-Vectorization - Talk Layout
Vectorization for SIMD
Alignment Example
Vectorization in GCC
Vector Abstractions
Abstractions for Alignment
Multi-platform Evaluation
Related Work & Conclusion
6
IBM Labs in Haifa
GCC
Free Software Foundation
Multi-platform
7
IBM Labs in Haifa
GCC
Free Software Foundation
Multi-platform
Who’s involved
Volunteers
Linux distributors (RedHat, Suse…)
Code Sourcery, AdaCore…
IBM, HP, Intel, Apple…
8
IBM Labs in Haifa
GCC Passes
…
Ada front-end
Fortran front-end
C++ front-end
C front-end
parse trees
GIMPLE Abstractions
middle-end
GIMPLE trees
SSA optimizations
CCP
PRE
DCE
CSE
loop optimizations
invariant motion
unswitching
DSE
linear transform
loop opts
If-conversion
forward prop
vectorization
copy prop
unrolling
VRP
Sibling call optimizations
Common subexpression elimination
Vector Size
Loop optimizations
…
mips port
i386 port
rs6000 port
Data flow analysis
back-end
RTL
machine description
9
Instruction combination
Instruction scheduling
Register allocation and reloading
Instruction scheduling (repeated)
assembly
Branch shortening
IBM Labs in Haifa
Multi-Platform Auto-Vectorization - Talk Layout
Vectorization for SIMD
Alignment Example
Vectorization in GCC
Vector Abstractions
Abstractions for Alignment
Multi-platform Evaluation
Related Work & Conclusion
10
IBM Labs in Haifa
s = 0;
for (i=0; i<N; i++) {
s = s + a[i] * b[i];
}
Vector Abstractions: Why Needed
Represent high-level idioms that otherwise can’t be
vectorized
reduction
special idioms (sad, subtract-and-saturate, dot-product)
Express vector operations in GIMPLE
“reduc-plus”
extract, shuffle,…
s1,s2,s3,s4
00 610 802 10
4
03
API for targets to convey availability and cost of a
functionality
optab/type
add
reduc-plus
11
char
f1
short
f2
int
f3
v16char
v8short
f4
f7
f5
f8
v4int
f6
IBM Labs in Haifa
Vector Abstractions: Considerations
Generality vs. applicability
General enough to cover all uses
Minimize increase of operation-codes
Not generally supported
permute
0
1
2
3
4
5
6
7
a b c d e f g h
V1
V2
Compound vs. building blocks
Increase of operation-codes
subtract-and-saturate,
dot-product
Complicated “black-box” operations
Increase ways to represent same functionality
Improved direct support of a high-level idiom over basic functionalities
GCC convensions
naming, existing-operation-codes, default values…
Performance
Translates to most efficient code
12
IBM Labs in Haifa
Vector Abstractions: Abstractions for alignment
V3
V3
Implicit Realignment
movdqu
load-left, load-right
misaligned_ref (ptr, mis)
0
Data in Memory:
alvn.ps v1,v2,addr
13
7
V2
vec-permute v1, v2, {2,3,4,5}
extql (v1, addr), v2
vec-or (v1, v2)
V3
6
e f g h
V3
a b c d e f g h i j k l m n o p
48
5
vec-shift-left v1, v2, 2
V3
32
4
V3
V1
16
3
V1
aligned_ref (ptr)
realign_load (v1, v2, RT)
Realignment Token (RT)
8
2
(MIPS MDMX)
a b c d
Explicit Realignment
0
1
(MMX/SSE)
(Altivec)
extqh (v2, addr)
(Alpha)
(MIPS64)
IBM Labs in Haifa
Handling Alignment
for (i=0; i<N; i++){
x = a[i];
b[i] = x;
}
addra_0 = &a[0];
addrb = &b[0];
vector vx,
vx; vx1, vx2;
addr_0;
addra_i = addra_0;
LOOP:
vx1= =misaligned_ref
align_ref (addra_i),
(addra_i,0);
vx
vx2 = align_ref (addra_i+15);
vx = realign_load (vx1, vx2, addra_i);
indirect_ref
indirect_ref(addrb)
(addrb)==vx;
vx;
addra_i += 16; addrb += 16;
14
addra_0 = &a[0];
addrb = &b[0];
vector vx, vx1, vx2;
vx1 = align_ref (addra_0);
addra_i = addra_0 + 15;
LOOP:
vx2 = align_ref (addra_i);
vx = realign_load (vx1, vx2, addra_i);
indirect_ref (addrb) = vx;
addra_i += 16; addrb += 16; vx1 = vx2;
IBM Labs in Haifa
Handling Alignment
addra_0 = &a[0];
adrb = &b[0];
vector vx;
addra_i = addra_0;
LOOP:
vx = misaligned_ref (addra_i,0);
indirect_ref (addrb) = vx;
addra_i += 16; addrb += 16;
15
for (i=0; i<N; i++){
x = a[i];
b[i] = x;
}
addra_0 = &a[0];
addrb = &b[0];
vector vx, vx1, vx2;
vx1 = align_ref (addra_0);
RT = target_get_RT (addra_0);
addra_i = addra_0 + 15;
LOOP:
vx2 = align_ref (addra_i);
vx = realign_load (vx1, vx2, RT);
indirect_ref (addrb) = vx;
addra_i += 16; addrb += 16; vx1 = vx2;
IBM Labs in Haifa
GIMPLE Vector Abstractions
Alignment:
misaligned_ref, align_ref
realign_load, target_get_RT
16
Conditional operations:
(cond) ? x : y
Reduction:
reduc_plus
Type Conversions
unpack_high, unpack_low
pack_mod, pack_sat
Special patterns:
dot_prod, sad
sub_sat
widen_mult, widen_sum
Strided-Accesses:
extract_odd, extract_even
interleave_high,
interleave_low
IBM Labs in Haifa
Multi-Platform Auto-Vectorization - Talk Layout
Vectorization for SIMD
Alignment Example
Vectorization in GCC
Vector Abstractions
Abstractions for Alignment
Multi-platform Evaluation
Related Work & Conclusion
17
IBM Labs in Haifa
Multi-Platform Evaluation
IBM PowerPC970, Altivec (VS = 16)
Intel Pentium4, SSE2 (VS = 16)
AMD Athlon64, SSE2 (VS = 16)
Intel Itanium2 (VS = 8)
MIPS64, paired-single-fp (VS = 8)
Alpha (VS = 8)
18
IBM Labs in Haifa
Multi-Platform Evaluation
Vectorization Speedup Factors - Aligned
blas.sdot_fp
.
blas.saxpy_fp
.
blas.dscal_fp
.
vecmax_fp
.
checksum_s
.
chromakey_u
.
vecmax_s
vecsum_u
.
chromakey_u
.
.
19
powerpc
pentium
athlon
itanium
alpha
mips
vecmax_u
IBM Labs in Haifa
Multi-Platform Evaluation
Vectorization Speedup Factors - Unaligned
blas.sdot_fp
.
blas.saxpy_fp
.
blas.dscal_fp
.
vecmax_fp
.
checksum_s
.
chromakey_u
.
vecmax_s
vecsum_u
.
chromakey_u
.
.
20
powerpc
pentium
athlon
itanium
alpha
mips
vecmax_u
IBM Labs in Haifa
Related Work
Vectorizing compilers available for a specific architecture
XL (Eichenberger, Wu). Altivec
icc (Bik). MMX/SSE
CoSy (Krall). VIS
SUIF (Larsen,Amarasinghe ; Shin,Chame,Hall) – Altivec
Vectorizing compilers available for multiple SIMD targets
source-to-source compilers
Vienna MAP, 2-way, domain-specific patterns. BG +
SWARP. source-to-source, multimedia patterns. Trimedia +
This Work:
In a robust industrial-strength compiler
Experimental results on several different SIMD platforms
21
IBM Labs in Haifa
Concluding Remarks
SIMD
Hardware limitations
Unique Hardware mechanisms
Diverse nature
Multi-platform vectorizer
Bridge gap across different SIMD targets
Efficiently support each individual platform
Identify proper abstractions
Developing the vectorizer in the GCC platform
Collaborative investment of different vendors/developers
Open, available
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
22
IBM Labs in Haifa
The End
23