Presentation

Autovectorization in GCC
Dorit Naishlos
[email protected]
IBM Labs in Haifa
IBM Labs in Haifa
Vectorization in GCC - Talk Layout
Background: GCC
HRL and GCC
Vectorization
Background
The GCC Vectorizer
Developing a vectorizer in GCC
Status & Results
Concluding Remarks
2
IBM Labs in Haifa
GCC
Open Source
Multi-platform
Download from gcc.gnu.org
How does it work
cvs
mailing list: [email protected]
steering committee, maintainers
Who’s involved
Volunteers
Linux distributors
Apple, IBM – HRL (Haifa Research Lab)
3
IBM Labs in Haifa
GCC 4.0
GCC Passes
front-end
parse trees
generic trees
gimple trees
middle-end
generic trees
back-end
RTL
4
misc opts
loop optimizations
into SSA
loop opts
SSA optimizations
vectorization
Out of SSA
loop opts
gimple trees
machine
generic trees
description
misc opts
IBM Labs in Haifa
GCC Passes
front-end
parse trees
middle-end
generic trees
Fortran 95 front-end
IPO
CP
Aliasing
Data layout
Vectorization
Loop unrolling
Scheduler
Modulo Scheduling
back-end
RTL
5
Power4
machine
description
The Haifa GCC team:
Leehod Baruch
Revital Eres
Olga Golovanevsky
Mustafa Hagog
Razya Ladelsky
Victor Leikehman
Dorit Naishlos
Mircea Namolaru
Ira Rosen
Ayal Zaks
IBM Labs in Haifa
Vectorization in GCC - Talk Layout
Background: GCC
HRL and GCC
Vectorization
Background
The GCC Vectorizer
Developing a vectorizer in GCC
Status & Results
Concluding Remarks
6
IBM Labs in Haifa
Programming for Vector Machines
Proliferation of SIMD (Single Instruction Multiple Data) model
MMX/SSE, Altivec
Communications, Video, Gaming
Fortran90
a[0:N] = b[0:N] + c[0:N];
Intrinsics
vector float vb = vec_load (0, ptr_b);
vector float vc = vec_load (0, ptr_c);
vector float va = vec_add (vb, vc);
vec_store (va, 0, ptr_a);
Autovectorization: Automatically transform serial code to vector code
by the compiler.
7
IBM Labs in Haifa
What is vectorization
VF = 4
0
1
2
3
VR1 a b c d
OP(a)
VR2
OP(b)
VR3
VR4
VR5
Vector Registers
VOP( a, VR1
b, c, d )
OP(c)
OP(d)
vectorization
Data elements packed into vectors
Vector length
Vectorization Factor (VF)
Data in Memory:
a b c d e f g h i j k l m n o p
8
Vector operation
IBM Labs in Haifa
Vectorization
original serial loop:
for(i=0; i<N; i++){
a[i] = a[i] + b[i];
}
vectorization
loop in vector notation:
for (i=0; i<N; i+=VF){
a[i:i+VF] = a[i:i+VF] + b[i:i+VF];
}
Loop based vectorization
No dependences between iterations
9
loop in vector notation:
for (i=0; i<(N-N%VF); i+=VF){
a[i:i+VF] = a[i:i+VF] + b[i:i+VF];
} vectorized loop
for ( ; i < N; i++) {
a[i] = a[i] + b[i];
} epilog loop
IBM Labs in Haifa
Loop Dependence Tests
for (i=0; i<N; i++){
D[i] = A[i] + Y
A[i+1] = B[i] + X
}
for (i=0; i<N; i++){
B[i] = A[i] + Y
A[i+1] = B[i] + X
}
for (i=0; i<N; i++)
for (j=0; j<N; j++)
A[i+1][j] = A[i][j] + X
10
IBM Labs in Haifa
Basic vectorizer 01.01.2004
known loop bound
arrays and pointers
1D aligned arrays
unaligned accesses
force alignment
invariants
conditional code
idiom recognition
saturation
mainline
11
Vectorizer Skeleton
for (i=0; i<N; i++){
a[i] = b[i] + c[i];
get candidate loops
nesting, entry/exit, countable
}
memory references
li r9,4
li r2,0
mtctr r9
access pattern analysis
data dependences
alignment analysis
scalar dependences
vectorizable operations
data-types, VF, target support
vectorize loop
L2:
lvx v0,r2,r30
lvx v1,r2,r29
vaddfp v0,v0,v1
stvx v0,r2,r0
addi r2,r2,16
bdnz L2
IBM Labs in Haifa
VR1 a b c d
Alignment support in a multi-platform compiler
General (new trees: realign_load)
Efficient (new target hooks: mask_for_load)
Hide low-level details
OP(c)
VR2 e f g h
OP(d)
Alignment
0
1
2
3
VR3 c d e f
VR4
VR5
VOP( c, VR3
d, e, f )
OP(e)
OP(f)
Vector Registers
(VR1,VR2)
mask
VR3
(0,0,1,1,1,1,0,0)
pack (VR1,VR2),mask
VOP(VR3)
Data in Memory:
a b c d e f g h i j k l m n o p
misalign = -2
12
vload (mem)
IBM Labs in Haifa
Handling Alignment
Alignment analysis
Transformations to force alignment
loop versioning
loop peeling
Efficient misalignment support
vector
==q;q,Milind
vector
*vp_q1
*vp_q2
= &q[VF-1];
Aart *vp_q
J.C. Bik,
Girkar,
Paul M. Grey,
vector Ximmin
*vp_p = p;
Tian. Automatic intra-register
int indxvectorization
= 0, vector vx,
vx1,
vx;for
thevx2;
intel architecture.
LOOP:International Journal of Parallel
vp_q1[indx];
vx1 =Programming,
April 2002.
vx2 = vp_q2[indx];
vx = vp_q[indx];
mask = target_hook_mask_for_load (q);
vx = realign_load(vx1, vx2, mask)
vp_p[indx] = vx;
indx++;
13
for (i=0; i<N; i++){
x = q[i]; //misalign(q) = unknown
p[i] = x; //misalign(p) = unknown
-2
}
Peeling for p[i] and versioning:
Loop
access p[i]):
versioning:
for (i =peeling
0; i < 2;(for
i++){
if x(q(i
for
aligned)
0; i < 2; i++){
{
=is=
q[i];
for
x
= (i=0;
q[i];
p[i]
= x; i<N; i++){
x == q[i];
x; //misalign(q) = 0
} p[i]
= x; //misalign(p)
= -2
}if (qp[i]
is aligned)
{
for}for
(i =
2;3;
i <i<N;
N; i++){
(i =
i++){
}else
{ q[i];
x=
q[i];
//misalign(q)
==
unknown
0
x=
//misalign(q)
forp[i]
(i=0;
i<N;
i++){ = =
00
p[i]
==
x; x;
//misalign(p)
//misalign(p)
} } x = q[i]; //misalign(q) = unknown
p[i]{ = x; //misalign(p) = -2
}else
}for (i = 3; i<N; i++){
} x = q[i]; //misalign(q) = unknown
p[i] = x; //misalign(p) = 0
}
}
IBM Labs in Haifa
Preliminary Results
Pixel Blending Application
- small dataset: 16x improvement
- tiled large dataset: 7x improvement
- large dataset with display: 3x improvement
for (i = 0; i < sampleCount; i++) {
output[i] = ( (input1[i] * )>>8 + (input2[i] * ( -1))>>8 );
}
SPEC gzip – 9% improvement
for (n = 0; n < SIZE; n++) {
m = head[n];
head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0);
}
Kernels:
14
lvx v0,r3,r2
vsubuhs v0,v0,v1
stvx v0,r3,r2
addi r2,r2,16
bdnz L2
IBM Labs in Haifa
Performance improvement (aligned accesses)
float
int
short
char
a[i]=b[i]
15
a[i+ ]=b[i- ] a[i]=b[i]+
a[i]=b[i]&c[i] a[i]=b[i]+c[i]
IBM Labs in Haifa
Performance improvement (unaligned accesses)
float
int
short
char
a[i]=b[i]
16
a[i+ ]=b[i- ] a[i]=b[i]+
a[i]=b[i]&c[i] a[i]=b[i]+c[i]
IBM Labs in Haifa
Vectorizer Status
In the main GCC development trunk
Will be part of the GCC 4.0 release
New development branch (autovect-branch)
Vectorizer Developers:
Dorit Naishlos
Olga Golovanevsky
Ira Rosen
Leehod Baruch
Keith Besaw (IBM US)
Devang Patel (Apple)
17
IBM Labs in Haifa
Concluding Remarks
GCC
HRL and GCC
Evolving - new SSA framework
GCC vectorizer
Developing a generic vectorizer
working with an open source community
in a multi-platform compiler
Open
GCC 4.0
“project management” ?
culture
multi-platform
we are part of the GCC community
world wide collaboration
world wide exposure
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
18
IBM Labs in Haifa
The End
19
IBM Labs in Haifa
Future vectorization Features
1. Reduction
2. Multiple data types
3. Non-consecutive data-accesses
Internal Representation
limited (scalar operations)
high level, machine independent
Low-level, architecture-dependent details
multiplatform
20
IBM Labs in Haifa
1. Reduction
Cross iteration dependence
Prolog and epilog
Partial sums
s1,s2,s3,s4
0 0 0 0
0 1 2 3
4 6 8 10
28
21
+
0 1 2 3
+
4 5 6 7
s = 0;
for (i=0; i<N; i++) {
s += a[i] * b[i];
}
loop:
s_1 = phi (0, s_2)
i_1 = phi (0, i_1)
xa_1 = a[i_1]
xb_1 = b[i_1]
tmp_1 = xa * xb
s_2 = s_1 + tmp_1
i_2 = i_1 + 1
if (i_2 < N) goto loop
IBM Labs in Haifa
2. Mixed data types
short b[N];
int a[N];
for (i=0; i<N; i++)
a[i] = (int) b[i];
Unpack
22
IBM Labs in Haifa
3. Non-consecutive access patterns
0
1
2
3
A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5)
VR1 a b c d
OP(a)
VR2 e f g h
OP(f)
VR3 i j k l
VR4 m n o p
VR5 a f k p
VOP( a, VR5
f, k, p )
OP(k)
OP(p)
(VR1,…,VR4)
mask
VR5
(1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1)
pack (VR1,…,VR4),mask
VOP(VR5)
Data in Memory:
a b c d e f g h i j k l m n o p
23
vload (mem)