Autovectorization in GCC Dorit Naishlos [email protected] IBM Labs in Haifa IBM Labs in Haifa Vectorization in GCC - Talk Layout Background: GCC HRL and GCC Vectorization Background The GCC Vectorizer Developing a vectorizer in GCC Status & Results Concluding Remarks 2 IBM Labs in Haifa GCC Open Source Multi-platform Download from gcc.gnu.org How does it work cvs mailing list: [email protected] steering committee, maintainers Who’s involved Volunteers Linux distributors Apple, IBM – HRL (Haifa Research Lab) 3 IBM Labs in Haifa GCC 4.0 GCC Passes front-end parse trees generic trees gimple trees middle-end generic trees back-end RTL 4 misc opts loop optimizations into SSA loop opts SSA optimizations vectorization Out of SSA loop opts gimple trees machine generic trees description misc opts IBM Labs in Haifa GCC Passes front-end parse trees middle-end generic trees Fortran 95 front-end IPO CP Aliasing Data layout Vectorization Loop unrolling Scheduler Modulo Scheduling back-end RTL 5 Power4 machine description The Haifa GCC team: Leehod Baruch Revital Eres Olga Golovanevsky Mustafa Hagog Razya Ladelsky Victor Leikehman Dorit Naishlos Mircea Namolaru Ira Rosen Ayal Zaks IBM Labs in Haifa Vectorization in GCC - Talk Layout Background: GCC HRL and GCC Vectorization Background The GCC Vectorizer Developing a vectorizer in GCC Status & Results Concluding Remarks 6 IBM Labs in Haifa Programming for Vector Machines Proliferation of SIMD (Single Instruction Multiple Data) model MMX/SSE, Altivec Communications, Video, Gaming Fortran90 a[0:N] = b[0:N] + c[0:N]; Intrinsics vector float vb = vec_load (0, ptr_b); vector float vc = vec_load (0, ptr_c); vector float va = vec_add (vb, vc); vec_store (va, 0, ptr_a); Autovectorization: Automatically transform serial code to vector code by the compiler. 7 IBM Labs in Haifa What is vectorization VF = 4 0 1 2 3 VR1 a b c d OP(a) VR2 OP(b) VR3 VR4 VR5 Vector Registers VOP( a, VR1 b, c, d ) OP(c) OP(d) vectorization Data elements packed into vectors Vector length Vectorization Factor (VF) Data in Memory: a b c d e f g h i j k l m n o p 8 Vector operation IBM Labs in Haifa Vectorization original serial loop: for(i=0; i<N; i++){ a[i] = a[i] + b[i]; } vectorization loop in vector notation: for (i=0; i<N; i+=VF){ a[i:i+VF] = a[i:i+VF] + b[i:i+VF]; } Loop based vectorization No dependences between iterations 9 loop in vector notation: for (i=0; i<(N-N%VF); i+=VF){ a[i:i+VF] = a[i:i+VF] + b[i:i+VF]; } vectorized loop for ( ; i < N; i++) { a[i] = a[i] + b[i]; } epilog loop IBM Labs in Haifa Loop Dependence Tests for (i=0; i<N; i++){ D[i] = A[i] + Y A[i+1] = B[i] + X } for (i=0; i<N; i++){ B[i] = A[i] + Y A[i+1] = B[i] + X } for (i=0; i<N; i++) for (j=0; j<N; j++) A[i+1][j] = A[i][j] + X 10 IBM Labs in Haifa Basic vectorizer 01.01.2004 known loop bound arrays and pointers 1D aligned arrays unaligned accesses force alignment invariants conditional code idiom recognition saturation mainline 11 Vectorizer Skeleton for (i=0; i<N; i++){ a[i] = b[i] + c[i]; get candidate loops nesting, entry/exit, countable } memory references li r9,4 li r2,0 mtctr r9 access pattern analysis data dependences alignment analysis scalar dependences vectorizable operations data-types, VF, target support vectorize loop L2: lvx v0,r2,r30 lvx v1,r2,r29 vaddfp v0,v0,v1 stvx v0,r2,r0 addi r2,r2,16 bdnz L2 IBM Labs in Haifa VR1 a b c d Alignment support in a multi-platform compiler General (new trees: realign_load) Efficient (new target hooks: mask_for_load) Hide low-level details OP(c) VR2 e f g h OP(d) Alignment 0 1 2 3 VR3 c d e f VR4 VR5 VOP( c, VR3 d, e, f ) OP(e) OP(f) Vector Registers (VR1,VR2) mask VR3 (0,0,1,1,1,1,0,0) pack (VR1,VR2),mask VOP(VR3) Data in Memory: a b c d e f g h i j k l m n o p misalign = -2 12 vload (mem) IBM Labs in Haifa Handling Alignment Alignment analysis Transformations to force alignment loop versioning loop peeling Efficient misalignment support vector ==q;q,Milind vector *vp_q1 *vp_q2 = &q[VF-1]; Aart *vp_q J.C. Bik, Girkar, Paul M. Grey, vector Ximmin *vp_p = p; Tian. Automatic intra-register int indxvectorization = 0, vector vx, vx1, vx;for thevx2; intel architecture. LOOP:International Journal of Parallel vp_q1[indx]; vx1 =Programming, April 2002. vx2 = vp_q2[indx]; vx = vp_q[indx]; mask = target_hook_mask_for_load (q); vx = realign_load(vx1, vx2, mask) vp_p[indx] = vx; indx++; 13 for (i=0; i<N; i++){ x = q[i]; //misalign(q) = unknown p[i] = x; //misalign(p) = unknown -2 } Peeling for p[i] and versioning: Loop access p[i]): versioning: for (i =peeling 0; i < 2;(for i++){ if x(q(i for aligned) 0; i < 2; i++){ { =is= q[i]; for x = (i=0; q[i]; p[i] = x; i<N; i++){ x == q[i]; x; //misalign(q) = 0 } p[i] = x; //misalign(p) = -2 }if (qp[i] is aligned) { for}for (i = 2;3; i <i<N; N; i++){ (i = i++){ }else { q[i]; x= q[i]; //misalign(q) == unknown 0 x= //misalign(q) forp[i] (i=0; i<N; i++){ = = 00 p[i] == x; x; //misalign(p) //misalign(p) } } x = q[i]; //misalign(q) = unknown p[i]{ = x; //misalign(p) = -2 }else }for (i = 3; i<N; i++){ } x = q[i]; //misalign(q) = unknown p[i] = x; //misalign(p) = 0 } } IBM Labs in Haifa Preliminary Results Pixel Blending Application - small dataset: 16x improvement - tiled large dataset: 7x improvement - large dataset with display: 3x improvement for (i = 0; i < sampleCount; i++) { output[i] = ( (input1[i] * )>>8 + (input2[i] * ( -1))>>8 ); } SPEC gzip – 9% improvement for (n = 0; n < SIZE; n++) { m = head[n]; head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0); } Kernels: 14 lvx v0,r3,r2 vsubuhs v0,v0,v1 stvx v0,r3,r2 addi r2,r2,16 bdnz L2 IBM Labs in Haifa Performance improvement (aligned accesses) float int short char a[i]=b[i] 15 a[i+ ]=b[i- ] a[i]=b[i]+ a[i]=b[i]&c[i] a[i]=b[i]+c[i] IBM Labs in Haifa Performance improvement (unaligned accesses) float int short char a[i]=b[i] 16 a[i+ ]=b[i- ] a[i]=b[i]+ a[i]=b[i]&c[i] a[i]=b[i]+c[i] IBM Labs in Haifa Vectorizer Status In the main GCC development trunk Will be part of the GCC 4.0 release New development branch (autovect-branch) Vectorizer Developers: Dorit Naishlos Olga Golovanevsky Ira Rosen Leehod Baruch Keith Besaw (IBM US) Devang Patel (Apple) 17 IBM Labs in Haifa Concluding Remarks GCC HRL and GCC Evolving - new SSA framework GCC vectorizer Developing a generic vectorizer working with an open source community in a multi-platform compiler Open GCC 4.0 “project management” ? culture multi-platform we are part of the GCC community world wide collaboration world wide exposure http://gcc.gnu.org/projects/tree-ssa/vectorization.html 18 IBM Labs in Haifa The End 19 IBM Labs in Haifa Future vectorization Features 1. Reduction 2. Multiple data types 3. Non-consecutive data-accesses Internal Representation limited (scalar operations) high level, machine independent Low-level, architecture-dependent details multiplatform 20 IBM Labs in Haifa 1. Reduction Cross iteration dependence Prolog and epilog Partial sums s1,s2,s3,s4 0 0 0 0 0 1 2 3 4 6 8 10 28 21 + 0 1 2 3 + 4 5 6 7 s = 0; for (i=0; i<N; i++) { s += a[i] * b[i]; } loop: s_1 = phi (0, s_2) i_1 = phi (0, i_1) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 if (i_2 < N) goto loop IBM Labs in Haifa 2. Mixed data types short b[N]; int a[N]; for (i=0; i<N; i++) a[i] = (int) b[i]; Unpack 22 IBM Labs in Haifa 3. Non-consecutive access patterns 0 1 2 3 A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5) VR1 a b c d OP(a) VR2 e f g h OP(f) VR3 i j k l VR4 m n o p VR5 a f k p VOP( a, VR5 f, k, p ) OP(k) OP(p) (VR1,…,VR4) mask VR5 (1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1) pack (VR1,…,VR4),mask VOP(VR5) Data in Memory: a b c d e f g h i j k l m n o p 23 vload (mem)