IBM Toronto Lab Auto-SIMDization Challenges Amy Wang, Peng Zhao, Peng Wu, Alexandre Eichenberger IBM Toronto Laboratory IBM T.J. Watson Research Center Auto-SIMDization Challenges October 17, 2005 @2005 IBM Corporation © 2002 IBM Corporation Objective: What are the new challenges in SIMD code generation that are specific to VMX? (due to lack of time….) Scalar Prologue/Epilogue Code Generation (80% of the talk) Loop Distribution (10% of the talk) Mixed-Mode SIMDization Future Tuning Plan (10% of the talk) 2 Automatic SIMDization © 2005 IBM Corporation Background: Hardware imposed misalignment problem More details in the CELL tutorial Tuesday afternoon 3 Automatic SIMDization © 2005 IBM Corporation Single Instruction Multiple Data (SIMD) Computation Process multiple “b[i]+c[i]” data per operations 16-byte boundaries R1 b0 b1 b2 b3 b0 b1 b2 b3 b4 b5 b6 c0 c1 c2 c3 c0 c1 c2 c3 c4 c5 b8 b9 b10 b0+ b1+ b2+ b3+ c0 c1 c2 c3 + R2 b7 c6 c7 c8 R3 c9 c10 16-byte boundaries 4 Automatic SIMDization © 2005 IBM Corporation Code Generation for Loops (Multiple Statements) for (i=0; i<n; i++) { Implicit loop skewing (steady-state) a[i] = …; b[i+1] = …; a[i+4] = … c[i+3] = …; b[i+4] = … } c[i+4] = … a[i]= a0 a1 a2 b3 a4 a5 a6 a7 ... a96 a97 a98 a99 a a a a 100 101 102 103 b[i+1]= b0 b1 b2 b3 b4 b5 b6 b7 ... b96 b97 b98 b99 b b b b 100 101 102 103 c[i+3]= c0 c1 c2 b3 c3 c4 c5 c6 c7 ... c96 c97 c98 c99 c c c c 100 101 102 103 loop prologue (simdized) 5 Automatic SIMDization loop steady state (simdized) loop epilogue (simdized) © 2005 IBM Corporation Code Generation for Partial Store – Vector Prologue/Epilogue for (i=0; i<100; i++) a[i+3] = b[i+1] + c[i+2]; a0 a1 a2 a3 * * * b1+ c2 b2+ b3+ b4+ b5+ c3 c4 c5 c6 ... shuffle a0 load a[3] a1 a2 b1+ c2 store c[3] a0 a1 a2 store c[7] a3 a4 a5 a6 a7 … 16-byte boundaries Can be complicated for multi-threading and page faulting issues 6 Automatic SIMDization © 2005 IBM Corporation Multi-threading issue for (i=0; i<100; i++) a[i+3] = b[i+1] + c[i+2]; a0 a1 a2 a3 a4 a5 a6 a7 * * * b1+ c2 Thread 1 performs load of a0-a3 a0 a1 a2 shuffle a3 a0 Thread 2 updates a0, while thread 1 is working on shuffling a1 a2 b1+ c2 WRONG! a0 a1 b1+ b2+ b3+ b4+ b5+ a2 c2 c3 c4 c5 c6 … 16-byte boundaries 7 Automatic SIMDization © 2005 IBM Corporation Solution – Scalar Prologue/Epilgue After Late-SIMDization Scalar Prologue 20 | void ptest() { int K1; 22 | if (!1) goto lab_4; @CIV0 = 0; void ptest() { if (!1) goto lab_26; int i; @ubCondTest0 = (K1 & 3) * -4 + 16; for (i=0;i<UB;i++) { @CIV0 = 0; pout0[i+K1] = pin0[i+K1] + pin1[i+K1]; } do { /* id=2 guarded */ /* ~27 */ /* region = 0 */ /* bump-normalized */ if ((unsigned) (@CIV0 * 4) >= @ubCondTest0) goto lab_30; pout0[]0[K1 + @CIV0] = pin0[]0[K1 + @CIV0] + pin1[]0[K1 + @CIV0]; lab_30: } /* DIR LATCH */ @CIV0 = @CIV0 + 1; } while (@CIV0 < 4); /* ~27 */ @CIV0 = 0; lab_26: 8 Automatic SIMDization © 2005 IBM Corporation SIMD Body if (!1) goto lab_25; @CIV0 = 0; do { /* id=1 guarded */ /* ~3 */ /* region = 8 */ 23 | @V.pout0[]02[K1 + (@CIV0 + 4)] = @V.pin0[]01[K1 + (@CIV0 + 4)] + @V.pin1[]00[K1 + (@CIV0 + 4)]; 22 | /* DIR LATCH */ @CIV0 = @CIV0 + 4; } while (@CIV0 < 24); /* ~3 */ @mainLoopFinalCiv0 = (unsigned) @CIV0; lab_25: Scalar Epilogue if (!1) goto lab_28; @ubCondTest1 = (unsigned) ((K1 & 3) * -4 + 16); @CIV0 = 0; do { /* id=3 guarded */ /* ~29 */ /* region = 0 */ /* bump-normalized */ if ((unsigned) (@CIV0 * 4) < @ubCondTest1) goto lab_31; pout0[]0[K1 + (24 + @CIV0)] = pin0[]0[K1 + (24 + @CIV0)] + pin1[]0[K1 + (24 + @CIV0)] lab_31: /* DIR LATCH */ @CIV0 = @CIV0 + 1; } while (@CIV0 < 8); /* ~29 */ @CIV0 = 32; … 9 Automatic SIMDization © 2005 IBM Corporation Scalar Prologue/Epilogue Problems One loop becomes 3 loops: Scalar Prologue, SIMD Body, Scalar Epilogue Contains an “if” stmt, per peeled stmt, inside the Scalar P/E loops If there is one stmt that is misaligned, ‘every statement’ needs to be peeled Scalar P/E do not benefit from SIMD computation where as Vector P/E does. 10 Automatic SIMDization © 2005 IBM Corporation The million bucks questions… When there is a need to generate scalar p/e, what is the threshold for a loop upper bound? What is the performance difference between vector p/e versus scalar p/e? 11 Automatic SIMDization © 2005 IBM Corporation Experiments Written a gentest script with following parameters: ./gentest -s snum -l lnum -[c/r] ratio -n ub -s : number of store statements in the loop -l : number of loads per stmt -[c/r] : compile time or runtime misalignment, where 0 <= ratio <= 1 to specify the fraction of stmts that is aligned (i.e. known to aligned at quad word boundary during compile time). -n : upper bound of the loop (compile time constant) Since we’re only interested in overhead introduced by p/e, load references are relatively aligned with store references. (no shifts inside body) Use addition operation Assume data type of float (i.e. 4 bytes) Each generated testcase is compiled at –O3 –qhot –qenablevmx – arch=ppc970, and ran on AIX ppc970 machine (c2blade24) Each testcase is ran 3 times with average timing recorded. 10 variants of the same parameters are generated. 12 Automatic SIMDization © 2005 IBM Corporation Results Com pile Tim e Misalignm ent Study for Scalar P/E disable SIMD ub12 SIMD w Scalar P/E ub12 Time (seconds) 0.5 0.4 0.3 0.2 0.1 0 0 0.25 0.5 0.75 Ratio of Aligned Stm ts in Loop With the lowest functional ub of 12 and in the presence of different degree of compile misalignment, it is always good to simdize! Tobey is able to fully unroll the scalar p/e loops and fold away all the if conditions. (good job!) 13 Automatic SIMDization © 2005 IBM Corporation Results Runtime Misalignment Study for Scalar P/E 0.8 0.7 Time(seconds) 0.6 disable SIMD ub12 0.5 SIMD w Scalar P/E ub12 0.4 disable SIMD ub16 0.3 SIMD w Scalar P/E ub16 0.2 0.1 0 0 0.25 0.5 0.75 Ratio of Aligned Stmts in Loop When the aligned ratio is below 0.25 (i.e. misaligned ratio is greater than 0.75) at ub12, scalar p/e gives overhead too large that it is not good to simdize. However, if we raise the ub to 16, it is always good to simdize regardless of any degree of misalignment! Tobey is still able to fully unroll the scalar P/E loops, but can’t fold away “if”s with runtime condition. 14 Automatic SIMDization © 2005 IBM Corporation Answer to the first question. When there is a need to generate scalar p/e, what is the threshold for a loop upper bound? Compile time 12, Run time 16. 15 Automatic SIMDization © 2005 IBM Corporation Results Ve ctor P/E VS Scalar P/E (com pile tim e m is alignm e nt) 0.35 Time (seconds) 0.3 0.25 0.2 vector P/E ub12 0.15 scalar P/E ub12 0.1 0.05 0 0 0.25 0.5 0.75 Ratio of Aligne d Stm ts in Loop In the presence of only compile misalignment, vector p/e is always better than scalar p/e Improvement: Since every stmt is peeled, those that we have peeled a quad word may still be done using vector instructions 16 Automatic SIMDization © 2005 IBM Corporation Results Time (seconds) Ve ctor P/E VS Scalar P/E (runtim e m is alignm e nt) 0.9 0.8 0.7 0.6 0.5 vector P/E ub12 0.4 0.3 0.2 0.1 0 scalar P/E ub12 0 0.25 0.5 0.75 Ratio of Aligne d Stm ts in Loop In the presence of high runtime misalignment ratio, vector p/e suffers tremendous when it needs to generate select mask using a runtime variable. It is better to do scalar p/e when misalignment ratio is greater than 0.25! 17 Automatic SIMDization © 2005 IBM Corporation Answer to the second question What is the performance difference between vector p/e versus scalar p/e? Vector p/e is always better when there is only compile time misalignment. When there is runtime misalignment of greater than 0.25, scalar p/e proves to be better. 18 Automatic SIMDization © 2005 IBM Corporation Motivating Example Not all computations are simdizable Dependence cycles Non-stride-one memory accesses Unsupported operations and data types A simplified example from GSM.encoder, which is a speech compression application Linear Recurrence Not simdizable for (i = 0; i < N; i++) { 1: d[i+1] = d[i] + (rp[i] * u[i]); 2: t[i] = u[i] + (rp[i] * d[i]); } Fully simdizable 19 Automatic SIMDization © 2005 IBM Corporation Current Approach: Loop Distribution Distribute the simdizable and non/partially simdizable statements into separated loops (after Loop distribution) for (i = 0; i < N; i++) { 1: d[i+1] = d[i] + rp[i] * u[i]; } for (i = 0; i < N; i++) { 2: t[i] = u[i] + rp[i] * d[i]; Loop Distribution } After loop After distribution Simdize the loops with only simdizable statements (after SIMDization) for (i = 0; 1: d[i+1] = } for (i = 0; 2: t[i:i+3] } i < N; i++) { d[i] + rp[i] * u[i]; i < N; i+=4) { = u[i:i+3] + rp[i:i+3] * d[i:i+3]; After simdization 20 Automatic SIMDization © 2005 IBM Corporation Problems with Loop Distribution Increase reuse distances of memory references for (i = 0; i < N; i++) { 1: d[i+1] = d[i] + (rp[i] * u[i]); 2: t[i] = u[i] + (rp[i] * d[i]); } O(1) Only one unit is fully utilized for each loop for (i = 0; 1: d[i+1] = } for (i = 0; 2: t[i] = } 21 Automatic SIMDization Original Loop i < N; i++) { d[i] + (rp[i] * u[i]); SIMD idle i < N; i++) { u[i] + (rp[i] * d[i]); Scalar idle O(N) After loop distribution © 2005 IBM Corporation Preliminary results The prototyped mixed mode SIMDization has illustrated a gain of 2 times speed up for the SPEC95 FP swim. With loop distribution, the speed up is only 1.5 times. 22 Automatic SIMDization © 2005 IBM Corporation Conclusion and Future tuning plan Further improvement on scalar p/e code generation. Currently, finding out cases when stmt re-execution is allowed. This will allow us to fold away more if conditions More experiments to determine the upper bound threshold for different data types Enable Mixed-mode SIMDization Integration of SIMDization framework into TPO better e.g. predicative commoning 23 Automatic SIMDization © 2005 IBM Corporation Acknowledgement This work would not be possible without the technical contribution from the following individuals. Roch Archambault/Toronto/IBM Raul Silvera/Toronto/IBM Yaoqing Gao/Toronto/IBM Gang Ren/Watson/IBM 24 Automatic SIMDization © 2005 IBM Corporation