AutoSIMD compiler optimization for z/OS XL C/C++ programs Transform code automatically for efficient data processing Anna Thomas ([email protected]) Staff Software Developer IBM 02 October 2015 Learn about the AutoSIMD optimization introduced in the z/OS V2R2 XL C/C++ compiler. AutoSIMD transforms user source code into SIMD code, which runs on the new SIMD unit in z13 hardware for faster data processing. This optimization is useful for C and C++ analytics workloads which, when compiled with the V2R2 compiler, can generate SIMD code for efficient execution. The recent technological advancements have focused on enabling higher performance for scientific and analytical workloads that are inherently computation intensive. Single Instruction Multiple Data (SIMD) processing is one such enhancement for increased parallelism, and it requires both hardware and compiler support. With the addition of the SIMD processing unit in the new z13 processor, you have the hardware support required for processing SIMD code. The compiler support is provided through Vector Programming (built-in functions) which was added in the z/OS XL C/C++ V2R1M1 compiler. The AutoSIMD compiler optimization automatically transforms scalar code into SIMD code. It was first implemented in the z/OS V2R2 XL C/C++ compiler. This article focuses on the three advantages of the AutoSIMD compiler feature: • The effort involved in efficient code generation for the SIMD hardware is transferred from the application developer to the compiler. You need not rewrite your applications using Vector Programming to exploit the SIMD instruction set. • This feature is equipped with the knowledge of the z/Architecture and takes advantage of the SIMD instructions where it is best suited. It generates efficient vector code in conjunction with scalar code. • The AutoSIMD optimization is strategically placed among other compiler optimizations, to maximize synergy between transformations in order to deliver the best possible transformation sequence for the application code. © Copyright IBM Corporation 2015 AutoSIMD compiler optimization for z/OS XL C/C++ programs Trademarks Page 1 of 9 developerWorks® ibm.com/developerWorks/ AutoSIMD Simdization transforms code from a scalar form (a single operation taking a single set of operands) to a vector form, i.e. a single operation taking multiple set of operands. In other words, simdization follows the Single Instruction Multiple Data (SIMD) model. The AutoSIMD optimization performs automatic simdization for loops and code blocks. It is run after other optimizations that can potentially expose new opportunities for simdization. The AutoSIMD optimization contains three major phases: • The safety analysis phase to identify if the transformation is safe to perform • The profitability analysis phase to evaluate the SIMD code being generated is better than the original scalar code • The SIMD code generation phase to generate the vector code in place of the scalar code The first phase is the safety analysis phase. It studies various code properties such as data types and loop properties, and decides if the loops or code blocks are viable candidates for simdization. The second phase, which is the profitability analysis phase, studies the cost versus benefit for generating equivalent vector code for a given scalar code. It takes into account various factors such as z/ Architecture, and operations needed for setting up the vector code. Not all SIMD transformations are beneficial compared to the equivalent scalar code, you'll see one such scenario in the next section. The third phase is the code generation phase. It uses information from the profitability analysis and generates the SIMD code sequence instead of the scalar code sequence. The final outcome produces the exact same result as the scalar case. The AutoSIMD optimization is turned on by default when the HOT option is in effect, FLOAT(AFP(NOVOLATILE)) is set, TARGET(V2R2) and applications are compiled with z/OS V2R2 XL C/C++ compiler at ARCH=11. The optimization can be controlled by invoking the AUTOSIMD/ NOAUTOSIMD sub option of VECTOR. For more details on the AUTOSIMD sub option, refer to the z/OS V2R2 XL C/C++ Compiler User Guide. Code examples A few examples with source code and a snippet of the final pseudo-assembly generated when AutoSIMD optimization is in effect, are provided in this article. The pseudo-assembly contains the vector instructions relevant to the specific source code. Use the LIST option to see the complete listing. The vector programming equivalent of the source code is also included for all examples. Vector programming built-ins are explained in detail in the XL C/C++ V2R1M1 Compiler Programming Guide. Note: All the loop examples in this article are a loop with an upper bound which is not known at compile time. The loop is unrolled to pack multiple statements into a single vector statement. The end result of the loop is a 2x or 4x reduction in the number of iterations. AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 2 of 9 ibm.com/developerWorks/ developerWorks® Example 1. AutoSIMD on a simple loop In this loop example, each iteration scales an element of array b by factor ‘x’, adds the corresponding element of array a, and stores the result into array a. Listing 1. Source code unsigned int i, n, x; unsigned int *a, *b; for ( i = 0 ; i < n ; i++ ){ a[i] = a[i] + x*b[i]; } Listing 2. Vector programming equivalent for Listing 1 unsigned int i, n, x; vector unsigned int a, b, temp0, temp1, temp2, storetemp; temp0 = vec_splats(x); for ( i = 0 ; i < n ; i+=4 ) { temp1 = vec_xlw4(0,((char *)b + 4*i)); temp2 = vec_xlw4(0,((char *)a + 4*i)); storetemp = temp1 * temp0 + temp2; vec_xstw4(storetemp, 0, ((char *)a + 4*i)); } Pseudo-assembly Table 1 shows the pseudo-assembly of the source code, when compiled with AutoSIMD versus without AutoSIMD. Table 1. Pseudo-Assembly of Listing 1 with versus without AutoSIMD Pseudo-Assembly with AutoSIMD …. VLVG v0,r3,0,2 VREP v0,v0,0,2 Loop_Label: VL v2,@V.(b{unsigned int})0(r4,r2,0) VL v4,@V.(a{unsigned int})1(r4,r1,0) VML v2,v2,v0,b'0010' VA v2,v2,v4,b'0010' VST v2,@V.(a{unsigned int})1(r4,r1,0) LA r4,#AMNESIA(,r4,16) //16 byte added to … Pseudo-Assembly without AutoSIMD … LR MS AL ST LA … r0,r3 r0,(b{unsigned int})(r4,r2,0) r0,(a{unsigned int})(r4,r1,0) r0,(a{unsigned int})(r4,r1,0) r4,#AMNESIA(,r4,4) //4 byte added to i i Although the Pseudo-Assembly with AutoSIMD has more numbers of instructions statically, the amount of data processed at each iteration is 4x that of scalar. Example 2. AutoSIMD on loops with dependencies The example in Listing 3 is a loop with dependencies. This example shows how the placement of the AutoSIMD optimization facilitates the simdization of the loop. This loop updates arrays 'a' and 'd' only when the value being computed is the maximum of the current array value versus a computed value. AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 3 of 9 developerWorks® ibm.com/developerWorks/ Listing 3. Source code signed int *a, *b_opt, *c_opt, *b, *c, *d, *mc_opt, *ma, *mc; unsigned int maxvalue, x, y, k; for (k = 1; k < n; k++) { a[k] = b_opt[k-1] + c_opt[k-1]; if ((maxvalue = b[k-1] + c[k-1]) > a[k]) a[k] = maxvalue; if (a[k] < -x) a[k] = -x; d[k] = d[k-1] + mc_opt[k-1]; //loop carried dep between d[k] and d[k-1] if ((maxvalue = ma[k-1] + mc[k-1]) > d[k]) d[k] = maxvalue; if (d[k] < -y) d[k] = -y; } Array 'd' has a loop carried dependence on itself. During the safety analysis phase in AutoSIMD, the loop is deemed an unsafe candidate for simdization unless this dependency is removed. However, the loop distribution optimization which is done before the AutoSIMD optimization, distributes the statements calculating the array 'd' into another loop. This allows AutoSIMD to simdize the loop which calculates array 'a'. The source code in Listing 3 can be separated out into two loops, so that the first loop which has no loop dependencies, can be rewritten using the vector builtins. Note that the single source loop is split into two loops for clarity. Listing 4. Vector programming equivalent for Listing 3 vector signed int splatX = vec_splats(x); vector signed int a, b_opt, c_opt, b, c, temp0, temp1, temp2, temp3, maxTemp1, maxTemp2; unsigned int maxvalue, x, y, k; //loop 1 for (k = 0; k < n; k+=4) { temp0 = vec_xlw4(0, ((char *)c_opt + k * 4)); temp1 = vec_xlw4(0,((char *)b_opt + k * 4)); temp2 = vec_xlw4(0,((char *)c + k * 4)); temp3 = vec_xlw4(0,((char *)b + k * 4)); maxTemp1 = vec_max(temp3 + temp2, temp1 + temp0); maxTemp2 = vec_max(maxTemp1, - splatX); vec_xstw4(maxTemp2, 0, ((char *)a + (k * 4 + 1))) } //loop 2 for(k=1; k<n; k++) { d[k] = d[k-1] + mc_opt[k-1]; if ((maxvalue = ma[k-1] + mc[k-1]) > d[k]) d[k] = maxvalue; if (d[k] < -y) d[k] = -y; } Pseudo-assembly Loop distribution splits the source loop into two loops, and AutoSIMD simdizes loop 1, which updates array 'a'. For brevity, only statements of loop 1 in this pseudo-assembly are shown in Table 2. The complete listing with LIST shows the unrolled loop1 with vector instructions separated to avoid dependency stalls. Note that other compiler optimizations optimize loop 2 which updates array 'd'. AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 4 of 9 ibm.com/developerWorks/ developerWorks® Table 2. Pseudo-assembly for Listing 3 with versus without AutoSIMD Pseudo-assembly with AutoSIMD … VLVG v0,r10,0,2 VREP v0,v0,0,2 //vec_splats(x) VLC v0,v0,2 // -x Loop1_Label: VL v2,@V.(c_opt{int})0(r15,r2,0) VL v4,@V.(b_opt{int})1(r15,r3,0) VL v1,@V.(c{int})2(r15,r5,0) VL v3,@V.(b{int})3(r15,r6,0) VA v2,v2,v4,b'0010' //b_opt[k-1] + c_opt[k-1] VA v4,v1,v3,b'0010' // b[k-1] + c[k-1] VMX v2,v2,v4,b'0010' //max comparison VMX v2,v0,v2,b'0010' //max comparison VST v2,@V.(a{int})4(r15,r1,4) LA r15,#AMNESIA(,r15,16) … the instructions updating array d are in the second loop Pseudo-assembly without AutoSIMD … Loop_Label: L r7,#SPILL4(,r13,296) SLLK r8,r0,1 L r11,#SPILL5(,r13,300) ST r8,#SPILL10(,r13,320) L r8,#SPILL2(,r13,288) L r10,(c_opt{int})3.(r14,r2,4) L r6,(b{int})(r14,r7,0) L r7,(b_opt{int})(r14,r3,0) L r9,(c{int})5.(r14,r11,4) AL r6,(c{int})(r14,r11,0) // b[k-1] + c[k-1] AL r7,(c_opt{int})(r14,r2,0) //b_opt[] + c_opt[] L r11,#SPILL1(,r13,284) CR r6,r7 LOCRL r6,r7 //performing max comparison L r7,#SPILL5(,r13,300) CR r8,r6 LOCRL r8,r6 //performing max comparison L r6,#SPILL6(,r13,304) AL r4,(c{int})(r14,r7,0) ST r8,(a{int})0.(r14,r1,4) .. instructions for statements updating 'd' LA r14,#AMNESIA(,r14, 4) Example 3. AutoSIMD on code blocks outside loops The modification of an array of doubles is shown in this example. The vector facility for z/ Architecture provides support for Binary Floating Point (BFP) operations. Listing 5. Source code double a[2], b[2], c[2], d[2]; //global variables void update() { c[0] = c[0] * a[0] + b[0]; c[1] = c[1] * a[1] + b[1]; … some other computations } Listing 6. Vector programming equivalent for Listing 5 vector double a, b, c; void update() { vector double temp0, temp1, temp2; temp0 = vec_xld2(0,&b); temp1 = vec_xld2(0,&a); temp2 = vec_xld2(0,&c); storetemp = temp2 * temp1 + temp0; vec_xstd2(storetemp, 0,((char *)c)); … } Pseudo-assembly AutoSIMD generates the vector instructions for the source code block and uses the vector fused multiply add instruction VFMA. In Table 3, the corresponding scalar code is generated twice for the doubles, where MADB is the scalar version of VFMA. AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 5 of 9 developerWorks® ibm.com/developerWorks/ Table 3. Pseudo-assembly for Listing 5 with versus without AutoSIMD Pseudo-assembly of function with AutoSIMD … VL VL VL VFMA VST … v0,a(r15,r14,0) v2,b(r2,r14,0) v4,c(,r1,0) v0,v4,v0,v2,b'0011',b'0000' //c[] * a[] + b[] v0,c(,r1,0) Pseudo-assembly of function without AutoSIMD … LD f0,a[]0.off0(r15,r14,0) LD f2,b[]0.off0(r2,r14,0) LD f4,a[]0.off8(r15,r14,8) MADB f2,f0,c[]0.off0(r3,r14,0) //c[0]*a[0]+b[0] LD f0,b[]0.off8(r2,r14,8) STD f2,c[]0.off0(r3,r14,0) MADB f0,f4,c[]0.off8(r3,r14,8) //c[1]*a[1]+b[1] STD f0,c[]0.off8(r3,r14,8) … Example 4. Non profitable SIMD situations are identified and never simdized Listing 7 is an example where simdization is not beneficial, due to a better performing scalar hardware instruction. The AutoSIMD profitability analysis phase identifies this situation and avoids simdization of this loop. Listing 7. Source code unsigned long long *a, *b; unsigned int i, n; for ( i = 0 ; i < n ; i++ ) { a[i] = b[i]; } Listing 8. Pseudo-assembly when AutoSIMD in effect for Listing 7 @1L5 MVC LA LA BRCT DS 0H (a{unsigned long long})(256,r6,0),(b{unsigned long long})(r7,0) r6,(a{unsigned long long})(,r6,256) r7,(b{unsigned long long})(,r7,256) //adding 256 bytes instead of 16 bytes r0,@1L5 When AutoSIMD is in effect, this code is left in its scalar form which uses the MVC instruction to move 256 bytes from source array a to destination array b. If the code is simdized using vector programming, the code generated is 16 byte vector loads and stores, compared to 256 bytes with MVC. When the number of iterations are large, this becomes a long sequence of dependent vector loads and stores which can cause a performance degradation. Hence, AutoSIMD avoids generating the code sequence shown in Listing 9. Listing 9. Vector programming equivalent for Listing 7 vector double a, b; unsigned int i, n; for(i=0; i< n; i+=2) { temp0 = vec_xld2(0, ((char *)b + 8*i)); vec_xstd2(temp0, 0, ((char *)a + 8*i)); // store vector of two 8 byte elements into 'a' } Listing 10. Pseudo-assembly of vector programming equivalent in Listing 9 @1L40 VL VST LA BRCT DS 0H v0,@V.(b{unsigned long long})0(r6,r2,0) v0,@V.(a{unsigned long long})1(r6,r1,0) r6,#AMNESIA(,r6,16) //adding 16 bytes for copy at next iteration r8,@1L40 AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 6 of 9 ibm.com/developerWorks/ developerWorks® Conclusion With the advent of the SIMD unit in the new z13 processor, increased data parallelism is available for existing analytic applications. This article introduced the AutoSIMD compiler optimization in the z/OS V2R2 XL C/C++ compiler to automatically leverage SIMD opportunities in existing applications. The optimization safely transforms scalar code to vector code after considering the profitability of this transformation. The AutoSIMD optimization in combination with other compiler optimizations tries to generate efficient object code for improved execution time. AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 7 of 9 developerWorks® ibm.com/developerWorks/ Resources Learn • Read the z/OS V2R2 XL C/C++ Compiler User Guide. • Read the XL C/C++ V2R1M1 Compiler Programming Guide. • Improve your skills. Check the Rational training and certification catalog, which includes many types of courses on a wide range of topics. Get products and technologies • Evaluate IBM software in the way that suits you best: Download it for a trial, try it online, or use it in a cloud environment. Discuss • Get connected with your peers and keep up on the latest information in the community. AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 8 of 9 ibm.com/developerWorks/ developerWorks® About the author Anna Thomas Anna Thomas received her MASc. degree from the University of British Columbia in 2013. Anna's research focused on developing software techniques using static analysis, for hardware error resilience. After graduation, Anna joined IBM. She works in the Canada Lab on compiler optimizations for the XLC compiler on z and Power systems. Her interests include program analysis, language representation and computer architecture. © Copyright IBM Corporation 2015 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/) AutoSIMD compiler optimization for z/OS XL C/C++ programs Page 9 of 9