PDF

AutoSIMD compiler optimization for z/OS XL C/C++
programs
Transform code automatically for efficient data processing
Anna Thomas ([email protected])
Staff Software Developer
IBM
02 October 2015
Learn about the AutoSIMD optimization introduced in the z/OS V2R2 XL C/C++ compiler.
AutoSIMD transforms user source code into SIMD code, which runs on the new SIMD unit in
z13 hardware for faster data processing. This optimization is useful for C and C++ analytics
workloads which, when compiled with the V2R2 compiler, can generate SIMD code for efficient
execution.
The recent technological advancements have focused on enabling higher performance for
scientific and analytical workloads that are inherently computation intensive. Single Instruction
Multiple Data (SIMD) processing is one such enhancement for increased parallelism, and it
requires both hardware and compiler support. With the addition of the SIMD processing unit in
the new z13 processor, you have the hardware support required for processing SIMD code. The
compiler support is provided through Vector Programming (built-in functions) which was added in
the z/OS XL C/C++ V2R1M1 compiler.
The AutoSIMD compiler optimization automatically transforms scalar code into SIMD code. It
was first implemented in the z/OS V2R2 XL C/C++ compiler. This article focuses on the three
advantages of the AutoSIMD compiler feature:
• The effort involved in efficient code generation for the SIMD hardware is transferred from the
application developer to the compiler. You need not rewrite your applications using Vector
Programming to exploit the SIMD instruction set.
• This feature is equipped with the knowledge of the z/Architecture and takes advantage of the
SIMD instructions where it is best suited. It generates efficient vector code in conjunction with
scalar code.
• The AutoSIMD optimization is strategically placed among other compiler optimizations,
to maximize synergy between transformations in order to deliver the best possible
transformation sequence for the application code.
© Copyright IBM Corporation 2015
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Trademarks
Page 1 of 9
developerWorks®
ibm.com/developerWorks/
AutoSIMD
Simdization transforms code from a scalar form (a single operation taking a single set of operands)
to a vector form, i.e. a single operation taking multiple set of operands. In other words, simdization
follows the Single Instruction Multiple Data (SIMD) model.
The AutoSIMD optimization performs automatic simdization for loops and code blocks. It is
run after other optimizations that can potentially expose new opportunities for simdization. The
AutoSIMD optimization contains three major phases:
• The safety analysis phase to identify if the transformation is safe to perform
• The profitability analysis phase to evaluate the SIMD code being generated is better than the
original scalar code
• The SIMD code generation phase to generate the vector code in place of the scalar code
The first phase is the safety analysis phase. It studies various code properties such as data types
and loop properties, and decides if the loops or code blocks are viable candidates for simdization.
The second phase, which is the profitability analysis phase, studies the cost versus benefit for
generating equivalent vector code for a given scalar code. It takes into account various factors
such as z/ Architecture, and operations needed for setting up the vector code. Not all SIMD
transformations are beneficial compared to the equivalent scalar code, you'll see one such
scenario in the next section.
The third phase is the code generation phase. It uses information from the profitability analysis
and generates the SIMD code sequence instead of the scalar code sequence. The final outcome
produces the exact same result as the scalar case.
The AutoSIMD optimization is turned on by default when the HOT option is in effect,
FLOAT(AFP(NOVOLATILE)) is set, TARGET(V2R2) and applications are compiled with z/OS V2R2
XL C/C++ compiler at ARCH=11. The optimization can be controlled by invoking the AUTOSIMD/
NOAUTOSIMD sub option of VECTOR. For more details on the AUTOSIMD sub option, refer to the z/OS
V2R2 XL C/C++ Compiler User Guide.
Code examples
A few examples with source code and a snippet of the final pseudo-assembly generated when
AutoSIMD optimization is in effect, are provided in this article. The pseudo-assembly contains
the vector instructions relevant to the specific source code. Use the LIST option to see the
complete listing. The vector programming equivalent of the source code is also included for all
examples. Vector programming built-ins are explained in detail in the XL C/C++ V2R1M1 Compiler
Programming Guide.
Note: All the loop examples in this article are a loop with an upper bound which is not known at
compile time. The loop is unrolled to pack multiple statements into a single vector statement. The
end result of the loop is a 2x or 4x reduction in the number of iterations.
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 2 of 9
ibm.com/developerWorks/
developerWorks®
Example 1. AutoSIMD on a simple loop
In this loop example, each iteration scales an element of array b by factor ‘x’, adds the
corresponding element of array a, and stores the result into array a.
Listing 1. Source code
unsigned int i, n, x;
unsigned int *a, *b;
for ( i = 0 ; i < n ; i++ ){
a[i] = a[i] + x*b[i];
}
Listing 2. Vector programming equivalent for Listing 1
unsigned int i, n, x;
vector unsigned int a, b, temp0, temp1, temp2, storetemp;
temp0 = vec_splats(x);
for ( i = 0 ; i < n ; i+=4 ) {
temp1 = vec_xlw4(0,((char *)b + 4*i));
temp2 = vec_xlw4(0,((char *)a + 4*i));
storetemp = temp1 * temp0 + temp2;
vec_xstw4(storetemp, 0, ((char *)a + 4*i));
}
Pseudo-assembly
Table 1 shows the pseudo-assembly of the source code, when compiled with AutoSIMD versus
without AutoSIMD.
Table 1. Pseudo-Assembly of Listing 1 with versus without AutoSIMD
Pseudo-Assembly with AutoSIMD
….
VLVG
v0,r3,0,2
VREP
v0,v0,0,2
Loop_Label:
VL
v2,@V.(b{unsigned int})0(r4,r2,0)
VL
v4,@V.(a{unsigned int})1(r4,r1,0)
VML
v2,v2,v0,b'0010'
VA
v2,v2,v4,b'0010'
VST
v2,@V.(a{unsigned int})1(r4,r1,0)
LA
r4,#AMNESIA(,r4,16) //16 byte added to
…
Pseudo-Assembly without AutoSIMD
…
LR
MS
AL
ST
LA
…
r0,r3
r0,(b{unsigned int})(r4,r2,0)
r0,(a{unsigned int})(r4,r1,0)
r0,(a{unsigned int})(r4,r1,0)
r4,#AMNESIA(,r4,4) //4 byte added to i
i
Although the Pseudo-Assembly with AutoSIMD has more numbers of instructions statically, the
amount of data processed at each iteration is 4x that of scalar.
Example 2. AutoSIMD on loops with dependencies
The example in Listing 3 is a loop with dependencies. This example shows how the placement of
the AutoSIMD optimization facilitates the simdization of the loop.
This loop updates arrays 'a' and 'd' only when the value being computed is the maximum of the
current array value versus a computed value.
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 3 of 9
developerWorks®
ibm.com/developerWorks/
Listing 3. Source code
signed int *a, *b_opt, *c_opt, *b, *c, *d, *mc_opt, *ma, *mc;
unsigned int maxvalue, x, y, k;
for (k = 1; k < n; k++) {
a[k] = b_opt[k-1]
+ c_opt[k-1];
if ((maxvalue = b[k-1] + c[k-1]) > a[k])
a[k] = maxvalue;
if (a[k] < -x) a[k] = -x;
d[k] = d[k-1] + mc_opt[k-1];
//loop carried dep between d[k] and d[k-1]
if ((maxvalue = ma[k-1] + mc[k-1]) > d[k]) d[k] = maxvalue;
if (d[k] < -y) d[k] = -y;
}
Array 'd' has a loop carried dependence on itself. During the safety analysis phase in AutoSIMD,
the loop is deemed an unsafe candidate for simdization unless this dependency is removed.
However, the loop distribution optimization which is done before the AutoSIMD optimization,
distributes the statements calculating the array 'd' into another loop. This allows AutoSIMD to
simdize the loop which calculates array 'a'.
The source code in Listing 3 can be separated out into two loops, so that the first loop which has
no loop dependencies, can be rewritten using the vector builtins. Note that the single source loop
is split into two loops for clarity.
Listing 4. Vector programming equivalent for Listing 3
vector signed int splatX = vec_splats(x);
vector signed int a, b_opt, c_opt, b, c, temp0, temp1, temp2, temp3, maxTemp1, maxTemp2;
unsigned int maxvalue, x, y, k;
//loop 1
for (k = 0; k < n; k+=4) {
temp0 = vec_xlw4(0, ((char *)c_opt + k * 4));
temp1 = vec_xlw4(0,((char *)b_opt + k * 4));
temp2 = vec_xlw4(0,((char *)c + k * 4));
temp3 = vec_xlw4(0,((char *)b + k * 4));
maxTemp1 = vec_max(temp3 + temp2, temp1 + temp0);
maxTemp2 = vec_max(maxTemp1, - splatX);
vec_xstw4(maxTemp2, 0, ((char *)a + (k * 4 + 1)))
}
//loop 2
for(k=1; k<n; k++) {
d[k] = d[k-1] + mc_opt[k-1];
if ((maxvalue = ma[k-1] + mc[k-1]) > d[k])
d[k] = maxvalue;
if (d[k] < -y) d[k] = -y;
}
Pseudo-assembly
Loop distribution splits the source loop into two loops, and AutoSIMD simdizes loop 1, which
updates array 'a'. For brevity, only statements of loop 1 in this pseudo-assembly are shown in
Table 2. The complete listing with LIST shows the unrolled loop1 with vector instructions separated
to avoid dependency stalls. Note that other compiler optimizations optimize loop 2 which updates
array 'd'.
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 4 of 9
ibm.com/developerWorks/
developerWorks®
Table 2. Pseudo-assembly for Listing 3 with versus without AutoSIMD
Pseudo-assembly with AutoSIMD
…
VLVG
v0,r10,0,2
VREP
v0,v0,0,2 //vec_splats(x)
VLC
v0,v0,2
// -x
Loop1_Label:
VL
v2,@V.(c_opt{int})0(r15,r2,0)
VL
v4,@V.(b_opt{int})1(r15,r3,0)
VL
v1,@V.(c{int})2(r15,r5,0)
VL
v3,@V.(b{int})3(r15,r6,0)
VA
v2,v2,v4,b'0010' //b_opt[k-1] + c_opt[k-1]
VA
v4,v1,v3,b'0010' // b[k-1] + c[k-1]
VMX
v2,v2,v4,b'0010' //max comparison
VMX
v2,v0,v2,b'0010' //max comparison
VST
v2,@V.(a{int})4(r15,r1,4)
LA
r15,#AMNESIA(,r15,16)
…
the instructions updating array d are in the second loop
Pseudo-assembly without AutoSIMD
…
Loop_Label:
L
r7,#SPILL4(,r13,296)
SLLK
r8,r0,1
L
r11,#SPILL5(,r13,300)
ST
r8,#SPILL10(,r13,320)
L
r8,#SPILL2(,r13,288)
L
r10,(c_opt{int})3.(r14,r2,4)
L
r6,(b{int})(r14,r7,0)
L
r7,(b_opt{int})(r14,r3,0)
L
r9,(c{int})5.(r14,r11,4)
AL
r6,(c{int})(r14,r11,0) // b[k-1] + c[k-1]
AL
r7,(c_opt{int})(r14,r2,0) //b_opt[] + c_opt[]
L
r11,#SPILL1(,r13,284)
CR
r6,r7
LOCRL
r6,r7 //performing max comparison
L
r7,#SPILL5(,r13,300)
CR
r8,r6
LOCRL
r8,r6 //performing max comparison
L
r6,#SPILL6(,r13,304)
AL
r4,(c{int})(r14,r7,0)
ST
r8,(a{int})0.(r14,r1,4)
.. instructions for statements updating 'd'
LA
r14,#AMNESIA(,r14, 4)
Example 3. AutoSIMD on code blocks outside loops
The modification of an array of doubles is shown in this example. The vector facility for z/
Architecture provides support for Binary Floating Point (BFP) operations.
Listing 5. Source code
double a[2], b[2], c[2], d[2]; //global variables
void update() {
c[0] = c[0] * a[0] + b[0];
c[1] = c[1] * a[1] + b[1];
… some other computations
}
Listing 6. Vector programming equivalent for Listing 5
vector double a, b, c;
void update() {
vector double temp0, temp1, temp2;
temp0
= vec_xld2(0,&b);
temp1
= vec_xld2(0,&a);
temp2
= vec_xld2(0,&c);
storetemp = temp2 * temp1 + temp0;
vec_xstd2(storetemp, 0,((char *)c));
…
}
Pseudo-assembly
AutoSIMD generates the vector instructions for the source code block and uses the vector fused
multiply add instruction VFMA. In Table 3, the corresponding scalar code is generated twice for the
doubles, where MADB is the scalar version of VFMA.
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 5 of 9
developerWorks®
ibm.com/developerWorks/
Table 3. Pseudo-assembly for Listing 5 with versus without AutoSIMD
Pseudo-assembly of function with AutoSIMD
…
VL
VL
VL
VFMA
VST
…
v0,a(r15,r14,0)
v2,b(r2,r14,0)
v4,c(,r1,0)
v0,v4,v0,v2,b'0011',b'0000' //c[] * a[] + b[]
v0,c(,r1,0)
Pseudo-assembly of function without AutoSIMD
…
LD
f0,a[]0.off0(r15,r14,0)
LD
f2,b[]0.off0(r2,r14,0)
LD
f4,a[]0.off8(r15,r14,8)
MADB f2,f0,c[]0.off0(r3,r14,0) //c[0]*a[0]+b[0]
LD
f0,b[]0.off8(r2,r14,8)
STD
f2,c[]0.off0(r3,r14,0)
MADB f0,f4,c[]0.off8(r3,r14,8) //c[1]*a[1]+b[1]
STD
f0,c[]0.off8(r3,r14,8)
…
Example 4. Non profitable SIMD situations are identified and never simdized
Listing 7 is an example where simdization is not beneficial, due to a better performing scalar
hardware instruction. The AutoSIMD profitability analysis phase identifies this situation and avoids
simdization of this loop.
Listing 7. Source code
unsigned long long *a, *b;
unsigned int i, n;
for ( i = 0 ; i < n ; i++ ) {
a[i] = b[i];
}
Listing 8. Pseudo-assembly when AutoSIMD in effect for Listing 7
@1L5
MVC
LA
LA
BRCT
DS
0H
(a{unsigned long long})(256,r6,0),(b{unsigned long long})(r7,0)
r6,(a{unsigned long long})(,r6,256)
r7,(b{unsigned long long})(,r7,256) //adding 256 bytes instead of 16 bytes
r0,@1L5
When AutoSIMD is in effect, this code is left in its scalar form which uses the MVC instruction to
move 256 bytes from source array a to destination array b. If the code is simdized using vector
programming, the code generated is 16 byte vector loads and stores, compared to 256 bytes
with MVC. When the number of iterations are large, this becomes a long sequence of dependent
vector loads and stores which can cause a performance degradation. Hence, AutoSIMD avoids
generating the code sequence shown in Listing 9.
Listing 9. Vector programming equivalent for Listing 7
vector double a, b;
unsigned int i, n;
for(i=0; i< n; i+=2) {
temp0 = vec_xld2(0, ((char *)b + 8*i));
vec_xstd2(temp0, 0, ((char *)a + 8*i)); // store vector of two 8 byte elements into 'a'
}
Listing 10. Pseudo-assembly of vector programming equivalent in Listing 9
@1L40
VL
VST
LA
BRCT
DS
0H
v0,@V.(b{unsigned long long})0(r6,r2,0)
v0,@V.(a{unsigned long long})1(r6,r1,0)
r6,#AMNESIA(,r6,16)
//adding 16 bytes for copy at next iteration
r8,@1L40
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 6 of 9
ibm.com/developerWorks/
developerWorks®
Conclusion
With the advent of the SIMD unit in the new z13 processor, increased data parallelism is available
for existing analytic applications. This article introduced the AutoSIMD compiler optimization
in the z/OS V2R2 XL C/C++ compiler to automatically leverage SIMD opportunities in existing
applications. The optimization safely transforms scalar code to vector code after considering the
profitability of this transformation. The AutoSIMD optimization in combination with other compiler
optimizations tries to generate efficient object code for improved execution time.
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 7 of 9
developerWorks®
ibm.com/developerWorks/
Resources
Learn
• Read the z/OS V2R2 XL C/C++ Compiler User Guide.
• Read the XL C/C++ V2R1M1 Compiler Programming Guide.
• Improve your skills. Check the Rational training and certification catalog, which includes
many types of courses on a wide range of topics.
Get products and technologies
• Evaluate IBM software in the way that suits you best: Download it for a trial, try it online, or
use it in a cloud environment.
Discuss
• Get connected with your peers and keep up on the latest information in the community.
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 8 of 9
ibm.com/developerWorks/
developerWorks®
About the author
Anna Thomas
Anna Thomas received her MASc. degree from the University of British Columbia
in 2013. Anna's research focused on developing software techniques using
static analysis, for hardware error resilience. After graduation, Anna joined IBM. She
works in the Canada Lab on compiler optimizations for the XLC compiler on z and
Power systems. Her interests include program analysis, language representation and
computer architecture.
© Copyright IBM Corporation 2015
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)
AutoSIMD compiler optimization for z/OS XL C/C++ programs
Page 9 of 9

Open as PDF

Similar pages: PDF; INTEL 82454NX; Exploiting Parallelism with Dependence - Aware Scheduling