Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs Process data efficiently with vector data types and operations Rajan Bhakta ([email protected]) Technical Architect, z/OS XL C/C++ compilers IBM 14 July 2015 The IBM z13 hardware provides a new SIMD unit. This article describes how to use the IBM z/OS XL C/C++ language to take advantage of the new processor and exploit the enhanced parallelism it offers. This article also provides an overview of the new data types, the operations that can be done on those data types, and the built-in functions to make vector programming easier. Parallelism in C and C++ languages The latest C and C++ language standards provide threading support and atomics, but even with this support, parallelism is still not exploited as deeply as it can be. For example, there is no prescribed way for single instruction/multiple data (SIMD) style coding. The IBM z/OS XL C/C++ V2.1.1 compiler adds in support for SIMD data level parallelism with new data types that fit in with the existing C and C++ type system, use the existing operators to work on the new types in a way that is natural and intuitive, and a number of new built-in functions (BIFs) that can be used to exploit SIMD functionality at the hardware level. Programming efficient code to do a series of similar or same transformations to a large amount of data is a common task. As long as the data does not have dependencies during the transformations, it is usually a prime candidate for parallelism. Encapsulating the data in a form that can be manipulated efficiently, transformed effectively, and performed quickly requires a strong connection between the hardware and software. Without hardware support, the software overhead often results in poorly performing code. Without software support, the code is usually very hard to maintain and has poor readability. © Copyright IBM Corporation 2015 Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs Trademarks Page 1 of 6 developerWorks® ibm.com/developerWorks/ The new IBM z13 processors provide the hardware for increased parallelism and the IBM z/OS XL C/C++ V2.1.1 compiler provides the software to exploit the hardware to allow efficient, effective, and fast performing data processing for various fields such as business analytics. SIMD principles Large data sets (the multiple data part of SIMD) that require a series of operations lend themselves well to data types that encapsulate chunks of the data and allow the operations to take effect on the chunks. The IBM z13 SIMD unit allows handling of 16 byte chunks of data in various element sizes. It supports instructions (the single instruction part of SIMD) to perform operations on 16 individual single byte data elements, 8 individual double byte elements, 4 individual four byte elements or 2 individual eight byte elements (again, the multiple data part of SIMD). There are also instructions to perform 1 sixteen byte element operations, but since that does not really fit the SIMD model, that aspect will not be discussed in this article. New data types The element sizes supported by the SIMD unit map naturally to some of the basic C and C++ data types. You can chunk the data into various sized groups of normal C and C++ types. This chunking is represented by a new family of data types called the vector data types. The data types are the same for both C and C++ allowing interoperability between the two languages. The vector family of data types share common characteristics in that they all syntactically begin with the vector or __vector keyword, followed by the actual element type. The vector keyword is introduced with the VECTOR option. This article will use the vector keyword for simplicity, though __vector is valid wherever the vector keyword is present. The elements that are supported by the vector family of types are the normal integral types (char to long long int) and the double type, along with boolean versions of each of the integral types as seen in Table 1. Table 1. Vector data types Type Interpretation of content Range of values vector unsigned char 16 unsigned char 0..255 vector signed char 16 signed char -128..127 vector bool char 16 unsigned char 0 (FALSE), 255 (TRUE) vector unsigned short 8 unsigned short 0..65535 8 signed short -32768..32767 8 unsigned short 0 (FALSE), 65535 (TRUE) 4 unsigned int 0..2 -1 vector unsigned short int vector signed short vector signed short int vector bool short vector bool short int vector unsigned int Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs 32 Page 2 of 6 ibm.com/developerWorks/ developerWorks® vector signed int 4 signed int -2 ..2 -1 vector bool int 4 unsigned int 0 (FALSE), 2 -1 (TRUE) vector unsigned long long 2 unsigned long long 0..2 -1 vector signed long long 2 signed long long -2 ..2 -1 vector bool long long 2 unsigned long long 0 (FALSE), 2 -1 (TRUE) vector double 2 double IEEE-754 double (64 bit) precision floatingpoint values 31 31 32 64 63 63 64 Note that the long long int types are available since the VECTOR option implies LANGLVL(LONGLONG) and the double type is available only in IEEE mode, which can be used by specifying the FLOAT(IEEE) option. The data types in the vector family are all aligned on an 8 byte boundary. Specifying literals of the data types is done similar to the C99 compound literal mechanism with the vector type as the cast part and the vector element values in the brace (or parenthesized) list. The brace or parenthesized element list can also be used to initialize the vector types. Operations on vectors Having the vector data types is not very useful without being able to do transformations or operations on the data. Using a number of the normal C and C++ operators just as if the vector data types were normal basic arithmetic types allows a concise and readable mechanism to write SIMD algorithms. Arithmetic operations Common arithmetic operations such as addition, subtraction, increment, decrement, negation, and others are supported. These operations, like most of the vector operations, act element by element, following the SIMD model. For example, the following code results in the variable result having the value of four integers, all with the value 55, being displayed to stdout. #include <stdio.h> int main(void) { vector unsigned int a = {1, 2, 3, 4}; vector unsigned int b = {54, 53, 52, 51}; vector unsigned int result = a + b; printf("Result: %vld\n", result); return result[2]; } To compile this code in USS, the command is as follows: xlc –qvector –qarch=11 getResult.c –o getResult The example above also uses the [] vector element selection operator. It allows access to individual vector elements for both read and write purposes and acts just like the array index Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs Page 3 of 6 developerWorks® ibm.com/developerWorks/ operator. Note that since vector double types were not used, the -qfloat=ieee option did not need to be specified. Other operations Along with the arithmetic and logical operators, there are the usual set of data type operators such as sizeof, __alignof__, address of, typeof, casts, and assignment. A new vec_step operator that gives the number of elements in a vector is also available. This operator can be used to help iterate over the underlying scalar type arrays for example. For comparison purposes, the relational operators are also supported. For use in conditionals, the relational operators result in either a 1 or 0 value based on the relation between all the elements of one operand and the other. For more information on how to get element by element results and an OR reduced result, see the "Built-in functions" section. The default argument promotion rules that C and C++ provide do not take effect on vector types. This means that the operands might have to be more strictly related than an equivalent scalar operation is related. For example, the bit-wise left shift operator requires the same (ignoring signed-ness) vector type (to allow element by element shifting) or an unsigned long data type (for a shift by a single number of bits) as the right operand. It also does not allow vector boolean types since the logical values for those are only the maximum element type value and 0. Built-in functions For the most efficient access to machine instructions and operations that do not lend themselves to normal C and C++ operators, the IBM z/OS XL C/C++ compiler has added in a number of built-in functions (BIFs) to handle those operations. These vector BIFs are prototyped in builtins.h just like most of the other BIFs. They are only active when the VECTOR option is specified. The BIFs that are available include specialized arithmetic operations (which are not available with traditional C and C++ operators), compare, range compare, finding elements, gather/scatter, mask generation, copy until zero, load and store, specialized logical operations, merge, pack/unpack, replicate, rotate and shift, rounding and conversion, test and ANY and ALL predicates. These BIFs usually correspond to the machine SIMD instructions to allow efficient and high performing code. For example, in the following program code, for vec_any_lt you see the VFCH (VECTOR FP COMPARE HIGH) instruction being generated in the pseudo-assembly listing. #include <builtins.h> #include <math.h> #include <stdio.h> double getMinimumValue(vector double *list, int numberOfArrayElements) { vector double currentMinimum = { INFINITY, INFINITY }; // Infinities for (int i = 0; i < numberOfArrayElements; i++) { if (vec_any_lt(list[i], currentMinimum)) { // At least one element in the list element is smaller Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs Page 4 of 6 ibm.com/developerWorks/ developerWorks® if (list[i][0] < list[i][1]) { currentMinimum = vec_splat(list[i], 0); } else { currentMinimum = vec_splat(list[i], 1); } } } return currentMinimum[0]; } #define NUM_ELEMENTS 3 int main(void) { vector double array[NUM_ELEMENTS] = { { 5.5, 6.6 }, { 7.7, -3.14 }, { 2.73, 0.0 } }; double minValue = getMinimumValue(array, NUM_ELEMENTS); printf("Minimum Value: %e\n", minValue); return 55; } To compile this code in USS, the command is as follows: xlc –qvector –qarch=11 –qfloat=ieee –qlist –qlanglvl=extc99 getMinimum.c –o getMinimum The BIFs often translate directly into the associated machine instruction, but the compiler can still optimize it better (perhaps even removing the instruction) if it sees a better way of doing things. Although the example above is simple, it does show the use of array indexing with vector element indexing, for both initialization and for selecting elements of the vector. It also shows vector BIFs and vector types being used as parameters and arguments to functions. Conclusion The IBM z13 SIMD unit provides a very rich and important mechanism to help crunch data faster and write code that exploits parallelism at the data level. With the new data types, operations, and BIFs in C and C++, writing programs to exploit the new unit is very easy and intuitive. Along with the other features of the IBM z/OS XL C/C++ compiler, such as architecture sections, inline assembly, and high optimizations, programs can be written to fully exploit the new processor providing a huge boost to data analytics and processing. READ: For more information on vector programming, read the "IBM z/OS XL C/C++ V2R1 Programming Guide" (SC14-7315-01) Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs Page 5 of 6 developerWorks® ibm.com/developerWorks/ About the author Rajan Bhakta Rajan Bhakta has six years of development experience for IBM XL C. He is currently the ISO C Standards representative for Canada, and the C representative for IBM in INCITS. He is also the Technical Architect for z/OS XL C/C++. © Copyright IBM Corporation 2015 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/) Use the IBM z13 SIMD unit and the IBM z/OS XL C/C++ compiler to add parallelism to your C/C++ programs Page 6 of 6