Programming models for next generation of GPGPU architectures Benedict R. Gaster February, 2011 Motivation 2 | Programing models for next generation GPGPU | February, 2011 | Public OPENCL™ PROGRAM STRUCTURE Host C/C++ Code OpenCL™ C Device Code CPU DEVICE (Platform and Runtime APIs) (OpenCL C) 3 | Programing models for next generation GPGPU | February, 2011 | Public HELLO WORLD OPENCL™ C SOURCE __constant char hw[] = "Hello World\n"; __kernel void hello(__global char * out) { size_t tid = get_global_id(0); out[tid] = hw[tid]; } 4 | Programing models for next generation GPGPU | February, 2011 | Public HELLO WORLD OPENCL™ C SOURCE __constant char hw[] = "Hello World\n"; __kernel void hello(__global char * out) { size_t tid = get_global_id(0); out[tid] = hw[tid]; } • • • This is a separate source file (or string) Cannot directly access host data Compiled at runtime 5 | Programing models for next generation GPGPU | February, 2011 | Public HELLO WORLD - HOST PROGRAM // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); memobjs[0] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_char)*strlen(“Hello World”, NULL, NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); 6 | Programing models for next generation GPGPU | February, 2011 | Public // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = strlen(“Hello World”);; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[0], CL_TRUE, 0, strlen(“Hello World”) *sizeof(cl_char), dst, 0, NULL, NULL); HELLO WORLD - HOST PROGRAM // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); Define platform and queues // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue Define Memory objects cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); Build the program // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); Create and setup kernel // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Execute the kernel // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_char)*strlen(“Hello World”), srcA, NULL);} // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); Create the program 7 | Programing models for next generation GPGPU | February, 2011 | Public Read results on the host What can we learn 8 | Programing models for next generation GPGPU | February, 2011 | Public LEARN FROM CURRENT GENERATION ARCHITECTURE 9 | Programing models for next generation GPGPU | February, 2011 | Public COMMON USE CASES In OpenCL™ we generally see: – Pick single device (often GPU or CL_DEVICE_TYPE_DEFAULT) – All “kernels” in cl_program object are used in application In CUDA the default for runtime mode is: – Pick single device (always GPU) – All “kernels” in scope are exported to the host application for specific translation unit, i.e. calling kernels is syntactic and behave similar to static linkage. 10 | Programing models for next generation GPGPU | February, 2011 | Public A look into the future 11 | Programing models for next generation GPGPU | February, 2011 | Public NEXT GENERATION GPGPU PROGRAM STRUCTURE C++0x Code CPU (C++0x) 12 | Programing models for next generation GPGPU | February, 2011 | Public DEVICE (C++0x) HELLO WORLD C++0X SOURCE hw[] = "Hello World\n"; void __attribute__(gpu) hello( Index<1> index, char * out) { size_t id = index.getX(); out[id] = hw[id]; } int main(void) { char output[100]; parallelFor(Range<1>(length(hw)), [output] (Index<1> index) { hello(index, output); }); } 13 | Programing models for next generation GPGPU | February, 2011 | Public HELLO WORLD C++0X SOURCE • • • hw[] = "Hello World\n"; void __attribute__(gpu) hello( Index<1> index, char * out) { size_t id = index.getX(); out[id]C++0x = hw[id]; A single program } directly access data on host and device Can Compiled offline int main(void) { char output[100]; parallelFor(Range<1>(length(hw)), [output] (Index<1> index) { hello(index, output); }); } 14 | Programing models for next generation GPGPU | February, 2011 | Public What questions still need to be answered 15 | Programing models for next generation GPGPU | February, 2011 | Public WHAT QUESTIONS NEED TO BE ANSWERED How close can the CPU and GPU really be? – How does it effect the models of today: GPU implies through put computing! CPU implies local latency hiding in branchy code! What effect does it have on the kind of applications one can run on these Fusion systems? How does this all fit with managed languages? Is C++0x enough on its own? What about languages like Haskell or other high-level models? 16 | Programing models for next generation GPGPU | February, 2011 | Public QUESTIONS 17 | Programing models for next generation GPGPU | February, 2011 | Public Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. used with permission by Khronos. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2011 Advanced Micro Devices, Inc. All rights reserved. 18 | Programing models for next generation GPGPU | February, 2011 | Public