Heterogeneous Computing OpenCL™ and the ATI Radeon™ HD 5870 (“Evergreen”) Architecture Advanced Micro Devices 1 OpenCL™ (and Microsoft® DirectCompute) With OpenCL™/DirectCompute (DC) you can… Leverage CPUs and GPUs to accelerate parallel computation Enable dramatic speedups for computationally intensive applications Write accelerated portable code across different devices and architectures With AMD’s implementations you can… Leverage AMD’s CPUs and GPUs, to accelerate parallel computation OpenCL ™ Public release for multi-core CPU and AMD GPU’s December 2009 Microsoft DirectX® 11 CS Public release AMD GPUs October 2009, as part of Microsoft Windows® 7 Launch 2 The Heterogeneous Computing Software Ecosystem Increase ease of application development End--user Applications End Advanced Optimizations & Load Balancing Load balance across CPUs and GPUs; leverage Fusion performance advantages High Level Frameworks Middleware/Libraries: Video, Imaging, Math/Sciences, Physics Tools: HLL compilers, Debuggers, Profilers OpenCL™ & DirectCompute Hardware & Drivers: Fusion, Discrete CPUs/GPUs Drive new features into industry standards 3 OpenCL™ Overview 4 OpenCL™ Platform Model A host connected to one or more OpenCL™ devices An OpenCL™ device is A collection of one or more compute units (cores) A compute unit is composed of one or more processing elements Processing elements execute code as SIMD or SPMD 5 OpenCL™ Execution Model Kernel Basic unit of executable code - similar to a C function Data-parallel or task-parallel Program Collection of kernels and other functions Analogous to a dynamic library Applications queue kernel execution instances Queued in-order Executed in-order or out-of-order 6 Expressing Data-Parallelism in OpenCL™ Define N-dimensional computation domain (N = 1, 2 or 3) Each independent element of execution in N-D domain is called a work-item The N-D domain defines the total number of work-items that execute in parallel E.g., process a 1024 x 1024 image: Global problem dimensions: 1024 x 1024 = 1 kernel execution per pixel: 1,048,576 total kernel executions Scalar void scalar_mul(int n, const float *a, const float *b, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] * b[i]; } Data--parallel Data kernel void dp_mul(global const float *a, global const float *b, global float *result) { int id = get_global_id(0); result[id] = a[id] * b[id]; } // execute dp_mul over “n” work-items 7 Expressing Data-Parallelism in OpenCL™ Kernels executed across a global domain of work-items Global dimensions define the range of computation One work-item per computation, executed in parallel Work-items are grouped in local workgroups Local dimensions define the size of the workgroups Executed together on one device Share local memory and synchronization Caveats Global work-items must be independent: no global synchronization Synchronization can be done within a workgroup 8 OpenCL™ Memory Model Private Memory Per work-item Local Memory At least 32kB split into blocks each available to any work-item in a given work-group Private Memory Private Memory Private Memory Private Memory Work--Item Work Work--Item Work Work--Item Work Work--Item Work Local Memory Workgroup Local Memory Workgroup Local Global/Constant Memory Global/Constant Memory Not synchronized Computer Device Host Memory Host Memory On the CPU Host Memory management is explicit : You must move data from host -> global -> local and back 9 Compilation Model OpenCL™ uses Dynamic/Runtime compilation model (like OpenGL®): 1. The code is complied to an Intermediate Representation (IR) – Usually an assembler or a virtual machine – Known as offline compilation 2. The IR is compiled to a machine code for execution. – This step is much shorter. – It is known as online compilation. In dynamic compilation, step 1 is done usually only once, and the IR is stored. The App loads the IR and performs step 2 during the App’s runtime (hence the term…) 10 OpenCL™ Framework Context Programs Kernels dp_mul dp_mul add __kernel void add( global const float *a, global const float *b, global float *c) { int id=get_global_id(0); c[id]=a[id]+b[id]; } add CPU program binary add GPU program binary arg [0] [0] arg value arg[0] value value arg [1] [1] arg value arg[1] value value arg [2] [2] arg value arg[2] value value Memory Objects Images Buffers Command Queues InIn Order Order Queue Queue Outofof Out Order Order Queue Queue GPU GPU 11 Setup 1. Get the device(s) 2. Create a context 3. Create command queue(s) cl_uint num_devices_returned; cl_device_id devices[2]; err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &devices[0], num_devices_returned); err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &devices[1], &num_devices_returned); Queue Context Queue cl_context context; context = clCreateContext(0, 2, devices, NULL, NULL, &err); cl_command_queue queue_gpu, queue_cpu; queue_gpu = clCreateCommandQueue(context, devices[0], 0, &err); queue_cpu = clCreateCommandQueue(context, devices[1], 0, &err); 12 Choosing Devices A system may have several devices—which is best? The “best” device is algorithm- and hardware-dependent Query device info with: clGetDeviceInfo(device, param_name, *value) Number of compute units CL_DEVICE_MAX_COMPUTE_UNITS Clock frequency CL_DEVICE_MAX_CLOCK_FREQUENCY Memory size CL_DEVICE_GLOBAL_MEM_SIZE Extensions (double precision, atomics, etc.) Pick the best device for your algorithm 13 OpenCL™ Synchronization: Queues & Events OpenCL™ defines a command queue Created on a single device Within the scope of a context Two types of queues In order queue : commands are executed in the order of issuing Out of order queue : command execution is dependent only on its event list completion Q1,1 D1 IOQ C3 C2 C1 Context 1 Context 2 Device 1 Device 2 Device 1 Device 3 Q1,2 D1 IOQ C2 C1 Q1,3 D2 IOQ Q1,4 D2 OOQ Q2,1 D1 OOQ Q2,2 D3 IOQ C4 C3 C2 C1 Multiple queues can be created on the same device 14 OpenCL™ Synchronization: Queues & Events (2) Commands are enqueued to a specific queue Kernels Execution Memory Operations Events Q1,1 D1 IOQ Context 1 Context 2 Device 1 Device 2 Device 1 Device 3 Q1,2 D1 IOQ Q1,3 D2 IOQ Each command can be created with an event C3 C2 C2 associated C1 C1 Each command execution can be dependent in a list of pre-created events In the example above : Commands can be dependent on events created in other queues/contexts Q1,4 D2 OOQ Q2,1 D1 OOQ Q2,2 D3 IOQ C4 C3 C2 C1 C3 from Q1,1 depends on C1 & C2 from Q1,2 C1 from Q1,4 depends on C2 from Q1,2 In Q1,4, C3 depends on C2 15 Synchronization: Two Devices/Queues Kernel 2 issued with no associated wait. Starts before the results from Kernel 1 are ready CPU GPU Time Kernel 2 Kernel 1 Kernel 2 waits for an event from Kernel 1 Does not start until the results from Kernel 1 are ready CPU GPU Kernel 2 Kernel 1 Time 16 OpenCL™ C Language Derived from ISO C99 No standard C99 headers, function pointers, recursion, variable length arrays, and bit fields Additions to the language for parallelism Work-items and workgroups Vector types Synchronization Address space qualifiers Optimized image access Built-in functions 17 Data Types Scalar data types char , uchar, short, ushort, int, uint, long, ulong, bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage) Image types image2d_t, image3d_t, sampler_t Vector data types Vector length of 2, 4, 8, and 16 Aligned at vector length int4 vi0 = (int4) -7; -7 -7 -7 -7 int4 vi1 = (int4)(0, 1, 2, 3); 0 1 2 3 0 1 1 3 Vector operations and built-in vi0.lo = vi1.hi; 2 3 -7 -7 int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); 2 3 -7 -7 18 Built-in Functions Math Functions IEEE 754 compatible rounding behavior for single precision floating-point IEEE 754 compliant behavior for double precision floating-point Defines maximum error of math functions as ULP values Workgroup Functions barrier, mem_fence, async_work_group_copy, wait_group_events Integer functions abs, abs_diff, add_sat, hadd, rhadd, clz, mad_hi, mad_sat, max, min, mul_hi, rotate, sub_sat, upsample 19 Built-in Functions (2) Image functions read_image[f | i | ui] write_image[f | i | ui] get_image_[width | height | depth] Common, Geometric and Relational Functions Vector Data Load and Store Functions eg. vload_half, vstore_half, vload_halfn, vstore_halfn, ... 20 OpenCL™ Basic Source Example 21 Vector Addition - Kernel __kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } Spec Guide __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1 22 Vector Addition - Host API (1) // Enumerate platforms cl_context cl_platform_id err = cl_platform_id *platforms = nPlatforms; platform; clGetPlatformIDs(0,NULL,&nPlatforms); malloc(nPlatforms); // get list of all platforms err = clGetPlatformIDs(nPlatforms,platforms,NULL); cl_context_properties p[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties) platform[0], 0 }; Spec Guide Platforms and platform creation: Section 4.1 23 Vector Addition - Host API (1) // create the OpenCL context on all devices cl_context context = clCreateContextFromType( p, CL_DEVICE_TYPE_ALL, NULL, NULL, NULL); // get the list of all devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); Spec Guide Contexts and context creation: Section 4.3 24 Vector Addition - Host API (2) // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL); Spec Guide Command queues: Creating buffer objects: Section 5.1 Section 5.2.1 25 Vector Addition - Host API (3) cl_program program = clCreateProgramWithSource( context, 1, &program_source, NULL, NULL); cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, “vec_add”, err = clSetKernelArg(kernel, 0, (void *)&memobjs[0], err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], NULL); sizeof(cl_mem)); sizeof(cl_mem)); sizeof(cl_mem)); Spec Guide Creating program objects: Building program executables: Creating kernel objects: Setting kernel arguments: Section 5.4.1 Section 5.4.2 Section 5.5.1 Section 5.5.2 26 Vector Addition - Host API (4) // set work-item dimensions size_t global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel( cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer( context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Spec Guide Executing Kernels: Reading, writing, and copying buffer objects: Section 5.6 Section 5.2.2 27 OpenCL™ Fluid Simulation in Bullet Physics 28 Bullet Physics Games Physics Simulation (http://bulletphysics.com) Open source engine (Zlib license) Rigid body dynamics, e.g. – Ragdolls, destruction, and vehicles Soft body dynamics – Cloth, rope, and deformable volumes Current version is 2.76 and is CPU only Future versions of Bullet will add a turbo injection, with the help of OpenCL™ 29 Open Physics – OpenCL™ and Bullet Rigid body dynamics – Erwin Coumans and Sony team developing accelerated version – 2D/3D Demos already exist and work on productization Soft body dynamics – AMD developing DirectCompute/OpenCL Cloth acceleration Fluid simulation – Does not currently exist in Bullet – AMD developing OpenCL/DirectCompute SPH implementation DRM Finite Element method – Pixelux working on Bullet integration 30 Fluids and particle systems the basics Simply, highly parallel, thus map well to the GPU Particles store position, mass, velocity, age, density, etc Particles are moved by time stepping: – Euler or Leapfrog integration dv i dt = ai – Acceleration ai has contributions from gravity, pressure gradient, and viscosity mi (x i ; yi ; zi ) vi ai 31 Particle-Particle Interaction For correct simulation of fluids, inter-particle forces are required Naïve implementation leads to a complexity of O(n2) [Muller03] reduce this to linear complexity by introducing a cutoff distance k k 32 Smoothed Particle Hydrodynamics 33 SPH – Algorithm Build spatial grid on particles Allow fast neighbor finding For each particle Find neighbors For each particle On the GPU each particle is worked on in parallel Compute density and pressure For each particle Compute acceleration For each particle Integrate 34 SPH – Reducing O(n2) to O(n) with spatial hashing Fill particles into a grid with spacing 2*interaction distance Search potential neighbors in adjacent cells only Map cells [i,j,k] into 1D array via hash function h(i,j,k) [teschner03] Implementation: Infinite 3D grid is mapped to finite grid – 10 * 10 * 10 in the unit cube – (x,y,z) maps (x%11,y%11,z%11) Finding neighbors is fast Requires fast GPU sort – Today Radixsort – Future other sorts, work still needed to study sorting algorithms on GPU 35 SPH – Next Generation Reverted to pure streaming implementation Data is always read coherently Pipeline is automatically generated Using sequence description file Predict memory bandwidth, automatically Still in early development, running but: Simulation still unstable, this is a maths issue so: – Problem is memory bound and so – 256,000 particles @ 60Hz theoretically doable! 36 SPH – Rendering No ‘good’ solution to date Screen-space Curvature Flow [Malladi at el, 1995, Van der Laan at el, 2009] Evolve surface along normal direction with speed determined by mean surface curvature Flow moves surface only in Z Needs a lot of particles to look good! Almost all post rendering processing (geometry shader is used to render screen point sprites), so: Why not use OpenCL™? – OpenCL C Geometry and Fragment Shaders, with fixed function rasterization 37 SPH – Early prototype 38 OpenCL™ Soft Body Simulation 39 Particle-based soft body simulation Mass/spring system: Large collection of masses or particles Connect using springs Layout and properties of springs changes properties of soft body General spring types: Structural springs Shearing springs Bending springs 40 Simulating a cloth Subset of full range of soft bodies. Large number of particles lends itself to parallel processing. Force from a given link must be applied to both particles Requires batching to correctly update particles Basic stages of simulation: Original layout Current layout: Apply forces to masses Compute forces as stretch and compute velocities from rest length Compute new positions based on velocities 41 Apply simulation stages in batches Links cannot be solved in parallel Multiple links share nodes Dependencies in the interactions Batch the links based on their inteconnections: Graph colouring Complicated link structure = many batches Compute the velocity: Over each batch in turn Over multiple cycles to allow solution to converge Update positions, faces, normals etc based on computed velocities in a single pass 42 Adding wind to the simulation Compute normals on a per-face basis Number of faces can be optimised to improve performance We use a face per node triangle in the obvious fashion Node normals and areas computed from surrounding faces Compute wind force interaction based on velocity and density of the wind medium Wind velocity Relative velocity Node velocity Scale by medium density 43 The “Übercloth” Multi-pass multi-phase algorithm Large number of overall passes While hardware can run multiple kernels simultaneously, each wavefront executes too briefly for scheduler to gain from this Multiple cloths require all passes to be computed for each Separate kernel issues can end up being small Merge cloths into single simulated cloth object Inefficient on the CPU due to caching Highly efficient on the GPU: – Higher occupancy – Massively parallel 44 Reducing dispatch calls Even the giant multi-cloth dispatch involves a large number of kernel dispatches Given a scene with 78k particles, 468k links Over 350 kernel dispatches per frame More than half of the execution time performing kernel dispatches Suggestion: Extend the batching to sort links by interconnections Pairs of links connected by a node can safely be executed together, but would normally be batched separately Fuse kernel to compute both – small saving in memory traffic, halve dispatches for link solver 45 Cloth 46 The ATI Radeon™ HD 5870 (“Evergreen”) architecture 47 ATI Radeon™ HD 5870 GPU - Superior Consumer Hardware AT I R a d e o n ™ H D 5 8 7 0 G P U Process/Transistors 40nm/2.15B Stream Processors 1600 Peak SP Flop Rate 2.72 Teraflops Peak DP Flop Rate 544 GFLOPS Core Clock 850 MHz Max Resolution 3x2560x1600 Memory Type GDDR5 4.8Gbps Frame Buffer 1 GB Max Bandwidth 153 GB/s Max/Idle Board Power 188W/27 W *Subject to monitor resolutions (Max ASIC Resolution 6x2560x1600 @60Hz) . **Price on newegg.com as low as $394 as of March 20, 2010. “Most Advanced” technologies: internal calculations at time of launch showed that the processor used in the ATI Radeon HD 5800 series can achieve 2.72 TeraFLOPS, more than any other known microprocessor as of September 23, 2009. FLOPS is not necessarily an indicator of leading performance in every application as AMD GPUs are designed and built to excel specifically in applications involving massively parallel calculations. 48 ATI Eyefinity – Enhanced Productivity ATI Eyefinity technology can support up to 6 displays using a single enabled ATI Radeon™ graphics card - the number of displays may vary by board design and you should confirm exact specifications with the applicable manufacturer before purchase. ATI Eyefinity technology works with games that support non-standard aspect ratios, which is required for panning across multiple displays. To enable more than two displays, additional panels with native DisplayPort™ connectors, and/or certified DisplayPort™ adapters to convert your monitor’s native input to your cards DisplayPort™ or Mini-DisplayPort™ connector(s), are required. 49 ATI Radeon™ HD 5870 GPU Features ATI Radeon™ HD 4870 2 ATI Radeon™ HD 5870 2 Difference 263 mm 334 mm 1.27x Transistors 956 million 2.15 billion 2.25x Memory Bandwidth 115 GB/sec 153 GB/sec 1.33x L2-L1 Rd Bandwidth 512 bytes/clk 512 bytes/clk 1x L1 Bandwidth 640 bytes/clk 1280 bytes/clk 2x Vector GPR 2.62 Mbytes 5.24 MByte 2x LDS Memory 160 kb 640kb 4x 640 byte/clk 2560 bytes/clk 4x Concurrent Threads 15872 31744 2x Shader (ALU units) 800 1600 2x Idle 90 W 27 W 0.3x Max 160 W 188 W 1.17x Area LDS Bandwidth Board Power 20 SIMD engines (MPMD) Each with 16 Stream Cores (SC) – Each with 5 Processing Elements (PE) (1600 Total) – Each PE IEEE 754 –2008 precision capable denorm support, fma, flags, round modes – Fast integer support Each with local data share memory – 32 kb shared low latency memory – 32 banks with hardware conflict management – 32 integer atomic units 80 Read Address Probes 4 addresses per SIMD engine (32 -128 bits data) 4 filter or convert logic per SIMD Global Memory access 32 SC access read/write/integer atomic/clock Relaxed Unordered Memory Consistency Model On chip 64 kb global data share 153 GB/sec GDDR5 memory interface 50 GDDR5 Memory Controller Architecture 2nd Generation GDDR5 Implementation Optimized for Speed and Area GPU High Bandwidth Direct Connect Clients Multi Engine Client Hub Connect EDC (Error Detection Code) CRC Checks on Data Transfers for Improved Reliability at High Clock Speeds GDDR5 Memory Clock temperature compensation Enables Speeds Approaching 5 Gbps Fast GDDR5 Link Retraining Allows Voltage & Clock Switching on the Fly without Glitches 51 Tera Scale 2 Architecture - ATI Radeon™ HD 5870 GPU Double the processing power of previous generation. Up to: 2.72 Teraflops 27.2 Giga Pixels/sec Partially decoupled architecture Work units produced by rasterizer passed into UTDP UTDP executes the control flow program Units of work called clauses pushed to SIMD engines and memory controllers Relies on threading to keep SIMD units busy 52 ATI Radeon™ HD 5870 GPU - ISA: Control, memory and ALU programs 13 TEX: ADDR(368) CNT(2) 08 ALU_PUSH_BEFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT ____, R0.x, 0.0f UPDATE_PRED 22 VFETCH R0.x___, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x___, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) UPDATE_EXEC_MASK 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SUB_INT R5.x, R3.x, R2.x y: LSHL ____, R3.x, (0x00000002, 2.802596929e-45f).x z: LSHL ____, R2.x, (0x00000002, 2.802596929e-45f).x t: MOV R8.x, 0.0f 18 x: SUB_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 14 ALU_BREAK: ADDR(85) CNT(88) VEC_120 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL ____, R5.x, (0x00000002, 2.802596929e-45f).x w: ADD_INT ____, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, (0x00000008, 1.121038771e-44f).y 24 x: MOV y: ASHR z: ASHR w: ASHR t: ASHR ____, R0.x T0.y, R0.x, (0x00000018, 3.363116314e-44f).x ____, R0.x, (0x00000008, 1.121038771e-44f).y ____, R0.x, (0x00000010, 2.242077543e-44f).z T0.z, R1.x, (0x00000018, 3.363116314e-44f).x 15 ALU: ADDR(173) CNT(1) 40 x: MOV R1.x, R7.x 16 ENDLOOP i0 PASS_JUMP_ADDR(12) 17 POP (1) ADDR(18) 18 ALU: ADDR(174) CNT(22) KCACHE0(CB1:0-15) 41 x: MOV T0.x, KC0[2].x z: ADD_INT ____, R2.x, R3.x w: MOV T0.w, KC0[2].x t: LSHL R4.x, R3.x, (0x00000002, 2.802596929e-45f).x 53 ATI Radeon™ HD 5870 GPU - Compute Full hardware DirectCompute 11 Implementation Full hardware OpenCL™ 1.0 Implementation IEEE754-2008 compliant precision for SP and DP Memory Load Operations 80 4-16 byte gather loads from L1 per clk 8 64 byte loads from L2 per clk Grouped Cache reads with invalidate Memory Store Operations Coalesces and combine writes Scatter up to 32 64 bit values per clock Global Memory R/W Cache with Atomics Relaxed Global Memory Consistency Model 54 ATI Radeon™ HD 5870 GPU - Compute (2) Functionality targeting OpenCL™ 1.0 and beyond 32-bit atomic operations Flexible 32kB Local Data Shares Flexible 64kB Global Data Share Global semaphores/barrier synchronization Append/consume buffers Ordered Append buffers 55 Scaling the ATI Radeon™ HD 5000 Series GPU Architecture ATI Radeon™ HD 5870 “Cypress” Area 2 ATI Radeon™ HD 5770 “Juniper” 2 ATI Radeon™ HD 5670 “Redwood” 2 ATI Radeon™ HD 5450 “Cedar” 2 334 mm 166 mm 104 mm 59 mm Transistors 2.15 billion 1.04 billion 0.627 billion 0.292 billion Memory Bandwidth 153 GB/sec 76 GB/sec 64 GB/sec 13 GB/sec SIMD Engines 20 10 5 2 Vector width 64 64 64 32 1600 800 400 80 256-bits 128-bits 128-bits 64-bits TEX 80 40 20 8 ROPs 32 16 16 4 L2-L1 Rd Bandwidth 512 bytes/clk 512 bytes/clk 512 bytes/clk 512 bytes/clk L1 Bandwidth 1280 bytes/clk 640 bytes/clk 320 bytes/clk 128 bytes/clk 5.24 MByte 2.62 MByte 1.31 MByte 0.524 MByte 640kB 320 kB 160 kB 64 kB 2560 bytes/clk 1280 bytes/clk 640 bytes/clk 128 bytes/clk 31744 15872 < 15872 << 15872 Shader (ALU units) Memory width Vector GPR LDS Memory LDS Bandwidth Concurrent Threads 56 Compute Aspects of ATI Radeon™ HD 5870 GPU • Stream Cores • Local Data Share (LDS) • SIMD Engine • Load / Store / Atomic Data Access • Dispatch / Indirect Dispatch • Global Data Share (GDS) 57 Stream Core with Processing Elements (PE) Each Stream Core Unit includes: • 4 PE 4 Independent SP or Integer Ops 2 DP add or dependant SP pairs 1 DP fma or mult or SP dp4 • • • • 1 Special Function PE • 1 SP or Integer Operation • SP or DP Transcendental Ops • Operand Prep logic • General Purpose Registers • Data forwarding and predication logic 58 Processing Element (PE) Precision Improvements FMA (Fused Multiply Add) IEEE 754-2008 precise with all round modes, proper handling of Nan/Inf/Zero and full de-normal support in hardware for SP and DP MULADD instruction Without truncation, enabling a MULieee followed ADDieee to be combined with round and normalization after both multiplication and subsequent addition. IEEE Rounding Modes Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round toward zero. Supported under program control anywhere in the shader. Double and single precision modes are controlled separately. Applies to all slots in a VLIW. 59 Processing Element (PE) Precision Improvements (2) DeDe-normal Programmable Mode control For SP and DP independently. Separate control for input flush to zero and underflow flush to zero. FP Conversion Ops Between 16 bit, 32 bit, and 64 bit floats with full IEEE 754 precision. Hardware Exceptions Detection For floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation 64 bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root 60 Processing Element (PE) Improved IPC – Co-issue of dependant Ops in “ONE VLIW” instruction – full IEEE intermediate rounding & normalization – Dot4 (A=A*B + C*D + E*F + G*H), – Dual Dot2 ( A= A*B + C*D; F = G*h + I*J) – Dual Dependant Multiplies (A = A * B * C ; F = G * H * I;) – Dual Dependant Adds (A = B + C + D; E = F + G + H;) – Dependant Muladd (A= A*B+C + D*E; F = G*H + I + J*K) – 24 bit integer – MUL, MULADD (4 – co-issue) – Heavy use for Integer thread group address calculation 61 Processing Element (PE) New Integer Ops • • • • • • • 32b operand count bits 64b operand count bits Insert Bit field Extract Bit Field Find first Bit (high, low, signed high) Reverse bits Extended Integer Math • Integer Add with carry • Integer Subtract with borrow • 1 bit pre-fix sum on 64b mask. (useful for compaction) • Shader Accessible 64 bit counter • Uniform indexing of constants 62 Processing Element (PE) Special Ops – Conversion Ops – FP32 to FP64 and FP64 to FP32 (w/IEEE conversion rules) – FP32 to FP16 and FP 16 to FP32 (w/IEEE conversion rules) – FP32 to Int/UInt and Uint/Int to FP32 – Very Fast 8 bit Sum of Absolute Differences (SAD) – 4x1 SAD per lane, with 4x4 8 bit SAD in one VLIW – Used for video encoding, computer vision – Will be exposed via OpenCL™ extension – Video Ops – 8 bit packed to float and float to 8 bit packed conversion Ops – 4 8 bit pixel average (bilinear interpolation with programmable round) – Arbitrary Byte or Bit extraction from 64 bits 63 Local Data Share (LDS) Share Data between Work Items of a Work Group designed to increase performance High Bandwidth access per SIMD Engine (1024b/clk) – Peak is double external R/W bandwidth (512b/clk) Full coalesce R/W/Atomic with optimization for broadcast reads access Low Latency Access per SIMD Engine 0 latency direct reads (Conflict free or Broadcast) 1 VLIW instruction latency for LDS indirect Op All bank conflicts are hardware detected and serialized as necessary with fast support for broadcast reads 64 Local Data Share (LDS) (2) Hardware allocation of LDS space per thread group dispatch Base and size stored with wavefront for private access Hardware range check - Out of bound access attempts will return 0 and kill writes 32 – byte, ubyte, short, ushort reads/writes per clk (reads are sign extended) Per lane 32 bit Read, Read2, Write, Write2 Per lane 32 bit Atomic: add, sub, rsub, inc, dec, min, max, and, or, xor, mskor, exchange, exchange2, compare _swap(int), compare_swap(float w/Nan/Inf) Return pre-Op value to Stream Core 4 primary Processing Elements 65 Local Data Share (LDS) 66 SIMD Engine SIMD Engine can process Wavefronts from multiple kernels concurrently Thread divergence within a Wavefront is enabled with Lane Masking and Branching Enabling each Thread in a Wavefront to traverse a unique program execution path Full hardware barrier support for up to 8 Work Groups per SIMD Engine (for thread data sharing) Each Stream Core receives up to the following per VLIW instruction issue 5 unique ALU Ops - or - 4 unique ALU Ops with a LDS Op (Up to 3 operands per thread) 67 SIMD Engine (2) LDS and Global Memory access for byte, ubyte, short, ushort reads/writes supported at 32bit dword rates Private Loads and read only texture reads via Read Cache Unordered shared consistent loads/stores/atomics via R/W Cache Wavefront length of 64 threads where each thread executes a 5 way VLIW Instruction each issue ¼ Wavelength (16 threads) on each clock of 4 clocks (T0-15, T16-31, T32-47, T48-T63) 68 Global Data Share (GDS) Low Latency Access to a global data shared memory between all threads in a kernel ~25 clks latency Issued in parallel to math similar to texture, return to GPR Full coalesce R/W/Atomic and optimized for broadcast reads 8 threads post request per clock for full set of DS Operations In parallel to global memory access Driver allocation and initialized space per dispatch (base & size stored) 64kb can be used for multiple small kernels concurrently Driver stores of state to memory post kernel Hardware range check on each access, out of bounds returns 0 and kills writes Full compliment of DS operations Can be used for low latency global reductions 69 Global Data Share (GDS) 70 DirectCompute features on ATI Radeon™ HD 5870 GPU Feature Cs 4.0 Cs 5.0 Benefits Thread Group Dispatch 2D 3D Assisted Domain control (Grid) and easy address generation w/synch primitives per block for easy communication between thread Indirect Group Dispatch N/A 3D Useful for conditional Execution and enables dispatch of GPU en-queued work without CPU intervention Thread Group Limits 768 1024 Larger groups to enable natural algorithm mapping. More Parallel threads in a group Group Shared Memory 16 kb 32 kb Cooperative memory for sharing and reuse. Increase in effective bandwidth Shared Memory access Owners write share read model Full 32 kb read/write accesses Enabled full use of shared memory as data amplification and new algorithm implementation Atomic operations N/A Local and Global Memory Enables data binning for stats and powerful set of data structure creation and controls, misuses possible Append/Consume Buffers N/A 8 Supported Indirect producer consumer queue systems, list and stacks. Unordered Access View 1 to CS 8 Supported CS/PS Scatter Writes and Buffer management for compute Group Sync Instructions 71 AMD OpenCL™ Tool Support 72 Compiler, runtime and debugger LLVM-based compiler chain Runtimes for both x86 CPUs and ATI Radeon™ HD 4000 Series GPUs and ATI Radeon™ HD 5000 Series GPUs Debugging support allows GDB step-through debugging on the CPU GPU-side debugging being worked on 73 ATI Stream Profiler Offers access to information from the hardware performance counters Integrated into visual studio Documentation will be improved in future to allow third party profiler development 74 Stream Kernel Analyzer View IL and ISA code for a range of architectures Gives statistics on min/max number of cycles; number of ALU, fetch and write instructions, throughput etc. 75 Industry Standard Programming Interfaces Industry Standard APIs • Stream processing has evolved to a point where proprietary solutions are not helping to drive broader technology acceptance • As hardware and software evolves, the key to making it accessible in a unified way is to use standards. • Forces competition and lower cost/power solutions with regular improvements • OpenCL™ and DirectCompute are the Open APIs to access GPU compute. Industry standards help ease crosscrossplatform development and further mainstream use 76 Summary OpenCL™: an open standard for heterogeneous compute Working on applications for OpenCL™, in particular Open Physics ATI Radeon™ HD 5870 GPU: a modern compute--focused GPU architecture compute Tool support: debugging and performance analysis 77 References Links to AMD’s OpenCL™ tutorial and some samples http://developer.amd.com/gpu/ATIStreamSDK/pages/Publications.aspx AMD Developers Central : Samples, Tools, Downloads, White papers etc. http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx Khronos OpenCL: Specification, Introduction Slides, and Quick Reference www.khronos.org/opencl/ DirectCompute Introduction http://msdn.microsoft.com/en-us/directx/bb896684.aspx 78 Acknowledgements Cloth example produced in collaboration with: Justin Hensley Abe Wiley Jason Yang SPH by: Saif Ali Alan Heirich Ben Gaster 79 Disclaimer and Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, AMD Opteron, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, Windows Vista, and DirectX™ are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners. 80