Heterogeneous Computing OpenCL™ and the Radeon HD 5870 Architecture PDF

Heterogeneous Computing
OpenCL™ and the ATI Radeon™ HD 5870 (“Evergreen”)
Architecture
Advanced Micro Devices
1
OpenCL™ (and Microsoft® DirectCompute)
With OpenCL™/DirectCompute (DC) you can…
Leverage CPUs and GPUs to accelerate parallel computation
Enable dramatic speedups for computationally intensive applications
Write accelerated portable code across different devices and architectures
With AMD’s implementations you can…
Leverage AMD’s CPUs and GPUs, to accelerate parallel computation
OpenCL ™ Public release for multi-core CPU and AMD GPU’s December 2009
Microsoft DirectX® 11 CS Public release AMD GPUs October 2009, as part of Microsoft Windows®
7 Launch
2
The Heterogeneous Computing Software Ecosystem
Increase ease of application
development
End--user Applications
End
Advanced Optimizations
& Load Balancing
Load balance across
CPUs and GPUs;
leverage Fusion
performance
advantages
High Level
Frameworks
Middleware/Libraries: Video,
Imaging, Math/Sciences,
Physics
Tools: HLL
compilers,
Debuggers,
Profilers
OpenCL™ & DirectCompute
Hardware & Drivers: Fusion, Discrete CPUs/GPUs
Drive new
features into
industry
standards
3
OpenCL™ Overview
4
OpenCL™ Platform Model
A host connected to one or more OpenCL™ devices
An OpenCL™ device is
A collection of one or more compute units (cores)
A compute unit is composed of one or more processing elements
Processing elements execute code as SIMD or SPMD
5
OpenCL™ Execution Model
Kernel
Basic unit of executable code - similar to a C function
Data-parallel or task-parallel
Program
Collection of kernels and other functions
Analogous to a dynamic library
Applications queue kernel execution instances
Queued in-order
Executed in-order or out-of-order
6
Expressing Data-Parallelism in OpenCL™
Define N-dimensional computation domain (N = 1, 2 or 3)
Each independent element of execution in N-D domain is called a work-item
The N-D domain defines the total number of work-items that execute in parallel
E.g., process a 1024 x 1024 image: Global problem dimensions: 1024 x 1024 = 1 kernel
execution per pixel: 1,048,576 total kernel executions
Scalar
void
scalar_mul(int n,
const float *a,
const float *b,
float *result)
{
int i;
for (i=0; i<n; i++)
result[i] = a[i] * b[i];
}
Data--parallel
Data
kernel void
dp_mul(global const float *a,
global const float *b,
global float *result)
{
int id = get_global_id(0);
result[id] = a[id] * b[id];
}
// execute dp_mul over “n” work-items
7
Expressing Data-Parallelism in OpenCL™
Kernels executed across a global domain of work-items
Global dimensions define the range of computation
One work-item per computation, executed in parallel
Work-items are grouped in local workgroups
Local dimensions define the size of the workgroups
Executed together on one device
Share local memory and synchronization
Caveats
Global work-items must be independent: no global synchronization
Synchronization can be done within a workgroup
8
OpenCL™ Memory Model
Private Memory
Per work-item
Local Memory
At least 32kB split into blocks each available
to any work-item in a given work-group
Private
Memory
Private
Memory
Private
Memory
Private
Memory
Work--Item
Work
Work--Item
Work
Work--Item
Work
Work--Item
Work
Local Memory
Workgroup
Local Memory
Workgroup
Local Global/Constant Memory
Global/Constant Memory
Not synchronized
Computer Device
Host Memory
Host Memory
On the CPU
Host
Memory management is explicit :
You must move data from host -> global -> local and back
9
Compilation Model
OpenCL™ uses Dynamic/Runtime compilation model (like OpenGL®):
1. The code is complied to an Intermediate Representation (IR)
– Usually an assembler or a virtual machine
– Known as offline compilation
2. The IR is compiled to a machine code for execution.
– This step is much shorter.
– It is known as online compilation.
In dynamic compilation, step 1 is done usually only once, and the IR is stored.
The App loads the IR and performs step 2 during the App’s runtime (hence the term…)
10
OpenCL™ Framework
Context
Programs
Kernels
dp_mul
dp_mul
add
__kernel void
add(
global const float *a,
global const float *b,
global float *c)
{
int id=get_global_id(0);
c[id]=a[id]+b[id];
}
add
CPU program
binary
add
GPU program
binary
arg [0]
[0]
arg
value
arg[0]
value
value
arg [1]
[1]
arg
value
arg[1]
value
value
arg [2]
[2]
arg
value
arg[2]
value
value
Memory Objects
Images
Buffers
Command Queues
InIn
Order
Order
Queue
Queue
Outofof
Out
Order
Order
Queue
Queue
GPU
GPU
11
Setup
1. Get the device(s)
2. Create a context
3. Create command queue(s)
cl_uint num_devices_returned;
cl_device_id devices[2];
err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1,
&devices[0], num_devices_returned);
err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1,
&devices[1], &num_devices_returned);
Queue
Context
Queue
cl_context context;
context = clCreateContext(0, 2, devices, NULL, NULL, &err);
cl_command_queue queue_gpu, queue_cpu;
queue_gpu = clCreateCommandQueue(context, devices[0], 0, &err);
queue_cpu = clCreateCommandQueue(context, devices[1], 0, &err);
12
Choosing Devices
A system may have several devices—which is best?
The “best” device is algorithm- and hardware-dependent
Query device info with:
clGetDeviceInfo(device, param_name, *value)
Number of compute units
CL_DEVICE_MAX_COMPUTE_UNITS
Clock frequency
CL_DEVICE_MAX_CLOCK_FREQUENCY
Memory size
CL_DEVICE_GLOBAL_MEM_SIZE
Extensions
(double precision, atomics, etc.)
Pick the best device for your algorithm
13
OpenCL™ Synchronization: Queues & Events
OpenCL™ defines a command queue
Created on a single device
Within the scope of a context
Two types of queues
In order queue : commands are executed in the
order of issuing
Out of order queue : command execution is
dependent only on its event list completion
Q1,1
D1
IOQ
C3
C2
C1
Context 1
Context 2
Device 1
Device 2
Device 1
Device 3
Q1,2
D1
IOQ
C2
C1
Q1,3
D2
IOQ
Q1,4
D2
OOQ
Q2,1
D1
OOQ
Q2,2
D3
IOQ
C4
C3
C2
C1
Multiple queues can be created on the same device
14
OpenCL™ Synchronization: Queues & Events (2)
Commands are enqueued to a specific queue
Kernels Execution
Memory Operations
Events
Q1,1
D1
IOQ
Context 1
Context 2
Device 1
Device 2
Device 1
Device 3
Q1,2
D1
IOQ
Q1,3
D2
IOQ
Each command can be created with an event
C3
C2
C2
associated
C1
C1
Each command execution can be dependent in a list
of pre-created events
In the example above :
Commands can be dependent on events created in
other queues/contexts
Q1,4
D2
OOQ
Q2,1
D1
OOQ
Q2,2
D3
IOQ
C4
C3
C2
C1
C3 from Q1,1 depends on C1 & C2
from Q1,2
C1 from Q1,4 depends on C2 from
Q1,2
In Q1,4, C3 depends on C2
15
Synchronization: Two Devices/Queues
Kernel 2 issued with no associated wait.
Starts before the results from Kernel 1 are
ready
CPU
GPU
Time
Kernel 2
Kernel 1
Kernel 2 waits for an event from Kernel 1
Does not start until the results from Kernel 1
are ready
CPU
GPU
Kernel 2
Kernel 1
Time
16
OpenCL™ C Language
Derived from ISO C99
No standard C99 headers, function pointers, recursion, variable length arrays, and bit fields
Additions to the language for parallelism
Work-items and workgroups
Vector types
Synchronization
Address space qualifiers
Optimized image access
Built-in functions
17
Data Types
Scalar data types
char , uchar, short, ushort, int, uint, long, ulong,
bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
Image types
image2d_t, image3d_t, sampler_t
Vector data types
Vector length of 2, 4, 8, and 16
Aligned at vector length
int4 vi0 = (int4) -7; -7 -7 -7 -7
int4 vi1 = (int4)(0, 1, 2, 3);
0
1
2
3
0
1
1
3
Vector operations and built-in
vi0.lo = vi1.hi; 2
3
-7
-7
int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); 2
3
-7
-7
18
Built-in Functions
Math Functions
IEEE 754 compatible rounding behavior for single precision floating-point
IEEE 754 compliant behavior for double precision floating-point
Defines maximum error of math functions as ULP values
Workgroup Functions
barrier, mem_fence, async_work_group_copy, wait_group_events
Integer functions
abs, abs_diff, add_sat, hadd, rhadd, clz, mad_hi, mad_sat, max,
min, mul_hi, rotate, sub_sat, upsample
19
Built-in Functions (2)
Image functions
read_image[f | i | ui]
write_image[f | i | ui]
get_image_[width | height | depth]
Common, Geometric and Relational Functions
Vector Data Load and Store Functions
eg. vload_half, vstore_half, vload_halfn, vstore_halfn, ...
20
OpenCL™ Basic Source Example
21
Vector Addition - Kernel
__kernel void vec_add (__global const float *a,
__global const float *b,
__global
float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
Spec Guide
__kernel:
__global:
get_global_id():
Data types:
Section 6.7.1
Section 6.5.1
Section 6.11.1
Section 6.1
22
Vector Addition - Host API (1)
// Enumerate platforms
cl_context
cl_platform_id
err =
cl_platform_id *platforms =
nPlatforms;
platform;
clGetPlatformIDs(0,NULL,&nPlatforms);
malloc(nPlatforms);
// get list of all platforms
err = clGetPlatformIDs(nPlatforms,platforms,NULL);
cl_context_properties p[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties) platform[0], 0 };
Spec Guide
Platforms and platform creation:
Section 4.1
23
Vector Addition - Host API (1)
// create the OpenCL context on all devices
cl_context context = clCreateContextFromType(
p, CL_DEVICE_TYPE_ALL, NULL, NULL, NULL);
// get the list of all devices associated with context
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
cl_device_id *devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
Spec Guide
Contexts and context creation:
Section 4.3
24
Vector Addition - Host API (2)
// create a command-queue
cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);
// allocate the buffer memory objects
cl_mem memobjs[3];
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);
memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL);
memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
sizeof(cl_float)*n, NULL, NULL);
Spec Guide
Command queues:
Creating buffer objects:
Section 5.1
Section 5.2.1
25
Vector Addition - Host API (3)
cl_program program = clCreateProgramWithSource(
context, 1, &program_source, NULL, NULL);
cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program, “vec_add”,
err = clSetKernelArg(kernel, 0, (void *)&memobjs[0],
err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1],
err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2],
NULL);
sizeof(cl_mem));
sizeof(cl_mem));
sizeof(cl_mem));
Spec Guide
Creating program objects:
Building program executables:
Creating kernel objects:
Setting kernel arguments:
Section 5.4.1
Section 5.4.2
Section 5.5.1
Section 5.5.2
26
Vector Addition - Host API (4)
// set work-item dimensions
size_t global_work_size[0] = n;
// execute kernel
err = clEnqueueNDRangeKernel(
cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
// read output array
err = clEnqueueReadBuffer(
context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
Spec Guide
Executing Kernels:
Reading, writing, and
copying buffer objects:
Section 5.6
Section 5.2.2
27
OpenCL™ Fluid Simulation in
Bullet Physics
28
Bullet Physics
Games Physics Simulation (http://bulletphysics.com)
Open source engine (Zlib license)
Rigid body dynamics, e.g.
– Ragdolls, destruction, and vehicles
Soft body dynamics
– Cloth, rope, and deformable volumes
Current version is 2.76 and is CPU only
Future versions of Bullet will add a turbo injection, with the help of OpenCL™
29
Open Physics – OpenCL™ and Bullet
Rigid body dynamics
– Erwin Coumans and Sony team developing accelerated version
– 2D/3D Demos already exist and work on productization
Soft body dynamics
– AMD developing DirectCompute/OpenCL Cloth acceleration
Fluid simulation
– Does not currently exist in Bullet
– AMD developing OpenCL/DirectCompute SPH implementation
DRM Finite Element method
– Pixelux working on Bullet integration
30
Fluids and particle systems the basics
Simply, highly parallel, thus map well to the GPU
Particles store position, mass, velocity, age, density, etc
Particles are moved by time stepping:
– Euler or Leapfrog integration
dv i
dt
= ai
– Acceleration ai has contributions from gravity, pressure gradient, and viscosity
mi
(x i ; yi ; zi )
vi
ai
31
Particle-Particle Interaction
For correct simulation of fluids, inter-particle forces are required
Naïve implementation leads to a complexity of O(n2)
[Muller03] reduce this to linear complexity by introducing a cutoff distance k
k
32
Smoothed Particle Hydrodynamics
33
SPH – Algorithm
Build spatial grid on particles
Allow fast neighbor finding
For each particle
Find neighbors
For each particle
On the GPU each particle is worked on in parallel
Compute density and pressure
For each particle
Compute acceleration
For each particle
Integrate
34
SPH – Reducing O(n2) to O(n) with spatial hashing
Fill particles into a grid with spacing 2*interaction distance
Search potential neighbors in adjacent cells only
Map cells [i,j,k] into 1D array via hash function h(i,j,k) [teschner03]
Implementation:
Infinite 3D grid is mapped to finite grid
– 10 * 10 * 10 in the unit cube
– (x,y,z) maps (x%11,y%11,z%11)
Finding neighbors is fast
Requires fast GPU sort
– Today Radixsort
– Future other sorts, work still needed to study sorting algorithms on GPU
35
SPH – Next Generation
Reverted to pure streaming implementation
Data is always read coherently
Pipeline is automatically generated
Using sequence description file
Predict memory bandwidth, automatically
Still in early development, running but:
Simulation still unstable, this is a maths issue so:
– Problem is memory bound and so – 256,000 particles @ 60Hz theoretically doable!
36
SPH – Rendering
No ‘good’ solution to date
Screen-space Curvature Flow
[Malladi at el, 1995, Van der Laan at el, 2009]
Evolve surface along normal direction with speed determined by mean surface curvature
Flow moves surface only in Z
Needs a lot of particles to look good!
Almost all post rendering processing (geometry shader is used to render screen point
sprites), so:
Why not use OpenCL™?
– OpenCL C Geometry and Fragment Shaders, with fixed function rasterization
37
SPH – Early prototype
38
OpenCL™ Soft Body Simulation
39
Particle-based soft body simulation
Mass/spring system:
Large collection of masses or particles
Connect using springs
Layout and properties of springs changes properties of soft body
General spring types:
Structural springs
Shearing springs
Bending springs
40
Simulating a cloth
Subset of full range of soft bodies.
Large number of particles lends itself to parallel processing.
Force from a given link must be applied to both particles
Requires batching to correctly update particles
Basic stages of simulation:
Original layout
Current layout:
Apply forces to masses
Compute forces as stretch and compute velocities
from rest length
Compute new positions
based on velocities
41
Apply simulation stages in batches
Links cannot be solved in parallel
Multiple links share nodes
Dependencies in the interactions
Batch the links based on their inteconnections:
Graph colouring
Complicated link structure = many batches
Compute the velocity:
Over each batch in turn
Over multiple cycles to allow solution to converge
Update positions, faces, normals etc based on computed velocities in a single pass
42
Adding wind to the simulation
Compute normals on a per-face basis
Number of faces can be optimised to improve performance
We use a face per node triangle in the obvious fashion
Node normals and areas computed from surrounding faces
Compute wind force interaction based on velocity and density of the wind medium
Wind velocity
Relative velocity
Node velocity
Scale by medium density
43
The “Übercloth”
Multi-pass multi-phase algorithm
Large number of overall passes
While hardware can run multiple kernels simultaneously, each wavefront executes
too briefly for scheduler to gain from this
Multiple cloths require all passes to be computed for each
Separate kernel issues can end up being small
Merge cloths into single simulated cloth object
Inefficient on the CPU due to caching
Highly efficient on the GPU:
– Higher occupancy
– Massively parallel
44
Reducing dispatch calls
Even the giant multi-cloth dispatch involves a large number of kernel dispatches
Given a scene with 78k particles, 468k links
Over 350 kernel dispatches per frame
More than half of the execution time performing kernel dispatches
Suggestion:
Extend the batching to sort links by interconnections
Pairs of links connected by a node can safely be executed
together, but would normally be batched separately
Fuse kernel to compute both – small saving in memory
traffic, halve dispatches for link solver
45
Cloth
46
The ATI Radeon™ HD 5870 (“Evergreen”) architecture
47
ATI Radeon™ HD 5870 GPU - Superior Consumer Hardware
AT I R a d e o n ™ H D 5 8 7 0 G P U
Process/Transistors
40nm/2.15B
Stream Processors
1600
Peak SP Flop Rate
2.72 Teraflops
Peak DP Flop Rate
544 GFLOPS
Core Clock
850 MHz
Max Resolution
3x2560x1600
Memory Type
GDDR5 4.8Gbps
Frame Buffer
1 GB
Max Bandwidth
153 GB/s
Max/Idle Board Power
188W/27 W
*Subject to monitor resolutions (Max ASIC Resolution 6x2560x1600 @60Hz) . **Price on newegg.com as low as $394 as of March 20, 2010. “Most Advanced” technologies:
internal calculations at time of launch showed that the processor used in the ATI Radeon HD 5800 series can achieve 2.72 TeraFLOPS, more than any other known
microprocessor as of September 23, 2009. FLOPS is not necessarily an indicator of leading performance in every application as AMD GPUs are designed and built to excel
specifically in applications involving massively parallel calculations.
48
ATI Eyefinity – Enhanced Productivity
ATI Eyefinity technology can support up to 6 displays using a single enabled ATI Radeon™ graphics card - the number of displays may vary by board design and you should
confirm exact specifications with the applicable manufacturer before purchase. ATI Eyefinity technology works with games that support non-standard aspect ratios, which is
required for panning across multiple displays. To enable more than two displays, additional panels with native DisplayPort™ connectors, and/or certified DisplayPort™ adapters
to convert your monitor’s native input to your cards DisplayPort™ or Mini-DisplayPort™ connector(s), are required.
49
ATI Radeon™ HD 5870 GPU Features
ATI Radeon™
HD 4870
2
ATI Radeon™
HD 5870
2
Difference
263 mm
334 mm
1.27x
Transistors
956 million
2.15 billion
2.25x
Memory Bandwidth
115 GB/sec
153 GB/sec
1.33x
L2-L1 Rd Bandwidth
512 bytes/clk
512 bytes/clk
1x
L1 Bandwidth
640 bytes/clk
1280 bytes/clk
2x
Vector GPR
2.62 Mbytes
5.24 MByte
2x
LDS Memory
160 kb
640kb
4x
640 byte/clk
2560 bytes/clk
4x
Concurrent Threads
15872
31744
2x
Shader (ALU units)
800
1600
2x
Idle
90 W
27 W
0.3x
Max
160 W
188 W
1.17x
Area
LDS Bandwidth
Board Power
20 SIMD engines (MPMD)
Each with 16 Stream Cores (SC)
– Each with 5 Processing Elements (PE) (1600 Total)
– Each PE IEEE 754 –2008 precision capable
denorm support, fma, flags, round modes
– Fast integer support
Each with local data share memory
– 32 kb shared low latency memory
– 32 banks with hardware conflict management
– 32 integer atomic units
80 Read Address Probes
4 addresses per SIMD engine (32 -128 bits data)
4 filter or convert logic per SIMD
Global Memory access
32 SC access read/write/integer atomic/clock
Relaxed Unordered Memory Consistency Model
On chip 64 kb global data share
153 GB/sec GDDR5 memory interface
50
GDDR5 Memory Controller Architecture
2nd Generation GDDR5 Implementation
Optimized for Speed and Area
GPU High Bandwidth Direct Connect Clients
Multi Engine Client Hub Connect
EDC (Error Detection Code)
CRC Checks on Data Transfers for Improved
Reliability at High Clock Speeds
GDDR5 Memory Clock temperature compensation
Enables Speeds Approaching 5 Gbps
Fast GDDR5 Link Retraining
Allows Voltage & Clock Switching on the Fly
without Glitches
51
Tera Scale 2 Architecture - ATI Radeon™ HD 5870 GPU
Double the processing power of previous
generation. Up to:
2.72 Teraflops
27.2 Giga Pixels/sec
Partially decoupled architecture
Work units produced by rasterizer
passed into UTDP
UTDP executes the control flow
program
Units of work called clauses pushed
to SIMD engines and memory
controllers
Relies on threading to keep SIMD units busy
52
ATI Radeon™ HD 5870 GPU - ISA: Control, memory and ALU programs
13 TEX: ADDR(368) CNT(2)
08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)
15 x: SETGT_INT R0.x, R2.x, R4.x
16 x: PREDNE_INT ____, R0.x, 0.0f
UPDATE_PRED
22 VFETCH R0.x___, R0.y, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
23 VFETCH R1.x___, R0.z, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
UPDATE_EXEC_MASK
09 JUMP POP_CNT(1) ADDR(18)
10 ALU: ADDR(64) CNT(9)
17 x: SUB_INT R5.x, R3.x, R2.x
y: LSHL
____, R3.x, (0x00000002, 2.802596929e-45f).x
z: LSHL
____, R2.x, (0x00000002, 2.802596929e-45f).x
t: MOV
R8.x, 0.0f
18 x: SUB_INT R6.x, PV17.y, PV17.z
y: MOV
R8.y, 0.0f
z: MOV
R8.z, 0.0f
w: MOV
R8.w, 0.0f
14 ALU_BREAK: ADDR(85) CNT(88)
VEC_120
11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)
12 ALU: ADDR(73) CNT(12)
19 x: LSHL
____, R5.x, (0x00000002, 2.802596929e-45f).x
w: ADD_INT ____, R6.x, R1.x VEC_120
t: ADD_INT R7.x, R1.x, (0x00000008, 1.121038771e-44f).y
24 x: MOV
y: ASHR
z: ASHR
w: ASHR
t: ASHR
____, R0.x
T0.y, R0.x, (0x00000018, 3.363116314e-44f).x
____, R0.x, (0x00000008, 1.121038771e-44f).y
____, R0.x, (0x00000010, 2.242077543e-44f).z
T0.z, R1.x, (0x00000018, 3.363116314e-44f).x
15 ALU: ADDR(173) CNT(1)
40 x: MOV
R1.x, R7.x
16 ENDLOOP i0 PASS_JUMP_ADDR(12)
17 POP (1) ADDR(18)
18 ALU: ADDR(174) CNT(22) KCACHE0(CB1:0-15)
41 x: MOV
T0.x, KC0[2].x
z: ADD_INT ____, R2.x, R3.x
w: MOV
T0.w, KC0[2].x
t: LSHL
R4.x, R3.x, (0x00000002, 2.802596929e-45f).x
53
ATI Radeon™ HD 5870 GPU - Compute
Full hardware DirectCompute 11 Implementation
Full hardware OpenCL™ 1.0 Implementation
IEEE754-2008 compliant precision for SP and DP
Memory Load Operations
80 4-16 byte gather loads from L1 per
clk
8 64 byte loads from L2 per clk
Grouped Cache reads with invalidate
Memory Store Operations
Coalesces and combine writes
Scatter up to 32 64 bit values per clock
Global Memory R/W Cache with
Atomics
Relaxed Global Memory Consistency
Model
54
ATI Radeon™ HD 5870 GPU - Compute (2)
Functionality targeting OpenCL™ 1.0 and beyond
32-bit atomic operations
Flexible 32kB Local Data Shares
Flexible 64kB Global Data Share
Global semaphores/barrier
synchronization
Append/consume buffers
Ordered Append buffers
55
Scaling the ATI Radeon™ HD 5000 Series GPU Architecture
ATI Radeon™
HD 5870
“Cypress”
Area
2
ATI Radeon™
HD 5770
“Juniper”
2
ATI Radeon™
HD 5670
“Redwood”
2
ATI Radeon™
HD 5450
“Cedar”
2
334 mm
166 mm
104 mm
59 mm
Transistors
2.15 billion
1.04 billion
0.627 billion
0.292 billion
Memory Bandwidth
153 GB/sec
76 GB/sec
64 GB/sec
13 GB/sec
SIMD Engines
20
10
5
2
Vector width
64
64
64
32
1600
800
400
80
256-bits
128-bits
128-bits
64-bits
TEX
80
40
20
8
ROPs
32
16
16
4
L2-L1 Rd Bandwidth
512 bytes/clk
512 bytes/clk
512 bytes/clk
512 bytes/clk
L1 Bandwidth
1280 bytes/clk
640 bytes/clk
320 bytes/clk
128 bytes/clk
5.24 MByte
2.62 MByte
1.31 MByte
0.524 MByte
640kB
320 kB
160 kB
64 kB
2560 bytes/clk
1280 bytes/clk
640 bytes/clk
128 bytes/clk
31744
15872
< 15872
<< 15872
Shader (ALU units)
Memory width
Vector GPR
LDS Memory
LDS Bandwidth
Concurrent Threads
56
Compute Aspects of ATI Radeon™ HD 5870 GPU
• Stream Cores
• Local Data Share (LDS)
• SIMD Engine
• Load / Store / Atomic Data Access
• Dispatch / Indirect Dispatch
• Global Data Share (GDS)
57
Stream Core with Processing Elements (PE)
Each Stream Core Unit includes:
•
4 PE
4 Independent SP or Integer Ops
2 DP add or dependant SP pairs
1 DP fma or mult or SP dp4
•
•
•
•
1 Special Function PE
•
1 SP or Integer Operation
•
SP or DP Transcendental Ops
•
Operand Prep logic
•
General Purpose Registers
•
Data forwarding and predication
logic
58
Processing Element (PE) Precision Improvements
FMA (Fused Multiply Add)
IEEE 754-2008 precise with all round modes, proper handling of Nan/Inf/Zero and full
de-normal support in hardware for SP and DP
MULADD instruction
Without truncation, enabling a MULieee followed ADDieee to be combined with round
and normalization after both multiplication and subsequent addition.
IEEE Rounding Modes
Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round
toward zero.
Supported under program control anywhere in the shader. Double and single
precision modes are controlled separately. Applies to all slots in a VLIW.
59
Processing Element (PE) Precision Improvements (2)
DeDe-normal Programmable Mode control
For SP and DP independently.
Separate control for input flush to zero and underflow flush to zero.
FP Conversion Ops
Between 16 bit, 32 bit, and 64 bit floats with full IEEE 754 precision.
Hardware Exceptions Detection
For floating point numbers with software recording and reporting mechanism.
Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation
64 bit Transcendental Approximation
Hardware based double precision approximation for reciprocal, reciprocal square
root and square root
60
Processing Element (PE) Improved IPC
– Co-issue of dependant Ops in “ONE VLIW” instruction
– full IEEE intermediate rounding & normalization
– Dot4
(A=A*B + C*D + E*F + G*H),
– Dual Dot2
( A= A*B + C*D;
F = G*h + I*J)
– Dual Dependant Multiplies
(A = A * B * C ;
F = G * H * I;)
– Dual Dependant Adds
(A = B + C + D;
E = F + G + H;)
– Dependant Muladd
(A= A*B+C + D*E;
F = G*H + I + J*K)
– 24 bit integer
– MUL, MULADD (4 – co-issue)
– Heavy use for Integer thread group address calculation
61
Processing Element (PE) New Integer Ops
•
•
•
•
•
•
•
32b operand count bits
64b operand count bits
Insert Bit field
Extract Bit Field
Find first Bit (high, low, signed high)
Reverse bits
Extended Integer Math
• Integer Add with carry
• Integer Subtract with borrow
• 1 bit pre-fix sum on 64b mask. (useful for compaction)
• Shader Accessible 64 bit counter
• Uniform indexing of constants
62
Processing Element (PE) Special Ops
– Conversion Ops
– FP32 to FP64 and FP64 to FP32 (w/IEEE conversion rules)
– FP32 to FP16 and FP 16 to FP32 (w/IEEE conversion rules)
– FP32 to Int/UInt and Uint/Int to FP32
– Very Fast 8 bit Sum of Absolute Differences (SAD)
– 4x1 SAD per lane, with 4x4 8 bit SAD in one VLIW
– Used for video encoding, computer vision
– Will be exposed via OpenCL™ extension
– Video Ops
– 8 bit packed to float and float to 8 bit packed conversion Ops
– 4 8 bit pixel average (bilinear interpolation with programmable round)
– Arbitrary Byte or Bit extraction from 64 bits
63
Local Data Share (LDS)
Share Data between Work Items of a Work Group designed to increase performance
High Bandwidth access per SIMD Engine (1024b/clk) – Peak is double external R/W
bandwidth (512b/clk)
Full coalesce R/W/Atomic with optimization for broadcast reads access
Low Latency Access per SIMD Engine
0 latency direct reads (Conflict free or Broadcast)
1 VLIW instruction latency for LDS indirect Op
All bank conflicts are hardware detected and serialized as necessary with fast support for
broadcast reads
64
Local Data Share (LDS) (2)
Hardware allocation of LDS space per thread group dispatch
Base and size stored with wavefront for private access
Hardware range check - Out of bound access attempts will return 0 and kill writes
32 – byte, ubyte, short, ushort reads/writes per clk (reads are sign extended)
Per lane 32 bit Read, Read2, Write, Write2
Per lane 32 bit Atomic: add, sub, rsub, inc, dec, min, max, and, or, xor, mskor, exchange,
exchange2, compare _swap(int), compare_swap(float w/Nan/Inf)
Return pre-Op value to Stream Core 4 primary Processing Elements
65
Local Data Share (LDS)
66
SIMD Engine
SIMD Engine can process Wavefronts from multiple kernels concurrently
Thread divergence within a Wavefront is enabled with Lane Masking and Branching
Enabling each Thread in a Wavefront to traverse a unique program execution path
Full hardware barrier support for up to 8 Work Groups per SIMD Engine (for thread data
sharing)
Each Stream Core receives up to the following per VLIW instruction issue
5 unique ALU Ops - or - 4 unique ALU Ops with a LDS Op (Up to 3 operands per thread)
67
SIMD Engine (2)
LDS and Global Memory access for byte, ubyte, short, ushort reads/writes supported at 32bit
dword rates
Private Loads and read only texture reads via Read Cache
Unordered shared consistent loads/stores/atomics via R/W Cache
Wavefront length of 64 threads where each thread executes a 5 way VLIW Instruction each issue
¼ Wavelength (16 threads) on each clock of 4 clocks (T0-15, T16-31, T32-47, T48-T63)
68
Global Data Share (GDS)
Low Latency Access to a global data shared memory between all threads in a kernel
~25 clks latency
Issued in parallel to math similar to texture, return to GPR
Full coalesce R/W/Atomic and optimized for broadcast reads
8 threads post request per clock for full set of DS Operations
In parallel to global memory access
Driver allocation and initialized space per dispatch (base & size stored)
64kb can be used for multiple small kernels concurrently
Driver stores of state to memory post kernel
Hardware range check on each access, out of bounds returns 0 and kills writes
Full compliment of DS operations
Can be used for low latency global reductions
69
Global Data Share (GDS)
70
DirectCompute features on ATI Radeon™ HD 5870 GPU
Feature
Cs 4.0
Cs 5.0
Benefits
Thread Group Dispatch
2D
3D
Assisted Domain control (Grid) and easy address
generation w/synch primitives per block for easy
communication between thread
Indirect Group Dispatch
N/A
3D
Useful for conditional Execution and enables dispatch of
GPU en-queued work without CPU intervention
Thread Group Limits
768
1024
Larger groups to enable natural algorithm mapping.
More Parallel threads in a group
Group Shared Memory
16 kb
32 kb
Cooperative memory for sharing and reuse. Increase in
effective bandwidth
Shared Memory access
Owners write share
read model
Full 32 kb read/write
accesses
Enabled full use of shared memory as data amplification
and new algorithm implementation
Atomic operations
N/A
Local and Global Memory
Enables data binning for stats and powerful set of data
structure creation and controls, misuses possible
Append/Consume Buffers
N/A
8 Supported
Indirect producer consumer queue systems, list and
stacks.
Unordered Access View
1 to CS
8 Supported CS/PS
Scatter Writes and Buffer management for compute
Group Sync Instructions
71
AMD OpenCL™ Tool Support
72
Compiler, runtime and debugger
LLVM-based compiler chain
Runtimes for both x86 CPUs and ATI Radeon™ HD 4000 Series GPUs and ATI
Radeon™ HD 5000 Series GPUs
Debugging support allows GDB step-through debugging on the CPU
GPU-side debugging being worked on
73
ATI Stream Profiler
Offers access to information from the hardware performance counters
Integrated into visual studio
Documentation will be improved in future to allow third party profiler
development
74
Stream Kernel Analyzer
View IL and ISA code for a range of architectures
Gives statistics on min/max number of cycles; number of ALU, fetch and write
instructions, throughput etc.
75
Industry Standard Programming Interfaces
Industry Standard APIs
•
Stream processing has evolved to a point where
proprietary solutions are not helping to drive broader technology
acceptance
•
As hardware and software evolves, the key to making it
accessible in a unified way is to use standards.
•
Forces competition and lower cost/power solutions with regular
improvements
•
OpenCL™ and DirectCompute are the Open APIs to access GPU
compute.
Industry standards
help ease crosscrossplatform
development and
further mainstream
use
76
Summary
OpenCL™: an open standard for heterogeneous
compute
Working on applications for OpenCL™, in
particular Open Physics
ATI Radeon™ HD 5870 GPU: a modern
compute--focused GPU architecture
compute
Tool support: debugging and performance
analysis
77
References
Links to AMD’s OpenCL™ tutorial and some samples
http://developer.amd.com/gpu/ATIStreamSDK/pages/Publications.aspx
AMD Developers Central : Samples, Tools, Downloads, White papers etc.
http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx
Khronos OpenCL: Specification, Introduction Slides, and Quick Reference
www.khronos.org/opencl/
DirectCompute Introduction
http://msdn.microsoft.com/en-us/directx/bb896684.aspx
78
Acknowledgements
Cloth example produced in collaboration with:
Justin Hensley
Abe Wiley
Jason Yang
SPH by:
Saif Ali
Alan Heirich
Ben Gaster
79
Disclaimer and Attribution
DISCLAIMER
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,
ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO
ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN
IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, AMD Opteron, Radeon, and combinations thereof are trademarks
of Advanced Micro Devices, Inc. Microsoft, Windows, Windows Vista, and DirectX™ are registered trademarks of Microsoft Corporation in the United States and/or other
jurisdictions. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be
trademarks of their respective owners.
80