AFDS 2011 Keynote: “Heterogeneous Parallelism at Microsoft”

Herb Sutter
Welcome to
the jungle
The free lunch
is so over
2011-201x
2005-2011
1975-2005
Put a computer
on every desk, in
every home, in
every pocket.
Put a parallel
supercomputer
on every desk, in
every home, in
every pocket.
Put a heterogeneous
supercomputer
on every desk,
in every home,
in every pocket.
Memory
Processors
AMD
Processors
Xbox 360
& mainstream
computer
AMD
80x86
Athlon
GPU
Other GPU
Fusion
APU
Phenom II
Memory
AMD
Processors
Xbox 360
& mainstream
computer
AMD
80x86
Athlon
GPU
Other GPU
Fusion
APU
Phenom II
Memory
Cloud
+ GPU
Microsoft Azure
Cloud Computing
Memory
Processors
Processors
(GP)GPU
Cloud
IaaS/HaaS
ISO
ISO
C++
C++0x
Multicore CPU
Memory
Processors
(GP)GPU
Cloud
IaaS/HaaS
ISO
C++0x
C++ PPL
Memory
Multicore CPU
Processors
?
DirectCompute(GP)GPU
ISO
C++0x
C++ PPL
Memory
Multicore CPU
Cloud
IaaS/HaaS
Processors
C++
AMP
DirectCompute(GP)GPU
Accelerated
Massive Parallelism
ISO
C++0x
C++ PPL
Memory
Multicore CPU
Cloud
IaaS/HaaS
Convert this (serial loop nest)
void MatrixMult( float* C, const vector<float>& A, const vector<float>& B,
int M, int N, int W )
{
for (int y = 0; y < M; y++)
for (int x = 0; x < N; x++) {
float sum = 0;
for(int i = 0; i < W; i++)
sum += A[y*W + i] * B[i*N + x];
C[y*N + x] = sum;
}
}
Convert this (serial loop nest)
void MatrixMult(
C, const
… to thisfloat*
(parallel
loop,vector<float>&
CPU or GPU) A, const vector<float>& B,
int M, int N, int W )
void MatrixMult( float* C, const vector<float>& A, const vector<float>& B,
{
for (int y = 0; y < M; y++) int M, int N, int W )
for (int {x = 0; x < N; x++) {
array_view<const
float,2> a(M,W,A), b(W,N,B);
float sum
= 0;
c(M,N,C);
for(int iarray_view<writeonly<float>,2>
= 0; i < W; i++)
sum +=
A[y*W + i] * B[i*N
+ x]; [=](index<2> idx) restrict(direct3d) {
parallel_for_each(
c.grid,
C[y*N + x]
= sum;
float
sum = 0;
}
for(int i = 0; i < a.x; i++)
}
sum += a(idx.y, i) * b(i, idx.x);
c[idx] = sum;
} );
}
EVOLUTION OF HETEROGENEOUS COMPUTING
Excellent
Fusion™ System Architecture
GPU Peer Processor
Standards Drivers Era
Proprietary Drivers Era
Graphics & Proprietary
Driver-based APIs
 “Adventurous” programmers
 Exploit early programmable
“shader cores” in the GPU
 Make your program look like
“graphics” to the GPU
Poor
Architecture Maturity & Programmer Accessibility
Architected Era
OpenCL™, DirectCompute
Driver-based APIs
 Expert programmers
 C and C++ subsets
 Compute centric APIs, data
types
 Multiple address spaces with
explicit data movement
 Specialized work queue based
structures
 Kernel mode dispatch








Mainstream programmers
Full C++
GPU as a co-processor
Unified coherent address space
Task parallel runtimes
Nested Data Parallel programs
User mode dispatch
Pre-emption and context
switching
 CUDA™, Brook+, etc
2002 - 2008
17 | The Programmer’s Guide to the APU Galaxy | June 2011
2009 - 2011
2012 - 2020
Memory
Processors
Single-core to multi-core
ISO
C++0x
?
PPL
Parallel
Patterns
Library
(VS2010)
Single-core to multi-core
ISO
C++0x
forall( x, y )
forall( z; w; v )
forall( k, l, m, n )
...?
PPL
Parallel
Patterns
Library
(VS2010)
ISO
C++0x
λ
parallel_for_each(
items.begin(), items.end(),
[=]( Item e )
{
… your code here …
} );
Single-core to multi-core
PPL
Parallel
Patterns
Library
(VS2010)
1
language feature for multicore
and STL, functors, callbacks, events, ...
Multi-core to hetero-core
ISO
C++0x
?
C++ AMP
Accelerated
Massive
Parallelism
Multi-core to hetero-core
restrict
ISO
C++0x
parallel_for_each(
items.grid,
[=](index<2> i) restrict(direct3d)
{
… your code here …
} );
C++ AMP
Accelerated
Massive
Parallelism
1
language feature for
heterogeneous cores
Memory
Processors


Problem: Some cores don’t support the entire C++ language.
Solution: General restriction qualifiers enable expressing language subsets
within the language. Direct3d math functions in the box.
Example
double sin( double );
// 1a: general code
double sin( double ) restrict(direct3d); // 1b: specific code
double cos( double ) restrict(direct3d); // 2: same code for either
parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) {
…
sin( data.angle ); // ok, chooses overload based on context
cos( data.angle ); // ok
…
});

Initially supported restriction qualifiers:


restrict(cpu): The implicit default.
restrict(direct3d): Can execute on any DX11 device via
DirectCompute.


Restrictions follow limitations of DX11 device model
(e.g., no function pointers, virtual calls, goto).
Potential future directions:


restrict(pure): Declare and enforce a function has no side effects.
Great to be able to state declaratively for parallelism.
General facility for language subsets, not just about compute targets.


Problem: Memory may be flat, nonuniform, incoherent, and/or disjoint.
Solution: Portable view that works like an N-dimensional “iterator range.”

Future-proof: No explicit .copy()/.sync(). As needed by each actual device.
Example
void MatrixMult( float* C, const vector<float>& A,
const vector<float>& B, int M, int N, int W )
{
array_view<const float,2> a(M,W,A), b(W,N,B); // 2D view over C array
array_view<writeonly<float>,2> c(M,N,C);
// 2D view over C++ std::vector
parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) {
…
} );
}
TM
Bring CPU
debugging
experience
to the GPU
Bring CPU
debugging
experience
to the GPU
TM
Cloud
GPU
# cores, not counting SIMD
Cloud
OoO
GPU
InO
CPU
OoO
CPU
Cloud
GPU
# cores, not counting SIMD
Cloud
OoO
Welcome to
the jungle
GPU
The free lunch
is so over
InO
CPU
OoO
CPU
Memory
Processors
C++ PPL: 9:45am
C++ AMP: 2:00pm, Room 406
Herb Sutter