Herb Sutter Welcome to the jungle The free lunch is so over 2011-201x 2005-2011 1975-2005 Put a computer on every desk, in every home, in every pocket. Put a parallel supercomputer on every desk, in every home, in every pocket. Put a heterogeneous supercomputer on every desk, in every home, in every pocket. Memory Processors AMD Processors Xbox 360 & mainstream computer AMD 80x86 Athlon GPU Other GPU Fusion APU Phenom II Memory AMD Processors Xbox 360 & mainstream computer AMD 80x86 Athlon GPU Other GPU Fusion APU Phenom II Memory Cloud + GPU Microsoft Azure Cloud Computing Memory Processors Processors (GP)GPU Cloud IaaS/HaaS ISO ISO C++ C++0x Multicore CPU Memory Processors (GP)GPU Cloud IaaS/HaaS ISO C++0x C++ PPL Memory Multicore CPU Processors ? DirectCompute(GP)GPU ISO C++0x C++ PPL Memory Multicore CPU Cloud IaaS/HaaS Processors C++ AMP DirectCompute(GP)GPU Accelerated Massive Parallelism ISO C++0x C++ PPL Memory Multicore CPU Cloud IaaS/HaaS Convert this (serial loop nest) void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { for (int y = 0; y < M; y++) for (int x = 0; x < N; x++) { float sum = 0; for(int i = 0; i < W; i++) sum += A[y*W + i] * B[i*N + x]; C[y*N + x] = sum; } } Convert this (serial loop nest) void MatrixMult( C, const … to thisfloat* (parallel loop,vector<float>& CPU or GPU) A, const vector<float>& B, int M, int N, int W ) void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, { for (int y = 0; y < M; y++) int M, int N, int W ) for (int {x = 0; x < N; x++) { array_view<const float,2> a(M,W,A), b(W,N,B); float sum = 0; c(M,N,C); for(int iarray_view<writeonly<float>,2> = 0; i < W; i++) sum += A[y*W + i] * B[i*N + x]; [=](index<2> idx) restrict(direct3d) { parallel_for_each( c.grid, C[y*N + x] = sum; float sum = 0; } for(int i = 0; i < a.x; i++) } sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum; } ); } EVOLUTION OF HETEROGENEOUS COMPUTING Excellent Fusion™ System Architecture GPU Peer Processor Standards Drivers Era Proprietary Drivers Era Graphics & Proprietary Driver-based APIs “Adventurous” programmers Exploit early programmable “shader cores” in the GPU Make your program look like “graphics” to the GPU Poor Architecture Maturity & Programmer Accessibility Architected Era OpenCL™, DirectCompute Driver-based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching CUDA™, Brook+, etc 2002 - 2008 17 | The Programmer’s Guide to the APU Galaxy | June 2011 2009 - 2011 2012 - 2020 Memory Processors Single-core to multi-core ISO C++0x ? PPL Parallel Patterns Library (VS2010) Single-core to multi-core ISO C++0x forall( x, y ) forall( z; w; v ) forall( k, l, m, n ) ...? PPL Parallel Patterns Library (VS2010) ISO C++0x λ parallel_for_each( items.begin(), items.end(), [=]( Item e ) { … your code here … } ); Single-core to multi-core PPL Parallel Patterns Library (VS2010) 1 language feature for multicore and STL, functors, callbacks, events, ... Multi-core to hetero-core ISO C++0x ? C++ AMP Accelerated Massive Parallelism Multi-core to hetero-core restrict ISO C++0x parallel_for_each( items.grid, [=](index<2> i) restrict(direct3d) { … your code here … } ); C++ AMP Accelerated Massive Parallelism 1 language feature for heterogeneous cores Memory Processors Problem: Some cores don’t support the entire C++ language. Solution: General restriction qualifiers enable expressing language subsets within the language. Direct3d math functions in the box. Example double sin( double ); // 1a: general code double sin( double ) restrict(direct3d); // 1b: specific code double cos( double ) restrict(direct3d); // 2: same code for either parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { … sin( data.angle ); // ok, chooses overload based on context cos( data.angle ); // ok … }); Initially supported restriction qualifiers: restrict(cpu): The implicit default. restrict(direct3d): Can execute on any DX11 device via DirectCompute. Restrictions follow limitations of DX11 device model (e.g., no function pointers, virtual calls, goto). Potential future directions: restrict(pure): Declare and enforce a function has no side effects. Great to be able to state declaratively for parallelism. General facility for language subsets, not just about compute targets. Problem: Memory may be flat, nonuniform, incoherent, and/or disjoint. Solution: Portable view that works like an N-dimensional “iterator range.” Future-proof: No explicit .copy()/.sync(). As needed by each actual device. Example void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { array_view<const float,2> a(M,W,A), b(W,N,B); // 2D view over C array array_view<writeonly<float>,2> c(M,N,C); // 2D view over C++ std::vector parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { … } ); } TM Bring CPU debugging experience to the GPU Bring CPU debugging experience to the GPU TM Cloud GPU # cores, not counting SIMD Cloud OoO GPU InO CPU OoO CPU Cloud GPU # cores, not counting SIMD Cloud OoO Welcome to the jungle GPU The free lunch is so over InO CPU OoO CPU Memory Processors C++ PPL: 9:45am C++ AMP: 2:00pm, Room 406 Herb Sutter