VIRTUALIZATION IN THE AGE OF HETEROGENEOUS MACHINES David F. Bacon The ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’11) Keynote - March 9, 2011 WHAT HETEROGENEOUS MACHINES? WHAT HETEROGENEOUS MACHINES? WHAT HETEROGENEOUS MACHINES? It’s The Multicore Era! THE PATH OF LEAST IMAGINATION For years, computer architects have been saying that a big new idea in computing was needed. Indeed, as transistors have continued to shrink, rather than continuing to innovate, computer designers have simply adopted a so-called “multicore” approach, where multiple processors are added as more chip real estate became available. Source: Remapping Computer Circuitry to Avert Impending Bottlenecks, February 28, 2011 MEANWHILE... GPU FPGA Tilera 64 Cell BE IBM WSP WHAT IS PERFORMANCE? Peak Performance Source: Brodtkorb et al, The State-of-the-Art in Heterogeneous Computing, 2010 PERFORMANCE, TAKE 2 Actual Performance (Random Number Generation) Source: Thomas et al, A comparison of CPUs, GPUs, FPGAs and masssively parallel processor arrays for random number generation, 2009 WHAT’S THE BEST USE OF TRANSISTORS? Homogeneous vs. Heterogeneous WHAT’S THE BEST USE OF TRANSISTORS? Homogeneous vs. Heterogeneous Keep Transistors Busy vs. Turn Transistors Off Should We Virtualize Heterogeneous Architectures? How? At the Language or System Level? Or Both? What is “Virtualization” Anyway? THE ONLY 2 IDEAS IN COMPUTER SCIENCE THE ONLY 2 IDEAS IN COMPUTER SCIENCE Hashing f(x) THE ONLY 2 IDEAS IN COMPUTER SCIENCE Hashing Indirection f(x) THE ONLY 2 IDEAS IN COMPUTER SCIENCE Hashing Indirection f(x) SYSTEM VS. LANGUAGE VMS User-Mode ISA Supervisor-Mode ISA Device Drivers SYSTEM VS. LANGUAGE VMS Supervisor-Mode ISA Device Drivers x86 User-Mode ISA x86 SYSTEM VS. LANGUAGE VMS Supervisor-Mode ISA Device Drivers x86 Supervisor-Mode ISA Device Drivers x86 Supervisor-Mode ISA Device Drivers x86 User-Mode ISA x86 SYSTEM VS. LANGUAGE VMS Supervisor-Mode ISA Device Drivers x86 Supervisor-Mode ISA Device Drivers x86 Supervisor-Mode ISA Device Drivers x86 User-Mode ISA ARM User-Mode ISA POWER User-Mode ISA x86 SYSTEM VS. LANGUAGE VMS • Virtualize Environment • Virtualize Processor • Hold ISA Constant • Vary ISA OVERLAP: SYNERGY AND CONFLICT OVERLAP: SYNERGY AND CONFLICT Memory OVERLAP: SYNERGY AND CONFLICT Memory Threads OVERLAP: SYNERGY AND CONFLICT Memory Threads User-Mode ISA Supervisor-Mode ISA DEVICE OR PROCESSOR? GPU FPGA THE “ACCELERATOR” MODEL Main Program GPU CPU FPGA THE “ACCELERATOR” MODEL Force Calc Main Program GPU CPU FPGA THE “ACCELERATOR” MODEL Force Calc Main Program FFT GPU CPU FPGA THE REALITY THE REALITY THE REALITY THE REALITY THE REALITY THE REALITY THE REALITY THE REALITY Language VMs for Heterogenous Systems Liquid Metal Joshua Auerbach Perry Cheng Rodric Rabbah David F. Bacon Steven Fink Sunil Shukla Christophe Dubach Yu Zhang HETEROGENEOUS PROGRAMMING Node Compiler Synthesis C+Intrinsics VHDL Verilog SystemC GPU Compiler Cuda OpenCL CPU Compiler Java C++ Python binary binary binary bitfile CPU GPU WSP FPGA CPU Compiler GPU Compiler Node Compiler Synthesis binary binary binary bitfile CPU GPU WSP FPGA CPU Backend GPU Backend Node Backend Verilog Backend bytecode binary binary bitfile CPU GPU WSP FPGA THE LIQUID METAL PROGRAMMING LANGUAGE Lime CPU Backend GPU Backend Node Backend Verilog Backend Lime Compiler bytecode binary binary bitfile CPU GPU WSP FPGA THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary binary bitfile CPU GPU WSP FPGA THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary binary bitfile Artifact Store CPU GPU WSP FPGA THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA THE ARTIFACT STORE & EXCLUSION Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store CPU GPU WSP FPGA HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU GPU WSP FPGA HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU GPU “Bus” WSP FPGA HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU bytecode GPU “Bus” WSP FPGA HETEROGENEOUS EXECUTION OF LIME Lime Lime Compiler bytecode binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LmVM) CPU bytecode GPU “Bus” WSP FPGA bitfile THE LIME LANGUAGE: VIRTUALIZING HETEROGENEOUS COMPUTATION LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime % limec MyClass.lime LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime % limec MyClass.lime % java MyClass LIME: JAVA IS (ALMOST) A SUBSET % javac MyClass.java % java MyClass % mv MyClass.java MyClass.lime % limec MyClass.lime % java MyClass INCREMENTALLY USE LIME FEATURES LIME LANGUAGE OVERVIEW BIT-LEVEL PARALLELISM Immutable Types Bounded Types Bounded Arrays Primitive Supertypes Core Features Programmable Primitives Stream Programming Collective Operations DATA PARALLELISM Supporting Features Reifiable Generics Ranges, Bounded “for” User-defined operators Typedefs Local type inference Tuples PIPELINE PARALLELISM Graph Construction Isolation Enforcement Closed World Support Rate Matching Messaging VIRTUALIZATION COMPARISON FPGA CPU Physical Virtual Physical Virtual Core Thread LUTs bit RAM Heap CLBs User Primitives byte word float dfloat Compare&Swap byte float int Slices double Block RAM Bounded Arrays Routing Network Streams Reduce DMA, SerDes Streams DSP Multipliers integer<N> synchronized tasks Map STREAMING COMPUTATION PIPELINE PARALLELISM TASK CREATION (STATELESS) var uppercaser = task work; primitive filter stream char port type port char local char work(char c) { return toUpper(c); } worker method stream type TASK CREATION (STATELESS) var uppercaser = task work; primitive filter Task Creation stream char port type port local char work(char c) { return toUpper(c); } worker method Isolation Keywords char stream type PIPELINES var pipeline = task worker1 => task worker2 => task worker3; port-to-stream connection port-to-stream connection char char worker1(…) { … } int[[5]] int worker2(…) { … } worker3(…) { … } compound filter PIPELINES Graph Construction var pipeline = task worker1 => task worker2 => task worker3; port-to-stream connection port-to-stream connection char char worker1(…) { … } int[[5]] int worker2(…) { … } worker3(…) { … } compound filter SOURCES AND SINKS filter source char reader(…) { … } sink char worker(…) { … } writer(…) { … } Heap /tmp/mydata File System /dev/tty HELLO WORLD, STREAMING STYLE public static void main(string[[]] args) { char[[]] msg = { 'H','E','L','L','O',',',' ', 'W','O','R','L','D','!','\n' }; var hello = msg.source(1) => task Character.toLowerCase(char) => task System.out.print(char); hello.finish(); } DEMO HELLO WORLD LIME/ECLIPSE ENVIRONMENT STATEFUL TASKS var averager = task Averager().avg; instance variables (local state) primitive filter double total; long count; double double double avg(double x) { total += x; return total/++count; } RATE MATCHING var matchedpipe = task AddStuff().work1 => # => task work2; rate matcher (rate 1:2) int x; int Buffer int[[2]] int int[[2]] work1(int i) { int r = i+x; x = i/2; return { r, i }; } int int work2(int i) { return i*3; } compound filter (rate 1:2) DEMO N-BODY SIMULATION: CPU VS. GPU 9X SPEEDUP (9.26 GFLOPS) ON LAPTOP VIRTUALIZATION OF DATA MOVEMENT => VIRTUALIZATION OF DATA MOVEMENT => + VIRTUALIZATION OF DATA MOVEMENT => + x 3 months WHAT’S SO HARD? • Projects routinely spend 60-80% of time on data xfer • Verilog is hard, but driving devices and buses is harder • Motherboard chips, BIOS settings: 4x slowdown each • Signal transitions are considered an “interface” spec • “IP” extremely sensitive to configurations • and often broken when it mismatches the original • Device drivers also highly sensitive • Often specialized to one I/O style; fall off cliff otherwise • Hardware design culture accepts this as C.O.D.B. COLLECTIVE OPERATIONS DATA PARALLELISM ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; a b 1.2 0.0 0.0 99.0 3.14 -3.0 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> a b 1.2 0.0 0.0 99.0 3.14 -3.0 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> index a b 0 1.2 0.0 1 0.0 99.0 2 3.14 -3.0 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> index a b 0 1.2 0.0 1 0.0 99.0 2 3.14 -3.0 domain( ) ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> index a b 0 1.2 0.0 1 0.0 99.0 2 3.14 -3.0 domain( ) 0 :: 2 0 :: 2 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> index a b 0 1.2 0.0 1 0.0 99.0 2 3.14 -3.0 domain( ) 0 :: 2 ? == 0 :: 2 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> Collectable<int,float> index a b 0 1.2 0.0 1 0.0 99.0 2 3.14 -3.0 domain( ) 0 :: 2 ? == 0 :: 2 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> Collectable<int,float> collector index a b 0 1.2 0.0 1 0.0 99.0 2 3.14 -3.0 domain( ) 0 :: 2 ? == 0 :: 2 temp ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> Collectable<int,float> collector index a 0 1.2 1 2 b temp + 0.0 1.2 0.0 + 99.0 99.0 3.14 + -3.0 0.14 domain( ) 0 :: 2 ? == 0 :: 2 ARRAY PARALLELISM float[ ] a = ...; float[ ] b = ...; float[ ] c = a @+ b; Indexable<int,float> Collectable<int,float> collector b c + 0.0 1.2 0.0 + 99.0 99.0 3.14 + -3.0 0.14 index a 0 1.2 1 2 domain( ) 0 :: 2 ? == 0 :: 2 MANY THINGS ARE COLLECTABLE var a = new Matrix<3,quaternion>(100,200,100); var b = new Matrix<3,quaternion>(100,200,100); … var c = a @ + b; ( 0::99, 0::199, 0::99 ) Map<string,List<int>> a = ... { “foo”, “bar” } Map<string,List<int>> b = ... Map<string,List<int>> c = a @ append(b); string foo = Character.toLowerCase(@ “FOO”); 0 :: 2 COLLECTIVE OPERATIONS IN ACTION package lime.util.synthesizable; public class FixedHashMap<K extends Value, V extends Value, BUCKETS extends ordinal<BUCKETS>, LINKS extends ordinal<LINKS>> extends AbstractMap<K,V> { protected final nodes = new Node<K,V>[BUCKETS][LINKS]; nodes BUCKETS = 3 LINKS = 2 Node 1 K V Node 1 K V Node 0 K V Node 1 K V Node 0 K V Node 0 K V GET OPERATION, STEP 1: SELECT ROW public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; key } LINKS = 2 row nodes BUCKETS = 3 Node 1 K V Node 1 K V Node 0 K V Node 1 K V Node 0 K V Node 0 K V GET OPERATION, STEP 1: SELECT ROW public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; key } row Node 1 K V Node 1 K V STEP 2: COMPARE ALL KEYS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; key } row Node 1 K V Node 1 K V STEP 2: COMPARE ALL KEYS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; key } row Node 1 selections K V Node 1 K V STEP 2: COMPARE ALL KEYS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } key row Node 1 selections K key V Node 1 K V STEP 2: COMPARE ALL KEYS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } key row Node 1 K true selections key V Node 1 K false V STEP 2: COMPARE ALL KEYS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } row Node 1 selections K true V Node 1 K false V STEP 3: GET VALUES/ZEROS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } row Node 1 selections K true V Node 1 K false V STEP 3: GET VALUES/ZEROS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } row Node 1 selections vals K true V Node 1 K false V STEP 3: GET VALUES/ZEROS public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } row Node 1 selections vals K V Node 1 K true false V 0 V V STEP 4: OR-REDUCE FOR RESULT public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } row Node 1 selections vals K V Node 1 K true false V 0 V STEP 4: OR-REDUCE FOR RESULT public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } vals V 0 STEP 4: OR-REDUCE FOR RESULT public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; } vals V 0 STEP 4: OR-REDUCE FOR RESULT public local V get(K key) { Node[LINKS] row = nodes[hash(key)]; boolean[LINKS] selections = row @ compareKey(key); V[LINKS] vals = row @ getValueOrDefault(selections); return | ! vals; V 0 } vals V 0 VIRTUALIZATION OF DATA PARALLELISM @ ! CURRENT RESULTS HOW DO WE EVALUATE PERFORMANCE? • Speedup for Naïve Users • How much faster than Java? • Slowdown for Expert Users? • How much slower than hand-tuned low-level code? • Our methodology: • Write/tune/compare 4 versions of each benchmark: • Java, Lime, OpenCL, Verilog • Doesn’t address flops/watt, flops/watt/$, productivity EXPERT VS NAÏVE SPEEDUP: END-TO-END (JAVA BASELINE) 6.67x 136x 1.06x 1.0x Expert 1.06x 6.10x Lime 66x 0.04x GPU GPU FPGA FPGA EXPERT VS NAÏVE SPEEDUP: KERNEL TIME (JAVA BASELINE) 37x 193x 10x 150x Expert 10x Lime 26x 126x 0.02x GPU GPU FPGA FPGA Should We Virtualize Heterogeneous Architectures? Yes We Can! “USER” VS. “SUPERVISOR” ACCELERATOR VS. FIRST-CLASS DEVICE ANOTHER UNSOLVED PROBLEM SUMMARY SUMMARY • Heterogeneity is Here to Stay SUMMARY • Heterogeneity is Here to Stay • Multicore Isn’t Dead - Just Less Important SUMMARY • Heterogeneity is Here to Stay • Multicore Isn’t Dead - Just Less Important • HLLs Can Insulate Programmer from Low-Level Details SUMMARY • Heterogeneity is Here to Stay • Multicore Isn’t Dead - Just Less Important • HLLs Can Insulate Programmer from Low-Level Details • But not fundamental computational structure SUMMARY • Heterogeneity is Here to Stay • Multicore Isn’t Dead - Just Less Important • HLLs Can Insulate Programmer from Low-Level Details • But not fundamental computational structure • System-Level Virtualization is an Open (Hard) Problem REAL-TIME PHOTOMOSAIC REAL-TIME PHOTOMOSAIC Questions?