slides

VIRTUALIZATION
IN THE AGE OF
HETEROGENEOUS
MACHINES
David F. Bacon
The ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (VEE ’11)
Keynote - March 9, 2011
WHAT HETEROGENEOUS MACHINES?
WHAT HETEROGENEOUS MACHINES?
WHAT HETEROGENEOUS MACHINES?
It’s
The Multicore Era!
THE PATH OF LEAST IMAGINATION
For years, computer architects have been saying
that a big new idea in computing was needed.
Indeed, as transistors have continued to shrink,
rather than continuing to innovate, computer
designers have simply adopted a so-called
“multicore” approach, where multiple processors
are added as more chip real estate became available.
Source: Remapping Computer Circuitry to Avert Impending Bottlenecks, February 28, 2011
MEANWHILE...
GPU
FPGA
Tilera 64
Cell BE
IBM WSP
WHAT IS PERFORMANCE?
Peak Performance
Source: Brodtkorb et al, The State-of-the-Art in Heterogeneous Computing, 2010
PERFORMANCE, TAKE 2
Actual Performance
(Random Number Generation)
Source: Thomas et al, A comparison of CPUs, GPUs, FPGAs and masssively parallel processor arrays for random number generation, 2009
WHAT’S THE BEST USE OF TRANSISTORS?
Homogeneous
vs.
Heterogeneous
WHAT’S THE BEST USE OF TRANSISTORS?
Homogeneous
vs.
Heterogeneous
Keep Transistors
Busy
vs.
Turn Transistors
Off
Should We Virtualize
Heterogeneous
Architectures?
How?
At the Language
or System Level?
Or Both?
What is
“Virtualization”
Anyway?
THE ONLY 2 IDEAS IN COMPUTER SCIENCE
THE ONLY 2 IDEAS IN COMPUTER SCIENCE
Hashing
f(x)
THE ONLY 2 IDEAS IN COMPUTER SCIENCE
Hashing
Indirection
f(x)
THE ONLY 2 IDEAS IN COMPUTER SCIENCE
Hashing
Indirection
f(x)
SYSTEM VS. LANGUAGE VMS
User-Mode
ISA
Supervisor-Mode
ISA
Device Drivers
SYSTEM VS. LANGUAGE VMS
Supervisor-Mode
ISA
Device Drivers
x86
User-Mode
ISA
x86
SYSTEM VS. LANGUAGE VMS
Supervisor-Mode
ISA
Device Drivers
x86
Supervisor-Mode
ISA
Device Drivers
x86
Supervisor-Mode
ISA
Device Drivers
x86
User-Mode
ISA
x86
SYSTEM VS. LANGUAGE VMS
Supervisor-Mode
ISA
Device Drivers
x86
Supervisor-Mode
ISA
Device Drivers
x86
Supervisor-Mode
ISA
Device Drivers
x86
User-Mode
ISA
ARM
User-Mode
ISA
POWER
User-Mode
ISA
x86
SYSTEM VS. LANGUAGE VMS
• Virtualize Environment
• Virtualize Processor
• Hold ISA Constant
• Vary ISA
OVERLAP: SYNERGY AND CONFLICT
OVERLAP: SYNERGY AND CONFLICT
Memory
OVERLAP: SYNERGY AND CONFLICT
Memory
Threads
OVERLAP: SYNERGY AND CONFLICT
Memory
Threads
User-Mode
ISA
Supervisor-Mode
ISA
DEVICE OR PROCESSOR?
GPU
FPGA
THE “ACCELERATOR” MODEL
Main
Program
GPU
CPU
FPGA
THE “ACCELERATOR” MODEL
Force Calc
Main
Program
GPU
CPU
FPGA
THE “ACCELERATOR” MODEL
Force Calc
Main
Program
FFT
GPU
CPU
FPGA
THE REALITY
THE REALITY
THE REALITY
THE REALITY
THE REALITY
THE REALITY
THE REALITY
THE REALITY
Language VMs
for Heterogenous
Systems
Liquid Metal
Joshua Auerbach
Perry Cheng
Rodric Rabbah
David F. Bacon
Steven Fink
Sunil Shukla
Christophe Dubach
Yu Zhang
HETEROGENEOUS PROGRAMMING
Node Compiler
Synthesis
C+Intrinsics
VHDL
Verilog
SystemC
GPU Compiler
Cuda
OpenCL
CPU Compiler
Java
C++
Python
binary
binary
binary
bitfile
CPU
GPU
WSP
FPGA
CPU Compiler
GPU Compiler
Node Compiler
Synthesis
binary
binary
binary
bitfile
CPU
GPU
WSP
FPGA
CPU Backend
GPU Backend
Node Backend
Verilog Backend
bytecode
binary
binary
bitfile
CPU
GPU
WSP
FPGA
THE LIQUID METAL PROGRAMMING LANGUAGE
Lime
CPU Backend
GPU Backend
Node Backend
Verilog Backend
Lime Compiler
bytecode
binary
binary
bitfile
CPU
GPU
WSP
FPGA
THE ARTIFACT STORE & EXCLUSION
Lime
Lime Compiler
bytecode
binary
binary
bitfile
CPU
GPU
WSP
FPGA
THE ARTIFACT STORE & EXCLUSION
Lime
Lime Compiler
bytecode
binary
binary
bitfile
Artifact Store
CPU
GPU
WSP
FPGA
THE ARTIFACT STORE & EXCLUSION
Lime
Lime Compiler
bytecode
binary
binary
bitfile
bitfile
bitfile
Artifact Store
CPU
GPU
WSP
FPGA
THE ARTIFACT STORE & EXCLUSION
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
CPU
GPU
WSP
FPGA
THE ARTIFACT STORE & EXCLUSION
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
CPU
GPU
WSP
FPGA
HETEROGENEOUS EXECUTION OF LIME
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
CPU
GPU
WSP
FPGA
HETEROGENEOUS EXECUTION OF LIME
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
Lime Virtual Machine (LmVM)
CPU
GPU
WSP
FPGA
HETEROGENEOUS EXECUTION OF LIME
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
Lime Virtual Machine (LmVM)
CPU
GPU
“Bus”
WSP
FPGA
HETEROGENEOUS EXECUTION OF LIME
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
Lime Virtual Machine (LmVM)
CPU
bytecode
GPU
“Bus”
WSP
FPGA
HETEROGENEOUS EXECUTION OF LIME
Lime
Lime Compiler
bytecode
binary
bitfile
bitfile
bitfile
Artifact Store
Lime Virtual Machine (LmVM)
CPU
bytecode
GPU
“Bus”
WSP
FPGA
bitfile
THE LIME LANGUAGE:
VIRTUALIZING HETEROGENEOUS
COMPUTATION
LIME: JAVA IS (ALMOST) A SUBSET
% javac MyClass.java
% java MyClass
LIME: JAVA IS (ALMOST) A SUBSET
% javac MyClass.java
% java MyClass
% mv MyClass.java MyClass.lime
LIME: JAVA IS (ALMOST) A SUBSET
% javac MyClass.java
% java MyClass
% mv MyClass.java MyClass.lime
% limec MyClass.lime
LIME: JAVA IS (ALMOST) A SUBSET
% javac MyClass.java
% java MyClass
% mv MyClass.java MyClass.lime
% limec MyClass.lime
% java MyClass
LIME: JAVA IS (ALMOST) A SUBSET
% javac MyClass.java
% java MyClass
% mv MyClass.java MyClass.lime
% limec MyClass.lime
% java MyClass
INCREMENTALLY USE LIME FEATURES
LIME LANGUAGE OVERVIEW
BIT-LEVEL PARALLELISM
Immutable Types
Bounded Types
Bounded Arrays
Primitive Supertypes
Core Features
Programmable Primitives
Stream Programming
Collective Operations
DATA PARALLELISM
Supporting Features
Reifiable Generics
Ranges, Bounded “for”
User-defined operators
Typedefs
Local type inference
Tuples
PIPELINE PARALLELISM
Graph Construction
Isolation Enforcement
Closed World Support
Rate Matching
Messaging
VIRTUALIZATION COMPARISON
FPGA
CPU
Physical
Virtual
Physical
Virtual
Core
Thread
LUTs
bit
RAM
Heap
CLBs
User Primitives
byte
word
float
dfloat
Compare&Swap
byte
float
int
Slices
double
Block RAM
Bounded Arrays
Routing Network
Streams Reduce
DMA, SerDes
Streams
DSP Multipliers
integer<N>
synchronized
tasks
Map
STREAMING COMPUTATION
PIPELINE PARALLELISM
TASK CREATION (STATELESS)
var uppercaser = task work;
primitive filter
stream
char
port
type port
char
local char work(char c) {
return toUpper(c);
}
worker
method
stream
type
TASK CREATION (STATELESS)
var uppercaser = task work;
primitive filter
Task Creation
stream
char
port
type port
local char work(char c) {
return toUpper(c);
}
worker
method
Isolation Keywords
char
stream
type
PIPELINES
var pipeline = task worker1 => task worker2 => task worker3;
port-to-stream
connection
port-to-stream
connection
char
char
worker1(…) { … }
int[[5]]
int
worker2(…) { … }
worker3(…) { … }
compound filter
PIPELINES
Graph Construction
var pipeline = task worker1 => task worker2 => task worker3;
port-to-stream
connection
port-to-stream
connection
char
char
worker1(…) { … }
int[[5]]
int
worker2(…) { … }
worker3(…) { … }
compound filter
SOURCES AND SINKS
filter
source
char
reader(…) { … }
sink
char
worker(…) { … }
writer(…) { … }
Heap
/tmp/mydata
File System
/dev/tty
HELLO WORLD, STREAMING STYLE
public static void main(string[[]] args) {
char[[]] msg = { 'H','E','L','L','O',',',' ',
'W','O','R','L','D','!','\n' };
var hello = msg.source(1) =>
task Character.toLowerCase(char) =>
task System.out.print(char);
hello.finish();
}
DEMO
HELLO WORLD
LIME/ECLIPSE ENVIRONMENT
STATEFUL TASKS
var averager = task Averager().avg;
instance variables
(local state)
primitive filter
double total;
long count;
double
double
double avg(double x) {
total += x;
return total/++count;
}
RATE MATCHING
var matchedpipe = task AddStuff().work1 => # => task work2;
rate matcher
(rate 1:2)
int x;
int
Buffer
int[[2]]
int
int[[2]] work1(int i) {
int r = i+x;
x = i/2;
return { r, i };
}
int
int work2(int i) {
return i*3;
}
compound filter
(rate 1:2)
DEMO
N-BODY SIMULATION: CPU VS. GPU
9X SPEEDUP (9.26 GFLOPS) ON LAPTOP
VIRTUALIZATION OF DATA MOVEMENT
=>
VIRTUALIZATION OF DATA MOVEMENT
=>
+
VIRTUALIZATION OF DATA MOVEMENT
=>
+
x
3
months
WHAT’S SO HARD?
• Projects routinely spend 60-80% of time on data xfer
• Verilog is hard, but driving devices and buses is harder
• Motherboard chips, BIOS settings: 4x slowdown each
• Signal transitions are considered an “interface” spec
• “IP” extremely sensitive to configurations
• and often broken when it mismatches the original
• Device drivers also highly sensitive
• Often specialized to one I/O style; fall off cliff otherwise
• Hardware design culture accepts this as C.O.D.B.
COLLECTIVE OPERATIONS
DATA PARALLELISM
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
a
b
1.2
0.0
0.0
99.0
3.14
-3.0
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
a
b
1.2
0.0
0.0
99.0
3.14
-3.0
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
index
a
b
0
1.2
0.0
1
0.0
99.0
2
3.14
-3.0
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
index
a
b
0
1.2
0.0
1
0.0
99.0
2
3.14
-3.0
domain( )
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
index
a
b
0
1.2
0.0
1
0.0
99.0
2
3.14
-3.0
domain( )
0 :: 2
0 :: 2
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
index
a
b
0
1.2
0.0
1
0.0
99.0
2
3.14
-3.0
domain( )
0 :: 2
?
==
0 :: 2
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
Collectable<int,float>
index
a
b
0
1.2
0.0
1
0.0
99.0
2
3.14
-3.0
domain( )
0 :: 2
?
==
0 :: 2
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
Collectable<int,float>
collector
index
a
b
0
1.2
0.0
1
0.0
99.0
2
3.14
-3.0
domain( )
0 :: 2
?
==
0 :: 2
temp
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
Collectable<int,float>
collector
index
a
0
1.2
1
2
b
temp
+
0.0
1.2
0.0
+
99.0
99.0
3.14
+
-3.0
0.14
domain( )
0 :: 2
?
==
0 :: 2
ARRAY PARALLELISM
float[ ] a = ...;
float[ ] b = ...;
float[ ] c = a @+ b;
Indexable<int,float>
Collectable<int,float>
collector
b
c
+
0.0
1.2
0.0
+
99.0
99.0
3.14
+
-3.0
0.14
index
a
0
1.2
1
2
domain( )
0 :: 2
?
==
0 :: 2
MANY THINGS ARE COLLECTABLE
var a = new Matrix<3,quaternion>(100,200,100);
var b = new Matrix<3,quaternion>(100,200,100);
…
var c = a @ + b;
( 0::99, 0::199, 0::99 )
Map<string,List<int>> a = ...
{ “foo”, “bar” }
Map<string,List<int>> b = ...
Map<string,List<int>> c = a @ append(b);
string foo = Character.toLowerCase(@ “FOO”);
0 :: 2
COLLECTIVE OPERATIONS IN ACTION
package lime.util.synthesizable;
public class FixedHashMap<K extends Value, V extends Value,
BUCKETS extends ordinal<BUCKETS>, LINKS extends ordinal<LINKS>>
extends AbstractMap<K,V>
{
protected final nodes = new Node<K,V>[BUCKETS][LINKS];
nodes
BUCKETS = 3
LINKS = 2
Node 1
K
V
Node 1
K
V
Node 0
K
V
Node 1
K
V
Node 0
K
V
Node 0
K
V
GET OPERATION, STEP 1: SELECT ROW
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
key
}
LINKS = 2
row
nodes
BUCKETS = 3
Node 1
K
V
Node 1
K
V
Node 0
K
V
Node 1
K
V
Node 0
K
V
Node 0
K
V
GET OPERATION, STEP 1: SELECT ROW
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
key
}
row
Node 1
K
V
Node 1
K
V
STEP 2: COMPARE ALL KEYS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
key
}
row
Node 1
K
V
Node 1
K
V
STEP 2: COMPARE ALL KEYS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
key
}
row
Node 1
selections
K
V
Node 1
K
V
STEP 2: COMPARE ALL KEYS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
key
row
Node 1
selections
K
key
V
Node 1
K
V
STEP 2: COMPARE ALL KEYS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
key
row
Node 1
K
true
selections
key
V
Node 1
K
false
V
STEP 2: COMPARE ALL KEYS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
row
Node 1
selections
K
true
V
Node 1
K
false
V
STEP 3: GET VALUES/ZEROS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
row
Node 1
selections
K
true
V
Node 1
K
false
V
STEP 3: GET VALUES/ZEROS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
row
Node 1
selections
vals
K
true
V
Node 1
K
false
V
STEP 3: GET VALUES/ZEROS
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
row
Node 1
selections
vals
K
V
Node 1
K
true
false
V
0
V
V
STEP 4: OR-REDUCE FOR RESULT
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
row
Node 1
selections
vals
K
V
Node 1
K
true
false
V
0
V
STEP 4: OR-REDUCE FOR RESULT
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
vals
V
0
STEP 4: OR-REDUCE FOR RESULT
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
}
vals
V
0
STEP 4: OR-REDUCE FOR RESULT
public local V get(K key) {
Node[LINKS] row = nodes[hash(key)];
boolean[LINKS] selections = row @ compareKey(key);
V[LINKS] vals = row @ getValueOrDefault(selections);
return | ! vals;
V
0
}
vals
V
0
VIRTUALIZATION OF DATA PARALLELISM
@ !
CURRENT RESULTS
HOW DO WE EVALUATE PERFORMANCE?
• Speedup for Naïve Users
• How much faster than Java?
• Slowdown for Expert Users?
• How much slower than hand-tuned low-level code?
• Our methodology:
• Write/tune/compare 4 versions of each benchmark:
•
Java, Lime, OpenCL, Verilog
• Doesn’t address flops/watt, flops/watt/$, productivity
EXPERT VS NAÏVE SPEEDUP: END-TO-END
(JAVA BASELINE)
6.67x
136x
1.06x
1.0x
Expert
1.06x
6.10x
Lime
66x
0.04x
GPU
GPU
FPGA
FPGA
EXPERT VS NAÏVE SPEEDUP: KERNEL TIME
(JAVA BASELINE)
37x
193x
10x
150x
Expert
10x
Lime
26x
126x
0.02x
GPU
GPU
FPGA
FPGA
Should We Virtualize
Heterogeneous
Architectures?
Yes We Can!
“USER” VS. “SUPERVISOR”
ACCELERATOR VS. FIRST-CLASS DEVICE
ANOTHER UNSOLVED PROBLEM
SUMMARY
SUMMARY
• Heterogeneity is Here to Stay
SUMMARY
• Heterogeneity is Here to Stay
• Multicore Isn’t Dead - Just Less Important
SUMMARY
• Heterogeneity is Here to Stay
• Multicore Isn’t Dead - Just Less Important
• HLLs Can Insulate Programmer from Low-Level Details
SUMMARY
• Heterogeneity is Here to Stay
• Multicore Isn’t Dead - Just Less Important
• HLLs Can Insulate Programmer from Low-Level Details
• But not fundamental computational structure
SUMMARY
• Heterogeneity is Here to Stay
• Multicore Isn’t Dead - Just Less Important
• HLLs Can Insulate Programmer from Low-Level Details
• But not fundamental computational structure
•
System-Level Virtualization is an Open (Hard) Problem
REAL-TIME PHOTOMOSAIC
REAL-TIME PHOTOMOSAIC
Questions?