pdf

Functioning Hardware
from
Functional Specifications
Stephen A. Edwards
Martha A. Kim
Richard Townsend
Kuangya Zhai
Lianne Lairmore
Columbia University
IBM PL Day, November 18, 2014
Where’s my 10 GHz processor?
Intel CPU Trends
Sutter, The Free Lunch is Over, DDJ 2005.
Data: Intel, Wikipedia, K. Olukotun
The Cray-2: Immersed in Fluorinert
1985
ECL
150 kW
Where’s all that power going?
What can we do about it?
Dally: Calculation Cheap; Communication Costly
64b FPU
0.1mm2
50pJ /op
1.5GHz
64b Off-Chip
Channel
1nJ /word
10mm 250pJ, 4 cycles
64b 1mm
Channel
25pJ /word
20mm
“Chips are power
limited and most
power is spent
moving data
Performance =
Parallelism
Efficiency =
Locality
64b Floating Point
Bill Dally’s 2009 DAC Keynote, The End of Denial Architecture
Parallelism for Performance; Locality for Efficiency
Dally: “Single-thread processors are
in denial about these two facts”
We need
different programming paradigms
and
different architectures
on which to run them.
The Future is Wires and Memory
How best to use all those
transistors?
Dark Silicon
Taylor and Swanson’s Conservation Cores
C-core
Generation
BB0
BB1
BB2
Inter-BB
State Machine
CFG
Code to Stylized Verilog and
through a CAD flow.
.V
+
.V
LD
+
+
+1
Synopsys
IC Compiler,
P&R, CTS
+
+
LD
*
LD
+
+
Datapath
ST
<N?
0.01 mm2 in 45 nm TSMC
runs at 1.4 GHz
Custom datapaths, controllers for loop kernels; uses existing
memory hierarchy
Swanson, Taylor, et al. Conservation Cores. ASPLOS 2010.
Bacon et al.’s Liquid Metal
Fig. 2. Block level diagram of DES and Lime code snippet
JITting Lime (Java-like, side-effect-free, streaming) to FPGAs
Huang, Hormati, Bacon, and Rabbah, Liquid Metal, ECOOP 2008.
What are we doing about it?
Functional Programs to FPGAs
Functional Programs to FPGAs
Functional Programs to FPGAs
Functional Programs to FPGAs
Functional Programs to FPGAs
Functional Programs to FPGAs
Functional Programs to FPGAs
Why Functional Specifications?
I
Referential
transparency/side-effect freedom
make formal reasoning about
programs vastly easier
I
Inherently concurrent and
race-free (Thank Church and
Rosser). If you want races and
deadlocks, you need to add
constructs.
I
Immutable data structures makes
it vastly easier to reason about
memory in the presence of
concurrency
Why FPGAs?
I
We do not know the structure of
future memory systems
Homogeneous/Heterogeneous?
Levels of Hierarchy?
Communication Mechanisms?
I
We do not know the architecture
of future multi-cores
Programmable in Assembly/C?
Single- or multi-threaded?
Use FPGAs as a surrogate. Ultimately too flexible, but
representative of the long-term solution.
The Practical Question
How do we synthesize hardware
from pure functional languages
for FPGAs?
Control and datapath are easy; the memory system is
interesting.
The Type System: Algebraic Data Types
Types are primitive (Boolean, Integer, etc.) or other ADTs:
type ::= Type
| Constr Type∗ | · · · | Constr Type∗
Primitive type
Tagged union
Examples:
data Intlist = Nil
−− Linked list of integers
| Cons Int Intlist
data Bintree = Leaf Int
−− Binary tree of integers
| Branch Bintree Bintree
data Expr = Literal Int
−− Arithmetic expression
| Var String
| Binop Expr Op Expr
data Op = Add | Sub | Mult | Div
Example: Huffman Decoder in Haskell
data HTree = Branch HTree HTree | Leaf Char
decode :: HTree → [Bool] → [Char]
decode table str = bit str table
where
bit (False:xs) (Branch l _) = bit xs l
bit (True:xs) (Branch _ r) = bit xs r
bit x
(Leaf c)
= c : bit x table
bit []
_
= []
9-bit pointer
12-bit pointer
8-bit char
1
Leaf
9-bit pointer
0
Branch
1
Cons
0
Nil
B 1
0
10-bit pointer
8-bit char
Cons
Nil
−− 0: left
−− 1: right
−− leaf
−− done
Memory
Optimizations
Split
Memories
Memory
Optimizations
Input
Output
HTree
Split
Memories
Input
Memory
Output
HTree
Use Streams
Optimizations
Out FIFO
In FIFO
Mem
HTree
Mem
Split
Memories
Input
Memory
Output
HTree
Use Streams
Optimizations
Out FIFO
In FIFO
Mem
HTree
Mem
Unroll for locality
In FIFO
Mem
Out FIFO
HTree
Mem
Split
Memories
Input
Memory
Output
HTree
Use Streams
Out FIFO
In FIFO
Optimizations
Mem
HTree
Mem
Unroll for locality
In FIFO
Mem
Out FIFO
HTree
Mem
Speculate
In FIFO
Out FIFO
Mem
Mem
HTree Mem
Hardware Synthesis:
Semantics-preserving steps to
a low-level dialect
Removing Recursion: The Fib Example
fib n
= case n of
1
→1
2
→1
n
→ fib (n−1) + fib (n−2)
Transform to Continuation-Passing Style
fibk n k
= case n of
1
→k1
2
→k1
n
→ fibk (n−1) (λ n1 →
fibk (n−2) (λ n2 →
k (n1 + n2)))
fib
n
=
fibk n (λ x → x)
Name Lambda Expressions (Lambda Lifting)
fibk n k
k1
k2
k0
fib
= case n of
1
→k1
2
→k1
n
→ fibk (n−1) (k1 n k)
n k n1 =
n1 k n2 =
x =
n
=
fibk (n−2) (k2 n1 k)
k (n1 + n2)
x
fibk n k0
Represent Continuations with a Type
data Cont = K0 | K1 Int Cont | K2 Int Cont
fibk n k
= case (n,k) of
(1, k) → kk k 1
(2, k) → kk k 1
(n, k) → fibk (n−1) (K1 n k)
kk k a
= case (k, a) of
(( K1 n k), n1) → fibk (n−2) (K2 n1 k)
(( K2 n1 k), n2) → kk k (n1 + n2)
(K0,
x ) →x
fib n
=
fibk n K0
Merge Functions
data Cont = K0 | K1 Int Cont | K2 Int Cont
data Call = Fibk Int Cont | KK Cont Int
fibk z
(Fibk
(Fibk
(Fibk
= case z of
1 k) → fibk (KK k 1)
2 k) → fibk (KK k 1)
n k) → fibk (Fibk (n−1) (K1 n k))
(KK (K1 n k) n1) → fibk (Fibk (n−2) (K2 n1 k))
(KK (K2 n1 k) n2) → fibk (KK k (n1 + n2))
(KK K0
x ) →x
fib n
=
fibk (Fibk n K0)
Add Explicit Memory Operations
load :: CRef → Cont
store :: Cont →CRef
data Cont = K0 | K1 Int CRef | K2 Int CRef
data Call = Fibk Int CRef | KK Cont Int
fibk z
(Fibk
(Fibk
(Fibk
= case z of
1 k) → fibk (KK (load k) 1)
2 k) → fibk (KK (load k) 1)
n k) → fibk (Fibk (n−1) (store (K1 n k )))
(KK (K1 n k) n1) → fibk (Fibk (n−2) (store (K2 n1 k)))
(KK (K2 n1 k) n2) → fibk (KK (load k) (n1 + n2))
(KK K0
x ) →x
fib n
=
fibk (Fibk n (store K0))
Syntax-Directed Translation to Hardware
n
fib
Call
CRef
Cont
store
fibk
CRef
load
we
di
a
mem
do
load0
Cont
result
Duplication Can Increase Parallelism
fib 0 = 0
fib 1 = 1
fib n = fib (n−1) + fib (n−2)
Duplication Can Increase Parallelism
After duplicating functions:
fib 0 = 0
fib 1 = 1
fib n = fib 0 (n−1) + fib0 0 (n−2)
fib 0 = 0
fib 1 = 1
fib n = fib (n−1) + fib (n−2)
fib 0 0 = 0
fib 0 1 = 1
fib 0 n = fib 0 (n−1) + fib0 (n−2)
fib 0 0 0 = 0
fib 0 0 1 = 1
fib 0 0 n = fib 0 0 (n−1) + fib0 0 (n−2)
Here, fib’ and fib” may run in
parallel.
Unrolling Recursive Data Structures
Original Huffman tree type:
data Htree = Branch Htree HTree | Leaf Char
Unrolled Huffman tree type:
data Htree = Branch Htree0 HTree0 | Leaf Char
data Htree0 = Branch0 Htree00 HTree00 | Leaf0 Char
data Htree00 = Branch00 Htree HTree | Leaf0 0 Char
Increases locality: larger data blocks.
A type-aware cache line
I
Dark Silicon is the future: faster
transistors; most must remain off
C-core
Generation
BB0
BB1
BB2
Inter-BB
State Machine
CFG
+
+
+
+
Synopsys
IC Compiler,
P&R, CTS
LD
ST
0.01 mm2 in 45 nm TSMC
runs at 1.4 GHz
<N?
abstraction
Our project: A Pure Functional
Language to FPGAs
.V
LD
+
+
LD
+1
I
Code to Stylized Verilog and
through a CAD flow.
.V
Datapath
*
Custom accelerators are the
future; many approaches
+
I
s
ge
ngua
La
Future
C et al.
gcc et al.
x86 et al.
Future
ISAs
FPGAs
today
Higher-level
languages
A Functional IR
More
hardware reality
time
Encoding the Types
Huffman tree nodes: (19 bits)
9-bit pointer
8-bit char
1
Leaf
9-bit pointer
0
Branch
Boolean input stream: (14 bits)
12-bit pointer
I
Algebraic Data Types in Hardware
B1
0
Cons
Nil
Character output stream: (19 bits)
10-bit pointer
8-bit char
Split
Memories
Input
Memory
1
Cons
0
Nil
Output
HTree
Use Streams
Optimizations
In FIFO
Out FIFO
Mem
Mem
HTree
Unroll for locality
I
Optimizations
In FIFO
Out FIFO
Mem
Mem
HTree
Speculate
In FIFO
Out FIFO
Mem
Mem
HTree Mem
Syntax-Directed Translation to Hardware
n
fib
Call
I
Removing recursion
CRef
Cont
store
fibk
CRef
load
we
di
a
mem
do
load0
Cont
result