Functioning Hardware from Functional Specifications Stephen A. Edwards Martha A. Kim Richard Townsend Kuangya Zhai Lianne Lairmore Columbia University IBM PL Day, November 18, 2014 Where’s my 10 GHz processor? Intel CPU Trends Sutter, The Free Lunch is Over, DDJ 2005. Data: Intel, Wikipedia, K. Olukotun The Cray-2: Immersed in Fluorinert 1985 ECL 150 kW Where’s all that power going? What can we do about it? Dally: Calculation Cheap; Communication Costly 64b FPU 0.1mm2 50pJ /op 1.5GHz 64b Off-Chip Channel 1nJ /word 10mm 250pJ, 4 cycles 64b 1mm Channel 25pJ /word 20mm “Chips are power limited and most power is spent moving data Performance = Parallelism Efficiency = Locality 64b Floating Point Bill Dally’s 2009 DAC Keynote, The End of Denial Architecture Parallelism for Performance; Locality for Efficiency Dally: “Single-thread processors are in denial about these two facts” We need different programming paradigms and different architectures on which to run them. The Future is Wires and Memory How best to use all those transistors? Dark Silicon Taylor and Swanson’s Conservation Cores C-core Generation BB0 BB1 BB2 Inter-BB State Machine CFG Code to Stylized Verilog and through a CAD flow. .V + .V LD + + +1 Synopsys IC Compiler, P&R, CTS + + LD * LD + + Datapath ST <N? 0.01 mm2 in 45 nm TSMC runs at 1.4 GHz Custom datapaths, controllers for loop kernels; uses existing memory hierarchy Swanson, Taylor, et al. Conservation Cores. ASPLOS 2010. Bacon et al.’s Liquid Metal Fig. 2. Block level diagram of DES and Lime code snippet JITting Lime (Java-like, side-effect-free, streaming) to FPGAs Huang, Hormati, Bacon, and Rabbah, Liquid Metal, ECOOP 2008. What are we doing about it? Functional Programs to FPGAs Functional Programs to FPGAs Functional Programs to FPGAs Functional Programs to FPGAs Functional Programs to FPGAs Functional Programs to FPGAs Functional Programs to FPGAs Why Functional Specifications? I Referential transparency/side-effect freedom make formal reasoning about programs vastly easier I Inherently concurrent and race-free (Thank Church and Rosser). If you want races and deadlocks, you need to add constructs. I Immutable data structures makes it vastly easier to reason about memory in the presence of concurrency Why FPGAs? I We do not know the structure of future memory systems Homogeneous/Heterogeneous? Levels of Hierarchy? Communication Mechanisms? I We do not know the architecture of future multi-cores Programmable in Assembly/C? Single- or multi-threaded? Use FPGAs as a surrogate. Ultimately too flexible, but representative of the long-term solution. The Practical Question How do we synthesize hardware from pure functional languages for FPGAs? Control and datapath are easy; the memory system is interesting. The Type System: Algebraic Data Types Types are primitive (Boolean, Integer, etc.) or other ADTs: type ::= Type | Constr Type∗ | · · · | Constr Type∗ Primitive type Tagged union Examples: data Intlist = Nil −− Linked list of integers | Cons Int Intlist data Bintree = Leaf Int −− Binary tree of integers | Branch Bintree Bintree data Expr = Literal Int −− Arithmetic expression | Var String | Binop Expr Op Expr data Op = Add | Sub | Mult | Div Example: Huffman Decoder in Haskell data HTree = Branch HTree HTree | Leaf Char decode :: HTree → [Bool] → [Char] decode table str = bit str table where bit (False:xs) (Branch l _) = bit xs l bit (True:xs) (Branch _ r) = bit xs r bit x (Leaf c) = c : bit x table bit [] _ = [] 9-bit pointer 12-bit pointer 8-bit char 1 Leaf 9-bit pointer 0 Branch 1 Cons 0 Nil B 1 0 10-bit pointer 8-bit char Cons Nil −− 0: left −− 1: right −− leaf −− done Memory Optimizations Split Memories Memory Optimizations Input Output HTree Split Memories Input Memory Output HTree Use Streams Optimizations Out FIFO In FIFO Mem HTree Mem Split Memories Input Memory Output HTree Use Streams Optimizations Out FIFO In FIFO Mem HTree Mem Unroll for locality In FIFO Mem Out FIFO HTree Mem Split Memories Input Memory Output HTree Use Streams Out FIFO In FIFO Optimizations Mem HTree Mem Unroll for locality In FIFO Mem Out FIFO HTree Mem Speculate In FIFO Out FIFO Mem Mem HTree Mem Hardware Synthesis: Semantics-preserving steps to a low-level dialect Removing Recursion: The Fib Example fib n = case n of 1 →1 2 →1 n → fib (n−1) + fib (n−2) Transform to Continuation-Passing Style fibk n k = case n of 1 →k1 2 →k1 n → fibk (n−1) (λ n1 → fibk (n−2) (λ n2 → k (n1 + n2))) fib n = fibk n (λ x → x) Name Lambda Expressions (Lambda Lifting) fibk n k k1 k2 k0 fib = case n of 1 →k1 2 →k1 n → fibk (n−1) (k1 n k) n k n1 = n1 k n2 = x = n = fibk (n−2) (k2 n1 k) k (n1 + n2) x fibk n k0 Represent Continuations with a Type data Cont = K0 | K1 Int Cont | K2 Int Cont fibk n k = case (n,k) of (1, k) → kk k 1 (2, k) → kk k 1 (n, k) → fibk (n−1) (K1 n k) kk k a = case (k, a) of (( K1 n k), n1) → fibk (n−2) (K2 n1 k) (( K2 n1 k), n2) → kk k (n1 + n2) (K0, x ) →x fib n = fibk n K0 Merge Functions data Cont = K0 | K1 Int Cont | K2 Int Cont data Call = Fibk Int Cont | KK Cont Int fibk z (Fibk (Fibk (Fibk = case z of 1 k) → fibk (KK k 1) 2 k) → fibk (KK k 1) n k) → fibk (Fibk (n−1) (K1 n k)) (KK (K1 n k) n1) → fibk (Fibk (n−2) (K2 n1 k)) (KK (K2 n1 k) n2) → fibk (KK k (n1 + n2)) (KK K0 x ) →x fib n = fibk (Fibk n K0) Add Explicit Memory Operations load :: CRef → Cont store :: Cont →CRef data Cont = K0 | K1 Int CRef | K2 Int CRef data Call = Fibk Int CRef | KK Cont Int fibk z (Fibk (Fibk (Fibk = case z of 1 k) → fibk (KK (load k) 1) 2 k) → fibk (KK (load k) 1) n k) → fibk (Fibk (n−1) (store (K1 n k ))) (KK (K1 n k) n1) → fibk (Fibk (n−2) (store (K2 n1 k))) (KK (K2 n1 k) n2) → fibk (KK (load k) (n1 + n2)) (KK K0 x ) →x fib n = fibk (Fibk n (store K0)) Syntax-Directed Translation to Hardware n fib Call CRef Cont store fibk CRef load we di a mem do load0 Cont result Duplication Can Increase Parallelism fib 0 = 0 fib 1 = 1 fib n = fib (n−1) + fib (n−2) Duplication Can Increase Parallelism After duplicating functions: fib 0 = 0 fib 1 = 1 fib n = fib 0 (n−1) + fib0 0 (n−2) fib 0 = 0 fib 1 = 1 fib n = fib (n−1) + fib (n−2) fib 0 0 = 0 fib 0 1 = 1 fib 0 n = fib 0 (n−1) + fib0 (n−2) fib 0 0 0 = 0 fib 0 0 1 = 1 fib 0 0 n = fib 0 0 (n−1) + fib0 0 (n−2) Here, fib’ and fib” may run in parallel. Unrolling Recursive Data Structures Original Huffman tree type: data Htree = Branch Htree HTree | Leaf Char Unrolled Huffman tree type: data Htree = Branch Htree0 HTree0 | Leaf Char data Htree0 = Branch0 Htree00 HTree00 | Leaf0 Char data Htree00 = Branch00 Htree HTree | Leaf0 0 Char Increases locality: larger data blocks. A type-aware cache line I Dark Silicon is the future: faster transistors; most must remain off C-core Generation BB0 BB1 BB2 Inter-BB State Machine CFG + + + + Synopsys IC Compiler, P&R, CTS LD ST 0.01 mm2 in 45 nm TSMC runs at 1.4 GHz <N? abstraction Our project: A Pure Functional Language to FPGAs .V LD + + LD +1 I Code to Stylized Verilog and through a CAD flow. .V Datapath * Custom accelerators are the future; many approaches + I s ge ngua La Future C et al. gcc et al. x86 et al. Future ISAs FPGAs today Higher-level languages A Functional IR More hardware reality time Encoding the Types Huffman tree nodes: (19 bits) 9-bit pointer 8-bit char 1 Leaf 9-bit pointer 0 Branch Boolean input stream: (14 bits) 12-bit pointer I Algebraic Data Types in Hardware B1 0 Cons Nil Character output stream: (19 bits) 10-bit pointer 8-bit char Split Memories Input Memory 1 Cons 0 Nil Output HTree Use Streams Optimizations In FIFO Out FIFO Mem Mem HTree Unroll for locality I Optimizations In FIFO Out FIFO Mem Mem HTree Speculate In FIFO Out FIFO Mem Mem HTree Mem Syntax-Directed Translation to Hardware n fib Call I Removing recursion CRef Cont store fibk CRef load we di a mem do load0 Cont result