A Stall‐Free Real‐Time Garbage Collector for g FPGAs David F. Bacon, Perry Cheng, Sunil Shukla http://www.research.ibm.com/liquidmetal/ Motivation: Liquid Metal Language Lime Toolchain n Lime Toolchain = IDE + Compiler + Debugger Platform = Hardware + Runtime CPU GPU FPGA 2 What is Garbage Collection? Mutator Heap Register g Stack 3 What is Garbage Collection? Mutator Heap Register g Stack 4 Memory Management: CPU vs. FPGA CPU Static 1950 STW 10s 1960 Concurrent 1s 1970 1980 Generational G i l 10s most<100ms 1990 Metronome 10ms 500us 2000 Static FPGA 2010 RTGC 0 cycles PLDI’12: “And Then There Were None: A Stall-Free Real-Time Garbage Collector for Reconfigurable Hardware” - D. Bacon, P. Cheng, S. Shukla Heap ‐ features • Homogeneous objects — — 2 pointer fields 2 pointer fields Programmable number of data fields • Minimum mutator utilization (MMU) of 100% • Validated analytical bounds on the WCET • Implemented in Verilog 6 Heap ‐ Block Diagram Data/Pointer Memory Read/write Address Write data Read data Allocation Engine Addr Alloc’d Alloc GC Mark Engine GC busy Roots Heap Occupancy Sweep Engine GC 7 Marking Mutator Heap Register g Stack 8 Mark Engine Mark Map Mutator Pointer Mem A A Mark Queue B B Write Barrier Mark Engine: Write Barriers Mutator Heap Register g Stack 10 Mark Engine: Write Barriers Mark Map Mutator Pointer Memory A A Mark Queue B B Write Barrier Evaluation 1. Static memory manager (“Malloc”) 2. Stop-the-world collector (“STW”) 3. Concurrent collector (“RTGC”) Data/Pointer Memory M Read/write Address Write data Read data Allocation Engine Addr Alloc’d Alloc GC Mark Engine GC busy Roots Heap Occupancy Sweep Engine GC 12 Evaluation roadmap • GC (Space, Frequency) • Application + GC (Time, Energy) Application + GC (Time Energy) 13 GC: Area Available slices: 51,840 51 840 14 GC: Memory (BRAM) Available BRAM: 648 (1.45 MB) 15 GC: Frequency 16 Evaluation Roadmap • GC (Space, Time, Frequnecy) • Application + GC (Time, Energy) Application + GC (Time Energy) 17 Applications Binary Tree Doubly ended queue (deque) • Operation: insert, delete, traverse • Two link lists growing from opposite ends • Bursty operation • mutation rate = 0.13 pointer writes/cycle • Avg. mutation rate = 0.02 pointer writes/cycle • Avg. allocation rate (α) = 0.07 objects/cycles bj / l • Avg. allocation rate (α) = 0 009 objects/cycle 0.009 objects/cycle 18 App+GC: Time (cycles) Bi Binary Tree T Deque 19 App+GC: Time (ms), Energy (mJ) Binary Tree D Deque Time (s) = Time (M cycles) / Clock_Frequency Clock Frequency (MHz) Energy (J) = Power (W) * Time (s) 20 Conclusion • First design & implementation of a GC on FPGA • First GC to entirely eliminate mutator interference by the collector • Big leap in the field of HLS 21 THANK YOU http://www.research.ibm.com/liquidmetal/ 22