slides

A Stall‐Free Real‐Time Garbage Collector for g
FPGAs
David F. Bacon, Perry Cheng, Sunil Shukla
http://www.research.ibm.com/liquidmetal/
Motivation: Liquid Metal
Language
Lime Toolchain
n
Lime
Toolchain =
IDE +
Compiler +
Debugger
Platform =
Hardware +
Runtime
CPU
GPU
FPGA
2
What is Garbage Collection?
Mutator
Heap
Register
g
Stack
3
What is Garbage Collection?
Mutator
Heap
Register
g
Stack
4
Memory Management: CPU vs. FPGA
CPU
Static
1950
STW
10s
1960
Concurrent
1s
1970
1980
Generational
G
i
l
10s
most<100ms
1990
Metronome
10ms
500us
2000
Static
FPGA
2010
RTGC
0 cycles
PLDI’12: “And Then There Were None: A Stall-Free Real-Time Garbage Collector for Reconfigurable Hardware”
- D. Bacon, P. Cheng, S. Shukla
Heap ‐ features
• Homogeneous objects
—
—
2 pointer fields
2
pointer fields
Programmable number of data fields
• Minimum mutator utilization (MMU) of 100%
• Validated analytical bounds on the WCET
• Implemented in Verilog
6
Heap ‐ Block Diagram
Data/Pointer
Memory
Read/write Address
Write data
Read data
Allocation
Engine
Addr Alloc’d
Alloc
GC
Mark
Engine
GC busy
Roots
Heap Occupancy
Sweep
Engine
GC
7
Marking
Mutator
Heap
Register
g
Stack
8
Mark Engine
Mark Map
Mutator
Pointer Mem
A
A
Mark Queue
B
B
Write Barrier
Mark Engine: Write Barriers
Mutator
Heap
Register
g
Stack
10
Mark Engine: Write Barriers
Mark Map
Mutator
Pointer Memory
A
A
Mark Queue
B
B
Write Barrier
Evaluation
1. Static memory manager (“Malloc”)
2. Stop-the-world collector (“STW”)
3. Concurrent collector (“RTGC”)
Data/Pointer
Memory
M
Read/write Address
Write data
Read data
Allocation
Engine
Addr Alloc’d
Alloc
GC
Mark
Engine
GC busy
Roots
Heap Occupancy
Sweep
Engine
GC
12
Evaluation roadmap
• GC (Space, Frequency)
• Application + GC (Time, Energy)
Application + GC (Time Energy)
13
GC: Area
Available slices: 51,840
51 840
14
GC: Memory (BRAM)
Available BRAM: 648 (1.45 MB)
15
GC: Frequency
16
Evaluation Roadmap
• GC (Space, Time, Frequnecy)
• Application + GC (Time, Energy)
Application + GC (Time Energy)
17
Applications
Binary Tree
Doubly ended queue (deque)
• Operation: insert, delete, traverse
• Two link lists growing from opposite ends
• Bursty operation
• mutation rate = 0.13 pointer writes/cycle
• Avg. mutation rate = 0.02 pointer writes/cycle
• Avg. allocation rate (α) = 0.07 objects/cycles
bj
/ l
• Avg. allocation rate (α) = 0 009 objects/cycle
0.009 objects/cycle
18
App+GC: Time (cycles) Bi
Binary Tree
T
Deque
19
App+GC: Time (ms), Energy (mJ)
Binary Tree
D
Deque
Time (s) = Time (M cycles) / Clock_Frequency
Clock Frequency (MHz)
Energy (J) = Power (W) * Time (s)
20
Conclusion
• First design & implementation of a GC on FPGA
• First GC to entirely eliminate mutator interference by the collector
• Big leap in the field of HLS
21
THANK YOU
http://www.research.ibm.com/liquidmetal/
22