slides

Java without the Coffee Breaks:
A Nonintrusive Multiprocessor
Garbage Collector
David F. Bacon
IBM T.J. Watson Research Center
Joint work with C.R. Attanasio, Han Lee,
V.T. Rajan, and Steve Smith
ACM Conference on Programming Language Design and Implemenentation
Snowbird, Utah
June 20, 2001
Overview
• Two types of Garbage Collection
– Tracing [McCarthy 1960]
– Reference Counting [Collins 1960]
• Contributions of this work:
– Improved concurrent reference counting algorithm
– Practical cycle collection algorithm
• Both stop-the-world and concurrent
– Implementation in collector w/very low pause times
Features of the Recycler
•
•
•
•


Concurrent
Multiprocessor
Reference-counted
Cycle collecting
Low-latency: under 10 ms
High-performance: 90% of stop-the-world
Why do it?
• RC has good locality properties
–
–
–
–
Will tracing collection scale to multi-GB heaps?
Effect of rising memory latency
RC is easily decoupled from mutator
Promises lower synchronization costs
• Java makes good GC a commercial requirement
– Makes sense to have “genetic diversity”
• People said it couldn’t be done
Why is it hard?
• Must defer reference counting stack
– Complicates algorithm
• Reference counts are shared variables
– Synchronization required
• Write barriers for RC more expensive
– Affect two objects
• RC doesn’t handle cycles
– Other systems use backup tracing collector
Outline
Introduction
Motivation
System Overview
• Concurrent Reference Counting
• Cycle Collection
• Measurements
• Conclusions
System Overview
• Producer/Consumer System
Mutator
Mutator
Emit inc/dec
Allocate
Emit inc/dec
Allocate
Collector
Free Memory
Process inc/dec
Collect Cycles
• Similar to Deutsch-Bobrow DRC
– As implemented by DeTreville on SRC Firefly
Implementation
• Implemented in Jalapeño JVM at IBM TJW
• All of VM, JIT, and GC are written in Java
– Extended with unsafe primitives
– Multiple GC implementations
– GC is machine-independent except for barriers
• Targets
– IBM RS/6000 multiprocessors (optimized)
– Linux/Intel (unoptimized; optimized coming)
Outline
Introduction
Motivation
System Overview
Concurrent Reference Counting (no cycles)
• Cycle Collection
• Measurements
• Conclusions
Concurrent Reference Counting
• Time divided into epochs
– All CPUs must participate before epoch advances
• Write barrier on heap updates
– inc/dec operations placed in buffer
– Objects allocated with RC=1, dec enqueued
– Decrements processed one epoch behind increments
• Stack references are deferred
– Snapshot stacks at epoch boundary
– First increment; decrement at next epoch
• Simpler invariant than Deutsch-Bobrow; no ZCT required
Ragged Barriers
CPU 1
Mutator
CPU 2
Mutator
CPU 3
Collector
Time
Epoch 7
Epoch 8
Epoch 9
Outline
Introduction
Motivation
System Overview
Concurrent Reference Counting
Cycle Collection
• Measurements
• Conclusions
Synchronous Cycle Collection
• Class loader identifies acyclic classes
– Arrays, String, and such are marked green
• Two key observations:
– Most reference counts are 1
– Garbage cycles created by decrement to non-0
• Use those objects as starting points
– DFS-based algorithm subtracts internal RC’s
– If resulting count is 0, collect cyclic garbage
• Based on algorithm by Lins, but O(n) instead of O(n2)
Root Buffer
Live
Data
12
12
012
1
102
012
012
1. Process Decrements and Accumulate Roots
2. Mark Gray: Subtract Internal Reference Counts
3. Scan: Restore Live, Mark Dead White
4. Collect White
Concurrent Cycle Collection
• Based on synchronous algorithm
– But second reference count field is used
• Relies on stability property of garbage
– If no mutation, synchronous algorithm will work
– Detect when mutation occurs and avoid collecting
• When cycle is found
– Place in a buffer, wait for next epoch
– Inc/dec stream will provide necessary information
Detecting Concurrent Mutation
• Delta test
– Detects local changes
– Observes recolorings of buffered cycle objects
• Sigma test
– Detects non-local changes
– Checks sum of RC’s = number of pointers
• For details, see ECOOP’01 paper
Outline
Introduction
Motivation
System Overview
Concurrent Reference Counting
Cycle Collection
Measurements
• Conclusions
Measurements
• Compared to parallel mark&sweep collector
– High-performance, throughput-oriented
– Stop-the-world
• Benchmarks:
– SPEC, SPECjbb, Jalapeño compiler
– Synthetic cycle garbage generator (ggauss)
• In MP mode, 1 more CPU than threads
App. Speed vs. Parallel
Mark&Sweep
1.4
Multiprocessing
Uniprocessing
Relative Speed
1.2
1
0.8
0.6
0.4
0.2
ss
au
gg
la
pe
no
b
ja
sp
ec
jb
ck
ja
trt
ud
ga
pe
m
m
io
c
va
ja
db
je
ss
ra
yt
ra
ce
co
m
pr
es
s
0
m
ga
ud
c
io
va
m
trt
ja
c
sp k
ec
jb
b
ja
la
pe
n
gg o
au
ss
pe
ja
db
1000000
ce
10000000
ra
s
ss
es
je
pr
yt
m
ra
co
Max. Pause Time (us)
Pause Time vs. Parallel Mark&Sweep
RC Pause
M&S Pause
100000
10000
1000
100
Cycle Collection: Buffering Roots
100%
Roots
70%
60%
50%
Unbuffered
Freed
Repeat
Acyclic
40%
30%
20%
ga c
ud
io
m
trt
ja
s p ck
ec
ja jbb
la
pe
g g no
au
ss
va
m
pe
ja
db
j
ra e s s
yt
ra
ce
m
pr
es
s
10%
0%
co
Potential Roots
90%
80%
db
ce
ec
jb
b
ja
la
pe
no
gg
au
ss
sp
ja
ck
m
trt
va
m
c*
pe
ga
ud
io
ja
ra
*
s
ss
es
je
pr
yt
m
ra
co
Refs. Traced (millions)
Reference Tracing vs. Mark&Sweep
180
160
RC
140
M&S
120
100
80
60
40
20
0
Related Work
• Dijkstra et al [1976], Steele [1975,1976], Lamport
[1976]
• DeTreville [1990]
• Doligez, Leroy, Gonthier [1993,1994]
• Domani, Kolodner, Petrank [2000]
• Huelsbergen et al [1993,1999]
• Martínez et al [1990], Lins [1992]
• Plakal & Fischer [2001], Levanoni & Petrank [2001]
Conclusions
• Recycler sets new benchmark for concurrent GC
– 2.6 ms max. pause time for general-purpose programs
– End-to-end execution times comparable
• RC can perform very well in concurrent system
– No synchronization required in common case
– Standard RC problems can be overcome
• Concurrent cycle collection works
– Viable alternative to backup mark & sweep
– More study needed to reduce memory tracing costs
More Information
• The Recycler (including this presentation)
• Cycle Collection Algorithm/Proof (ECOOP’01)
– www.research.ibm.com/people/d/dfb
• Jalapeño
– www.research.ibm.com/jalapeno
• Jalapeño BoF Thursday 7pm
Addendum
• Algorithm Animations
– Reference counting across multiple CPUs
– Successful acyclic cyclic collection
– Aborted acyclic cycle collection
• Additional Measurements
– Collection cost breakdown
– Objects freed by cycle collection
– Allocation and mutation rates
CPU 1
CPU 2
Collector
CPU
Root Buffer
Cycle Buffer
Live
Data
12
1
2
1
01
1
01201
1
2
1021
21
0201
1. Process Decrements
2. Mark Gray
3. Scan
4. Collect White
011201
5. Await next epoch
6. If still orange, GC
7. If changed, restore
Root Buffer
Cycle Buffer
Live
Data
1
121
1
1201
2
211
32
312302
1. Process Decrements
2. Mark Gray
3. Scan
4. Collect White
5. Calculate external in-degree
1
12010
6. Await next epoch
7. Compute in-degree sum
8. If 0, GC/decrement neighbors
9. If non-0, restore
Collection Cost Breakdown
100%
90%
Free Cycles
Validate
Collect
Scan
Mark
Purge
Increment
Decrement
70%
60%
50%
40%
30%
20%
10%
m
trt
ja
c
sp k
ec
j
ja bb
la
pe
n
gg o
au
ss
io
ud
c
pe
ga
va
db
ja
m
yt
ra
ce
ss
je
ra
m
pr
es
s
0%
co
Collection Time
80%
m
la
au
ss
no
b
ck
jb
ja
pe
gg
ja
io
m
trt
ud
c
db
ce
va
ec
ga
sp
pe
ja
ra
s
ss
es
je
pr
yt
m
ra
co
Objects Freed
Objects Freed by Cycle Collection
40%
35%
30%
25%
20%
15%
10%
5%
0%
m
gg
ss
no
au
pe
K Dec/s
la
350
b
ck
K Inc/s
jb
ja
400
ja
io
m
trt
ud
c
db
ce
va
ec
ga
sp
pe
ja
ra
s
ss
es
je
pr
yt
m
ra
co
K Mutations/sec
Mutation Rate (Barrier Frequency)
450
300
250
200
150
100
50
0
m
ga
ud
c
io
va
m
trt
ja
ck
sp
ec
jb
b
ja
la
pe
no
gg
au
ss
pe
ja
ce
db
ra
s
ss
es
je
pr
yt
m
ra
co
K Allocations/sec
Allocation Rate
200
180
160
140
120
100
80
60
40
20
0