Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector David F. Bacon IBM T.J. Watson Research Center Joint work with C.R. Attanasio, Han Lee, V.T. Rajan, and Steve Smith ACM Conference on Programming Language Design and Implemenentation Snowbird, Utah June 20, 2001 Overview • Two types of Garbage Collection – Tracing [McCarthy 1960] – Reference Counting [Collins 1960] • Contributions of this work: – Improved concurrent reference counting algorithm – Practical cycle collection algorithm • Both stop-the-world and concurrent – Implementation in collector w/very low pause times Features of the Recycler • • • • Concurrent Multiprocessor Reference-counted Cycle collecting Low-latency: under 10 ms High-performance: 90% of stop-the-world Why do it? • RC has good locality properties – – – – Will tracing collection scale to multi-GB heaps? Effect of rising memory latency RC is easily decoupled from mutator Promises lower synchronization costs • Java makes good GC a commercial requirement – Makes sense to have “genetic diversity” • People said it couldn’t be done Why is it hard? • Must defer reference counting stack – Complicates algorithm • Reference counts are shared variables – Synchronization required • Write barriers for RC more expensive – Affect two objects • RC doesn’t handle cycles – Other systems use backup tracing collector Outline Introduction Motivation System Overview • Concurrent Reference Counting • Cycle Collection • Measurements • Conclusions System Overview • Producer/Consumer System Mutator Mutator Emit inc/dec Allocate Emit inc/dec Allocate Collector Free Memory Process inc/dec Collect Cycles • Similar to Deutsch-Bobrow DRC – As implemented by DeTreville on SRC Firefly Implementation • Implemented in Jalapeño JVM at IBM TJW • All of VM, JIT, and GC are written in Java – Extended with unsafe primitives – Multiple GC implementations – GC is machine-independent except for barriers • Targets – IBM RS/6000 multiprocessors (optimized) – Linux/Intel (unoptimized; optimized coming) Outline Introduction Motivation System Overview Concurrent Reference Counting (no cycles) • Cycle Collection • Measurements • Conclusions Concurrent Reference Counting • Time divided into epochs – All CPUs must participate before epoch advances • Write barrier on heap updates – inc/dec operations placed in buffer – Objects allocated with RC=1, dec enqueued – Decrements processed one epoch behind increments • Stack references are deferred – Snapshot stacks at epoch boundary – First increment; decrement at next epoch • Simpler invariant than Deutsch-Bobrow; no ZCT required Ragged Barriers CPU 1 Mutator CPU 2 Mutator CPU 3 Collector Time Epoch 7 Epoch 8 Epoch 9 Outline Introduction Motivation System Overview Concurrent Reference Counting Cycle Collection • Measurements • Conclusions Synchronous Cycle Collection • Class loader identifies acyclic classes – Arrays, String, and such are marked green • Two key observations: – Most reference counts are 1 – Garbage cycles created by decrement to non-0 • Use those objects as starting points – DFS-based algorithm subtracts internal RC’s – If resulting count is 0, collect cyclic garbage • Based on algorithm by Lins, but O(n) instead of O(n2) Root Buffer Live Data 12 12 012 1 102 012 012 1. Process Decrements and Accumulate Roots 2. Mark Gray: Subtract Internal Reference Counts 3. Scan: Restore Live, Mark Dead White 4. Collect White Concurrent Cycle Collection • Based on synchronous algorithm – But second reference count field is used • Relies on stability property of garbage – If no mutation, synchronous algorithm will work – Detect when mutation occurs and avoid collecting • When cycle is found – Place in a buffer, wait for next epoch – Inc/dec stream will provide necessary information Detecting Concurrent Mutation • Delta test – Detects local changes – Observes recolorings of buffered cycle objects • Sigma test – Detects non-local changes – Checks sum of RC’s = number of pointers • For details, see ECOOP’01 paper Outline Introduction Motivation System Overview Concurrent Reference Counting Cycle Collection Measurements • Conclusions Measurements • Compared to parallel mark&sweep collector – High-performance, throughput-oriented – Stop-the-world • Benchmarks: – SPEC, SPECjbb, Jalapeño compiler – Synthetic cycle garbage generator (ggauss) • In MP mode, 1 more CPU than threads App. Speed vs. Parallel Mark&Sweep 1.4 Multiprocessing Uniprocessing Relative Speed 1.2 1 0.8 0.6 0.4 0.2 ss au gg la pe no b ja sp ec jb ck ja trt ud ga pe m m io c va ja db je ss ra yt ra ce co m pr es s 0 m ga ud c io va m trt ja c sp k ec jb b ja la pe n gg o au ss pe ja db 1000000 ce 10000000 ra s ss es je pr yt m ra co Max. Pause Time (us) Pause Time vs. Parallel Mark&Sweep RC Pause M&S Pause 100000 10000 1000 100 Cycle Collection: Buffering Roots 100% Roots 70% 60% 50% Unbuffered Freed Repeat Acyclic 40% 30% 20% ga c ud io m trt ja s p ck ec ja jbb la pe g g no au ss va m pe ja db j ra e s s yt ra ce m pr es s 10% 0% co Potential Roots 90% 80% db ce ec jb b ja la pe no gg au ss sp ja ck m trt va m c* pe ga ud io ja ra * s ss es je pr yt m ra co Refs. Traced (millions) Reference Tracing vs. Mark&Sweep 180 160 RC 140 M&S 120 100 80 60 40 20 0 Related Work • Dijkstra et al [1976], Steele [1975,1976], Lamport [1976] • DeTreville [1990] • Doligez, Leroy, Gonthier [1993,1994] • Domani, Kolodner, Petrank [2000] • Huelsbergen et al [1993,1999] • Martínez et al [1990], Lins [1992] • Plakal & Fischer [2001], Levanoni & Petrank [2001] Conclusions • Recycler sets new benchmark for concurrent GC – 2.6 ms max. pause time for general-purpose programs – End-to-end execution times comparable • RC can perform very well in concurrent system – No synchronization required in common case – Standard RC problems can be overcome • Concurrent cycle collection works – Viable alternative to backup mark & sweep – More study needed to reduce memory tracing costs More Information • The Recycler (including this presentation) • Cycle Collection Algorithm/Proof (ECOOP’01) – www.research.ibm.com/people/d/dfb • Jalapeño – www.research.ibm.com/jalapeno • Jalapeño BoF Thursday 7pm Addendum • Algorithm Animations – Reference counting across multiple CPUs – Successful acyclic cyclic collection – Aborted acyclic cycle collection • Additional Measurements – Collection cost breakdown – Objects freed by cycle collection – Allocation and mutation rates CPU 1 CPU 2 Collector CPU Root Buffer Cycle Buffer Live Data 12 1 2 1 01 1 01201 1 2 1021 21 0201 1. Process Decrements 2. Mark Gray 3. Scan 4. Collect White 011201 5. Await next epoch 6. If still orange, GC 7. If changed, restore Root Buffer Cycle Buffer Live Data 1 121 1 1201 2 211 32 312302 1. Process Decrements 2. Mark Gray 3. Scan 4. Collect White 5. Calculate external in-degree 1 12010 6. Await next epoch 7. Compute in-degree sum 8. If 0, GC/decrement neighbors 9. If non-0, restore Collection Cost Breakdown 100% 90% Free Cycles Validate Collect Scan Mark Purge Increment Decrement 70% 60% 50% 40% 30% 20% 10% m trt ja c sp k ec j ja bb la pe n gg o au ss io ud c pe ga va db ja m yt ra ce ss je ra m pr es s 0% co Collection Time 80% m la au ss no b ck jb ja pe gg ja io m trt ud c db ce va ec ga sp pe ja ra s ss es je pr yt m ra co Objects Freed Objects Freed by Cycle Collection 40% 35% 30% 25% 20% 15% 10% 5% 0% m gg ss no au pe K Dec/s la 350 b ck K Inc/s jb ja 400 ja io m trt ud c db ce va ec ga sp pe ja ra s ss es je pr yt m ra co K Mutations/sec Mutation Rate (Barrier Frequency) 450 300 250 200 150 100 50 0 m ga ud c io va m trt ja ck sp ec jb b ja la pe no gg au ss pe ja ce db ra s ss es je pr yt m ra co K Allocations/sec Allocation Rate 200 180 160 140 120 100 80 60 40 20 0