The Metronome: A Hard Real-time Garbage Collector David F. Bacon Perry Cheng David Grove V.T. Rajan Martin Vechev IBM T.J. Watson Research Center Real Time: The Final Frontier • Real-time systems growing in importance – Becoming more complex – BMW CEO: “80% of innovation in SW” • Programmers left behind – Still using assembler and C – Lack productivity advantages of Java – Often must co-exist w/large, non-real-time Java application • Result – Complexity – Low reliability or very high validation cost Scope • Time-sensitive applications – Traditional “hard” real-time systems – Soft real-time systems (games, media, ...) – OS components: device drivers, fault handlers, … • Uniprocessor – Multiprocessors rare in real-time systems – Complication: collector must be finely interleaved – Simplification: memory model easier to program • No truly concurrent operations • Sequentially consistent Metronome Project Goals • Make GC feasible for hard real-time systems • Provide simple application interface • Develop technology that is efficient: – Throughput, Space comparable to stop-the-world • BREAK THE MILLISECOND BARRIER – While providing even CPU utilization Where’s the Beef? • Maximum pause time < 4 ms • Utilization > 50% ±2% • Memory requirement < 2 X max live Controlling Time: Scheduling Collection The Baker Fallacy “A real-time list processing system is one in which the time required by the elementary list processing operations…is bounded by a small constant” [Baker’78] s a i n Fi De Why Doesn’t It Work? • Operations not smooth • Constant not so small • Traps are not smooth n i l d Scheduling: Work vs. Time • Baker approach is work-based – On barriers/allocations do a little GC work – For constant k, utilization may drop to 1/k • Metronome is time-based – Tick (user), Tock (GC) – Utilization is guaranteed to be tick/(tick+tock) • Minimum Mutator Utilization (MMU) – Issue: can we keep up with the mutator? Pause time distribution: javac Time-based Scheduling Work-based Scheduling javac.work.opt: Distribution of pause time javac.time.opt: Distribution of pause time 50 600 45 500 40 35 400 Count Count 30 300 25 20 200 15 10 100 5 0 0 2 4 8 6 Pause Time(ms) 10 12 12 ms 14 0 0 5 10 20 15 Pause Time(ms) 12 ms 25 30 35 Utilization vs Time: javac Time-based Scheduling 1.0 1 0.8 0.8 Utilization (%) Mutator CPU Utilization 0.8 0.8 0.6 0.6 0.45 0.4 0.4 0.2 0.2 0.0 0 1 Mutator CPU Utilization 1.0 Work-based Scheduling 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0 5 10 15 Time (s) Time (s) 20 25 30 0 0 5 10 15 Time (s) Time (s) 20 25 30 Space Usage: javac javac.work.opt: Space usage vs time 80 In Use (Time−Based) In Use (Work−Based) Maximum Live Data GC trigger 70 60 Memory (Mb) 50 40 30 20 10 0 0 5 10 15 Time (s) 20 25 30 Parameterization Real Time Interval Maximum Used Memory CPU Utilization Tuner ∆t s u Mutator Allocation Rate a*(∆GC) m Maximum Live Memory Collector R Collection Rate Controlling Space: Layout and Compaction Two Basic Organizations • Semi-space Copying – Concurrently copies between spaces – Cost of copying, consistency – Fragmentation 2x, but bounded • Mark-sweep (non-copying) – Fragmentation avoidance, coalescing – Fragmentation unbounded. Segregated Free-list Architecture • Suffers from 5 (!) kinds of fragmentation – – – – 12 16 24 Internal and page-internal: bound (e.g. 1/8) External: Break up large objects (arraylets) Page external: Defragment Size external: eat it (up to 1 page/size class) Selective Defragmentation • When do we move objects? – When there is fragmentation • Usually: program exhibits locality of size (λ) – Dead objects are re-used quickly – Fragmentation is low • But: defragment either when – Dead objects are not re-used for n GC cycles – Free pages fall below limit for performing a GC • In practice: we move 2-3% of live data – Major improvement over copying collector – Very resilient to adversary programs Metronome Adapts • Semi-space Copying – Concurrently copies between spaces – Fragmentation 2x, but bounded • Mark-sweep (non-copying) – Fragmentation avoidance, coalescing – Fragmentation unbounded • The Metronome – Mark-sweep, selective defragmentation – Best of both, adaptively adjusts Metronome: The Next Generation Time to Get Serious • Significant interest – But Jikes RVM is not product-level • Pause time/MMU meets wide class of needs – But utilization during GC is an issue • Next-generation Implementation – In IBM’s J9 product (no commitment expressed…) – Higher performance • Generational – Native threads (challenging) – Binary distribution to groups at Berkeley, Salzburg Generational RTGC a Nursery a' Heap • Allocation rate is single biggest factor – Nursery reduces heap allocation rate • Our STW GC: collects 64K in 4ms (ARM) – So, collect small nursery synchronously • Survival rate usually < 15% (with tricks) – But predictability is harder (interface/methodology issue) Conclusions • The Metronome provides true real-time GC – First collector to do so without major sacrifice • Short pauses (4 ms) • High MMU during collection (50%) • Low memory consumption (2x max live) • Critical features – Time-based scheduling – Hybrid, mostly non-copying approach – Integration w/compiler Future Work • Two main goals: – Reduce pause time, memory requirement – Increase predictability • Pause time: – Expect sub-millisecond using current techniques – For 10’s of microseconds, need interrupt-based • Predictability – Studying parameterization of collector – Good research area Backup Slides Components of the Metronome • Incremental mark-sweep collector – Mark phase fixes stale pointers • Selective incremental defragmentation – Moves < 2% of traced objects • Time-based scheduling • Segregated free list allocator – Geometric size progression limits internal fragmentation Old New Support Technologies • Read barrier: to-space invariant [Brooks] – New techniques with only 4% overhead • Write barrier: snapshot-at-the-beginning [Yuasa] • Arraylets: bound fragmentation, large object ops Old New Limiting Internal Fragmentation • Choose page size P and block sizes sk such that – sk = sk-1(1+ρ) – smax = P ρ • Then – Internal and page-internal fragmentation < ρ • Example: – P =16KB, ρ =1/8, smax = 2KB – Internal and page-internal fragmentation < 1/8 Write Barrier: Snapshot-at-start • Problem: mutator changes object graph • Solution: write barrier prevents lost objects • Logically, collector takes atomic snapshot – Objects live at snapshot will not be collected – Write barrier saves overwritten pointers [Yuasa] – Write buffer must be drained periodically B A C WB Read Barrier: To-space Invariant • Problem: Collector moves objects (defragmentation) – and mutator is finely interleaved • Solution: read barrier ensures consistency – Each object contains a forwarding pointer [Brooks] – Read barrier unconditionally forwards all pointers – Mutator never sees old versions of objects X X Y A Z Y A A′ Z From-space BEFORE AFTER To-space Incremental Mark-Sweep • Mark/sweep finely interleaved with mutator • Write barrier ensures no lost objects – Must treat objects in write buffer as roots • Read barrier ensures consistency – Marker always traces correct object • With barriers, interleaving is simple Pointer Fixup During Mark • When can a moved object be freed? – When there are no more pointers to it • Mark phase updates pointers – Redirects forwarded pointers as it marks them • Object moved in collection n can be freed: – At the end of mark phase of collection n+1 X Y A A′ Z From-space To-space Work-based Scheduling • Trigger the collector to collect CW bytes 1 100 Smooth Alloc Uneven Alloc High Alloc Any Space (MB) 0.8 0.6 0.4 80 60 40 20 0 0 10 0.2 0.01 MMU (CPU Utilization) – Whenever the mutator allocates QW bytes Window Size (s) - log Time (s) Time-based Scheduling • Trigger collector to run for CT seconds 1 Space (MB) 0.8 0.6 0.4 0.2 Any 100 90 80 70 60 50 40 30 20 10 0 Smooth Alloc Uneven Alloc High Alloc 10 0 0.01 MMU (CPU Utilization) – Whenever mutator runs for QT seconds Window Size (s) - log Time (s) Fragmentation: ρ=1/8 vs ρ=1/2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Internal Page-Internal External Recently Dead Live co m pr es s o m pe ga ud i m tr t je ss ja va c ja ck db 0 Read Barrier Optimizations • Barrier variants: when to redirect – Lazy: easier for collector (no fixup) – Eager: better for performance (loop over a[i]) • Standard optimizations: CSE, code motion • Problem: pointers can be null – Augment read barrier for GetField(x,offset): tmp = x[offset]; return tmp == null ? null : tmp[redirect] – Optimize by null-check combining, sinking Read Barrier Results • Conventional wisdom: read barriers too slow – Previous studies: 20-40% overhead [Zorn,Nielsen] 12 Lazy 10 Eager 8 6 4 2 G eo .M ea n ja ck m tr t m pe ga ud io ja va c db je ss co m pr es s 0 Arraylets • Large arrays create problems – Fragment memory space – Can not be moved in a short, bounded time • Solution: break large arrays into arraylets – Access via indirection; move one arraylet at a time • Optimizations – Type-dependent code optimized for contiguous case – Opportunistic contiguous allocation A A1 A2 A3 Perfect Reuse Allocated Freed Allocated Freed Allocated Freed Bytes Locality of Size: λ • Measures reuse • Normalized: 0≤λ≤1 • Segregated by size Freed Allocated Reused Reused λ = Σi min ( fi / f, ai / a) λ in Real-time GC Context • Mark-sweep (non-copying) Assumes λ=1 • Semi-space Copying Assumes λ=0 • The Metronome Adapts as λ varies λ in Practice 1 λ 0.9 % Defrag 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 javac db jack mtrt jess fragger Triggering Collection PAGES De fra g Fre e MS 1 DF 1 TIME MS 2 DF 2 MS 3 Factors in Space Consumption λ MS 1 DF 1 MS 2 R DF 2 MS 3 MMU: javac 1 Time−Based Work−Based 0.9 0.8 0.7 Utilization 0.6 0.5 0.4 0.3 0.2 0.1 0 −2 10 −1 0 10 10 Delta t (s) 1 10