Dynamic Selection of Application-Specific Garbage Collectors Sunil V. Soman Chandra Krintz University of California, Santa Barbara David F. Bacon IBM T.J. Watson Research Center Background VMs/managed runtimes Next generation internet computing Automatic memory management for memory safety High performance, multi-application servers GC performance impacts overall performance So GC must perform well ISMM '04 2 GC Research Focus on general-purpose GC One GC for all apps No single GC provides best performance in all cases Across applications Even for the same application given different memory constraints ISMM '04 3 Motivation SS MS GMS GSS CMS SPECjbb2000 10^6/Throughput 360 310 260 210 30 25 20 110 15 2 3 4 5 6 7 8 9 SS MS GMS GSS CMS 35 160 1 _209_db 40 Execution Time (sec) 410 10 11 12 Heap Size Relative to Min 1 2 3 4 5 6 7 8 9 10 11 Heap Size Relative to Min switch point ISMM '04 4 12 Motivation Other researchers have reported similar results [Attanasio ’01, Fitzgerald ’00] Spread between best & worst at least 15% Generational not always better Employ a VM instance built with the optimal GC Goal of our work: employ multiple GCs in the same system and switch between them ISMM '04 5 Framework Implemented in the JikesRVM Uses the Java Memory management Toolkit (JMTk) Can switch between diverse collectors Extended to enable multiple GC support Easily extensible GC System = Allocator + Collector ISMM '04 6 Framework Implementation StopTheWorld StopTheWorld VM_Processor (Plan currentPlan;) Plan {SS_Plan,MS_Plan,...} Plan (Selected at build time) SS_Plan MS_Plan CopyMS_Plan Generational GenMS_Plan GenCopy_Plan VM_Processor Most code is shared across GCs 44MB vs 42MB (average across GCs) ISMM '04 7 Framework Implementation HIGHER Nursery Mark-Sweep Alloc Unmap Example MS -> GSS MS -> GMS High Semispace Low Semispace LOWER Large Object Space GC Data Structures Immortal ISMM '04 8 HIGHER Framework Implementation Nursery Mark-Sweep Alloc No GC Do not unmap Example MS -> GSS MS -> GMS High Semispace Low Semispace LOWER Large Object Space GC Data Structures Immortal ISMM '04 9 HIGHER Framework Implementation Nursery Mark-Sweep High Semispace Low Semispace LOWER Large Object Space GC Data Structures Immortal Mem mapped on demand Unused memory unmapped Need MS & copying states Use object header Bit stealing Worst-case cost = 1 GC No need to perform GC in many cases ISMM '04 10 Making Use of GC Switching Dynamic switching guided by Cross-input offline behavior Annotation Methodology 2.4 GHz/1G single proc x86 (hyperthreading) JikesRVM v2.2.0, linux kernel 2.4.18 SpecJVM98, SpecJBB, Jolden, Javagrande Pseudo-adaptive system ISMM '04 11 Annotation-guided GC (AnnotGC) Annotation Hints about optimization opportunities Widely used to guide compiler optimization Different benchmarks & inputs examined offline Cross-input results combined Select “optimal” GC and annotate program with it Best GC for a range of heap sizes JVM at program load time Switch to annotated GC Ignore annotation if GC-Switching not avail. AnnotGC Implementation Observations Only 0 or 1 switch points per benchmark Switch point relative to min heap does not vary much across inputs Annotate min heap size & switch point 4 byte annotation in Java class file Encoded in a compact form 40 MB min assumed if not specified ISMM '04 13 AnnotGC Results SS MS GMS GSS CMS SPECjbb2000 10^6/Throughput 360 310 _209_db 260 210 40 35 30 25 160 20 110 15 1 2 3 4 5 6 7 8 9 SS MS GMS GSS CMS 45 Execution Time (sec) 410 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Heap Size Relative to Min Heap Size Relative to Min ISMM '04 14 AnnotGC Results 360 10^6/Throughput _209_db SS MS GMS GC CMS GC Annot SPECjbb2000 310 260 210 160 40 35 30 25 20 15 110 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Heap Size Relative to Min Heap Size Relative to Min SS MS GMS GSS CMS GC Annot 45 Execution Time (sec) 410 Degradation over best: 4% Improvement over worst: 26% Improvement over GenMS: 7% ISMM '04 15 AnnotGC Results Average Difference Between Best & Worst GC Systems Benchmarks Degradation over Best Improvement over Worst compress 6.28% (443 ms) 3.53% (279 ms) jess 2.82% (85 ms) 56.17% (5767 ms) db 2.88% (532 ms) 12.47% (3028 ms) javac 5.64% (392 ms) 24.12% (2944 ms) mpegaudio 3.54% (214 ms) 3.21% (209ms) mtrt 4.51% (270 ms) 42.29% (5170 ms) jack 3.22% (147 ms) 32.70% (2787 ms) JavaGrande 3.97% (2511 ms) 17.71% (15500 ms) SPECjbb 2.22% (3.17*106/tput) 27.95% (82.6*106/tput) MST 4.07% (30 ms) 48.42% (1001 ms) Voronoi 9.20% (144 ms) 31.78% (1063 ms) Average 4.38% 26.22% Average (without MS) 3.36% 24.13% Extending GC Switching Switch automatically Performance hit: lost optimization opportunity No offline profiling Employ execution characteristics Inlining of allocation sites Write barriers Solution: Aggressive specialization guarded by Method invalidation On-stack replacement ISMM '04 17 Investigating Auto Switching Exploit general performance characteristics Best performing GCs across heap sizes & programs Deciding when to switch Default GC: MS Heap size & heap residency threshold If residency > 60%, switch to GSS if heap size > 90MB else use GMS Different collectors for startup & steadystate ISMM '04 18 Preliminary Implementation Method invalidation - for future invocations On-stack replacement - for currently executing methods Deferred compilation & guarded inlining [Fink’04] Unconditional OsrPoints Our extension OsrInfoPoints for state info Without inserting checks in application code ISMM '04 19 AutoSwitch Results Improvement over worst: 17% Degradation over best: 15% Overhead (Annot: 26%) (Annot: 4%) OSR - negligible Recompilation - depends on where switch occurs Lost optimization opportunities all variables live at every GC point OsrInfoPoints are “pinned” down DCE, load/store elim, code motion, reg allocation ISMM '04 20 Related Work Application-specific GC studies Profile-direction GC selection [Fitzgerald’00] Comparison of GCs [Attanasio’01] Comparing gen mark-sweep & copying [Zorn’90] Switching/swapping Coupling compaction with copying [Sansom’92] Hot-swapping mark-sweep/mark-compact [Printezis’01] BEA Weblogic JRockit [BEA Workshop’03] ISMM '04 21 Conclusion Choice of best GC is app-dependent Novel framework Switch between diverse collectors Enables annotation driven & automatic switching Significantly reduce impact of choosing the "wrong" collector AnnotGC: 26%, degradation: 4%, GMS: 7% Enabled by aggressive specialization ISMM '04 22 Future Work Improving AutoSwitch Improved OSR Cuts AutoSwitch overhead by half Better heuristics for automatic switching Paper available upon request Decide when to switch & which GC to switch to High-performance application-specific VMs Guided by available resources Application characteristics Self-modifying ISMM '04 23 App-specific Garbage Collectors Front Loaders Recyclers Rear Loaders Side Loaders All products are trademarks of Heil Environmental Industries, Ltd. 2002-2003 Extra slides ISMM '04 25 Preliminary Implementation Aggressive specialization Method Invalidation Inline allocation sites Insert write barriers only when necessary Invalidated method replaced by stub On-stack Replacement (OSR) Potentially at every GC point ISMM '04 26 On-stack replacement Insert OsrInfoPoints at each GC point Keep track of state Call sites, allocation sites, prologs, backedges, explicit yieldpoints, exceptions All local/stack variables kept alive Lazy OSR foo old return address trigger OSR bar osr_helper { ... } b. OSR triggered lazily by external event ISMM '04 27 GC Performance Evaluation (2) Comparison of parallel garbage collectors [5] Results Mark-sweep, copying, Gen MS, Gen copy, hybrid (copying + MS) Hybrid has lowest heap residency Gen copy handles fragmentation better Minor collections for Gen copy faster Mark-sweep has fewer GCs No one collector best across applications ISMM '04 28 AnnotGC Results SS MS GMS GSS CMS GC Annot SPECjbb2000 10^6/Throughput 360 310 _209_db 260 210 40 35 30 25 160 20 110 15 1 2 3 4 5 6 7 8 9 SS MS GMS GSS CMS GC Annot 45 Execution Time (sec) 410 10 11 12 1 2 3 4 5 6 7 8 9 10 11 Heap Size Relative to Min Heap Size Relative to Min Degradation over best: 4% Improvement over worst: 26% Improvement over GenMS: 7% ISMM '04 29 12 GC Performance Evaluation Comparing generational mark-sweep & copying [66] Mark-sweep Copying collection Slower free-list allocator Requires less heap space: 20% less on average Fastest allocator Copying overhead Copy reserve space required Generational copying not clearly superior ISMM '04 30 GC Performance Evaluation (2) Comparison of parallel garbage collectors [5] Results Mark-sweep, copying, Gen MS, Gen copy, hybrid (copying + MS) Hybrid has lowest heap residency Gen copy handles fragmentation better Minor collections for Gen copy faster Mark-sweep has fewer GCs No one collector best across applications ISMM '04 31 GC Performance Evaluation (3) Fitzgerald et al [32] compared GCs across 20 benchmarks Every collector best at least once Spread between best & worst at least 15% Generational not always better Non-generational may outperform by 15-20% Recommend “profile-directed” selection Null collector, non-gen copying, Gen copy with various WB implementations The “best” GC chosen from different pre-compiled binaries based on profiles It is difficult or impossible to replace GCs at runtime ISMM '04 32 ISMM '04 33