Dynamic Selection of
Application-Specific Garbage
Collectors
Sunil V. Soman
Chandra Krintz
University of California, Santa Barbara
David F. Bacon
IBM T.J. Watson Research Center
Background
VMs/managed runtimes
Next generation internet computing
Automatic memory management for memory
safety
High performance, multi-application servers
GC performance impacts overall
performance
So GC must perform well
ISMM '04
2
GC Research
Focus on general-purpose GC
One GC for all apps
No single GC provides best
performance in all cases
Across applications
Even for the same application given
different memory constraints
ISMM '04
3
Motivation
SS
MS
GMS
GSS
CMS
SPECjbb2000
10^6/Throughput
360
310
260
210
30
25
20
110
15
2
3
4
5
6
7
8
9
SS
MS
GMS
GSS
CMS
35
160
1
_209_db
40
Execution Time (sec)
410
10 11 12
Heap Size Relative to Min
1
2
3
4
5
6
7
8
9
10 11
Heap Size Relative to Min
switch point
ISMM '04
4
12
Motivation
Other researchers have reported similar
results [Attanasio ’01, Fitzgerald ’00]
Spread between best & worst at least 15%
Generational not always better
Employ a VM instance built with the
optimal GC
Goal of our work: employ multiple
GCs in the same system and switch
between them
ISMM '04
5
Framework
Implemented in the JikesRVM
Uses the Java Memory management Toolkit
(JMTk)
Can switch between diverse collectors
Extended to enable multiple GC support
Easily extensible
GC System = Allocator + Collector
ISMM '04
6
Framework Implementation
StopTheWorld
StopTheWorld
VM_Processor
(Plan currentPlan;)
Plan
{SS_Plan,MS_Plan,...}
Plan
(Selected at build time)
SS_Plan
MS_Plan
CopyMS_Plan
Generational
GenMS_Plan
GenCopy_Plan
VM_Processor
Most code is shared across GCs
44MB vs 42MB (average across GCs)
ISMM '04
7
Framework Implementation
HIGHER
Nursery
Mark-Sweep
Alloc
Unmap
Example
MS -> GSS
MS -> GMS
High Semispace
Low Semispace
LOWER
Large Object Space
GC Data Structures
Immortal
ISMM '04
8
HIGHER
Framework Implementation
Nursery
Mark-Sweep
Alloc
No GC
Do not unmap
Example
MS -> GSS
MS -> GMS
High Semispace
Low Semispace
LOWER
Large Object Space
GC Data Structures
Immortal
ISMM '04
9
HIGHER
Framework Implementation
Nursery
Mark-Sweep
High Semispace
Low Semispace
LOWER
Large Object Space
GC Data Structures
Immortal
Mem mapped on demand
Unused memory unmapped
Need MS & copying states
Use object header
Bit stealing
Worst-case cost = 1 GC
No need to perform GC in
many cases
ISMM '04
10
Making Use of GC Switching
Dynamic switching guided by
Cross-input offline behavior
Annotation
Methodology
2.4 GHz/1G single proc x86 (hyperthreading)
JikesRVM v2.2.0, linux kernel 2.4.18
SpecJVM98, SpecJBB, Jolden, Javagrande
Pseudo-adaptive system
ISMM '04
11
Annotation-guided GC (AnnotGC)
Annotation
Hints about optimization opportunities
Widely used to guide compiler optimization
Different benchmarks & inputs examined offline
Cross-input results combined
Select “optimal” GC and annotate program with it
Best GC for a range of heap sizes
JVM at program load time
Switch to annotated GC
Ignore annotation if GC-Switching not avail.
AnnotGC Implementation
Observations
Only 0 or 1 switch points per benchmark
Switch point relative to min heap does not
vary much across inputs
Annotate min heap size & switch point
4 byte annotation in Java class file
Encoded in a compact form
40 MB min assumed if not specified
ISMM '04
13
AnnotGC Results
SS
MS
GMS
GSS
CMS
SPECjbb2000
10^6/Throughput
360
310
_209_db
260
210
40
35
30
25
160
20
110
15
1
2
3
4
5
6
7
8
9
SS
MS
GMS
GSS
CMS
45
Execution Time (sec)
410
10 11 12
1
2
3
4
5
6
7
8
9
10 11 12
Heap Size Relative to Min
Heap Size Relative to Min
ISMM '04
14
AnnotGC Results
360
10^6/Throughput
_209_db
SS
MS
GMS
GC
CMS
GC Annot
SPECjbb2000
310
260
210
160
40
35
30
25
20
15
110
1
2
3
4
5
6
7
8
9
10 11 12
1
2
3
4
5
6
7
8
9
10 11 12
Heap Size Relative to Min
Heap Size Relative to Min
SS
MS
GMS
GSS
CMS
GC Annot
45
Execution Time (sec)
410
Degradation over best: 4%
Improvement over worst: 26%
Improvement over GenMS: 7%
ISMM '04
15
AnnotGC Results
Average Difference Between Best & Worst GC Systems
Benchmarks
Degradation over Best
Improvement over
Worst
compress
6.28% (443 ms)
3.53% (279 ms)
jess
2.82% (85 ms)
56.17% (5767 ms)
db
2.88% (532 ms)
12.47% (3028 ms)
javac
5.64% (392 ms)
24.12% (2944 ms)
mpegaudio
3.54% (214 ms)
3.21% (209ms)
mtrt
4.51% (270 ms)
42.29% (5170 ms)
jack
3.22% (147 ms)
32.70% (2787 ms)
JavaGrande
3.97% (2511 ms)
17.71% (15500 ms)
SPECjbb
2.22% (3.17*106/tput)
27.95% (82.6*106/tput)
MST
4.07% (30 ms)
48.42% (1001 ms)
Voronoi
9.20% (144 ms)
31.78% (1063 ms)
Average
4.38%
26.22%
Average (without MS)
3.36%
24.13%
Extending GC Switching
Switch automatically
Performance hit: lost optimization opportunity
No offline profiling
Employ execution characteristics
Inlining of allocation sites
Write barriers
Solution: Aggressive specialization guarded by
Method invalidation
On-stack replacement
ISMM '04
17
Investigating Auto Switching
Exploit general performance characteristics
Best performing GCs across heap sizes & programs
Deciding when to switch
Default GC: MS
Heap size & heap residency threshold
If residency > 60%,
switch to GSS if heap size > 90MB
else use GMS
Different collectors for startup & steadystate
ISMM '04
18
Preliminary Implementation
Method invalidation - for future invocations
On-stack replacement - for currently executing
methods
Deferred compilation & guarded inlining [Fink’04]
Unconditional OsrPoints
Our extension
OsrInfoPoints for state info
Without inserting checks in application code
ISMM '04
19
AutoSwitch Results
Improvement over worst: 17%
Degradation over best: 15%
Overhead
(Annot: 26%)
(Annot: 4%)
OSR - negligible
Recompilation - depends on where switch occurs
Lost optimization opportunities
all variables live at every GC point
OsrInfoPoints are “pinned” down
DCE, load/store elim, code motion, reg allocation
ISMM '04
20
Related Work
Application-specific GC studies
Profile-direction GC selection [Fitzgerald’00]
Comparison of GCs [Attanasio’01]
Comparing gen mark-sweep & copying [Zorn’90]
Switching/swapping
Coupling compaction with copying [Sansom’92]
Hot-swapping mark-sweep/mark-compact
[Printezis’01]
BEA Weblogic JRockit [BEA Workshop’03]
ISMM '04
21
Conclusion
Choice of best GC is app-dependent
Novel framework
Switch between diverse collectors
Enables annotation driven & automatic
switching
Significantly reduce impact of choosing
the "wrong" collector
AnnotGC: 26%, degradation: 4%, GMS: 7%
Enabled by aggressive specialization
ISMM '04
22
Future Work
Improving AutoSwitch
Improved OSR
Cuts AutoSwitch overhead by half
Better heuristics for automatic switching
Paper available upon request
Decide when to switch & which GC to switch to
High-performance application-specific VMs
Guided by available resources
Application characteristics
Self-modifying
ISMM '04
23
App-specific Garbage Collectors
Front Loaders
Recyclers
Rear Loaders
Side Loaders
All products are trademarks of Heil Environmental Industries, Ltd. 2002-2003
Extra slides
ISMM '04
25
Preliminary Implementation
Aggressive specialization
Method Invalidation
Inline allocation sites
Insert write barriers only when necessary
Invalidated method replaced by stub
On-stack Replacement (OSR)
Potentially at every GC point
ISMM '04
26
On-stack replacement
Insert OsrInfoPoints at each GC point
Keep track of state
Call sites, allocation sites, prologs, backedges,
explicit yieldpoints, exceptions
All local/stack variables kept alive
Lazy OSR
foo
old return address
trigger OSR
bar
osr_helper {
...
}
b. OSR triggered lazily by external event
ISMM '04
27
GC Performance Evaluation (2)
Comparison of parallel garbage collectors [5]
Results
Mark-sweep, copying, Gen MS, Gen copy, hybrid
(copying + MS)
Hybrid has lowest heap residency
Gen copy handles fragmentation better
Minor collections for Gen copy faster
Mark-sweep has fewer GCs
No one collector best across applications
ISMM '04
28
AnnotGC Results
SS
MS
GMS
GSS
CMS
GC Annot
SPECjbb2000
10^6/Throughput
360
310
_209_db
260
210
40
35
30
25
160
20
110
15
1
2
3
4
5
6
7
8
9
SS
MS
GMS
GSS
CMS
GC Annot
45
Execution Time (sec)
410
10 11 12
1
2
3
4
5
6
7
8
9
10 11
Heap Size Relative to Min
Heap Size Relative to Min
Degradation over best: 4%
Improvement over worst: 26%
Improvement over GenMS: 7%
ISMM '04
29
12
GC Performance Evaluation
Comparing generational mark-sweep & copying [66]
Mark-sweep
Copying collection
Slower free-list allocator
Requires less heap space: 20% less on average
Fastest allocator
Copying overhead
Copy reserve space required
Generational copying not clearly superior
ISMM '04
30
GC Performance Evaluation (2)
Comparison of parallel garbage collectors [5]
Results
Mark-sweep, copying, Gen MS, Gen copy, hybrid
(copying + MS)
Hybrid has lowest heap residency
Gen copy handles fragmentation better
Minor collections for Gen copy faster
Mark-sweep has fewer GCs
No one collector best across applications
ISMM '04
31
GC Performance Evaluation (3)
Fitzgerald et al [32] compared GCs across 20
benchmarks
Every collector best at least once
Spread between best & worst at least 15%
Generational not always better
Non-generational may outperform by 15-20%
Recommend “profile-directed” selection
Null collector, non-gen copying, Gen copy with various
WB implementations
The “best” GC chosen from different pre-compiled binaries
based on profiles
It is difficult or impossible to replace GCs at runtime
ISMM '04
32
ISMM '04
33