slides

Dynamic Selection of
Application-Specific Garbage
Collectors
Sunil V. Soman
Chandra Krintz
University of California, Santa Barbara
David F. Bacon
IBM T.J. Watson Research Center
Background

VMs/managed runtimes
Next generation internet computing
 Automatic memory management for memory
safety
 High performance, multi-application servers


GC performance impacts overall
performance
 So GC must perform well
ISMM '04
2
GC Research

Focus on general-purpose GC


One GC for all apps
No single GC provides best
performance in all cases
Across applications
 Even for the same application given
different memory constraints

ISMM '04
3
Motivation
SS
MS
GMS
GSS
CMS
SPECjbb2000
10^6/Throughput
360
310
260
210
30
25
20
110
15
2
3
4
5
6
7
8
9
SS
MS
GMS
GSS
CMS
35
160
1
_209_db
40
Execution Time (sec)
410
10 11 12
Heap Size Relative to Min
1
2
3
4
5
6
7
8
9
10 11
Heap Size Relative to Min
switch point
ISMM '04
4
12
Motivation

Other researchers have reported similar
results [Attanasio ’01, Fitzgerald ’00]
Spread between best & worst at least 15%
 Generational not always better

Employ a VM instance built with the
optimal GC
 Goal of our work: employ multiple
GCs in the same system and switch
between them

ISMM '04
5
Framework

Implemented in the JikesRVM

Uses the Java Memory management Toolkit
(JMTk)


Can switch between diverse collectors


Extended to enable multiple GC support
Easily extensible
GC System = Allocator + Collector
ISMM '04
6
Framework Implementation
StopTheWorld
StopTheWorld
VM_Processor
(Plan currentPlan;)
Plan
{SS_Plan,MS_Plan,...}
Plan
(Selected at build time)
SS_Plan
MS_Plan
CopyMS_Plan
Generational
GenMS_Plan
GenCopy_Plan
VM_Processor

Most code is shared across GCs

44MB vs 42MB (average across GCs)
ISMM '04
7
Framework Implementation
HIGHER

Nursery
Mark-Sweep
Alloc
Unmap
Example
MS -> GSS
 MS -> GMS

High Semispace
Low Semispace
LOWER
Large Object Space
GC Data Structures
Immortal
ISMM '04
8
HIGHER
Framework Implementation
Nursery
Mark-Sweep

Alloc
No GC
Do not unmap
Example
MS -> GSS
 MS -> GMS

High Semispace
Low Semispace
LOWER
Large Object Space
GC Data Structures
Immortal
ISMM '04
9
HIGHER
Framework Implementation
Nursery


Mark-Sweep

High Semispace
Low Semispace
LOWER
Large Object Space 
GC Data Structures
Immortal
Mem mapped on demand
Unused memory unmapped
Need MS & copying states
Use object header
 Bit stealing

Worst-case cost = 1 GC

No need to perform GC in
many cases
ISMM '04
10
Making Use of GC Switching

Dynamic switching guided by
Cross-input offline behavior
 Annotation


Methodology
2.4 GHz/1G single proc x86 (hyperthreading)
 JikesRVM v2.2.0, linux kernel 2.4.18

SpecJVM98, SpecJBB, Jolden, Javagrande
 Pseudo-adaptive system

ISMM '04
11
Annotation-guided GC (AnnotGC)

Annotation

Hints about optimization opportunities
 Widely used to guide compiler optimization
Different benchmarks & inputs examined offline





Cross-input results combined
Select “optimal” GC and annotate program with it
Best GC for a range of heap sizes
JVM at program load time
 Switch to annotated GC
 Ignore annotation if GC-Switching not avail.
AnnotGC Implementation

Observations
Only 0 or 1 switch points per benchmark
 Switch point relative to min heap does not
vary much across inputs


Annotate min heap size & switch point
4 byte annotation in Java class file
 Encoded in a compact form
 40 MB min assumed if not specified

ISMM '04
13
AnnotGC Results
SS
MS
GMS
GSS
CMS
SPECjbb2000
10^6/Throughput
360
310
_209_db
260
210
40
35
30
25
160
20
110
15
1
2
3
4
5
6
7
8
9
SS
MS
GMS
GSS
CMS
45
Execution Time (sec)
410
10 11 12
1
2
3
4
5
6
7
8
9
10 11 12
Heap Size Relative to Min
Heap Size Relative to Min
ISMM '04
14
AnnotGC Results
360
10^6/Throughput
_209_db
SS
MS
GMS
GC
CMS
GC Annot
SPECjbb2000
310
260
210
160
40
35
30
25
20
15
110
1
2
3
4
5
6
7
8
9
10 11 12
1
2
3
4


5
6
7
8
9
10 11 12
Heap Size Relative to Min
Heap Size Relative to Min

SS
MS
GMS
GSS
CMS
GC Annot
45
Execution Time (sec)
410
Degradation over best: 4%
Improvement over worst: 26%
Improvement over GenMS: 7%
ISMM '04
15
AnnotGC Results
Average Difference Between Best & Worst GC Systems
Benchmarks
Degradation over Best
Improvement over
Worst
compress
6.28% (443 ms)
3.53% (279 ms)
jess
2.82% (85 ms)
56.17% (5767 ms)
db
2.88% (532 ms)
12.47% (3028 ms)
javac
5.64% (392 ms)
24.12% (2944 ms)
mpegaudio
3.54% (214 ms)
3.21% (209ms)
mtrt
4.51% (270 ms)
42.29% (5170 ms)
jack
3.22% (147 ms)
32.70% (2787 ms)
JavaGrande
3.97% (2511 ms)
17.71% (15500 ms)
SPECjbb
2.22% (3.17*106/tput)
27.95% (82.6*106/tput)
MST
4.07% (30 ms)
48.42% (1001 ms)
Voronoi
9.20% (144 ms)
31.78% (1063 ms)
Average
4.38%
26.22%
Average (without MS)
3.36%
24.13%
Extending GC Switching

Switch automatically



Performance hit: lost optimization opportunity



No offline profiling
Employ execution characteristics
Inlining of allocation sites
Write barriers
Solution: Aggressive specialization guarded by


Method invalidation
On-stack replacement
ISMM '04
17
Investigating Auto Switching

Exploit general performance characteristics


Best performing GCs across heap sizes & programs
Deciding when to switch
 Default GC: MS
 Heap size & heap residency threshold
 If residency > 60%,
switch to GSS if heap size > 90MB
 else use GMS


Different collectors for startup & steadystate
ISMM '04
18
Preliminary Implementation


Method invalidation - for future invocations
On-stack replacement - for currently executing
methods



Deferred compilation & guarded inlining [Fink’04]
Unconditional OsrPoints
Our extension

OsrInfoPoints for state info

Without inserting checks in application code
ISMM '04
19
AutoSwitch Results



Improvement over worst: 17%
Degradation over best: 15%
Overhead



(Annot: 26%)
(Annot: 4%)
OSR - negligible
Recompilation - depends on where switch occurs
Lost optimization opportunities
 all variables live at every GC point
 OsrInfoPoints are “pinned” down
 DCE, load/store elim, code motion, reg allocation
ISMM '04
20
Related Work

Application-specific GC studies




Profile-direction GC selection [Fitzgerald’00]
Comparison of GCs [Attanasio’01]
Comparing gen mark-sweep & copying [Zorn’90]
Switching/swapping



Coupling compaction with copying [Sansom’92]
Hot-swapping mark-sweep/mark-compact
[Printezis’01]
BEA Weblogic JRockit [BEA Workshop’03]
ISMM '04
21
Conclusion
Choice of best GC is app-dependent
 Novel framework

Switch between diverse collectors
 Enables annotation driven & automatic
switching


Significantly reduce impact of choosing
the "wrong" collector
AnnotGC: 26%, degradation: 4%, GMS: 7%
 Enabled by aggressive specialization

ISMM '04
22
Future Work

Improving AutoSwitch


Improved OSR
Cuts AutoSwitch overhead by half


Better heuristics for automatic switching


Paper available upon request
Decide when to switch & which GC to switch to
High-performance application-specific VMs



Guided by available resources
Application characteristics
Self-modifying
ISMM '04
23
App-specific Garbage Collectors
Front Loaders
Recyclers
Rear Loaders
Side Loaders
All products are trademarks of Heil Environmental Industries, Ltd. 2002-2003
Extra slides
ISMM '04
25
Preliminary Implementation

Aggressive specialization



Method Invalidation


Inline allocation sites
Insert write barriers only when necessary
Invalidated method replaced by stub
On-stack Replacement (OSR)

Potentially at every GC point
ISMM '04
26
On-stack replacement

Insert OsrInfoPoints at each GC point




Keep track of state
Call sites, allocation sites, prologs, backedges,
explicit yieldpoints, exceptions
All local/stack variables kept alive
Lazy OSR
foo
old return address
trigger OSR
bar
osr_helper {
...
}
b. OSR triggered lazily by external event
ISMM '04
27
GC Performance Evaluation (2)

Comparison of parallel garbage collectors [5]


Results





Mark-sweep, copying, Gen MS, Gen copy, hybrid
(copying + MS)
Hybrid has lowest heap residency
Gen copy handles fragmentation better
Minor collections for Gen copy faster
Mark-sweep has fewer GCs
No one collector best across applications
ISMM '04
28
AnnotGC Results
SS
MS
GMS
GSS
CMS
GC Annot
SPECjbb2000
10^6/Throughput
360
310
_209_db
260
210
40
35
30
25
160
20
110
15
1
2
3
4
5
6
7
8
9
SS
MS
GMS
GSS
CMS
GC Annot
45
Execution Time (sec)
410
10 11 12
1
2
3
4
5
6
7
8
9
10 11
Heap Size Relative to Min
Heap Size Relative to Min
Degradation over best: 4%
Improvement over worst: 26%
Improvement over GenMS: 7%



ISMM '04
29
12
GC Performance Evaluation


Comparing generational mark-sweep & copying [66]
Mark-sweep



Copying collection




Slower free-list allocator
Requires less heap space: 20% less on average
Fastest allocator
Copying overhead
Copy reserve space required
Generational copying not clearly superior
ISMM '04
30
GC Performance Evaluation (2)

Comparison of parallel garbage collectors [5]


Results





Mark-sweep, copying, Gen MS, Gen copy, hybrid
(copying + MS)
Hybrid has lowest heap residency
Gen copy handles fragmentation better
Minor collections for Gen copy faster
Mark-sweep has fewer GCs
No one collector best across applications
ISMM '04
31
GC Performance Evaluation (3)

Fitzgerald et al [32] compared GCs across 20
benchmarks




Every collector best at least once
Spread between best & worst at least 15%
Generational not always better


Non-generational may outperform by 15-20%
Recommend “profile-directed” selection


Null collector, non-gen copying, Gen copy with various
WB implementations
The “best” GC chosen from different pre-compiled binaries
based on profiles
It is difficult or impossible to replace GCs at runtime
ISMM '04
32
ISMM '04
33