Slides: PDF

Efficient Execution of
Compressed Programs
Charles Lefurgy
http://www.eecs.umich.edu/compress
Advanced Computer Architecture Laboratory
Electrical Engineering and Computer Science Dept.
The University of Michigan, Ann Arbor
The problem
• Microprocessor die cost
– Low cost is critical for high-volume, low-margin embedded systems
– Control cost by reducing area and increasing yield
• Increasing amount of on-chip memory
– Memory is 40-80% of die area [ARM, MCore]
– In control-oriented embedded systems,
much of this is program memory
• How can program memory be reduced without sacrificing
performance?
CPU RAM
I/O
ROM
Program
Embedded Microprocessor
2
Solution
• Code compression
– Reduce compiled code size
– Compress at compile-time
– Decompress at run-time
CPU RAM
I/O
ROM
Program
Original Program
– Hardware or software?
– Code size?
– Execution speed?
CPU RAM
I/O
ROM
• Implementation
Compressed Program
Embedded Systems
3
Research contributions
• HW decompression [MICRO-32, MICRO-30, CASES-98]
– Dictionary compression method
– Analysis of IBM CodePack algorithm
– HW decompression increases performance
• SW decompression [HPCA-6, CASES-99]
– Near native code performance for media applications
– Software-managed cache
– Hybrid program optimization (new profiling method)
– Memoization optimization
4
Outline
Compression overview and metrics
HW decompression overview
SW decompression
• Compression algorithms
– Dictionary
– CodePack
• Hardware support
• Performance study
• Optimizations
– Hybrid programs
– Memoization
5
Why are programs compressible?
• Ijpeg benchmark (MIPS gcc 2.7.2.3 -O2)
– 49,566 static instructions
– 13,491 unique instructions
– 1% of unique instructions cover 29% of static instructions
1,000
810
(jalr $31,$2)
100
Number
10
13,491
1
0
5,000
10,000
15,000
Unique instruction bit patterns
6
Evaluation metrics
• Size
compressed size
compression ratio =
original size
100%
80%
Compression ratio
60%
40%
20%
0%
Original
program
Compressed
program
• Decode efficiency
– Measure program execution time
7
Previous results
Who
Thumb
MIPS-16
CodePack
Wolfe
Lekatsas
Araujo
Liao
Kirovski
Ernst
•
Instruction
Set
ARM
MIPS
PowerPC
MIPS
MIPS, x86
MIPS
TMS320C25
SPARC
SPARC
Compression
Ratio
70%
60%
60%
73%
50%, 80%
43%
82%
60%
20%
Comment
16-bit instruction subset of 32-bit ISA
16-bit instruction subset of 32-bit ISA
Cache line compression
Cache line, Huffman
Cache line, stream division, Huffman
Cache line, op. factorization, Huffman
Procedure abstraction
Procedure compression
Interpreted wire code
Many compression methods
– Compression unit: instruction, cache line, or procedure
•
Difficult to compare previous results
– Studies use different instruction sets and benchmarks
•
•
Many studies do not measure execution time
Wire code
– Small size: a goal for embedded systems
– Problem: no random access, must decompress entire program at once
8
Hardware decompression
CodePack
• Overview
– IBM
– PowerPC instruction set
– 60% compression ratio, ±10% performance [IBM]
• performance gain due to prefetching
• Implementation
– Binary executables are compressed after compilation
– Decompression during instruction cache miss
• Instruction cache holds native code
• Decompress two cache lines at a time (16 insns)
– PowerPC core is unaware of compression
10
CodePack encoding
32-bit PowerPC instruction word
0
15
Encoding for upper 16 bits
8 00xxx
16
31
Encoding for lower 16 bits
1 00
Encodes zero
32 0 1 x x x x x
16 0 1 x x x x
64 1 0 0 x x x x x x
32 1 0 0 x x x x x
128 1 0 1 x x x x x x x
128 1 0 1 x x x x x x x
256 1 1 0 x x x x x x x x
256 1 1 0 x x x x x x x x
111xxxxxxxxxxxxxxxx
111xxxxxxxxxxxxxxxx
Tag
Escape
Index
Raw bits
11
CodePack system
• CodePack is part of the memory system
– After L1 instruction cache
• Dictionaries
– Contain 16-bit upper and lower halves of instructions
• Index table
– Maps instruction address to compressed code address
Dictionaries
Compressed code
Decompressor
Processor
Instruction cache
CodePack
Index table
main memory
Instruction memory hierarchy
12
CodePack decompression
Fetch index
L1 I-cache
miss address
0
56
25 26
31
Index table
(in main memory)
Byte-aligned
block address
Compressed bytes
(in main memory)
Fetch
compressed
instructions
Compression Block
(16 instructions)
1 compressed instruction Hi tag Low tag
Hi index
Low index
Decompress
High dictionary
Native Instruction
Low dictionary
High 16-bits
Low 16-bits
13
Instruction cache miss latency
• Native code uses critical word first
• Compressed code must be fetched sequentially
a) Native code
Instruction cache miss
Instructions from main memory
b) Compressed code
Instruction cache miss
Index from main memory
Codes from main memory
Decompressor
c) Compressed code
+ optimizations
Instruction cache miss
Index from index cache
Codes from main memory
Two Decompressors
A
B
t=0
10
20
30
1 cycle
L1 cache miss
Fetch instructions (first line)
Decompression cycle
Fetch index
Fetch instructions (remaining lines)
Critical instruction word
14
Comparison of optimizations
• Index cache provides largest benefit
• Optimizations
– index cache: 64 lines, 4 indices/line, fully assoc.
– 2nd decoder
• Speedup over native code: 0.97 to 1.05
• Speedup over CodePack: 1.17 to 1.25
1.2
1
Speedup over
native code
0.8
0.6
CodePack
index cache
2nd decoder
both optimizations
0.4
0.2
0
cc1
go
perl
vortex
15
Hardware decompression conclusions
• Performance can be improved at modest cost
– Remove decompression overhead
• index lookup
• dictionary lookup
– Better memory bus utilization
• Compression can speedup execution
– Compressed code requires fewer main memory accesses
– CodePack includes simple prefetching
• Systems that benefit most from compression
– Narrow memory bus
– Slow memory
16
Software decompression
Software decompression
• Previous work
– Whole program compression [Tauton91]
• Saved disk space
• No memory savings
– Procedure compression [Kirovski97]
• Requires large decompression memory
• Fragmentation of decompression memory
• Slow
• My work
– Decompression unit: 1 or 2 cache-lines
– High performance focus
18
How does code compression work?
• What is compressed?
– Individual instructions
• When is decompression performed?
– During I-cache miss
• How is decompression implemented?
– I-cache miss invokes exception handler
• What is decompressed?
– 1 or 2 cache lines
• Where are decompressed instructions stored?
– I-cache is the decompression buffer
Decompressor
program
Decompressed
instructions
I-cache (SRAM)
Processor
System-on-a-chip
Program &
decompressor
data
D-cache (SRAM)
Compressed code
Dictionary
compressed
program
Index table
Native code
Program data
Main memory (eDRAM)
19
Dictionary compression algorithm
• Goal: fast decompression
• Dictionary contains unique instructions
• Replace program instructions with short index
32 bits
16 bits
lw r2,r3
5
lw r2,r3
5
lw r15,r3
30
lw r15,r3
30
lw r15,r3
30
32 bits
lw r2,r3
lw r15,r3
.dictionary segment
.text segment
.text segment (contains indices)
Original program
Compressed program
20
Decompression
• Algorithm
1. I-cache miss invokes decompressor (exception handler)
2. Fetch index
3. Fetch dictionary word
4. Place instruction in I-cache (special instruction)
• Write directly into I-cache
• Decompressed instructions only exist in I-cache
€
ö
Memory
I-cache
ì
Add r1,r2,r3
Dictionary
Proc.
5
D-cache
ó
...
Indices
21
Hardware support
• Decompression exception
– Raise exception on I-cache miss
– Exception not raised on native code section (allow hybrid programs)
– Similar to Informing Memory [Horowitz98]
• Store-instruction instruction
– MAJC, MIPS R10K
22
Two software decompressors
• Dictionary
– Faster
– Less compression
• CodePack
– A software version of IBM’s CodePack hardware
– Slower
– More compression
Codewords (indices)
Decompress granularity
Static instructions
Dynamic instructions
Decompression overhead
Dictionary
CodePack
Fixed-length
1 cache line
43
43
73-105 cycles
Variable-length
2 cache lines
174
1042-1062
1235-1266 cycles
23
Compression ratio
compressed size
• compression ratio =
original size
– CodePack: 55% - 63%
– Dictionary: 65% - 82%
100%
90%
80%
70%
60%
Compression
50%
ratio
40%
30%
20%
10%
0%
Dictionary
ex
rt
pe
rl
vo
pe
gw
it
c
en
go
eg
ijp
m
pe
g2
gh
os
ts
cr
ip
cc
t
1
CodePack
24
Simulation environment
•
•
•
•
•
SimpleScalar
Pipeline: 5 stage, in-order
I-cache: 4KB, 32B lines, 2-way
D-cache: 8KB, 16B lines, 2-way
Memory: embedded DRAM
– 10 cycle latency
– bus width = 1 cache line
– 10x denser than SRAM caches
• Performance: slowdown = 1 / speedup (1 = native code)
• Area results include:
– Main memory to hold program (compressed bytes, tables)
– I-cache
– Memory for decompressor optimizations (memorization, native code)
25
Performance
• CodePack: very high overhead
• Reduce overhead by reducing cache misses
Ghostscript
20
18
16
14
12
Slowdown 10
8
6
4
2
0
CodePack
Dictionary
Native
CodePack
Dictionary
Native
4KB
16KB
64KB
19.19
2.66
1.00
3.17
1.43
1.00
I-cache size (KB)
1.26
1.04
1.00
26
Cache miss
• Control slowdown by optimizing I-cache miss ratio
• Small change in miss ratio
large performance impact
40
CodePack 4KB
CodePack 16KB
CodePack 64KB
Dictionary 4KB
Dictionary 16KB
Dictionary 64KB
35
30
Slowdown 25
relative to 20
native code 15
10
5
0
0%
2%
4%
6%
8%
I-cache miss ratio
27
Two optimizations
•
Hybrid programs (static)
– Both compressed and native code
•
Memoization (dynamic)
– Cache recent decompressions in main memory
•
Both can be applied to any compression algorithm
Program memory
Original Program
native
Compressed
compressed
Memoization
compressed
Hybrid
compressed
Combined
compressed
memo
native
native
memo
28
Hybrid programs
• Selective compression
– Only compress some procedures
– Trade size for speed
– Avoid decompression overhead
• Profile methods
– Count dynamic instructions
• Example: ARM/Thumb
• Use when compressed code has more instructions
• Reduce number of executed instructions
– Count cache misses
• Example: CodePack
• Use when compressed code has longer cache miss latency
• Reduce cache miss latency
29
Cache miss profiling
• Cache miss profile reduces overhead 50%
• Loop-oriented benchmarks benefit most
– Profiles are different in loop regions
Pegwit (encryption)
1.12
CodePack: dynamic instructions
1.10
Slowdown
relative to
native code
CodePack: cache miss
1.08
1.06
1.04
1.02
1.00
60%
70%
80%
90%
100%
Compression ratio
30
Code placement
Original code
A
Memory
B
C
D
Whole compression
compressed code
dict
decompress region (in L1 cache only)
a b c d
Decompress
dict
a b c d
A
B
C
D
Same order
Selective compression
compressed code
dict
a c
native region
B
decompress region
D
Decompress
dict
a c
B
D
A
C
Different order!
31
CodePack vs. Dictionary
• More compression may have better performance
– CodePack has smaller size than Dictionary compression
– Even with some native code, CodePack is smaller
– CodePack is faster due to using more native code
Ghostscript
4.0
Slowdown
relative to native
code
3.5
CodePack: cache miss
3.0
Dictionary: cache miss
2.5
2.0
1.5
1.0
0.5
60%
70%
80%
90%
Compression ratio
100%
32
Memoization
• Reserve main memory for caching decompressed insns.
– Use high density DRAM to store more than I-cache in less area
– Manage as a cache: data and tags
• Algorithm
– Decompressor checks memo table before decompressing
– On hit, copy instructions into I-cache (no decompression)
– On miss, decompress into I-cache and update memo table
No memoization
address
With memoization
address
tag native code
decompress
=?
cache line
decompress
cache
line
33
Memoization results
• 16KB memoization table
• CodePack: large improvement
• Dictionary is already fast: small improvement
5
25
CodePack
20
Memo
Dictionary
4
Memo
3
15
ex
vo
rt
rl
pe
eg
m
pe
g2
en
c
pe
gw
it
ijp
go
t
ts
cr
ip
cc
gh
os
ex
vo
rt
pe
t
ts
cr
ip
cc
rl
0
ijp
eg
m
pe
g2
en
c
pe
gw
it
0
go
1
1
5
1
2
10
gh
os
Slowdown
30
34
Combined
• Use memoization on hybrid programs
– Keep area constant. Partition memory for best solution.
• Combined solution is often the best
14
CodePack with 32KB decompression buffer
0KB memo
4KB memo
8KB memo
16KB memo
32KB memo (not hybrid)
12
8
6
4
2
x
vo
rte
pe
rl
it
pe
gw
pe
g2
en
c
go
ijp
eg
m
gh
os
t
sc
r
ip
t
0
cc
1
Slowdown
10
35
Combined
Slowdown
• Adding DRAM is better than larger SRAM cache
Ghostscript using CodePack
0 KB Memo
9 KB Memo
36 KB Memo
144 KB Memo
4 KB I-cache
9 KB total
16 KB I-cache
64 KB I-cache
20
18
16
14
12
10
8
6
4
2
0
60%
4.5 KB Memo
18 KB Memo
72 KB Memo
288 KB Memo
8 KB I-cache
32 KB I-cache
Native
18 KB total
36 KB total
72 KB total
144 KB total
70%
80%
90%
Compression Ratio
288 KB total
100%
36
Conclusions
• High-performance SW decompression possible
– Dictionary faster than CodePack, but 5-25% compression ratio difference
– Hardware support
• I-cache miss exception
• Store-instruction instruction
• Tune performance
– Cache size
– Hybrid programs, memoization
• Hybrid programs
– Use cache miss profile for loop-oriented benchmarks
• Memoization
– No profile required, but has start-up latency
– Effective on large working sets
37
Future work
• Compiler optimization for compression
– Improve compression ratio without affecting performance
• Unify selective compression and code placement
– Reduce I-cache miss to improve performance
• Energy consumption
– Increasing run-time may use too much energy
– Improving bus utilization saves energy
• Dynamic code generation: Crusoe, Dynamo
– Memory management, code stitching
• Compression for high-performance
– Lower L2 miss rate
38
Web page
http://www.eecs.umich.edu/compress
39