Slides: PDF

Evaluation of a High Performance
Code Compression Method
Charles Lefurgy, Eva Piccininni,
and Trevor Mudge
Advanced Computer Architecture Laboratory
Electrical Engineering and Computer Science Dept.
The University of Michigan, Ann Arbor
MICRO-32
November 16-18, 1999
Motivation
• Problem: embedded code size
– Constraints: cost, area, and power
– Fit program in on-chip memory
– Compilers vs. hand-coded assembly
CPU RAM
I/O
ROM
Program
Original Program
• Portability
• Development costs
• Solution: code compression
– Reduce compiled code size
– Take advantage of instruction repetition
– Systems use cheaper processors with
smaller on-chip memories
CPU RAM
I/O
ROM
– Code bloat
Compressed Program
• Implementation
– Code size?
– Execution speed?
Embedded Systems
2
CodePack
• Overview
–
–
–
–
IBM
PowerPC instruction set
First system with instruction stream compression
60% compression ratio, ±10% performance [IBM]
• performance gain due to prefetching
• Implementation
– Binary executables are compressed after compilation
– Compression dictionaries tuned to application
– Decompression occurs on L1 cache miss
• L1 caches hold decompressed data
• Decompress 2 cache lines at a time (16 insns)
– PowerPC core is unaware of compression
3
CodePack encoding
• 32-bit insn is split into 2 16-bit words
• Each 16-bit word compressed separately
Encoding for upper 16 bits
8 00xxx
Encoding for lower 16 bits
1 00
Encodes zero
32 0 1 x x x x x
16 0 1 x x x x
64 1 0 0 x x x x x x
23 1 0 0 x x x x x
128 1 0 1 x x x x x x x
128 1 0 1 x x x x x x x
256 1 1 0 x x x x x x x x
256 1 1 0 x x x x x x x x
111xxxxxxxxxxxxxxxx
111xxxxxxxxxxxxxxxx
Tag
Escape
Index
Raw bits
4
CodePack decompression
Fetch index
L1 I-cache
miss address
0
56
25 26
31
Index table
(in main memory)
Byte-aligned
block address
Compressed bytes
(in main memory)
Fetch
compressed
instructions
Compression Block
(16 instructions)
1 compressed instruction Hi tag Low tag
Hi index Low index
Decompress
High dictionary
Native Instruction
Low dictionary
High 16-bits
Low 16-bits
5
Compression ratio
compressed size
• compression ratio =
original size
• Average: 62%
100%
90%
80%
70%
60%
Compression ratio 50%
40%
30%
20%
10%
0%
a
l
pp
u
s
ap
i
c
1
p
5
c c s9 p p
e s fp
r
p
om
d e g i95 im rid n c w it e rl or im tv 3 d ex e5
2
l
p 2c s w ca r b
rt av
r o ijp
ks m g g2 e pe g
o
u
u
d
8
w
m t
v
s
8
to
hy
pe
m
m
go
6
CodePack programs
• Compressed executable
– 17%-25% raw bits: not compressed!
• Includes escape bits
• Compiler optimizations might help
Tags
25%
– 5% index table
– 2KB dictionary (fixed cost)
– 1% pad bits
Escape
3%
Raw bits
14%
Indices
51%
Dictionary
1%
Pad
1%
Index table
5%
go
7
I-cache miss timing
• Native code uses critical word first
• Compressed code must be fetched sequentially
• Example shows miss to 5th instruction in cache line
– 32-bit insns, 64-bit bus
a) Native code
Instruction cache miss
Instructions from main memory
b) Compressed code
Instruction cache miss
Index from main memory
Codes from main memory
Decompressor
c) Compressed code
+ optimizations
Instruction cache miss
Index from index cache
Codes from main memory
2 Decompressors
A
B
t=0
10
20
30
1 cycle
L1 cache miss
Fetch instructions (first line)
Decompression cycle
Fetch index
Fetch instructions (remaining lines)
Critical instruction word
8
Baseline results
• CodePack causes up to 18% performance loss
–
–
–
–
SimpleScalar
4-issue, out-of-order
16 KB caches
Main memory: 10 cycle latency, 2 cycle rate
Instructions
per cycle
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
native
CodePack
cc1
go
perl
vortex
9
Optimization A: Index cache
• Remove index table access with a cache
– A cache hit removes main memory access for index
– optimized: 64 lines, fully assoc., 4 indices/line (<15% miss ratio)
• Within 8% of native code
– perfect: an infinite sized index cache
• Within 5% of native code
1.2
1.0
Speedup over
native code
0.8
0.6
0.4
CodePack
optim ized
0.2
perfect
0.0
cc1
go
perl
vortex
10
Optimization B: More decoders
• Codeword tags enable fast extraction of codewords
– Enables parallel decoding
• Try adding more decoders for faster decompression
• 2 decoders: performance within 13% of native code
1.0
0.8
Speedup over
native code
0.6
CodePack
2 insn/cycle
3 insn/cycle
16 insn/cycle
0.4
0.2
0.0
cc1
go
perl
vortex
11
Comparison of optimizations
• Index cache provides largest benefit
• Optimizations
– index cache: 64 lines, 4 indices/line, fully assoc.
– 2nd decoder
• Speedup over native code: 0.97 to 1.05
• Speedup over CodePack: 1.17 to 1.25
1.2
1
Speedup over
native code
0.8
0.6
CodePack
index cache
2nd decoder
both optimizations
0.4
0.2
0
cc1
go
perl
vortex
12
Cache effects
• Cache size controls normal CodePack slowdown
• Optimizations do well on small caches: 1.14 speedup
go benchmark
1.4
1.2
Speedup
over native
code
1.0
0.8
0.6
0.4
CodePack
0.2
optimized
0.0
1KB
4KB
16KB
64KB
13
Memory latency
• Optimized CodePack performs better with slow memories
– Fewer memory accesses than native code
go benchmark
1.2
1.0
Speedup 0.8
over native 0.6
code
0.4
CodePack
optimized
0.2
0.0
0.5x
1x
2x
4x
8x
Memory latency
14
Memory width
• CodePack provides speedup for small buses
• Optimizations help performance degrade gracefully as bus
size increases
go benchmark
1.2
1.0
Speedup over 0.8
native code 0.6
0.4
CodePack
0.2
optimized
0.0
16
32
64
128
Bus size (bits)
15
Conclusions
• CodePack works for other instruction sets than PowerPC
• Performance can be improved at modest cost
– Remove decompression overhead: index lookup, dictionary lookup
• Compression can speedup execution
– Compressed code requires fewer main memory accesses
– CodePack includes simple prefetching
• Systems that benefit most from compression
– Narrow buses
– Slow memories
• Workstations might benefit from compression
– Fewer L2 misses
– Less disk access
16
Web page
http://www.eecs.umich.edu/~tnm/compress
17