Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor MICRO-32 November 16-18, 1999 Motivation • Problem: embedded code size – Constraints: cost, area, and power – Fit program in on-chip memory – Compilers vs. hand-coded assembly CPU RAM I/O ROM Program Original Program • Portability • Development costs • Solution: code compression – Reduce compiled code size – Take advantage of instruction repetition – Systems use cheaper processors with smaller on-chip memories CPU RAM I/O ROM – Code bloat Compressed Program • Implementation – Code size? – Execution speed? Embedded Systems 2 CodePack • Overview – – – – IBM PowerPC instruction set First system with instruction stream compression 60% compression ratio, ±10% performance [IBM] • performance gain due to prefetching • Implementation – Binary executables are compressed after compilation – Compression dictionaries tuned to application – Decompression occurs on L1 cache miss • L1 caches hold decompressed data • Decompress 2 cache lines at a time (16 insns) – PowerPC core is unaware of compression 3 CodePack encoding • 32-bit insn is split into 2 16-bit words • Each 16-bit word compressed separately Encoding for upper 16 bits 8 00xxx Encoding for lower 16 bits 1 00 Encodes zero 32 0 1 x x x x x 16 0 1 x x x x 64 1 0 0 x x x x x x 23 1 0 0 x x x x x 128 1 0 1 x x x x x x x 128 1 0 1 x x x x x x x 256 1 1 0 x x x x x x x x 256 1 1 0 x x x x x x x x 111xxxxxxxxxxxxxxxx 111xxxxxxxxxxxxxxxx Tag Escape Index Raw bits 4 CodePack decompression Fetch index L1 I-cache miss address 0 56 25 26 31 Index table (in main memory) Byte-aligned block address Compressed bytes (in main memory) Fetch compressed instructions Compression Block (16 instructions) 1 compressed instruction Hi tag Low tag Hi index Low index Decompress High dictionary Native Instruction Low dictionary High 16-bits Low 16-bits 5 Compression ratio compressed size • compression ratio = original size • Average: 62% 100% 90% 80% 70% 60% Compression ratio 50% 40% 30% 20% 10% 0% a l pp u s ap i c 1 p 5 c c s9 p p e s fp r p om d e g i95 im rid n c w it e rl or im tv 3 d ex e5 2 l p 2c s w ca r b rt av r o ijp ks m g g2 e pe g o u u d 8 w m t v s 8 to hy pe m m go 6 CodePack programs • Compressed executable – 17%-25% raw bits: not compressed! • Includes escape bits • Compiler optimizations might help Tags 25% – 5% index table – 2KB dictionary (fixed cost) – 1% pad bits Escape 3% Raw bits 14% Indices 51% Dictionary 1% Pad 1% Index table 5% go 7 I-cache miss timing • Native code uses critical word first • Compressed code must be fetched sequentially • Example shows miss to 5th instruction in cache line – 32-bit insns, 64-bit bus a) Native code Instruction cache miss Instructions from main memory b) Compressed code Instruction cache miss Index from main memory Codes from main memory Decompressor c) Compressed code + optimizations Instruction cache miss Index from index cache Codes from main memory 2 Decompressors A B t=0 10 20 30 1 cycle L1 cache miss Fetch instructions (first line) Decompression cycle Fetch index Fetch instructions (remaining lines) Critical instruction word 8 Baseline results • CodePack causes up to 18% performance loss – – – – SimpleScalar 4-issue, out-of-order 16 KB caches Main memory: 10 cycle latency, 2 cycle rate Instructions per cycle 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 native CodePack cc1 go perl vortex 9 Optimization A: Index cache • Remove index table access with a cache – A cache hit removes main memory access for index – optimized: 64 lines, fully assoc., 4 indices/line (<15% miss ratio) • Within 8% of native code – perfect: an infinite sized index cache • Within 5% of native code 1.2 1.0 Speedup over native code 0.8 0.6 0.4 CodePack optim ized 0.2 perfect 0.0 cc1 go perl vortex 10 Optimization B: More decoders • Codeword tags enable fast extraction of codewords – Enables parallel decoding • Try adding more decoders for faster decompression • 2 decoders: performance within 13% of native code 1.0 0.8 Speedup over native code 0.6 CodePack 2 insn/cycle 3 insn/cycle 16 insn/cycle 0.4 0.2 0.0 cc1 go perl vortex 11 Comparison of optimizations • Index cache provides largest benefit • Optimizations – index cache: 64 lines, 4 indices/line, fully assoc. – 2nd decoder • Speedup over native code: 0.97 to 1.05 • Speedup over CodePack: 1.17 to 1.25 1.2 1 Speedup over native code 0.8 0.6 CodePack index cache 2nd decoder both optimizations 0.4 0.2 0 cc1 go perl vortex 12 Cache effects • Cache size controls normal CodePack slowdown • Optimizations do well on small caches: 1.14 speedup go benchmark 1.4 1.2 Speedup over native code 1.0 0.8 0.6 0.4 CodePack 0.2 optimized 0.0 1KB 4KB 16KB 64KB 13 Memory latency • Optimized CodePack performs better with slow memories – Fewer memory accesses than native code go benchmark 1.2 1.0 Speedup 0.8 over native 0.6 code 0.4 CodePack optimized 0.2 0.0 0.5x 1x 2x 4x 8x Memory latency 14 Memory width • CodePack provides speedup for small buses • Optimizations help performance degrade gracefully as bus size increases go benchmark 1.2 1.0 Speedup over 0.8 native code 0.6 0.4 CodePack 0.2 optimized 0.0 16 32 64 128 Bus size (bits) 15 Conclusions • CodePack works for other instruction sets than PowerPC • Performance can be improved at modest cost – Remove decompression overhead: index lookup, dictionary lookup • Compression can speedup execution – Compressed code requires fewer main memory accesses – CodePack includes simple prefetching • Systems that benefit most from compression – Narrow buses – Slow memories • Workstations might benefit from compression – Fewer L2 misses – Less disk access 16 Web page http://www.eecs.umich.edu/~tnm/compress 17