Efficient Execution of Compressed Programs Charles Lefurgy http://www.eecs.umich.edu/compress Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor The problem • Microprocessor die cost – Low cost is critical for high-volume, low-margin embedded systems – Control cost by reducing area and increasing yield • Increasing amount of on-chip memory – Memory is 40-80% of die area [ARM, MCore] – In control-oriented embedded systems, much of this is program memory • How can program memory be reduced without sacrificing performance? CPU RAM I/O ROM Program Embedded Microprocessor 2 Solution • Code compression – Reduce compiled code size – Compress at compile-time – Decompress at run-time CPU RAM I/O ROM Program Original Program – Hardware or software? – Code size? – Execution speed? CPU RAM I/O ROM • Implementation Compressed Program Embedded Systems 3 Research contributions • HW decompression [MICRO-32, MICRO-30, CASES-98] – Dictionary compression method – Analysis of IBM CodePack algorithm – HW decompression increases performance • SW decompression [HPCA-6, CASES-99] – Near native code performance for media applications – Software-managed cache – Hybrid program optimization (new profiling method) – Memoization optimization 4 Outline Compression overview and metrics HW decompression overview SW decompression • Compression algorithms – Dictionary – CodePack • Hardware support • Performance study • Optimizations – Hybrid programs – Memoization 5 Why are programs compressible? • Ijpeg benchmark (MIPS gcc 2.7.2.3 -O2) – 49,566 static instructions – 13,491 unique instructions – 1% of unique instructions cover 29% of static instructions 1,000 810 (jalr $31,$2) 100 Number 10 13,491 1 0 5,000 10,000 15,000 Unique instruction bit patterns 6 Evaluation metrics • Size compressed size compression ratio = original size 100% 80% Compression ratio 60% 40% 20% 0% Original program Compressed program • Decode efficiency – Measure program execution time 7 Previous results Who Thumb MIPS-16 CodePack Wolfe Lekatsas Araujo Liao Kirovski Ernst • Instruction Set ARM MIPS PowerPC MIPS MIPS, x86 MIPS TMS320C25 SPARC SPARC Compression Ratio 70% 60% 60% 73% 50%, 80% 43% 82% 60% 20% Comment 16-bit instruction subset of 32-bit ISA 16-bit instruction subset of 32-bit ISA Cache line compression Cache line, Huffman Cache line, stream division, Huffman Cache line, op. factorization, Huffman Procedure abstraction Procedure compression Interpreted wire code Many compression methods – Compression unit: instruction, cache line, or procedure • Difficult to compare previous results – Studies use different instruction sets and benchmarks • • Many studies do not measure execution time Wire code – Small size: a goal for embedded systems – Problem: no random access, must decompress entire program at once 8 Hardware decompression CodePack • Overview – IBM – PowerPC instruction set – 60% compression ratio, ±10% performance [IBM] • performance gain due to prefetching • Implementation – Binary executables are compressed after compilation – Decompression during instruction cache miss • Instruction cache holds native code • Decompress two cache lines at a time (16 insns) – PowerPC core is unaware of compression 10 CodePack encoding 32-bit PowerPC instruction word 0 15 Encoding for upper 16 bits 8 00xxx 16 31 Encoding for lower 16 bits 1 00 Encodes zero 32 0 1 x x x x x 16 0 1 x x x x 64 1 0 0 x x x x x x 32 1 0 0 x x x x x 128 1 0 1 x x x x x x x 128 1 0 1 x x x x x x x 256 1 1 0 x x x x x x x x 256 1 1 0 x x x x x x x x 111xxxxxxxxxxxxxxxx 111xxxxxxxxxxxxxxxx Tag Escape Index Raw bits 11 CodePack system • CodePack is part of the memory system – After L1 instruction cache • Dictionaries – Contain 16-bit upper and lower halves of instructions • Index table – Maps instruction address to compressed code address Dictionaries Compressed code Decompressor Processor Instruction cache CodePack Index table main memory Instruction memory hierarchy 12 CodePack decompression Fetch index L1 I-cache miss address 0 56 25 26 31 Index table (in main memory) Byte-aligned block address Compressed bytes (in main memory) Fetch compressed instructions Compression Block (16 instructions) 1 compressed instruction Hi tag Low tag Hi index Low index Decompress High dictionary Native Instruction Low dictionary High 16-bits Low 16-bits 13 Instruction cache miss latency • Native code uses critical word first • Compressed code must be fetched sequentially a) Native code Instruction cache miss Instructions from main memory b) Compressed code Instruction cache miss Index from main memory Codes from main memory Decompressor c) Compressed code + optimizations Instruction cache miss Index from index cache Codes from main memory Two Decompressors A B t=0 10 20 30 1 cycle L1 cache miss Fetch instructions (first line) Decompression cycle Fetch index Fetch instructions (remaining lines) Critical instruction word 14 Comparison of optimizations • Index cache provides largest benefit • Optimizations – index cache: 64 lines, 4 indices/line, fully assoc. – 2nd decoder • Speedup over native code: 0.97 to 1.05 • Speedup over CodePack: 1.17 to 1.25 1.2 1 Speedup over native code 0.8 0.6 CodePack index cache 2nd decoder both optimizations 0.4 0.2 0 cc1 go perl vortex 15 Hardware decompression conclusions • Performance can be improved at modest cost – Remove decompression overhead • index lookup • dictionary lookup – Better memory bus utilization • Compression can speedup execution – Compressed code requires fewer main memory accesses – CodePack includes simple prefetching • Systems that benefit most from compression – Narrow memory bus – Slow memory 16 Software decompression Software decompression • Previous work – Whole program compression [Tauton91] • Saved disk space • No memory savings – Procedure compression [Kirovski97] • Requires large decompression memory • Fragmentation of decompression memory • Slow • My work – Decompression unit: 1 or 2 cache-lines – High performance focus 18 How does code compression work? • What is compressed? – Individual instructions • When is decompression performed? – During I-cache miss • How is decompression implemented? – I-cache miss invokes exception handler • What is decompressed? – 1 or 2 cache lines • Where are decompressed instructions stored? – I-cache is the decompression buffer Decompressor program Decompressed instructions I-cache (SRAM) Processor System-on-a-chip Program & decompressor data D-cache (SRAM) Compressed code Dictionary compressed program Index table Native code Program data Main memory (eDRAM) 19 Dictionary compression algorithm • Goal: fast decompression • Dictionary contains unique instructions • Replace program instructions with short index 32 bits 16 bits lw r2,r3 5 lw r2,r3 5 lw r15,r3 30 lw r15,r3 30 lw r15,r3 30 32 bits lw r2,r3 lw r15,r3 .dictionary segment .text segment .text segment (contains indices) Original program Compressed program 20 Decompression • Algorithm 1. I-cache miss invokes decompressor (exception handler) 2. Fetch index 3. Fetch dictionary word 4. Place instruction in I-cache (special instruction) • Write directly into I-cache • Decompressed instructions only exist in I-cache ö Memory I-cache ì Add r1,r2,r3 Dictionary Proc. 5 D-cache ó ... Indices 21 Hardware support • Decompression exception – Raise exception on I-cache miss – Exception not raised on native code section (allow hybrid programs) – Similar to Informing Memory [Horowitz98] • Store-instruction instruction – MAJC, MIPS R10K 22 Two software decompressors • Dictionary – Faster – Less compression • CodePack – A software version of IBM’s CodePack hardware – Slower – More compression Codewords (indices) Decompress granularity Static instructions Dynamic instructions Decompression overhead Dictionary CodePack Fixed-length 1 cache line 43 43 73-105 cycles Variable-length 2 cache lines 174 1042-1062 1235-1266 cycles 23 Compression ratio compressed size • compression ratio = original size – CodePack: 55% - 63% – Dictionary: 65% - 82% 100% 90% 80% 70% 60% Compression 50% ratio 40% 30% 20% 10% 0% Dictionary ex rt pe rl vo pe gw it c en go eg ijp m pe g2 gh os ts cr ip cc t 1 CodePack 24 Simulation environment • • • • • SimpleScalar Pipeline: 5 stage, in-order I-cache: 4KB, 32B lines, 2-way D-cache: 8KB, 16B lines, 2-way Memory: embedded DRAM – 10 cycle latency – bus width = 1 cache line – 10x denser than SRAM caches • Performance: slowdown = 1 / speedup (1 = native code) • Area results include: – Main memory to hold program (compressed bytes, tables) – I-cache – Memory for decompressor optimizations (memorization, native code) 25 Performance • CodePack: very high overhead • Reduce overhead by reducing cache misses Ghostscript 20 18 16 14 12 Slowdown 10 8 6 4 2 0 CodePack Dictionary Native CodePack Dictionary Native 4KB 16KB 64KB 19.19 2.66 1.00 3.17 1.43 1.00 I-cache size (KB) 1.26 1.04 1.00 26 Cache miss • Control slowdown by optimizing I-cache miss ratio • Small change in miss ratio large performance impact 40 CodePack 4KB CodePack 16KB CodePack 64KB Dictionary 4KB Dictionary 16KB Dictionary 64KB 35 30 Slowdown 25 relative to 20 native code 15 10 5 0 0% 2% 4% 6% 8% I-cache miss ratio 27 Two optimizations • Hybrid programs (static) – Both compressed and native code • Memoization (dynamic) – Cache recent decompressions in main memory • Both can be applied to any compression algorithm Program memory Original Program native Compressed compressed Memoization compressed Hybrid compressed Combined compressed memo native native memo 28 Hybrid programs • Selective compression – Only compress some procedures – Trade size for speed – Avoid decompression overhead • Profile methods – Count dynamic instructions • Example: ARM/Thumb • Use when compressed code has more instructions • Reduce number of executed instructions – Count cache misses • Example: CodePack • Use when compressed code has longer cache miss latency • Reduce cache miss latency 29 Cache miss profiling • Cache miss profile reduces overhead 50% • Loop-oriented benchmarks benefit most – Profiles are different in loop regions Pegwit (encryption) 1.12 CodePack: dynamic instructions 1.10 Slowdown relative to native code CodePack: cache miss 1.08 1.06 1.04 1.02 1.00 60% 70% 80% 90% 100% Compression ratio 30 Code placement Original code A Memory B C D Whole compression compressed code dict decompress region (in L1 cache only) a b c d Decompress dict a b c d A B C D Same order Selective compression compressed code dict a c native region B decompress region D Decompress dict a c B D A C Different order! 31 CodePack vs. Dictionary • More compression may have better performance – CodePack has smaller size than Dictionary compression – Even with some native code, CodePack is smaller – CodePack is faster due to using more native code Ghostscript 4.0 Slowdown relative to native code 3.5 CodePack: cache miss 3.0 Dictionary: cache miss 2.5 2.0 1.5 1.0 0.5 60% 70% 80% 90% Compression ratio 100% 32 Memoization • Reserve main memory for caching decompressed insns. – Use high density DRAM to store more than I-cache in less area – Manage as a cache: data and tags • Algorithm – Decompressor checks memo table before decompressing – On hit, copy instructions into I-cache (no decompression) – On miss, decompress into I-cache and update memo table No memoization address With memoization address tag native code decompress =? cache line decompress cache line 33 Memoization results • 16KB memoization table • CodePack: large improvement • Dictionary is already fast: small improvement 5 25 CodePack 20 Memo Dictionary 4 Memo 3 15 ex vo rt rl pe eg m pe g2 en c pe gw it ijp go t ts cr ip cc gh os ex vo rt pe t ts cr ip cc rl 0 ijp eg m pe g2 en c pe gw it 0 go 1 1 5 1 2 10 gh os Slowdown 30 34 Combined • Use memoization on hybrid programs – Keep area constant. Partition memory for best solution. • Combined solution is often the best 14 CodePack with 32KB decompression buffer 0KB memo 4KB memo 8KB memo 16KB memo 32KB memo (not hybrid) 12 8 6 4 2 x vo rte pe rl it pe gw pe g2 en c go ijp eg m gh os t sc r ip t 0 cc 1 Slowdown 10 35 Combined Slowdown • Adding DRAM is better than larger SRAM cache Ghostscript using CodePack 0 KB Memo 9 KB Memo 36 KB Memo 144 KB Memo 4 KB I-cache 9 KB total 16 KB I-cache 64 KB I-cache 20 18 16 14 12 10 8 6 4 2 0 60% 4.5 KB Memo 18 KB Memo 72 KB Memo 288 KB Memo 8 KB I-cache 32 KB I-cache Native 18 KB total 36 KB total 72 KB total 144 KB total 70% 80% 90% Compression Ratio 288 KB total 100% 36 Conclusions • High-performance SW decompression possible – Dictionary faster than CodePack, but 5-25% compression ratio difference – Hardware support • I-cache miss exception • Store-instruction instruction • Tune performance – Cache size – Hybrid programs, memoization • Hybrid programs – Use cache miss profile for loop-oriented benchmarks • Memoization – No profile required, but has start-up latency – Effective on large working sets 37 Future work • Compiler optimization for compression – Improve compression ratio without affecting performance • Unify selective compression and code placement – Reduce I-cache miss to improve performance • Energy consumption – Increasing run-time may use too much energy – Improving bus utilization saves energy • Dynamic code generation: Crusoe, Dynamo – Memory management, code stitching • Compression for high-performance – Lower L2 miss rate 38 Web page http://www.eecs.umich.edu/compress 39