CGO 2016 Re-constructing High-Level Information for Language-specific Binary Re-optimization Toshihiko Koju IBM Research - Tokyo Reid Copeland IBM Canada Motohiro Kawahito IBM Research - Tokyo Moriyoshi Ohara IBM Research - Tokyo 1 © 2016 IBM Corporation Summary Language-specific binary optimizer can achieve competitive performance relative to the state-of-the-art source compiler. source code latest compiler re-compiled binary Achieved +40.1% by our binary optimizer vs. +55.2% by the latest source compiler. earlier compiler original binary binary optimizer optimized binary language-specific contextual info. 2 © 2016 IBM Corporation Outline ! Motivation & Goal. ! Our Approach. – Four Major Optimizations. ! Performance Evaluation. ! Conclusions. 3 © 2016 IBM Corporation Compilation Technologies are becoming Increasingly Important ! Recent processors are equipped with specialized HW units. – e.g. SIMD, FPGA, GPU, DFP (Decimal Floating-Point) single-thread perf. of CPU ! Re-compilation with the latest compilation technologies is typically required to exploit them. downlevel processor with specialized HW units Specialized HW units complement diminishing single-thread performance improvements. w/o specialized HW units CPU generation latest processor DFP SIMD FPGA re-compilation existing application 4 latest compiler GPU re-compiled application © 2016 IBM Corporation Compiler Migration is Often Challenging for Large Enterprise Customers ! Adapting to changes in the following can be a significant undertaking: – language, compiler option, build environment, runtime environment changes in language behavior, compiler options, build environment, runtime environment downlevel processor compiler migration re-compilation existing application 5 latest compiler latest processor DFP SIMD FPGA GPU re-compiled application © 2016 IBM Corporation Ease Migration by Binary Optimization ! We have developed a production-level binary optimizer for mainframe COBOL applications. – IBM Automatic Binary Optimizer for z/OS. ! The binary optimizer utilizes the same optimization engine of the latest source code COBOL compiler. changes in Source code re-compilation: compiler migration source code Binary re-optimization: existing binary 6 latest compiler language behavior, compiler options, build environment, runtime environment re-compiled binary Provide strong compatibility with the current environment. binary optimizer optimized binary © 2016 IBM Corporation Traditional Approach does not Work ! Naive translation from machine instructions to IR (intermediate Representation) degraded the performance. – This is typically for binary translators based on general compiler engines. 7 Relative Performance higher is better original binary 180.0% 160.0% 140.0% 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0.0% naive translation w/o optimizations naive translation with optimizations 155.2% 100.0% source code re-compilation 90.0% 62.3% cob42.opt binOpt.O0 binOpt.O1 cob5.dev © 2016 IBM Corporation Our Goal ! Achieve competitive performance relative to the state-of-theart source code compiler. 8 Relative Performance higher is better original binary 180.0% 160.0% 140.0% 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0.0% naive translation w/o optimizations naive translation with optimizations 155.2% 100.0% source code re-compilation fill in the gap 90.0% our results 62.3% cob42.opt binOpt.O0 binOpt.O1 cob5.dev © 2016 IBM Corporation Outline ! Motivation & Goal. ! Our Approach. – Four Major Optimizations. ! Performance Evaluation. ! Conclusions. 9 © 2016 IBM Corporation We Need High-level Information to Optimize Binaries ! Compiler optimizations require "HLI" (high-level information). – e.g. variables, data types, scopes, constants, library semantics ! Only limited HLI is provided by the naive translation. – Resulting in limited optimization opportunities. poor HLI (high-level information): • variables • data types • scopes • constants • library semantics existing binary naive IR generator IR optimization engine optimized binary limited optimization opportunities 10 © 2016 IBM Corporation Our Approach: Language-Specific Binary Optimization ! General binary optimizers typically do not assume source languages which results in better applicability. – e.g. Dynamo [Bala00], DynamoRIO [Bruening03], DBT86/RASP [Hertzberg11] ! Our binary optimizer assumes the source language to reconstruct HLI by using the "contextual information". – e.g. data structure, register usage, instruction pattern General binary optimizer: general binary No assumption on source languages. binary optimizer optimized binary poor HLI Our binary optimizer: COBOL binary 11 limited optimizations Assuming the source language. binary optimizer contextual information rich HLI optimized binary sophisticated optimizations © 2016 IBM Corporation Re-construction of the Key HLI machine instructions Variable location abstract interpretation: • data structure • register usage EFA (Effective Address) inspection of usage: • location • instruction pattern data type 12 constant value determination of routine: • EFA • data structure use-define analysis: • EFA • data structure liveness analysis: • location • data structure inspection of literal pool: • EFA • data structure runtime routine scope generation of aliasing information: • location • scope aliasing analyze parameter: • semantics of routine • parameter information inspection of instruction pattern: • variable • instruction pattern original operation © 2016 IBM Corporation Abstract Interpretation: Analysis for EFA ! Take advantage of knowledge about data structures and register usages of COBOL. Machine instructions: COBOL Data structures & register usage: (a) Load R2,(R9+0x200) R9 (b) Load R3,(R2+0x100) control block (c) Load R4,(R2+0x200) R13 (d) Load R5,(R2+R4) heap stack Abstract interpretation (a) EFA=control block+off_heap R2=&heap (b) EFA=heap+0x100 R3=unknown (c) EFA=heap+0x200 R4=unknown (d) EFA=heap+unknown R5=unknown R12 label table 13 literal pool Without the contextual information, we cannot tell EFA which is a basis of many other HLI. © 2016 IBM Corporation Four Major Optimizations ! There are four significant optimizations which are newly introduced to the latest source code compiler. 1. 2. 3. 4. Type Reduction of Decimal Type. Strength Reduction of Edited Moves. Skipping Truncations of Binary Numbers. Inlining Efficient Version of Routines. 14 naive translation naive translation w/o optimizations with optimizations 180.0% 160.0% four optimizations 140.0% 120.0% 100.0% 90.0% 100.0% 80.0% 62.3% 60.0% 40.0% 20.0% 0.0% cob42.opt binOpt.O0 binOpt.O1 155.2% Relative Performance higher is better original binary source code re-compilation cob5.dev © 2016 IBM Corporation Optimization 1: Type Reduction of Decimal Type (1/2) ! Convert decimal types to decimal floating-point (DFP) types. – Computation on decimal type: memory-to-memory – Computation on DFP type: register-to-register Representation of integer number 87: binary integer type " 87 zoned decimal type " 0xF8F7 packed decimal type " 0x087F Business applications perform a lot of decimal computations. HLI about variables is necessary to enable the type reduction. Simple Example: COBOL: ZonedDecimal A, B; A=A+B 15 Original instruction: ZonedToPacked ByteOR ZonedToPacked ByteOR PackedAdd PackedToZoned (R13+272, 5 bytes), (R2+0, 8 bytes) (R13+276), 0xf (R13+280, 5 bytes), (R2+8, 8 bytes) (R13+284),0xf (R13+272, 5bytes), (R13+280, 5 bytes) (R2+0, 8 bytes), (R13+272, 5 bytes) © 2016 IBM Corporation Optimization 1: Type Reduction of Decimal Type (2/2) Original instruction: ZonedToPacked ByteOR ZonedToPacked ByteOR PackedAdd PackedToZoned (R13+272, 5 bytes), (R2+0, 8 bytes) (R13+276), 0xf (R13+280, 5 bytes), (R2+8, 8 bytes) (R13+284),0xf (R13+272, 5bytes), (R13+280, 5 bytes) (R2+0, 8 bytes), (R13+272, 5 bytes) EFA information use-def analysis Type Reduction liveness analysis Use-Def Analysis: *D=DEF, U=Use ZonedToPacked D ByteOR UD ZonedToPacked D ByteOR UD PackedAdd UD U PackedToZoned U D Variables: TMP1 A 16 TMP2 Optimized instruction: ZonedToDFP FPR1,*(A) ZonedToDFP FPR2,*(B) DFPAdd FPR1,FPR2 DFPToZoned FPR1,*(A) U U B ZonedToPacked PackedSetSign ZonedToPacked PackedSetSign PackedAdd PackedToZoned TMP1, A TMP1, plus TMP2, B TMP2, plus TMP1, TMP2 TMP1, A Re-constructed HLI about variables © 2016 IBM Corporation Optimization 2: Strength Reduction of Edited Moves ! Generate a specialized instruction sequence for a string format operation in COBOL. Simple Example: COBOL: * FOO: string of the format 9999. MOVE 123 TO FOO. " FOO=0123 Original instruction: EDIT control-bytes,numeric EDIT is a millicode instruction which interprets each byte of the control-bytes. Re-construct HLI about the original operation from control-bytes. 17 Optimized instruction: ConvToZoned TMP, numeric CopyZoned FOO, TMP © 2016 IBM Corporation Optimization 3: Skipping Truncations of Binary Numbers ! Generate guard code to skip unnecessary truncation of binary integer numbers. Simple Example: COBOL: * FOO: binary number with 4 digits. FOO=FOO+1000. Original code: TMP=FOO+1000 FOO=TMP%10000 Divide instruction is unconditionally executed to suppress overflows. Overflow is rare. Re-construct HLI about the original operation: • Quot is not in use. • Dividend is a constant of Pow10. 18 Optimized code: TMP=FOO+1000 if abs(TMP) >= 10000 TMP=TMP%10000 FOO=TMP © 2016 IBM Corporation Optimization 4: Inlining Efficient Version of Routines ! Replace a runtime routine with the equivalent code which exploits a more efficient algorithm and newer instructions. Simple Example: COBOL: * FOO, BAR: large packed decimal numbers (e.g. 18 digits). FOO=FOO/BAR Original instruction: CALL LargeDivide Runtime path length is more than 100 instructions. Re-construct HLI about the original operation: • Semantics of the runtime routine. • Parameter information. 19 Exploit new DFP instructions. Optimized instruction: PackedToDFP FPR1,FOO PackedToDFP FPR2,BAR DFPDivide FPR1,FPR2 DFPRound FPR1 DFPToPacked FPR1,FOO © 2016 IBM Corporation Outline ! Motivation & Goal. ! Our Approach. – Four Major Optimizations. ! Performance Evaluation. ! Conclusions. 20 © 2016 IBM Corporation Experimental Environment ! Machine: – 5.5 GHz zEC12 running z/OS. ! Benchmark: – IBM Internal benchmark suite for the development of the COBOL compilers. ! Binary optimizer: – Development version of IBM Automatic Binary Optimizer (2015). ! Original source code compiler: – IBM Enterprise COBOL V4R2 (2009). ! Latest source code compiler: – Development version of IBM Enterprise COBOL V5 (2015). Decimal Benchmarks cevalnd accoun*ng computa*ons. amort6 mortgage payment comparisons and amor*za*on computa*ons (external FP). djpcom85 record processing. 21 seqpr upda*ng sequen*al file records. ixseq upda*ng indexed VSAM file records. Non-‐Decimal Benchmarks telco telecom system. invupd upda*ng sales master records. amort2 mortgage payment comparisons and amor*za*on computa*ons (long FP). atm06 automa*c teller machine transac*ons. © 2016 IBM Corporation Performance Impact of Each Optimization a. b. c. d. e. f. O0 (no optimization) O1 (without HLI) b. + Skipping truncation c. + Type reduction d. + Optimization of Edited move e. + inlining routines (full) " " " " " " -37.7% -10.0% -1.4% +25.4% +36.7% +40.1% Improved performance from -10.0% to +40.1% by re-constructing HLI. ! The baseline (100%) is the performance of the original binaries. – Original binaries are compiled with the highest optimization-level of the old compiler. decimal benchmarks non-decimal benchmarks 22 Rela?ve$Performance$ higher is better 300.0%$ a.$ 250.0%$ b.$ 200.0%$ c.$ 150.0%$ d.$ 100.0%$ e.$ 50.0%$ f.$ g.$ 0.0%$ cevalnd$ amort6$ djpcom85$ seqpr$ ixseq$ telco$ invupd$ amort2$ © 2016 IBM Corporation atm06$ geo.$mean$ Performance Comparisons with the Re-compilation f. e. + inlining routines (full) g. latest source code compiler " " +40.1% +55.2% Good competitiveness of the binary optimizer relative to the state-of-theart source code compiler. ! The baseline (100%) is the performance of the original binaries. – Original binaries are compiled with the highest optimization-level of the old compiler. decimal benchmarks non-decimal benchmarks 23 Rela?ve$Performance$ higher is better 300.0%$ a.$ 250.0%$ b.$ 200.0%$ c.$ 150.0%$ d.$ 100.0%$ e.$ 50.0%$ f.$ g.$ 0.0%$ cevalnd$ amort6$ djpcom85$ seqpr$ ixseq$ telco$ invupd$ amort2$ © 2016 IBM Corporation atm06$ geo.$mean$ Performance Comparisons with the Re-compilation f. e. + inlining routines (full) g. latest source code compiler " " +40.1% +55.2% Can get even closer by: • Enabling loop optimizations. • Extend range of analysis of runtime routines. ! The baseline (100%) is the performance of the original binaries. – Original binaries are compiled with the highest optimization-level of the old compiler. decimal benchmarks non-decimal benchmarks 24 Rela?ve$Performance$ higher is better 300.0%$ a.$ 250.0%$ b.$ 200.0%$ c.$ 150.0%$ d.$ 100.0%$ e.$ 50.0%$ f.$ g.$ 0.0%$ cevalnd$ amort6$ djpcom85$ seqpr$ ixseq$ telco$ invupd$ amort2$ © 2016 IBM Corporation atm06$ geo.$mean$ Outline ! Motivation & Goal. ! Our Approach. – Four Major Optimizations. ! Performance Evaluation. ! Conclusions. 25 © 2016 IBM Corporation Conclusion Language-specific binary optimizer can achieve competitive performance relative to the state-of-the-art source compiler. source code latest compiler re-compiled binary Achieved +40.1% by our binary optimizer vs. +55.2% by the latest source compiler. earlier compiler original binary binary optimizer optimized binary language-specific contextual info. 26 © 2016 IBM Corporation Questions? 27 © 2016 IBM Corporation