Re-constructing High-Level Information for Language-Specific Binary Re-optimization

CGO 2016
Re-constructing High-Level Information
for Language-specific Binary Re-optimization
Toshihiko Koju
IBM Research - Tokyo
Reid Copeland
IBM Canada
Motohiro Kawahito
IBM Research - Tokyo
Moriyoshi Ohara
IBM Research - Tokyo
1
© 2016 IBM Corporation
Summary
Language-specific binary optimizer can achieve competitive
performance relative to the state-of-the-art source compiler.
source code
latest
compiler
re-compiled
binary
Achieved +40.1% by our
binary optimizer vs. +55.2%
by the latest source compiler.
earlier
compiler
original
binary
binary
optimizer
optimized
binary
language-specific
contextual info.
2
© 2016 IBM Corporation
Outline
! Motivation & Goal.
! Our Approach.
– Four Major Optimizations.
! Performance Evaluation.
! Conclusions.
3
© 2016 IBM Corporation
Compilation Technologies are becoming
Increasingly Important
! Recent processors are equipped with specialized HW units.
– e.g. SIMD, FPGA, GPU, DFP (Decimal Floating-Point)
single-thread
perf. of CPU
! Re-compilation with the latest compilation technologies is
typically required to exploit them.
downlevel processor
with specialized
HW units
Specialized HW units complement
diminishing single-thread
performance improvements.
w/o specialized HW units
CPU generation
latest
processor
DFP
SIMD
FPGA
re-compilation
existing application
4
latest
compiler
GPU
re-compiled
application
© 2016 IBM Corporation
Compiler Migration is Often Challenging
for Large Enterprise Customers
! Adapting to changes in the following can be a significant
undertaking:
– language, compiler option, build environment, runtime environment
changes in
language behavior,
compiler options,
build environment,
runtime environment
downlevel processor
compiler
migration
re-compilation
existing application
5
latest
compiler
latest
processor
DFP
SIMD
FPGA
GPU
re-compiled
application
© 2016 IBM Corporation
Ease Migration by Binary Optimization
! We have developed a production-level binary optimizer for
mainframe COBOL applications.
– IBM Automatic Binary Optimizer for z/OS.
! The binary optimizer utilizes the same optimization engine of
the latest source code COBOL compiler.
changes in
Source code re-compilation:
compiler
migration
source code
Binary re-optimization:
existing binary
6
latest compiler
language behavior,
compiler options,
build environment,
runtime environment
re-compiled binary
Provide strong compatibility
with the current environment.
binary optimizer
optimized binary
© 2016 IBM Corporation
Traditional Approach does not Work
! Naive translation from machine instructions to IR (intermediate
Representation) degraded the performance.
– This is typically for binary translators based on general compiler engines.
7
Relative Performance
higher is better
original
binary
180.0%
160.0%
140.0%
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0.0%
naive translation
w/o optimizations
naive translation
with optimizations
155.2%
100.0%
source code
re-compilation
90.0%
62.3%
cob42.opt binOpt.O0 binOpt.O1
cob5.dev
© 2016 IBM Corporation
Our Goal
! Achieve competitive performance relative to the state-of-theart source code compiler.
8
Relative Performance
higher is better
original
binary
180.0%
160.0%
140.0%
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0.0%
naive translation
w/o optimizations
naive translation
with optimizations
155.2%
100.0%
source code
re-compilation
fill in the gap
90.0%
our results
62.3%
cob42.opt binOpt.O0 binOpt.O1
cob5.dev
© 2016 IBM Corporation
Outline
! Motivation & Goal.
! Our Approach.
– Four Major Optimizations.
! Performance Evaluation.
! Conclusions.
9
© 2016 IBM Corporation
We Need High-level Information to Optimize Binaries
! Compiler optimizations require "HLI" (high-level information).
– e.g. variables, data types, scopes, constants, library semantics
! Only limited HLI is provided by the naive translation.
– Resulting in limited optimization opportunities.
poor HLI (high-level information):
•  variables
•  data types
•  scopes
•  constants
•  library semantics
existing
binary
naive IR
generator
IR
optimization
engine
optimized
binary
limited optimization
opportunities
10
© 2016 IBM Corporation
Our Approach: Language-Specific Binary Optimization
! General binary optimizers typically do not assume source
languages which results in better applicability.
– e.g. Dynamo [Bala00], DynamoRIO [Bruening03], DBT86/RASP [Hertzberg11]
! Our binary optimizer assumes the source language to reconstruct HLI by using the "contextual information".
– e.g. data structure, register usage, instruction pattern
General binary optimizer:
general binary
No assumption on source languages.
binary optimizer
optimized binary
poor HLI
Our binary optimizer:
COBOL binary
11
limited optimizations
Assuming the source language.
binary optimizer
contextual
information
rich HLI
optimized binary
sophisticated
optimizations
© 2016 IBM Corporation
Re-construction of the Key HLI
machine
instructions
Variable
location
abstract interpretation:
•  data structure
•  register usage
EFA
(Effective
Address)
inspection of usage:
•  location
•  instruction pattern
data type
12
constant
value
determination of routine:
•  EFA
•  data structure
use-define analysis:
•  EFA
•  data structure
liveness analysis:
•  location
•  data structure
inspection of literal pool:
•  EFA
•  data structure
runtime
routine
scope
generation of
aliasing information:
•  location
•  scope
aliasing
analyze parameter:
•  semantics of routine
•  parameter information
inspection of
instruction pattern:
•  variable
•  instruction pattern
original
operation
© 2016 IBM Corporation
Abstract Interpretation: Analysis for EFA
! Take advantage of knowledge about data structures and
register usages of COBOL.
Machine
instructions:
COBOL Data structures
& register usage:
(a) Load R2,(R9+0x200)
R9
(b) Load R3,(R2+0x100)
control block
(c) Load R4,(R2+0x200)
R13
(d) Load R5,(R2+R4)
heap
stack
Abstract
interpretation
(a) EFA=control block+off_heap
R2=&heap
(b) EFA=heap+0x100
R3=unknown
(c) EFA=heap+0x200
R4=unknown
(d) EFA=heap+unknown
R5=unknown
R12
label table
13
literal
pool
Without the contextual information, we cannot
tell EFA which is a basis of many other HLI.
© 2016 IBM Corporation
Four Major Optimizations
! There are four significant optimizations which are newly
introduced to the latest source code compiler.
1. 
2. 
3. 
4. 
Type Reduction of Decimal Type.
Strength Reduction of Edited Moves.
Skipping Truncations of Binary Numbers.
Inlining Efficient Version of Routines.
14
naive translation naive translation
w/o optimizations with optimizations
180.0%
160.0%
four
optimizations
140.0%
120.0%
100.0%
90.0%
100.0%
80.0%
62.3%
60.0%
40.0%
20.0%
0.0%
cob42.opt binOpt.O0 binOpt.O1
155.2%
Relative Performance
higher is better
original
binary
source code
re-compilation
cob5.dev
© 2016 IBM Corporation
Optimization 1: Type Reduction of Decimal Type (1/2)
! Convert decimal types to decimal floating-point (DFP) types.
– Computation on decimal type: memory-to-memory
– Computation on DFP type: register-to-register
Representation of integer number 87:
binary integer type " 87
zoned decimal type " 0xF8F7
packed decimal type " 0x087F
Business applications perform a lot of decimal computations.
HLI about variables is necessary
to enable the type reduction.
Simple Example:
COBOL:
ZonedDecimal A, B;
A=A+B
15
Original instruction:
ZonedToPacked
ByteOR
ZonedToPacked
ByteOR
PackedAdd
PackedToZoned
(R13+272, 5 bytes), (R2+0, 8 bytes)
(R13+276), 0xf
(R13+280, 5 bytes), (R2+8, 8 bytes)
(R13+284),0xf
(R13+272, 5bytes), (R13+280, 5 bytes)
(R2+0, 8 bytes), (R13+272, 5 bytes)
© 2016 IBM Corporation
Optimization 1: Type Reduction of Decimal Type (2/2)
Original instruction:
ZonedToPacked
ByteOR
ZonedToPacked
ByteOR
PackedAdd
PackedToZoned
(R13+272, 5 bytes), (R2+0, 8 bytes)
(R13+276), 0xf
(R13+280, 5 bytes), (R2+8, 8 bytes)
(R13+284),0xf
(R13+272, 5bytes), (R13+280, 5 bytes)
(R2+0, 8 bytes), (R13+272, 5 bytes)
EFA information
use-def analysis
Type Reduction
liveness analysis
Use-Def Analysis: *D=DEF, U=Use
ZonedToPacked
D
ByteOR
UD
ZonedToPacked
D
ByteOR
UD
PackedAdd
UD
U
PackedToZoned
U
D
Variables: TMP1
A
16
TMP2
Optimized instruction:
ZonedToDFP FPR1,*(A)
ZonedToDFP FPR2,*(B)
DFPAdd
FPR1,FPR2
DFPToZoned FPR1,*(A)
U
U
B
ZonedToPacked
PackedSetSign
ZonedToPacked
PackedSetSign
PackedAdd
PackedToZoned
TMP1, A
TMP1, plus
TMP2, B
TMP2, plus
TMP1, TMP2
TMP1, A
Re-constructed HLI
about variables
© 2016 IBM Corporation
Optimization 2: Strength Reduction of Edited Moves
! Generate a specialized instruction sequence for a string
format operation in COBOL.
Simple Example:
COBOL:
* FOO: string of the format 9999.
MOVE 123 TO FOO. " FOO=0123
Original instruction:
EDIT control-bytes,numeric
EDIT is a millicode instruction which
interprets each byte of the control-bytes.
Re-construct HLI about the original
operation from control-bytes.
17
Optimized instruction:
ConvToZoned TMP, numeric
CopyZoned
FOO, TMP
© 2016 IBM Corporation
Optimization 3: Skipping Truncations of Binary Numbers
! Generate guard code to skip unnecessary truncation of
binary integer numbers.
Simple Example:
COBOL:
* FOO: binary number with 4 digits.
FOO=FOO+1000.
Original code:
TMP=FOO+1000
FOO=TMP%10000
Divide instruction is unconditionally
executed to suppress overflows.
Overflow is rare.
Re-construct HLI about the original operation:
•  Quot is not in use.
•  Dividend is a constant of Pow10.
18
Optimized code:
TMP=FOO+1000
if abs(TMP) >= 10000
TMP=TMP%10000
FOO=TMP
© 2016 IBM Corporation
Optimization 4: Inlining Efficient Version of Routines
! Replace a runtime routine with the equivalent code which
exploits a more efficient algorithm and newer instructions.
Simple Example:
COBOL:
* FOO, BAR: large packed decimal numbers (e.g. 18 digits).
FOO=FOO/BAR
Original instruction:
CALL LargeDivide
Runtime path length is more than
100 instructions.
Re-construct HLI about the original operation:
•  Semantics of the runtime routine.
•  Parameter information.
19
Exploit new DFP
instructions.
Optimized instruction:
PackedToDFP FPR1,FOO
PackedToDFP FPR2,BAR
DFPDivide
FPR1,FPR2
DFPRound
FPR1
DFPToPacked FPR1,FOO
© 2016 IBM Corporation
Outline
! Motivation & Goal.
! Our Approach.
– Four Major Optimizations.
! Performance Evaluation.
! Conclusions.
20
© 2016 IBM Corporation
Experimental Environment
! Machine:
–  5.5 GHz zEC12 running z/OS.
! Benchmark:
–  IBM Internal benchmark suite for the development of the COBOL compilers.
! Binary optimizer:
–  Development version of IBM Automatic Binary Optimizer (2015).
! Original source code compiler:
–  IBM Enterprise COBOL V4R2 (2009).
! Latest source code compiler:
–  Development version of IBM Enterprise COBOL V5 (2015).
Decimal Benchmarks
cevalnd
accoun*ng computa*ons.
amort6
mortgage payment comparisons and amor*za*on computa*ons (external FP).
djpcom85 record processing.
21
seqpr
upda*ng sequen*al file records.
ixseq
upda*ng indexed VSAM file records.
Non-­‐Decimal Benchmarks
telco
telecom system.
invupd
upda*ng sales master records.
amort2
mortgage payment comparisons and amor*za*on computa*ons (long FP).
atm06
automa*c teller machine transac*ons.
© 2016 IBM Corporation
Performance Impact of Each Optimization
a. 
b. 
c. 
d. 
e. 
f. 
O0 (no optimization)
O1 (without HLI)
b. + Skipping truncation
c. + Type reduction
d. + Optimization of Edited move
e. + inlining routines (full)
"
"
"
"
"
"
-37.7%
-10.0%
-1.4%
+25.4%
+36.7%
+40.1%
Improved performance
from -10.0% to +40.1%
by re-constructing HLI.
!  The baseline (100%) is the performance of the original binaries.
–  Original binaries are compiled with the highest optimization-level of the old compiler.
decimal benchmarks
non-decimal benchmarks
22
Rela?ve$Performance$
higher is better
300.0%$
a.$
250.0%$
b.$
200.0%$
c.$
150.0%$
d.$
100.0%$
e.$
50.0%$
f.$
g.$
0.0%$
cevalnd$
amort6$ djpcom85$
seqpr$
ixseq$
telco$
invupd$
amort2$
© 2016 IBM Corporation
atm06$ geo.$mean$
Performance Comparisons with the Re-compilation
f.  e. + inlining routines (full)
g.  latest source code compiler
"
"
+40.1%
+55.2%
Good competitiveness of
the binary optimizer
relative to the state-of-theart source code compiler.
!  The baseline (100%) is the performance of the original binaries.
–  Original binaries are compiled with the highest optimization-level of the old compiler.
decimal benchmarks
non-decimal benchmarks
23
Rela?ve$Performance$
higher is better
300.0%$
a.$
250.0%$
b.$
200.0%$
c.$
150.0%$
d.$
100.0%$
e.$
50.0%$
f.$
g.$
0.0%$
cevalnd$
amort6$ djpcom85$
seqpr$
ixseq$
telco$
invupd$
amort2$
© 2016 IBM Corporation
atm06$ geo.$mean$
Performance Comparisons with the Re-compilation
f.  e. + inlining routines (full)
g.  latest source code compiler
"
"
+40.1%
+55.2%
Can get even closer by:
•  Enabling loop
optimizations.
•  Extend range of analysis
of runtime routines.
!  The baseline (100%) is the performance of the original binaries.
–  Original binaries are compiled with the highest optimization-level of the old compiler.
decimal benchmarks
non-decimal benchmarks
24
Rela?ve$Performance$
higher is better
300.0%$
a.$
250.0%$
b.$
200.0%$
c.$
150.0%$
d.$
100.0%$
e.$
50.0%$
f.$
g.$
0.0%$
cevalnd$
amort6$ djpcom85$
seqpr$
ixseq$
telco$
invupd$
amort2$
© 2016 IBM Corporation
atm06$ geo.$mean$
Outline
! Motivation & Goal.
! Our Approach.
– Four Major Optimizations.
! Performance Evaluation.
! Conclusions.
25
© 2016 IBM Corporation
Conclusion
Language-specific binary optimizer can achieve competitive
performance relative to the state-of-the-art source compiler.
source code
latest
compiler
re-compiled
binary
Achieved +40.1% by our
binary optimizer vs. +55.2%
by the latest source compiler.
earlier
compiler
original
binary
binary
optimizer
optimized
binary
language-specific
contextual info.
26
© 2016 IBM Corporation
Questions?
27
© 2016 IBM Corporation