Presentation

Programmers Talk to Compilers
A Language to Describe the Optimization Search Space
Albert Cohen
A LCHEMY Group, INRIA Futurs, France
Scalable On-Chip Parallel Computing
Massive parallelism on a chip
Structure and nature of the hardware exposed to the software...
... need to be considered for correctness and/or performance
Towards adaptive programs (multi-version, continuous optimization)
SW/HW negociation, from load balancing to algorithm selection
IBM Workshop on Compilation and Architecture
General-purpose applications need choice for scalable performance
2
Context
Context
Physically distributed, layered and heterogeneous resources
Scalable On-Chip Parallel Computing
Massive parallelism on a chip
Structure and nature of the hardware exposed to the software...
... need to be considered for correctness and/or performance
Towards adaptive programs (multi-version, continuous optimization)
SW/HW negociation, from load balancing to algorithm selection
Impact on
on
Impact
Programming models
Optimizing compilers
Component models
Run-time systems
IBM Workshop on Compilation and Architecture
General-purpose applications need choice for scalable performance
2
Context
Context
Physically distributed, layered and heterogeneous resources
T HE X -L ANGUAGE
A T OOL FOR E XPERT P ROGRAMMERS TO D RIVE
P ROGRAM O PTIMIZATION WHILE M AINTAINING
H IGH P RODUCTIVITY AND P ORTABILITY
Goals of the X-Language
1. Compact representation of multiple program versions
→ Derive multiple or multi-version programs from a single source
→ Generate code at run-time if necessary
2. Explicit multiple optimization strategies
→ Rely on predefined transformation primitives
→ Declare high-level optimization goals rather than explicit transformations
3. Implement and apply custom optimizations
→ Custom transformations can be implemented by expert programmers
→ Derive decision trees automatically from abstract descriptions
4. Bring together individual transformations and actual performance measurements
→ Implement local/layered learning/search strategies
→ Couple with hardware counters, sampling mechanisms and phase detection
Key Design Ideas
Build on top of advanced multistage programming technology
Manipuate code expressions
code c = ‘{ bar(42); ‘}
Splice code into code
‘{ foo(‘%(c)); ‘}
// foo(bar(42));
Cross-stage persistence
int x = 42; code c = ‘{ foo(bar(x)); ‘}
5
run(c);
IBM Workshop on Compilation and Architecture
Generate and run code
Key Design Ideas
Build on top of advanced multistage programming technology
Manipuate code expressions
code c = ‘{ bar(42); ‘}
Splice code into code
‘{ foo(‘%(c)); ‘}
// foo(bar(42));
run(c);
Cross-stage persistence
int x = 42; code c = ‘{ foo(bar(x)); ‘}
Provide some form of reflection that do not alter observable semantics
(Assuming transformation legality)
Use code annotations: #pragma xlang
#pragma xlang transformation [ scope_name ] node_name_regexp
[ parameters ] [ additional_names ]
IBM Workshop on Compilation and Architecture
Generate and run code
5
Example: #pragma xlang unroll loop1 4
Features of the Language
Transformations primitives
Loop transformations
→ unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing,
hoisting, shifting, scalar promotion, privatization
Interprocedural transformations
6
IBM Workshop on Compilation and Architecture
→ inlining, cloning, partial evaluation, slicing
Features of the Language
Transformations primitives
Loop transformations
→ unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing,
hoisting, shifting, scalar promotion, privatization
Interprocedural transformations
Composition of code generators (multi-stage evaluation with splicing)
Sequence of annotation pragmas
Procedural abstraction (build custom transformations from primitives)
Control the application and parameters of each transformation
6
Compound transformations
IBM Workshop on Compilation and Architecture
→ inlining, cloning, partial evaluation, slicing
Features of the Language
Transformations primitives
Loop transformations
→ unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing,
hoisting, shifting, scalar promotion, privatization
Interprocedural transformations
Composition of code generators (multi-stage evaluation with splicing)
Sequence of annotation pragmas
Procedural abstraction (build custom transformations from primitives)
Control the application and parameters of each transformation
Static analyses (crude scalar data-flow information right now)
6
Compound transformations
IBM Workshop on Compilation and Architecture
→ inlining, cloning, partial evaluation, slicing
Features of the Language
Transformations primitives
Loop transformations
→ unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing,
hoisting, shifting, scalar promotion, privatization
Interprocedural transformations
Composition of code generators (multi-stage evaluation with splicing)
Sequence of annotation pragmas
Procedural abstraction (build custom transformations from primitives)
Control the application and parameters of each transformation
Static analyses (crude scalar data-flow information right now)
Dynamic analyses (only time measurement right now)
6
Compound transformations
IBM Workshop on Compilation and Architecture
→ inlining, cloning, partial evaluation, slicing
Example: Transformation Sequences
Each transformation regererates annotations for the next transformation
−→
#pragma xlang name loop1
for (ii=m; ii+4<n; ii+=4) {
#pragma xlang name loop1_2
for (i=ii; i<ii+4; i++)
a[i] = b[i];
#pragma xlang name loop1_3
}
for (i=ii; i<n i++)
a[i] = b[i];
#pragma xlang unroll loop1_2
7
#pragma xlang stripmine loop1 4 loop1_2 loop1_3
#pragma xlang unroll loop1_2
IBM Workshop on Compilation and Architecture
#pragma xlang name loop1
for (i=m; i<n; i++)
a[i] = b[i];
Example: Transformation Sequences
#pragma xlang unroll loop1_2
−→
#pragma xlang name loop1
for (ii=m; ii+4<n; ii+=4) {
#pragma xlang name loop1_2
i = ii;
a[i] = b[i];
i = ii+1;
a[i] = b[i];
i = ii+2;
a[i] = b[i];
i = ii+3;
a[i] = b[i];
}
#pragma xlang name loop1_3
for (i=ii; i<n i++)
a[i] = b[i];
8
#pragma xlang name loop1
for (ii=m; ii+4<n; ii+=4) {
#pragma xlang name loop1_2
for (i=ii; i<ii+4; i++)
a[i] = b[i];
#pragma xlang name loop1_3
}
for (i=ii; i<n i++)
a[i] = b[i];
IBM Workshop on Compilation and Architecture
Each transformation regererates annotations for the next transformation
// drive search/learning strategy from this evaluation
}
9
for (u=1; u<8; u++) {
code c = ‘{
#pragma xlang name loop1
for (i=m; i<n; i++)
a[i] = b[i];
#pragma xlang stripmine loop1 u loop1_2 loop1_3
‘}
run(c, &elapsed_time);
IBM Workshop on Compilation and Architecture
Example: Evaluating Multiple Versions
Implementation
Multistage evaluation
Currently: syntactic hacks with strings and printf
Plan: MetaOCaml with offshoring (implicitely heterogeneous multistage program generation)
Plan: Stratego
Compound transformations
Currently: limited by naming scheme and programmer’s understanding
Plan: go for goal-driven inference of transformation strategies
10
Currently: syntactic hacks with patterns and rewrite rules
IBM Workshop on Compilation and Architecture
Transformation primitives
// Simplified transformation sequence for IA64
// (excluding search engine, pipelining, prefetch and page copying)
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
#pragma
xlang
xlang
xlang
xlang
xlang
xlang
xlang
xlang
xlang
xlang
xlang
xlang
transform
transform
transform
transform
transform
transform
transform
transform
transform
transform
transform
transform
stripmine iloop NU NUloop
stripmine jloop MU MUloop
interchange kloop MUloop
interchange jloop NUloop
interchange kloop NUloop
fullunroll NUloop
fullunroll MUloop
scalarize_in b in kloop
scalarize_in a in kloop
scalarize_in&out c in kloop
hoist kloop.loads before kloop
hoist kloop.stores after kloop
11
#pragma xlang name iloop
for (i=0; i<NB; i++)
#pragma xlang name jloop
for (j=0; j<NB; j++)
#pragma xlang name kloop
for (k=0; k<NB; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
IBM Workshop on Compilation and Architecture
Full Example: Matrix Product in ATLAS
// ...
}
#pragma xlang name kloop.stores
{ c[i+0][j+0] = c_0_0; c[i+0][j+1] = c_0_1;
c[i+0][j+2] = c_0_2; c[i+0][j+3] = c_0_3; }
}}
// Remainder code
12
#pragma xlang name iloop
for (i=0; i<NB; i++) {
#pragma xlang name jloop
for (j=0; j<NB; j+=4) {
#pragma xlang name kloop.loads
{ c_0_0 = c[i+0][j+0]; c_0_1 = c[i+0][j+1];
c_0_2 = c[i+0][j+2]; c_0_3 = c[i+0][j+3]; }
#pragma xlang name kloop
for (k=0; k<NB; k++) {
{ a_0 = a[i+0][k]; a_1 = a[i+0][k];
a_2 = a[i+0][k]; a_3 = a[i+0][k]; }
{ b_0 = b[k][j+0]; b_1 = b[k][j+1];
b_2 = b[k][j+2]; b_3 = b[k][j+3]; }
{ c_0_0=c_0_0+a_0*b_0; c_0_1=c_0_1+a_1*b_1;
c_0_2=c_0_2+a_2*b_2; c_0_3=c_0_3+a_3*b_3; }
IBM Workshop on Compilation and Architecture
Full Example: Matrix Product in ATLAS
13
IBM Workshop on Compilation and Architecture
Preliminary Results
Main Limitations
1. Hard to understand and keep track of transformations effects
→ Build and manage long sequences of transformations
→ Convince the expert programmer that it saves him time
14
→ General kind of program construction
→ Algorithm selection
IBM Workshop on Compilation and Architecture
2. Define custom transformations, beyond combination of existing primitive ones
OVERCOMING THESE L IMITATIONS
C OMPLEX T RANSFORMATION
S EQUENCES
Managing Complex Transformation Sequences
SpecFP 2000 on Alpha 21264C (EV68)
Speed-ups w.r.t. HP F90 and C compiler with KAP preprocessor
Base SPEC: -arch ev6 -fast -O5
Peak SPEC
Manual Optimizations
SWIM
1.00
1.61
GALGEL
1.04
1.39
WUPWISE
1.20
2.90
APPLU
1.47
2.18
APSI
1.07
1.23
FACEREC
1.04
1.42
AMMP
1.18
1.40
EQUAKE
2.65
3.22
MESA
1.12
1.17
MGRID
1.59
1.45
FMA3D
1.32
1.09
ART
1.22
1.07
Managing Complex Transformation Sequences
A 1 : -3 2 s
P e e lin g
S h iftin g
P e e lin g
S h iftin g
P e e lin g
S h iftin g
A rra y F w d
S u b s tit u tio n
F u s io n
A 2 : -3 0 s
A 3 : -1 5 s
P a d d in g
T ilin g
SWIM (base 204s)
In te rc h a n g e
S tr ip - M in in g
F u s io n
F u s io n
F u s io n
S c a la r
P r o m o tio n
L 1 -L 2
F F i su s s i i o o n n
A 3 : -2 4 s
In s tr u c tio n
S p litin g
S h iftin g
L 1
F is s io n
S t r ip - M in in g
S h iftin g
L 2
A rra y C o p y
P r o p a g a tio n
S c a la r
P r o m o tio n
F F i su s s i i o o n n
A 5 : -6 s
F u s io n
S tr ip - M in in g
F u s io n
H o is tin g
U n r o ll
a n d J a m
R e g is te r
P r o m o tio n
F u s io n
GALGEL (base 171s)
A 1 : -3 s
A : -2 9 s
G : -1 1 s
F u s io n
F u ll
U n r o llin g
In s tr u c tio n
S p littin g
F lo a t. p o in t
R e o r d e r in g
B 1 : -4 s
P r iv a t iz a tio n
F is s io n
S c a la r
P r o m o tio n
P r iv a t iz a tio n
APPLU (base 214s)
P e e lin g
A 3 : -1 s
F u s io n
B 1 : -3 1 s
B 2 : -1 8 s
F is s io n
A 2 : -2 s
F u ll
U n r o llin g
G : -2 7 s
P r iv a t iz a t io n
D a ta
L a y o u t
P r iv a t iz a t io n
B 2 : -1 s
In te rc h a n g e
F is s io n
F is s io n
S o ftw a re
P ip e lin in g
S o ftw a re
P ip e lin in g
B 3 : -1 s
S o ftw a re
P ip e lin in g
C 1 : -1 1 s
P r iv a t iz a t io n
P r iv a t iz a t io n
F is s io n
In te rc h a n g e
APSI (base 378s)
IBM Workshop on Compilation and Architecture
A 1 : -1 4 s
S tr ip - M in in g
A 4 : -5 s
17
A 2 : + 2 4 s
Polyhedral Representation for Complex Sequences
Program and transformations represented as matrices
Each transformation amounts to minor coefficient changes w.r.t. the previous state
No code duplication before the final code generation step
P ip L ib
P o ly L ib
O R C
P re O p t
L N O
W O P T
R V I2
C G
B in a r y
W H IR L
w 2 p
C o n v e r s io n
W R a P
U R U K
tra n s fo rm e d
W R a P
T r a n s fo r m a tio n s
W H IR L
C L o o G
s o u rc e c o d e
C o n v e r s io n
W L o o G
IBM Workshop on Compilation and Architecture
P o ly h e d r a M a n a g e m e n t
L ib r a r y
18
S o u rc e
c o d e
P a r a m e tr ic in te g e r lin e a r
p r o g r a m m in g s o lv e r
Polyhedral Representation for Complex Sequences
Example: optimizing SWIM2000 for IA32
# Move and shift calc3 backwards
shift(enclose(C3L1),{’1’,’0’,’0’})
(...)
shift(C3L17,{’1’})
motion(enclose(C3L1),BLOOP)
(...)
motion(C3L17,BLOOP)
# Peel and shift to enable fusion
peel(enclose(C3L1,2),’3’)
peel(enclose(C3L1_2,2),’N-3’)
(...)
shift(enclose(C1L1_2_1_2_1),{’0’,’1’,’1’})
shift(enclose(C2L1_2_1_2_1),{’0’,’2’,’2’})
# Double fusion of the three nests
motion(enclose(C2L1_2_1_2_1),TARGET_2_1_2_1)
motion(enclose(C1L1_2_1_2_1),C2L1_2_1_2_1)
motion(enclose(C3L1_2_1_2_1),C1L1_2_1_2_1)
# Register blocking and unrolling (factor 2)
time_stripmine(enclose(C3L1_2_1_2_1,2),2,2)
time_stripmine(enclose(C3L1_2_1_2_1,1),4,2)
interchange(enclose(C3L1_2_1_2_1,2))
fullunroll(enclose(C3L1_2_1_2_1_2_1_2_1,2))
fullunroll(enclose(C3L1_2_1_2_1_2_1_2_1,1))
IBM Workshop on Compilation and Architecture
polydeps_save(BLOOP)
19
polydeps_check(BLOOP)
Managing Complex Transformation Sequences
Example: optimizing SWIM 2000
38% speed-up on AMD Athlon 64 3400+ at 2.2GHz
(Three main procedures inlined by Open64)
112 statements and 215 array references in the polyhedral representation
Exact dependence analysis
Nesting depth: 3 original, 5 after tiling
441 matrices for all pairs of references at all depths
12s analysis time
Code generation
2267 lines of code (≈ 1500 after scalar optimization)
URGenT (improvement on CLooG, regeneration of Open64’s WHIRL): 4s
Back-end compilation: 22s
20
Automatic Transformation
Transformation
Automatic
421 lines of code in the source program
IBM Workshop on Compilation and Architecture
Main SCoP
OVERCOMING THESE L IMITATIONS
P ROGAM C ONSTRUCTION
Program Construction
Example: sorting network for even-odd merge sort of size 16
a_2 = t[0] < t[8] ? t[0] : t[8];
b_3 = t[0] < t[8] ? t[8] : t[0];
a_4 = t[4] < t[12] ? t[4] : t[12];
b_5 = t[4] < t[12] ? t[12] : t[4];
a_6 = a_2 < a_4 ? a_2 : a_4;
b_7 = a_2 < a_4 ? a_4 : a_2;
a_8 = b_3 < b_5 ? b_3 : b_5;
b_9 = b_3 < b_5 ? b_5 : b_3;
a_10 = b_7 < a_8 ? b_7 : a_8;
b_11 = b_7 < a_8 ? a_8 : b_7;
// ...
t[11]
t[12]
t[13]
t[14]
=
=
=
=
b_95
b_95
b_87
b_87
<
<
<
<
a_112
a_112
b_113
b_113
?
?
?
?
b_95 : a_112;
a_112 : b_95;
b_87 : b_113;
b_113 : b_87;
Program Construction
Define safe progam combinators/generators: MetaOCaml + offshoring to C
Offers run-time program generation... at a high overhead cost
let rec merge l cont = match l with
[] -> cont []
| [x] -> cont [x]
| [x; y] -> compare [x] [y] cont
| _ ->
merge (even l) (fun e -> merge (odd l) (fun o -> (compare (List.tl e) o (fun l -> cont ((List.hd e) :: l)))))
let sort n =
let rec sort_rec l cont = match l with
[] -> cont []
| [x] -> cont [x]
| _ ->
sort_rec (even l) (fun e -> sort_rec (odd l) (fun o -> (merge (e @ o) cont))) in
.< fun t -> .~(sort_rec (make_list .<t>. n) (make_array .<t>.)) >.
23
let compare a b cont =
let rec compare_rec a b l cont = match (a, b) with
([], []) -> cont l
| ([], t) | (t, []) -> cont (l @ t)
| (x :: c, y :: d) ->
.<
let a = if (.~x < .~y) then .~x else .~y in
let b = if (.~x < .~y) then .~y else .~x
in .~(compare_rec c d (l @ [.<a>.; .<b>.]) cont)
>.
in compare_rec a b [] cont
IBM Workshop on Compilation and Architecture
But is far from natural for application experts
Conclusion: Future Optimizing Compilers
→ Fully automatic framework for abstraction-penalty removal
→ Machine learning and rule-based system for architecture-aware optimizations
→ Let application experts tell what is important
Tools for safe and efficient metaprogramming
IBM Workshop on Compilation and Architecture
Tightly coupled off-line and on-line optimization
Machine learning compilers
24
Research Directions
Directions
Research
Compilers must do tedious things in a predictable manner...
... but should not try to be smart
→ Aggressive off-line analysis and narrowing of the optimization search-space
→ Low-overhead just-in-time/run-time transformations and code generation
Complement intermediate representations with program generators
Progresses
Progresses
→ Expose algebraic properties of the search space
→ Support global and complex transformation sequences
SPIRAL
T HANK YOU