Programmers Talk to Compilers A Language to Describe the Optimization Search Space Albert Cohen A LCHEMY Group, INRIA Futurs, France Scalable On-Chip Parallel Computing Massive parallelism on a chip Structure and nature of the hardware exposed to the software... ... need to be considered for correctness and/or performance Towards adaptive programs (multi-version, continuous optimization) SW/HW negociation, from load balancing to algorithm selection IBM Workshop on Compilation and Architecture General-purpose applications need choice for scalable performance 2 Context Context Physically distributed, layered and heterogeneous resources Scalable On-Chip Parallel Computing Massive parallelism on a chip Structure and nature of the hardware exposed to the software... ... need to be considered for correctness and/or performance Towards adaptive programs (multi-version, continuous optimization) SW/HW negociation, from load balancing to algorithm selection Impact on on Impact Programming models Optimizing compilers Component models Run-time systems IBM Workshop on Compilation and Architecture General-purpose applications need choice for scalable performance 2 Context Context Physically distributed, layered and heterogeneous resources T HE X -L ANGUAGE A T OOL FOR E XPERT P ROGRAMMERS TO D RIVE P ROGRAM O PTIMIZATION WHILE M AINTAINING H IGH P RODUCTIVITY AND P ORTABILITY Goals of the X-Language 1. Compact representation of multiple program versions → Derive multiple or multi-version programs from a single source → Generate code at run-time if necessary 2. Explicit multiple optimization strategies → Rely on predefined transformation primitives → Declare high-level optimization goals rather than explicit transformations 3. Implement and apply custom optimizations → Custom transformations can be implemented by expert programmers → Derive decision trees automatically from abstract descriptions 4. Bring together individual transformations and actual performance measurements → Implement local/layered learning/search strategies → Couple with hardware counters, sampling mechanisms and phase detection Key Design Ideas Build on top of advanced multistage programming technology Manipuate code expressions code c = ‘{ bar(42); ‘} Splice code into code ‘{ foo(‘%(c)); ‘} // foo(bar(42)); Cross-stage persistence int x = 42; code c = ‘{ foo(bar(x)); ‘} 5 run(c); IBM Workshop on Compilation and Architecture Generate and run code Key Design Ideas Build on top of advanced multistage programming technology Manipuate code expressions code c = ‘{ bar(42); ‘} Splice code into code ‘{ foo(‘%(c)); ‘} // foo(bar(42)); run(c); Cross-stage persistence int x = 42; code c = ‘{ foo(bar(x)); ‘} Provide some form of reflection that do not alter observable semantics (Assuming transformation legality) Use code annotations: #pragma xlang #pragma xlang transformation [ scope_name ] node_name_regexp [ parameters ] [ additional_names ] IBM Workshop on Compilation and Architecture Generate and run code 5 Example: #pragma xlang unroll loop1 4 Features of the Language Transformations primitives Loop transformations → unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing, hoisting, shifting, scalar promotion, privatization Interprocedural transformations 6 IBM Workshop on Compilation and Architecture → inlining, cloning, partial evaluation, slicing Features of the Language Transformations primitives Loop transformations → unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing, hoisting, shifting, scalar promotion, privatization Interprocedural transformations Composition of code generators (multi-stage evaluation with splicing) Sequence of annotation pragmas Procedural abstraction (build custom transformations from primitives) Control the application and parameters of each transformation 6 Compound transformations IBM Workshop on Compilation and Architecture → inlining, cloning, partial evaluation, slicing Features of the Language Transformations primitives Loop transformations → unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing, hoisting, shifting, scalar promotion, privatization Interprocedural transformations Composition of code generators (multi-stage evaluation with splicing) Sequence of annotation pragmas Procedural abstraction (build custom transformations from primitives) Control the application and parameters of each transformation Static analyses (crude scalar data-flow information right now) 6 Compound transformations IBM Workshop on Compilation and Architecture → inlining, cloning, partial evaluation, slicing Features of the Language Transformations primitives Loop transformations → unrolling, strip-mining, distribution, fusion, coalescing, interchange, skewing, reindexing, hoisting, shifting, scalar promotion, privatization Interprocedural transformations Composition of code generators (multi-stage evaluation with splicing) Sequence of annotation pragmas Procedural abstraction (build custom transformations from primitives) Control the application and parameters of each transformation Static analyses (crude scalar data-flow information right now) Dynamic analyses (only time measurement right now) 6 Compound transformations IBM Workshop on Compilation and Architecture → inlining, cloning, partial evaluation, slicing Example: Transformation Sequences Each transformation regererates annotations for the next transformation −→ #pragma xlang name loop1 for (ii=m; ii+4<n; ii+=4) { #pragma xlang name loop1_2 for (i=ii; i<ii+4; i++) a[i] = b[i]; #pragma xlang name loop1_3 } for (i=ii; i<n i++) a[i] = b[i]; #pragma xlang unroll loop1_2 7 #pragma xlang stripmine loop1 4 loop1_2 loop1_3 #pragma xlang unroll loop1_2 IBM Workshop on Compilation and Architecture #pragma xlang name loop1 for (i=m; i<n; i++) a[i] = b[i]; Example: Transformation Sequences #pragma xlang unroll loop1_2 −→ #pragma xlang name loop1 for (ii=m; ii+4<n; ii+=4) { #pragma xlang name loop1_2 i = ii; a[i] = b[i]; i = ii+1; a[i] = b[i]; i = ii+2; a[i] = b[i]; i = ii+3; a[i] = b[i]; } #pragma xlang name loop1_3 for (i=ii; i<n i++) a[i] = b[i]; 8 #pragma xlang name loop1 for (ii=m; ii+4<n; ii+=4) { #pragma xlang name loop1_2 for (i=ii; i<ii+4; i++) a[i] = b[i]; #pragma xlang name loop1_3 } for (i=ii; i<n i++) a[i] = b[i]; IBM Workshop on Compilation and Architecture Each transformation regererates annotations for the next transformation // drive search/learning strategy from this evaluation } 9 for (u=1; u<8; u++) { code c = ‘{ #pragma xlang name loop1 for (i=m; i<n; i++) a[i] = b[i]; #pragma xlang stripmine loop1 u loop1_2 loop1_3 ‘} run(c, &elapsed_time); IBM Workshop on Compilation and Architecture Example: Evaluating Multiple Versions Implementation Multistage evaluation Currently: syntactic hacks with strings and printf Plan: MetaOCaml with offshoring (implicitely heterogeneous multistage program generation) Plan: Stratego Compound transformations Currently: limited by naming scheme and programmer’s understanding Plan: go for goal-driven inference of transformation strategies 10 Currently: syntactic hacks with patterns and rewrite rules IBM Workshop on Compilation and Architecture Transformation primitives // Simplified transformation sequence for IA64 // (excluding search engine, pipelining, prefetch and page copying) #pragma #pragma #pragma #pragma #pragma #pragma #pragma #pragma #pragma #pragma #pragma #pragma xlang xlang xlang xlang xlang xlang xlang xlang xlang xlang xlang xlang transform transform transform transform transform transform transform transform transform transform transform transform stripmine iloop NU NUloop stripmine jloop MU MUloop interchange kloop MUloop interchange jloop NUloop interchange kloop NUloop fullunroll NUloop fullunroll MUloop scalarize_in b in kloop scalarize_in a in kloop scalarize_in&out c in kloop hoist kloop.loads before kloop hoist kloop.stores after kloop 11 #pragma xlang name iloop for (i=0; i<NB; i++) #pragma xlang name jloop for (j=0; j<NB; j++) #pragma xlang name kloop for (k=0; k<NB; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } IBM Workshop on Compilation and Architecture Full Example: Matrix Product in ATLAS // ... } #pragma xlang name kloop.stores { c[i+0][j+0] = c_0_0; c[i+0][j+1] = c_0_1; c[i+0][j+2] = c_0_2; c[i+0][j+3] = c_0_3; } }} // Remainder code 12 #pragma xlang name iloop for (i=0; i<NB; i++) { #pragma xlang name jloop for (j=0; j<NB; j+=4) { #pragma xlang name kloop.loads { c_0_0 = c[i+0][j+0]; c_0_1 = c[i+0][j+1]; c_0_2 = c[i+0][j+2]; c_0_3 = c[i+0][j+3]; } #pragma xlang name kloop for (k=0; k<NB; k++) { { a_0 = a[i+0][k]; a_1 = a[i+0][k]; a_2 = a[i+0][k]; a_3 = a[i+0][k]; } { b_0 = b[k][j+0]; b_1 = b[k][j+1]; b_2 = b[k][j+2]; b_3 = b[k][j+3]; } { c_0_0=c_0_0+a_0*b_0; c_0_1=c_0_1+a_1*b_1; c_0_2=c_0_2+a_2*b_2; c_0_3=c_0_3+a_3*b_3; } IBM Workshop on Compilation and Architecture Full Example: Matrix Product in ATLAS 13 IBM Workshop on Compilation and Architecture Preliminary Results Main Limitations 1. Hard to understand and keep track of transformations effects → Build and manage long sequences of transformations → Convince the expert programmer that it saves him time 14 → General kind of program construction → Algorithm selection IBM Workshop on Compilation and Architecture 2. Define custom transformations, beyond combination of existing primitive ones OVERCOMING THESE L IMITATIONS C OMPLEX T RANSFORMATION S EQUENCES Managing Complex Transformation Sequences SpecFP 2000 on Alpha 21264C (EV68) Speed-ups w.r.t. HP F90 and C compiler with KAP preprocessor Base SPEC: -arch ev6 -fast -O5 Peak SPEC Manual Optimizations SWIM 1.00 1.61 GALGEL 1.04 1.39 WUPWISE 1.20 2.90 APPLU 1.47 2.18 APSI 1.07 1.23 FACEREC 1.04 1.42 AMMP 1.18 1.40 EQUAKE 2.65 3.22 MESA 1.12 1.17 MGRID 1.59 1.45 FMA3D 1.32 1.09 ART 1.22 1.07 Managing Complex Transformation Sequences A 1 : -3 2 s P e e lin g S h iftin g P e e lin g S h iftin g P e e lin g S h iftin g A rra y F w d S u b s tit u tio n F u s io n A 2 : -3 0 s A 3 : -1 5 s P a d d in g T ilin g SWIM (base 204s) In te rc h a n g e S tr ip - M in in g F u s io n F u s io n F u s io n S c a la r P r o m o tio n L 1 -L 2 F F i su s s i i o o n n A 3 : -2 4 s In s tr u c tio n S p litin g S h iftin g L 1 F is s io n S t r ip - M in in g S h iftin g L 2 A rra y C o p y P r o p a g a tio n S c a la r P r o m o tio n F F i su s s i i o o n n A 5 : -6 s F u s io n S tr ip - M in in g F u s io n H o is tin g U n r o ll a n d J a m R e g is te r P r o m o tio n F u s io n GALGEL (base 171s) A 1 : -3 s A : -2 9 s G : -1 1 s F u s io n F u ll U n r o llin g In s tr u c tio n S p littin g F lo a t. p o in t R e o r d e r in g B 1 : -4 s P r iv a t iz a tio n F is s io n S c a la r P r o m o tio n P r iv a t iz a tio n APPLU (base 214s) P e e lin g A 3 : -1 s F u s io n B 1 : -3 1 s B 2 : -1 8 s F is s io n A 2 : -2 s F u ll U n r o llin g G : -2 7 s P r iv a t iz a t io n D a ta L a y o u t P r iv a t iz a t io n B 2 : -1 s In te rc h a n g e F is s io n F is s io n S o ftw a re P ip e lin in g S o ftw a re P ip e lin in g B 3 : -1 s S o ftw a re P ip e lin in g C 1 : -1 1 s P r iv a t iz a t io n P r iv a t iz a t io n F is s io n In te rc h a n g e APSI (base 378s) IBM Workshop on Compilation and Architecture A 1 : -1 4 s S tr ip - M in in g A 4 : -5 s 17 A 2 : + 2 4 s Polyhedral Representation for Complex Sequences Program and transformations represented as matrices Each transformation amounts to minor coefficient changes w.r.t. the previous state No code duplication before the final code generation step P ip L ib P o ly L ib O R C P re O p t L N O W O P T R V I2 C G B in a r y W H IR L w 2 p C o n v e r s io n W R a P U R U K tra n s fo rm e d W R a P T r a n s fo r m a tio n s W H IR L C L o o G s o u rc e c o d e C o n v e r s io n W L o o G IBM Workshop on Compilation and Architecture P o ly h e d r a M a n a g e m e n t L ib r a r y 18 S o u rc e c o d e P a r a m e tr ic in te g e r lin e a r p r o g r a m m in g s o lv e r Polyhedral Representation for Complex Sequences Example: optimizing SWIM2000 for IA32 # Move and shift calc3 backwards shift(enclose(C3L1),{’1’,’0’,’0’}) (...) shift(C3L17,{’1’}) motion(enclose(C3L1),BLOOP) (...) motion(C3L17,BLOOP) # Peel and shift to enable fusion peel(enclose(C3L1,2),’3’) peel(enclose(C3L1_2,2),’N-3’) (...) shift(enclose(C1L1_2_1_2_1),{’0’,’1’,’1’}) shift(enclose(C2L1_2_1_2_1),{’0’,’2’,’2’}) # Double fusion of the three nests motion(enclose(C2L1_2_1_2_1),TARGET_2_1_2_1) motion(enclose(C1L1_2_1_2_1),C2L1_2_1_2_1) motion(enclose(C3L1_2_1_2_1),C1L1_2_1_2_1) # Register blocking and unrolling (factor 2) time_stripmine(enclose(C3L1_2_1_2_1,2),2,2) time_stripmine(enclose(C3L1_2_1_2_1,1),4,2) interchange(enclose(C3L1_2_1_2_1,2)) fullunroll(enclose(C3L1_2_1_2_1_2_1_2_1,2)) fullunroll(enclose(C3L1_2_1_2_1_2_1_2_1,1)) IBM Workshop on Compilation and Architecture polydeps_save(BLOOP) 19 polydeps_check(BLOOP) Managing Complex Transformation Sequences Example: optimizing SWIM 2000 38% speed-up on AMD Athlon 64 3400+ at 2.2GHz (Three main procedures inlined by Open64) 112 statements and 215 array references in the polyhedral representation Exact dependence analysis Nesting depth: 3 original, 5 after tiling 441 matrices for all pairs of references at all depths 12s analysis time Code generation 2267 lines of code (≈ 1500 after scalar optimization) URGenT (improvement on CLooG, regeneration of Open64’s WHIRL): 4s Back-end compilation: 22s 20 Automatic Transformation Transformation Automatic 421 lines of code in the source program IBM Workshop on Compilation and Architecture Main SCoP OVERCOMING THESE L IMITATIONS P ROGAM C ONSTRUCTION Program Construction Example: sorting network for even-odd merge sort of size 16 a_2 = t[0] < t[8] ? t[0] : t[8]; b_3 = t[0] < t[8] ? t[8] : t[0]; a_4 = t[4] < t[12] ? t[4] : t[12]; b_5 = t[4] < t[12] ? t[12] : t[4]; a_6 = a_2 < a_4 ? a_2 : a_4; b_7 = a_2 < a_4 ? a_4 : a_2; a_8 = b_3 < b_5 ? b_3 : b_5; b_9 = b_3 < b_5 ? b_5 : b_3; a_10 = b_7 < a_8 ? b_7 : a_8; b_11 = b_7 < a_8 ? a_8 : b_7; // ... t[11] t[12] t[13] t[14] = = = = b_95 b_95 b_87 b_87 < < < < a_112 a_112 b_113 b_113 ? ? ? ? b_95 : a_112; a_112 : b_95; b_87 : b_113; b_113 : b_87; Program Construction Define safe progam combinators/generators: MetaOCaml + offshoring to C Offers run-time program generation... at a high overhead cost let rec merge l cont = match l with [] -> cont [] | [x] -> cont [x] | [x; y] -> compare [x] [y] cont | _ -> merge (even l) (fun e -> merge (odd l) (fun o -> (compare (List.tl e) o (fun l -> cont ((List.hd e) :: l))))) let sort n = let rec sort_rec l cont = match l with [] -> cont [] | [x] -> cont [x] | _ -> sort_rec (even l) (fun e -> sort_rec (odd l) (fun o -> (merge (e @ o) cont))) in .< fun t -> .~(sort_rec (make_list .<t>. n) (make_array .<t>.)) >. 23 let compare a b cont = let rec compare_rec a b l cont = match (a, b) with ([], []) -> cont l | ([], t) | (t, []) -> cont (l @ t) | (x :: c, y :: d) -> .< let a = if (.~x < .~y) then .~x else .~y in let b = if (.~x < .~y) then .~y else .~x in .~(compare_rec c d (l @ [.<a>.; .<b>.]) cont) >. in compare_rec a b [] cont IBM Workshop on Compilation and Architecture But is far from natural for application experts Conclusion: Future Optimizing Compilers → Fully automatic framework for abstraction-penalty removal → Machine learning and rule-based system for architecture-aware optimizations → Let application experts tell what is important Tools for safe and efficient metaprogramming IBM Workshop on Compilation and Architecture Tightly coupled off-line and on-line optimization Machine learning compilers 24 Research Directions Directions Research Compilers must do tedious things in a predictable manner... ... but should not try to be smart → Aggressive off-line analysis and narrowing of the optimization search-space → Low-overhead just-in-time/run-time transformations and code generation Complement intermediate representations with program generators Progresses Progresses → Expose algebraic properties of the search space → Support global and complex transformation sequences SPIRAL T HANK YOU