IBM HRL Data Layout Optimizations in GCC Olga Golovanevsky Razya Ladelsky IBM Haifa Research Lab, Israel 4/23/2007 © 2007 IBM Corporation IBM HRL Agenda General idea In scope of GCC Structure reorganization optimizations – Transformations – Decision making mechanism – Type–escape analysis Matrix reorganization optimizations – Matrix flattening – Matrix transposing – Combining: flattening + transposing Current Status Experimental results 2 4/23/2007 © 2007 IBM Corporation IBM HRL General Idea General idea: to adapt the layout of a data structure to its access patterns Principle: if data elements are accessed close in time, they should be allocated close in memory to gain better cache utilization Optimizations - profile-based/static: – matrix flattening, matrix transposing – structure peeling, splitting and reordering Flow: – analyze data accesses – change allocations, deallocations and access sites For structure: – fields are manipulated For matrix: – dimensions are manipulated 3 4/23/2007 © 2007 IBM Corporation IBM HRL In scope of GCC Presented: – matrix flattening/transposing: • gcc summit 2006 – struct-reorg: • • gcc summit 2005 by M.Hagog and C.Tice gcc summit 2007 Based on tree-ssa IR Use ipa infrastructure Jan Hubicka/Suse-Novell Require whole program – -fwhole-program, -combine Potential for LTO Developed on: – matrix optimizations – ipa-branch – struct-reorg – struct-reorg-branch Scheduled for gcc4.3 mainline 4 4/23/2007 © 2007 IBM Corporation IBM HRL Structure peeling typedef struct {int a; float b;} str; str *arr = (str *) malloc(N*sizeof(str)); foo1 () foo2 () { { for (i=0; i<N; i++) for (i=0; i<N; i++) arr[i].a = …; } … = arr[i].b; } a b a b a b a b a b a b a b a b 5 4/23/2007 © 2007 IBM Corporation IBM HRL Structure peeling typedef {int{int a;} str1; typedefstruct struct a; float b;} str; typedef struct {float b;} str2; str1 *arr1== (str (str1*)*) malloc(N*sizeof(str)); malloc(N*sizeof(str1)); str *arr str2 *arr2 = (str2 *) malloc(N*sizeof(str2)); foo1 () foo2 () { { for (i=0; i<N; i++) for (i=0; i<N; i++) arr[i].a ==…; arr1[i].a …; } arr[i].b; … = arr2[i].b; } a a b a a b a a b a a b a bb ba bb ba bb ba bb b 6 4/23/2007 © 2007 IBM Corporation IBM HRL Structure Splitting typeset struct node_struct { typedef struct node_struct { struct node_struct *parent; Node_0 * Node_0_ptr; Expression left; Node_1 * Node_1_ptr; Operator * bin_op; Node_2 * Node_2_ptr; Expression right; } Node; } Node; Node typedef struct { Node Node Node Node Expression Node_0 Node_0 Node_0 Node_0 Node_0 Node_1 Node_1 Node_1 Node_1 Node_1 Node_2 Node_2 Node_2 Node_2 Node_2 right; } Node_0; typedef struct { Expression Operator * left; bin_op; } Node_1; typedef struct { struct node_struct *parent; } Node_2; 7 4/23/2007 © 2007 IBM Corporation IBM HRL Structure Reordering typedef struct Operator_struct { typedef struct Operator_struct { n_int precedence; OpType type; const n_char *name; OpMode mode; OpMode n_char symbol; OpType type; n_char symbol; n_int mode; precedence; struct Operator_struct *lower; struct Operator_struct * higher; const char* struct Operator_struct * lower; struct Operator_struct *higher; } Operator; name; } Operator; Less aggressive optimization Does not change allocation/access sites, only type definition 8 4/23/2007 © 2007 IBM Corporation IBM HRL Decision Making Mechanism profile-based; make use of CFG Fields Reference Graph: – vertices – field accesses – edges • • follow CFG weighted with amount of data accessed between two field accesses Rebuild Fields Reference Graph into Close Proximity Graph – vertices – fields – Edges • • 9 potentially all vertices are interconnected average amount of data accessed between two fields 4/23/2007 © 2007 IBM Corporation IBM HRL Decision Making Mechanism struct str {int f1; int f2; int f3}; foo () bb1 } 10 (100#,10) 100 (10#,25) acc(str->f1) { … acc(str->f1); mem(10); for (i=0; i<N; i++) { mem(5); acc(str->f2); mem(10); acc(str->f1) mem(7); } mem(8); acc (str->f3); … (99#,12) (209#,11) (10.15) f1 f2 mem(10) (10#,15) (90#,18) (90#,18) 10 mem(5) acc(str->f2) acc(str->f2) 90(100#,10) mem(10) (99#,12) acc(str->f1) mem(7) bb3 bb2 10 (10#,15) mem(8) 100 (10#,15) f3 CPG str1:{f1,f2} srt2:{f3} 100 acc(str->f3) acc(str->f3) cfg FRG 4/23/2007 © 2007 IBM Corporation IBM HRL Type escape analysis structure escapes when: – – – – – pointer to structure is parameter/return type of global function structure is global variable one of the fields is escaped cast to pointer to structure operations other than plus, minus or times are applied to pointer to structure ipa-type-escape – – – – 11 ipa pass, by Kenneth Zadeck initially written on gimple recently updated to tree-ssa by default with –O2 4/23/2007 © 2007 IBM Corporation IBM HRL Future plans Support transformation for substructure Analyze/reorg more that one structure at a time: – consider the hierarchy of structures as a whole unite – mixed data structures: arrays within structures Spec2000/2006/mcf benchmark • new type of transformation Analyze more data, tune decision make algorithm Type escape analysis • • 12 more extensions with tree-ssa check real world applications with big portion of the code is in libraries 4/23/2007 © 2007 IBM Corporation IBM HRL Agenda General idea In scope of GCC Structure reorganization optimizations – Transformations – Decision making mechanism – Type–escape analysis Matrix reorganization optimizations – Matrix flattening – Matrix transposing – Combining: flattening + transposing Current status Experimental results 13 4/23/2007 © 2007 IBM Corporation IBM HRL Matrix Flattening Before flattening: A A[0] A[0][0] A[0][1] A[0][2] A[0][3] Access A[ i ][ j ] : A[1] A[1][0] A[1][1] A[1][2] A[1][3] A[2] A[2][0] A[2][1] A[2][2] A[2][3] 1. base = A + i * size(ptr) 2. A [ i ] [ j ] = *(*base + j * size(el)) After flattening: A’ A[0][0] A[0][1] A[0][2] A[0][3] A[1][0] A[1][1] A[1][2] A[1][3] A[2][0] A[2][1] A[2][2] A[2][3] One memory allocation Access A[ i ][ j ] : 14 A [ i ] [ j ] = *(A’ + (i * 4 + j) * size(el)) 4/23/2007 © 2007 IBM Corporation IBM HRL Flattening - the benefit What is the benefit? – Each matrix access involves less indirections -> less memory references – The rows are laid one after the other -> contiguous space allocation -> increase cache locality 15 4/23/2007 © 2007 IBM Corporation IBM HRL 2.1 Matrix flattening: partial flattening Flattening dim0 and dim1 flattened dim2 dim0 i 16 foo (a[i][j]) dim1 j 4/23/2007 © 2007 IBM Corporation IBM HRL Matrix Transposing for (i=0; i<N; i++) for (j=0; j<M; j++) access to A[ ji ][ ij ] A[0] A[0][0] A[0][0] A[0][1] A[0][1] A[0][2] A[0][2] A[0][3] A[0][3] A[1] A[1][0] A[1][0] A[1][1] A[1][1] A[1][2] A[1][2] A[1][3] A[1][3] A[2] A[2][0] A[2][0] A[2][1] A[2][1] A[2][2] A[2][2] A[2][3] A[2][3] The elements of the columns rows are accessed sequentially A[0][1] A[1][1] A[2][1] Iterated according to the columns rows A[0][2] A[1][2] A[2][2] Matrix transposing: Interchanging the dimensions 17 A[0][0] A[1][0] A[2][0] 4/23/2007 A[0][3] A[1][3] A[2][3] © 2007 IBM Corporation IBM HRL Matrix Transposing: decision making - example for (i=0; i<N; i++) -> row major matrix for (j=0; j<M; j++) acc1: access to A[ i ][ j ] for (i=0; i<N; i++) -> column major matrix for (j=0; j<M; j++) acc2: 18 access to A[ j ][ i ] 4/23/2007 © 2007 IBM Corporation IBM HRL Matrix Transposing: decision making Multi dimensional -> all possible permutations for allocation – How to decide the best permutation? – We consider: • Which dimensions are iterated in the innermost loops • Profiling information: how often the accesses were executed – The hotness of dimension determines how close in time were accesses to the elements of this dimension, normalized by the hotness of access execution. 19 4/23/2007 © 2007 IBM Corporation IBM HRL Combining Flattening and Transposing Original matrix : dim0 dim1 Flattened matrix Flattening mechanism Transposed matrix : Virtual mapping (Choose permutation) 20 dim0 -> dim1 dim1 -> dim0 Flattened transposed matrix Flattening mechanism 4/23/2007 © 2007 IBM Corporation IBM HRL Combining Flattening and Transposing transposing + flattening -> flatten the transposed matrix A[3,4] => A[4,3] A[0][0] A[0][1] A[0][2] A[0][3] A[1][0] A[1][1] A[1][2] A[1][3] A[2][0] A[2][1] A[2][2] A[2][3] 21 A[0][0] A[1][0] A[2][0] => A[0][1] A[1][1] A[2][1] A[0][2] A[1][2] A[2][2] A[0][3] A[1][3] A[2][3] 4/23/2007 © 2007 IBM Corporation IBM HRL status Matrix flattening and transposing: – Initially developed on the ipa branch. – To be included in GCC4.3; patch submitted recently – Enabled by -fipa-mreorg, -fdump-ipa-mreorg for dump – Transposing enabled when profiling information is available Struct reorg: – Initially developed on struct reorg branch – To be included in GCC4.3 – Enabled by -fipa-reorg-structs Thanks! Mustafa Hagog, Revital Eres, Daniel Berlin, Kenneth Zadeck, Zdenek Dvorak, Jan Hubicka, Sebastian Pop 22 4/23/2007 © 2007 IBM Corporation IBM HRL power5/linux Performance Results 180% Original 160% Matrix Flattening 140% Flattening+Transposing 120% Struct Reorg 100% 80% 60% 40% 20% 0% power970/linux art equake Original 600% Matix Flattening Flattening+Transposing 500% Struct Reorg Matrix+Struct Opt 400% 300% 200% 100% 0% art 23 equake 4/23/2007 © 2007 IBM Corporation IBM HRL Questions? 24 4/23/2007 © 2007 IBM Corporation