Inthreads: Code Generation and Microarchitecture Alex Gontmakher Gregory Shklover Assaf Schuster Avi Mendelson 1/2/2005 1 Outline n Inthreads Introduction n Inthreads code generation n Microarchitecture 1/2/2005 2 Motivation n Programs contain low-level parallelism n n n n 1/2/2005 Would be nice to be able to express it But: too complex for superscalar execution But: too low-level to be done by threads Solution: an extremely lightweight threading mechanism 3 Inthreads architecture n n Processor resources are shared between threads – just like SMT Architectural state, including registers, is shared between threads – unlike SMT!!! n n n Built-in synchronization n 1/2/2005 Lightweight thread context OS transparent Integration with processor hardware à more efficient communication & synchronization 4 Inthreads architecture: Thread manipulation n inth.start tid, addr - create a new thread n n Only a fixed number of threads supported No scheduling necessary for threads n inth.halt n inth.kill tid - terminate the given thread 1/2/2005 - terminate the current thread 5 Inthreads architecture: Synchronization n Condition registers C1, C2, … Cn: binary semaphores inth.set Ci – sets condition Ci. Releases one waiting thread. inth.clr Ci – clears Ci. n inth.wait Ci – waits for Ci to be set. n n 1/2/2005 6 Inthreads programming model n Threads share the registers cooperatively n n n No race conditions allowed n n n 1/2/2005 Thread-private variables use different registers in each thread Shared variables must be allocated to same register in all threads Accesses must be protected by synchronization Speculative races still possible à compiler analysis? No function calls when threads are active 7 Inthreads Code Generation: Compilation flow n Inthreads-C: C with explicit parallelization Inthreads-C .c Programmer .c Inthreads-C compiler .c .o Parallelizer 1/2/2005 8 Inthreads-C Extension INTH_CLEAR(1); INTH_START(start2); #pragma inthread (1 of 2) { b1 = …; c1 = …; a1 = min(b1,c1); INTH_WAIT(1); ret = min(a1,a2); } return ret; start2: #pragma inthread (2 of 2) { b2 = …; b3 = …; a2 = min(b2,c2); INTH_SET(1); INTH_HALT(); } 1/2/2005 /* initialize conditional variable #1 */ /* start the thread at label “start2” */ /* do half of the job */ /* wait till the second thread terminates */ /* calculate the final value */ Must be a macro /* do half of the job */ /* notify the first thread */ /* terminate this thread */ 9 Inthreads-C Extension: Correctness Requirements INTH_CLEAR(1); INTH_START(start2); #pragma reg_range (1 of 2) { b1 = …; c1 = …; a1 = min(b1,c1); INTH_WAIT(1); ret = min(a1,a2); } return ret; start2: #pragma reg_range (2 of 2) { b2 = …; c2 = …; a2 = min(b2,c2); INTH_SET(1); INTH_HALT(); } 1/2/2005 b1, c1, b2, c2: Private values Must be allocated to different registers a2: Shared value Must be allocated to the same register in both threads Special Semantics: The value of a2 must be transferred correctly. • INTH_WAIT is an Acquire for a2 • INTH_SET is a Release for a2 10 Optimization Correctness n Problem: Basic compiler assumptions are broken n Execution doesn’t follow a single path at any given moment… i1: A = 5 i2: INTH_WAIT(1) i3: A == ? j1: A = 10 j2: INTH_SET(1) à Classic optimizations may produce incorrect code n n n 1/2/2005 Common Subexpression Elimination Loop Invariant Motion Dead Code Elimination 11 Internal Representation (Concurrent) Control Flow Graph n INTH_START Do job… Do job… INTH_SET INTH_WAIT continue... 1/2/2005 INTH_HALT INTH_START is represented by a branch instruction with 100% probability for each edge INTH_SET are connected to corresponding INTH_WAIT instructions by synchronization edges INTH_HALT is represented by a “dead end” block 12 Concurrent Data Flow n Concurrent data flow is propagated through synchronization edges (that connect inth_set to corresponding inth_wait). A is marked as “shared through this INTH_SET()”. Meaning: • live • “unknown” value A == 5 1/2/2005 A=5 INTH_SET(2) … INTH_WAIT(3) B=A+2 … INTH_WAIT(2) A=A+1 INTH_SET(3) … 13 Register Allocation n n The problem: make sure concurrently-live values are allocated to different registers at each execution point. Method 1: Manual register file partitioning n n n n Method 2: Automatic register allocation n n 1/2/2005 Allocate private subset of registers to each thread Allocate a subset of registers for shared variables Use the allocation during register assignment à Not user-friendly, not portable, sub-optimal Using concurrent interference graph Uniform approach guided by spill costs heuristics 14 Register Allocation n Graph Coloring Algorithm n n Additional conflict edges are added to the interference graph to represent concurrently-live values. Each node now has parallel and sequential conflicts. Conservative – exact execution ordering is not calculated Interference graph: A D E B C 1/2/2005 F 15 A Running Example (from SPEC2K MCF) for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( bea_is_dual_infeasible( arc, red_cost ) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } } 1/2/2005 16 Example continued: parallelization ß for (; ; ) { for( ; arc1 < stop_arcs; arc1 += nr_group*3 ) { INTH_WAIT(1); if( arc1->ident > BASIC ) { red_cost1 = bea_compute_red_cost( arc1 ); basket_size += res1; res1 = bea_is_dual_infeasible( arc1, red_cost1 ); dest1 = basket_size; INTH_SET(1); INTH_SET(2); /* receive destination from the main thread */ INTH_WAIT(2); if( res1 ) { INTH_WAIT(3); perm[dest1]->a = arc1; basket_size += res2; perm[dest1]->cost = red_cost1; dest2 = basket_size; perm[dest1]->abs_cost = ABS(red_cost1); INTH_SET(4); } } else { INTH_WAIT(5); res1 = 0; basket_size += res3; INTH_SET(1); dest3 = basket_size; INTH_WAIT(2); INTH_SET(6); } } } 1/2/2005 17 Inthreads Implementation n Basic processor structure n Instruction execution ordering n Handling speculation 1/2/2005 18 Basic Processor Structure SMT + extensions n n n n n SAME: Concurrent fetch SAME: Multiple ROBs DIFF: Shared architectural regs DIFF: Shared renaming hardware Fetch Fetch Fetch Fetch Decode Decode OOO OOO Execution Execution Core Core Commit Commit Speculation Info (branches) Waiting Instructions Condition Info Thread Start/Kill Signals 1/2/2005 Thread ThreadControl Control Unit Unit Speculation Info (waits) 19 Instruction Execution Order n Transfer of values between registers MOVI R1, 10 ? n MOVI R2, R1 Race Condition! (forbidden) MOVI R1, 10 COND.SET 5 WAIT 5 MOVI R2, R1 n 1/2/2005 In-Order renaming ensures correct execution order 20 Instruction Execution Order n Transfer of Values between memory insns STORE [100] COND.SET 5 WAIT 5 LOAD [100] n n Instructions inserted into MOB in-order MOB will not reorder load and store if their addresses conflict è In absence of speculation, OOO executes register-shared threads correctly! 1/2/2005 21 Speculation – the Problem With Inthreads No Inthreads Synchronized BEQ R1,R0,… BEQ R1,R0,… SET 5 Unsynchronized MOVI R4,10 … MOVE R3,R4 Move is placed after BEQ in the ROB à will not be committed before BEQ, will be squashed by BEQ. 1/2/2005 WAIT 5 MOVE R3,R4 Move is in a different ROB than BEQ à can potentially be committed before BEQ. Needs special mechanism to know when BEQ is speculative and that BEQ is squashed. MOVE R3,R4 This is a race for access to R4. Forbidden in correct programs, but can still happen due to speculation. 22 Handling Speculation: Synch n Execute condition sets only when non-speculative Executed branches Mispredicted branches T0 TN Unresolved Branches [100] beq R1,R2 [117] bne R5,R0 … [105] j 405888 First Unresolved Branch[T] Thread Control Instructions T0: [095] SET 4 T0: [108] SET 5 1/2/2005 Timestamps: Assigned during Decode 23 Handling Speculation: Synch cont’d n Synchronization is a barrier for speculation n n n n Performance Hit: 0% speedup in the running example Can be worse than non-threaded execution Need to issue cond.sets speculatively Extended Speculation Definition: n n 1/2/2005 WAIT is speculative if the SET which provided its condition is speculative Any instruction is speculative if a preceding WAIT in its ROB is speculative or a preceding branch is yet unresolved 24 Speculative Synchronization BEQ BNE SET 5 BEQ SET 6 WAIT 5 WAIT 6 1/2/2005 25 Handling Speculation: Synch cont’d n For each SET/WAIT instruction n For each inthread n n Computing timestamps affecting a SET: n n MAX of: the latest branch timestamps of all the active WAITs of the same inthread, the latest timestamp of all the unresolved branches in the same inthread Computing timestamps affecting a WAIT: n 1/2/2005 Keep the timestamp of the latest branch that affects this instruction MAX of: same as SET, the timestamps of SET providing the condition 26 Handling Speculation: Synch cont’d Mispredicted branches Unresolved Branches First Unresolved Branch[T] Last Unresolved Branch[T] Committed conditions 1 1 0 0 T0: SET 2 103 --- --- --- T3: WAIT 2 103 --- --- 098 T3: SET 3 103 --- --- 098 Squashed WAITs T1: WAIT 0 --- --- --- --0 1 0 1 T2: WAIT 3 103 --- 114 098 1/2/2005 Computed conditions 0 1 0 0 27 Handling Speculation: cont’d n Assuming no unsynchronized races n n n Assuming unsynchronized races n n n 1/2/2005 Register values are transferred speculatively between inthreads 63% speedup on the running example Transferring registers non-speculatively à Performance hit: only 10% speedup Recovery mechanism as in Thread Control Unit à Expensive Ensuring no unsynchronized races à Need compiler support 28 Conclusion n n n Basic Inthreads microarchitecture is similar to SMT Speculative execution needs special recovery mechanisms for inter-threads dependencies Inthreads Compilation n n 1/2/2005 Register allocation problems unique to Inthreads Accurate code analysis could allow for simpler implementation 29 Questions? 1/2/2005 30 BACKUP foils 1/2/2005 31 Optimizations Correctness n Some optimizations are not affected n n Some optimizations just need the corrected (concurrent) data flow n n 1/2/2005 Dead code elimination, Constant propagation, Loop invariant motion, … Some optimizations require algorithmic fixes n n Cross jumping, Block reordering, … Global common sub-expression elimination, Register Allocation, … A formal approach – Delay Set Analysis. Still under investigation… 32 Conservative Concurrent Conflict Analysis Conservative: INTH_START A=… INTH_SET(2) B=… INTH_SET(3) INTH_WAIT(3) C=… INTH_WAIT(3) D=… 1/2/2005 A B C D Precise analysis: A B C D 33 Register Allocation n Possible register allocation solutions: n GCC’s old separate global + local allocation n n GCC’s new (Chaitin-style) algorithm n n Automatic partitioning (global, uniform approach) Optimal allocator (based on integer linear solver) n 1/2/2005 Static register file partitioning Still under investigation 34 Future work n n 1/2/2005 Our compiler sees code of all the threads. Can we do better? Automatic parallelization. 35