Presentation

Inthreads:
Code Generation and
Microarchitecture
Alex Gontmakher
Gregory Shklover
Assaf Schuster
Avi Mendelson
1/2/2005
1
Outline
n
Inthreads Introduction
n
Inthreads code generation
n
Microarchitecture
1/2/2005
2
Motivation
n
Programs contain low-level parallelism
n
n
n
n
1/2/2005
Would be nice to be able to express it
But: too complex for superscalar execution
But: too low-level to be done by threads
Solution: an extremely lightweight threading
mechanism
3
Inthreads architecture
n
n
Processor resources are shared between
threads – just like SMT
Architectural state, including registers, is
shared between threads – unlike SMT!!!
n
n
n
Built-in synchronization
n
1/2/2005
Lightweight thread context
OS transparent
Integration with processor hardware à more
efficient communication & synchronization
4
Inthreads architecture:
Thread manipulation
n
inth.start tid, addr - create a new thread
n
n
Only a fixed number of threads supported
No scheduling necessary for threads
n
inth.halt
n
inth.kill tid - terminate the given thread
1/2/2005
- terminate the current thread
5
Inthreads architecture:
Synchronization
n
Condition registers C1, C2, … Cn:
binary semaphores
inth.set Ci – sets condition Ci.
Releases one waiting thread.
inth.clr Ci – clears Ci.
n
inth.wait Ci – waits for Ci to be set.
n
n
1/2/2005
6
Inthreads programming model
n
Threads share the registers cooperatively
n
n
n
No race conditions allowed
n
n
n
1/2/2005
Thread-private variables use different
registers in each thread
Shared variables must be allocated to same
register in all threads
Accesses must be protected by synchronization
Speculative races still possible
à compiler analysis?
No function calls when threads are active
7
Inthreads Code Generation:
Compilation flow
n
Inthreads-C: C with explicit parallelization
Inthreads-C
.c
Programmer
.c
Inthreads-C
compiler
.c
.o
Parallelizer
1/2/2005
8
Inthreads-C Extension
INTH_CLEAR(1);
INTH_START(start2);
#pragma inthread (1 of 2)
{
b1 = …; c1 = …;
a1 = min(b1,c1);
INTH_WAIT(1);
ret = min(a1,a2);
}
return ret;
start2:
#pragma inthread (2 of 2)
{
b2 = …; b3 = …;
a2 = min(b2,c2);
INTH_SET(1);
INTH_HALT();
}
1/2/2005
/* initialize conditional variable #1 */
/* start the thread at label “start2” */
/* do half of the job */
/* wait till the second thread terminates */
/* calculate the final value */
Must be a macro
/* do half of the job */
/* notify the first thread */
/* terminate this thread */
9
Inthreads-C Extension:
Correctness Requirements
INTH_CLEAR(1);
INTH_START(start2);
#pragma reg_range (1 of 2)
{
b1 = …; c1 = …;
a1 = min(b1,c1);
INTH_WAIT(1);
ret = min(a1,a2);
}
return ret;
start2:
#pragma reg_range (2 of 2)
{
b2 = …; c2 = …;
a2 = min(b2,c2);
INTH_SET(1);
INTH_HALT();
}
1/2/2005
b1, c1, b2, c2: Private values
Must be allocated to
different registers
a2: Shared value
Must be allocated to the same
register in both threads
Special Semantics:
The value of a2 must be transferred
correctly.
• INTH_WAIT is an Acquire for a2
• INTH_SET is a Release for a2
10
Optimization Correctness
n
Problem: Basic compiler assumptions are broken
n
Execution doesn’t follow a single path at any given
moment…
i1: A = 5
i2: INTH_WAIT(1)
i3: A == ?
j1: A = 10
j2: INTH_SET(1)
à Classic optimizations may produce incorrect code
n
n
n
1/2/2005
Common Subexpression Elimination
Loop Invariant Motion
Dead Code Elimination
11
Internal Representation
(Concurrent) Control Flow Graph
n
INTH_START
Do job…
Do job…
INTH_SET
INTH_WAIT
continue...
1/2/2005
INTH_HALT
INTH_START is
represented by a branch
instruction with 100%
probability for each edge
INTH_SET are connected
to corresponding
INTH_WAIT instructions
by synchronization edges
INTH_HALT is
represented by a
“dead end” block
12
Concurrent Data Flow
n
Concurrent data flow is propagated through
synchronization edges (that connect inth_set to
corresponding inth_wait).
A is marked as “shared
through this INTH_SET()”.
Meaning:
• live
• “unknown” value
A == 5
1/2/2005
A=5
INTH_SET(2)
…
INTH_WAIT(3)
B=A+2
…
INTH_WAIT(2)
A=A+1
INTH_SET(3)
…
13
Register Allocation
n
n
The problem: make sure concurrently-live
values are allocated to different registers at each
execution point.
Method 1: Manual register file partitioning
n
n
n
n
Method 2: Automatic register allocation
n
n
1/2/2005
Allocate private subset of registers to each thread
Allocate a subset of registers for shared variables
Use the allocation during register assignment
à Not user-friendly, not portable, sub-optimal
Using concurrent interference graph
Uniform approach guided by spill costs heuristics
14
Register Allocation
n
Graph Coloring Algorithm
n
n
Additional conflict edges are added to the
interference graph to represent concurrently-live
values. Each node now has parallel and sequential
conflicts.
Conservative – exact execution ordering is not
calculated
Interference graph:
A
D
E
B
C
1/2/2005
F
15
A Running Example
(from SPEC2K MCF)
for( ; arc < stop_arcs; arc += nr_group ) {
if( arc->ident > BASIC ) {
red_cost = bea_compute_red_cost( arc );
if( bea_is_dual_infeasible( arc, red_cost ) ) {
basket_size++;
perm[basket_size]->a = arc;
perm[basket_size]->cost = red_cost;
perm[basket_size]->abs_cost = ABS(red_cost);
}
}
}
1/2/2005
16
Example continued: parallelization
ß
for (; ; ) {
for( ; arc1 < stop_arcs; arc1 += nr_group*3 ) {
INTH_WAIT(1);
if( arc1->ident > BASIC ) {
red_cost1 = bea_compute_red_cost( arc1 );
basket_size += res1;
res1 = bea_is_dual_infeasible( arc1, red_cost1 );
dest1 = basket_size;
INTH_SET(1);
INTH_SET(2);
/* receive destination from the main thread */
INTH_WAIT(2);
if( res1 ) {
INTH_WAIT(3);
perm[dest1]->a = arc1;
basket_size += res2;
perm[dest1]->cost = red_cost1;
dest2 = basket_size;
perm[dest1]->abs_cost = ABS(red_cost1);
INTH_SET(4);
}
}
else {
INTH_WAIT(5);
res1 = 0;
basket_size += res3;
INTH_SET(1);
dest3 = basket_size;
INTH_WAIT(2);
INTH_SET(6);
}
}
}
1/2/2005
17
Inthreads Implementation
n
Basic processor structure
n
Instruction execution ordering
n
Handling speculation
1/2/2005
18
Basic Processor Structure
SMT + extensions
n
n
n
n
n
SAME: Concurrent fetch
SAME: Multiple ROBs
DIFF: Shared architectural regs
DIFF: Shared renaming hardware
Fetch
Fetch
Fetch
Fetch
Decode
Decode
OOO
OOO Execution
Execution
Core
Core
Commit
Commit
Speculation
Info (branches)
Waiting
Instructions
Condition Info
Thread Start/Kill Signals
1/2/2005
Thread
ThreadControl
Control
Unit
Unit
Speculation
Info (waits)
19
Instruction Execution Order
n
Transfer of values between registers
MOVI R1, 10
?
n
MOVI R2, R1
Race Condition! (forbidden)
MOVI R1, 10
COND.SET 5
WAIT 5
MOVI R2, R1
n
1/2/2005
In-Order renaming ensures correct execution order
20
Instruction Execution Order
n
Transfer of Values between memory insns
STORE [100]
COND.SET 5
WAIT 5
LOAD [100]
n
n
Instructions inserted into MOB in-order
MOB will not reorder load and store if their
addresses conflict
è In absence of speculation, OOO executes
register-shared threads correctly!
1/2/2005
21
Speculation – the Problem
With Inthreads
No Inthreads
Synchronized
BEQ R1,R0,…
BEQ R1,R0,…
SET 5
Unsynchronized
MOVI R4,10
…
MOVE R3,R4
Move is placed after BEQ
in the ROB à will not be
committed before BEQ,
will be squashed by BEQ.
1/2/2005
WAIT 5
MOVE R3,R4
Move is in a different ROB
than BEQ à can potentially
be committed before BEQ.
Needs special mechanism to
know when BEQ is
speculative and that BEQ
is squashed.
MOVE R3,R4
This is a race for access to
R4. Forbidden in correct
programs, but can still
happen due to speculation.
22
Handling Speculation: Synch
n
Execute condition sets only when non-speculative
Executed branches
Mispredicted branches
T0
TN
Unresolved Branches
[100] beq R1,R2
[117] bne R5,R0
…
[105] j 405888
First Unresolved Branch[T]
Thread Control Instructions
T0: [095] SET 4
T0: [108] SET 5
1/2/2005
Timestamps:
Assigned during
Decode
23
Handling Speculation: Synch cont’d
n
Synchronization is a barrier for speculation
n
n
n
n
Performance Hit:
0% speedup in the running example
Can be worse than non-threaded execution
Need to issue cond.sets speculatively
Extended Speculation Definition:
n
n
1/2/2005
WAIT is speculative if the SET which provided
its condition is speculative
Any instruction is speculative if a preceding
WAIT in its ROB is speculative or a preceding
branch is yet unresolved
24
Speculative Synchronization
BEQ
BNE
SET 5
BEQ
SET 6
WAIT 5
WAIT 6
1/2/2005
25
Handling Speculation: Synch cont’d
n
For each SET/WAIT instruction
n
For each inthread
n
n
Computing timestamps affecting a SET:
n
n
MAX of: the latest branch timestamps of all the
active WAITs of the same inthread,
the latest timestamp of all the unresolved
branches in the same inthread
Computing timestamps affecting a WAIT:
n
1/2/2005
Keep the timestamp of the latest branch that affects
this instruction
MAX of: same as SET, the timestamps of SET
providing the condition
26
Handling Speculation: Synch cont’d
Mispredicted branches
Unresolved Branches
First Unresolved Branch[T]
Last Unresolved Branch[T]
Committed conditions
1 1 0 0
T0: SET 2
103 --- --- ---
T3: WAIT 2 103 --- --- 098
T3: SET 3
103 --- --- 098
Squashed
WAITs
T1: WAIT 0 --- --- --- --0 1 0 1
T2: WAIT 3 103 --- 114 098
1/2/2005
Computed
conditions
0 1 0 0
27
Handling Speculation: cont’d
n
Assuming no unsynchronized races
n
n
n
Assuming unsynchronized races
n
n
n
1/2/2005
Register values are transferred speculatively
between inthreads
63% speedup on the running example
Transferring registers non-speculatively
à Performance hit: only 10% speedup
Recovery mechanism as in Thread Control Unit
à Expensive
Ensuring no unsynchronized races
à Need compiler support
28
Conclusion
n
n
n
Basic Inthreads microarchitecture is similar
to SMT
Speculative execution needs special
recovery mechanisms for inter-threads
dependencies
Inthreads Compilation
n
n
1/2/2005
Register allocation problems unique to
Inthreads
Accurate code analysis could allow for simpler
implementation
29
Questions?
1/2/2005
30
BACKUP foils
1/2/2005
31
Optimizations Correctness
n
Some optimizations are not affected
n
n
Some optimizations just need the corrected
(concurrent) data flow
n
n
1/2/2005
Dead code elimination, Constant propagation,
Loop invariant motion, …
Some optimizations require algorithmic fixes
n
n
Cross jumping, Block reordering, …
Global common sub-expression elimination,
Register Allocation, …
A formal approach – Delay Set Analysis. Still
under investigation…
32
Conservative Concurrent Conflict
Analysis
Conservative:
INTH_START
A=…
INTH_SET(2)
B=…
INTH_SET(3)
INTH_WAIT(3)
C=…
INTH_WAIT(3)
D=…
1/2/2005
A
B
C
D
Precise analysis:
A
B
C
D
33
Register Allocation
n
Possible register allocation solutions:
n
GCC’s old separate global + local allocation
n
n
GCC’s new (Chaitin-style) algorithm
n
n
Automatic partitioning (global, uniform approach)
Optimal allocator (based on integer linear
solver)
n
1/2/2005
Static register file partitioning
Still under investigation
34
Future work
n
n
1/2/2005
Our compiler sees code of all the threads.
Can we do better?
Automatic parallelization.
35