Exploiting Parallelism with Dependence - Aware Scheduling

IBM Toronto Lab
Auto-SIMDization Challenges
Amy Wang, Peng Zhao,
Peng Wu, Alexandre Eichenberger
IBM Toronto Laboratory
IBM T.J. Watson Research Center
Auto-SIMDization Challenges
October 17, 2005
@2005 IBM Corporation
© 2002 IBM
Corporation
Objective:
What are the new challenges in SIMD code generation that
are specific to VMX?
(due to lack of time….)
Scalar Prologue/Epilogue Code Generation (80% of the talk)
Loop Distribution (10% of the talk)
Mixed-Mode SIMDization
Future Tuning Plan (10% of the talk)
2
Automatic SIMDization
© 2005 IBM Corporation
Background:
Hardware imposed misalignment problem
More details in the CELL tutorial Tuesday afternoon
3
Automatic SIMDization
© 2005 IBM Corporation
Single Instruction Multiple Data (SIMD) Computation
Process multiple “b[i]+c[i]” data per operations
16-byte boundaries
R1
b0
b1
b2
b3
b0
b1
b2
b3
b4
b5
b6
c0
c1
c2
c3
c0
c1
c2
c3
c4
c5
b8
b9 b10
b0+ b1+ b2+ b3+
c0 c1 c2 c3
+
R2
b7
c6
c7
c8
R3
c9 c10
16-byte boundaries
4
Automatic SIMDization
© 2005 IBM Corporation
Code Generation for Loops (Multiple Statements)
for (i=0; i<n; i++) {
Implicit loop skewing (steady-state)
a[i] = …;
b[i+1] = …;
a[i+4] = …
c[i+3] = …;
b[i+4] = …
}
c[i+4] = …
a[i]=
a0 a1 a2 b3
a4 a5 a6 a7
...
a96 a97 a98 a99
a a a a
100 101 102 103
b[i+1]=
b0 b1 b2 b3
b4 b5 b6 b7
...
b96 b97 b98 b99
b b b b
100 101 102 103
c[i+3]=
c0 c1 c2 b3
c3
c4 c5 c6 c7
...
c96 c97 c98 c99
c c c c
100 101 102 103
loop prologue
(simdized)
5
Automatic SIMDization
loop steady state
(simdized)
loop epilogue
(simdized)
© 2005 IBM Corporation
Code Generation for Partial Store – Vector Prologue/Epilogue
for (i=0; i<100; i++) a[i+3] = b[i+1] + c[i+2];
a0
a1
a2
a3
*
*
*
b1+
c2
b2+ b3+ b4+ b5+
c3 c4 c5 c6
...
shuffle
a0
load a[3]
a1
a2
b1+
c2
store c[3]
a0
a1
a2
store c[7]
a3
a4
a5
a6
a7
…
16-byte boundaries
Can be complicated for multi-threading and page faulting
issues
6
Automatic SIMDization
© 2005 IBM Corporation
Multi-threading issue
for (i=0; i<100; i++) a[i+3] = b[i+1] + c[i+2];
a0
a1
a2
a3
a4
a5
a6
a7
*
*
*
b1+
c2
Thread 1 performs load of a0-a3
a0
a1
a2
shuffle
a3
a0
Thread 2 updates a0, while thread 1 is
working on shuffling
a1
a2
b1+
c2
WRONG!
a0
a1
b1+ b2+ b3+ b4+ b5+
a2 c2 c3 c4 c5 c6 …
16-byte boundaries
7
Automatic SIMDization
© 2005 IBM Corporation
Solution – Scalar Prologue/Epilgue
After Late-SIMDization
Scalar Prologue
20 | void ptest()
{
int K1;
22 |
if (!1) goto lab_4;
@CIV0 = 0;
void ptest() {
if (!1) goto lab_26;
int i;
@ubCondTest0 = (K1 & 3) * -4 + 16;
for (i=0;i<UB;i++) {
@CIV0 = 0;
pout0[i+K1] = pin0[i+K1] +
pin1[i+K1];
}
do { /* id=2 guarded */ /* ~27 */
/* region = 0 */
/* bump-normalized */
if ((unsigned) (@CIV0 * 4) >= @ubCondTest0) goto lab_30;
pout0[]0[K1 + @CIV0] = pin0[]0[K1 + @CIV0] + pin1[]0[K1 + @CIV0];
lab_30:
}
/* DIR LATCH */
@CIV0 = @CIV0 + 1;
} while (@CIV0 < 4);
/* ~27 */
@CIV0 = 0;
lab_26:
8
Automatic SIMDization
© 2005 IBM Corporation
SIMD Body
if (!1) goto lab_25;
@CIV0 = 0;
do { /* id=1 guarded */ /* ~3 */
/* region = 8 */
23 |
@V.pout0[]02[K1 + (@CIV0 + 4)] = @V.pin0[]01[K1 + (@CIV0 + 4)] + @V.pin1[]00[K1 + (@CIV0 + 4)];
22 |
/* DIR LATCH */
@CIV0 = @CIV0 + 4;
} while (@CIV0 < 24);
/* ~3 */
@mainLoopFinalCiv0 = (unsigned) @CIV0;
lab_25:
Scalar Epilogue
if (!1) goto lab_28;
@ubCondTest1 = (unsigned) ((K1 & 3) * -4 + 16);
@CIV0 = 0;
do { /* id=3 guarded */ /* ~29 */
/* region = 0 */
/* bump-normalized */
if ((unsigned) (@CIV0 * 4) < @ubCondTest1) goto lab_31;
pout0[]0[K1 + (24 + @CIV0)] = pin0[]0[K1 + (24 + @CIV0)] + pin1[]0[K1 + (24 + @CIV0)]
lab_31:
/* DIR LATCH */
@CIV0 = @CIV0 + 1;
} while (@CIV0 < 8);
/* ~29 */
@CIV0 = 32;
…
9
Automatic SIMDization
© 2005 IBM Corporation
Scalar Prologue/Epilogue Problems
One loop becomes 3 loops: Scalar Prologue, SIMD Body, Scalar
Epilogue
Contains an “if” stmt, per peeled stmt, inside the Scalar P/E loops
If there is one stmt that is misaligned, ‘every statement’ needs to
be peeled
Scalar P/E do not benefit from SIMD computation where as Vector
P/E does.
10
Automatic SIMDization
© 2005 IBM Corporation
The million bucks questions…
When there is a need to generate scalar p/e, what is the
threshold for a loop upper bound?
What is the performance difference between vector p/e versus
scalar p/e?
11
Automatic SIMDization
© 2005 IBM Corporation
Experiments
Written a gentest script with following parameters:
./gentest -s snum -l lnum -[c/r] ratio -n ub
-s : number of store statements in the loop
-l : number of loads per stmt
-[c/r] : compile time or runtime misalignment, where 0 <= ratio <= 1 to specify the
fraction of stmts that is aligned (i.e. known to aligned at quad word boundary during
compile time).
-n : upper bound of the loop (compile time constant)
Since we’re only interested in overhead introduced by p/e, load references
are relatively aligned with store references. (no shifts inside body)
Use addition operation
Assume data type of float (i.e. 4 bytes)
Each generated testcase is compiled at –O3 –qhot –qenablevmx –
arch=ppc970, and ran on AIX ppc970 machine (c2blade24)
Each testcase is ran 3 times with average timing recorded.
10 variants of the same parameters are generated.
12
Automatic SIMDization
© 2005 IBM Corporation
Results
Com pile Tim e Misalignm ent Study for Scalar P/E
disable SIMD ub12
SIMD w Scalar P/E ub12
Time (seconds)
0.5
0.4
0.3
0.2
0.1
0
0
0.25
0.5
0.75
Ratio of Aligned Stm ts in Loop
With the lowest functional ub of 12 and in the presence of different
degree of compile misalignment, it is always good to simdize!
Tobey is able to fully unroll the scalar p/e loops and fold away all
the if conditions. (good job!)
13
Automatic SIMDization
© 2005 IBM Corporation
Results
Runtime Misalignment Study for Scalar P/E
0.8
0.7
Time(seconds)
0.6
disable SIMD ub12
0.5
SIMD w Scalar P/E ub12
0.4
disable SIMD ub16
0.3
SIMD w Scalar P/E ub16
0.2
0.1
0
0
0.25
0.5
0.75
Ratio of Aligned Stmts in Loop
When the aligned ratio is below 0.25 (i.e. misaligned ratio is greater than 0.75) at
ub12, scalar p/e gives overhead too large that it is not good to simdize.
However, if we raise the ub to 16, it is always good to simdize regardless of any
degree of misalignment!
Tobey is still able to fully unroll the scalar P/E loops, but can’t fold away “if”s with
runtime condition.
14
Automatic SIMDization
© 2005 IBM Corporation
Answer to the first question.
When there is a need to generate scalar p/e, what is the
threshold for a loop upper bound? Compile time 12, Run time 16.
15
Automatic SIMDization
© 2005 IBM Corporation
Results
Ve ctor P/E VS Scalar P/E (com pile tim e m is alignm e nt)
0.35
Time (seconds)
0.3
0.25
0.2
vector P/E ub12
0.15
scalar P/E ub12
0.1
0.05
0
0
0.25
0.5
0.75
Ratio of Aligne d Stm ts in Loop
In the presence of only compile misalignment, vector p/e is always better than scalar p/e
Improvement:
Since every stmt is peeled, those that we have peeled a quad word may still be
done using vector instructions
16
Automatic SIMDization
© 2005 IBM Corporation
Results
Time (seconds)
Ve ctor P/E VS Scalar P/E (runtim e m is alignm e nt)
0.9
0.8
0.7
0.6
0.5
vector P/E ub12
0.4
0.3
0.2
0.1
0
scalar P/E ub12
0
0.25
0.5
0.75
Ratio of Aligne d Stm ts in Loop
In the presence of high runtime misalignment ratio, vector p/e suffers tremendous when
it needs to generate select mask using a runtime variable.
It is better to do scalar p/e when misalignment ratio is greater than 0.25!
17
Automatic SIMDization
© 2005 IBM Corporation
Answer to the second question
What is the performance difference between vector p/e versus
scalar p/e?
Vector p/e is always better when there is only compile time
misalignment. When there is runtime misalignment of greater than
0.25, scalar p/e proves to be better.
18
Automatic SIMDization
© 2005 IBM Corporation
Motivating Example
Not all computations are simdizable
Dependence cycles
Non-stride-one memory accesses
Unsupported operations and data types
A simplified example from GSM.encoder, which is a speech
compression application
Linear Recurrence
Not simdizable
for (i = 0; i < N; i++) {
1:
d[i+1] = d[i] + (rp[i] * u[i]);
2:
t[i]
= u[i] + (rp[i] * d[i]);
}
Fully simdizable
19
Automatic SIMDization
© 2005 IBM Corporation
Current Approach: Loop Distribution
Distribute the simdizable and non/partially simdizable statements into
separated loops (after Loop distribution)
for (i = 0; i < N; i++) {
1:
d[i+1] = d[i] + rp[i] * u[i];
}
for (i = 0; i < N; i++) {
2:
t[i]
= u[i] + rp[i] * d[i];
Loop Distribution
}
After loop After
distribution
Simdize the loops with only simdizable statements (after SIMDization)
for (i = 0;
1:
d[i+1] =
}
for (i = 0;
2:
t[i:i+3]
}
i < N; i++) {
d[i] + rp[i] * u[i];
i < N; i+=4) {
= u[i:i+3] + rp[i:i+3] * d[i:i+3];
After simdization
20
Automatic SIMDization
© 2005 IBM Corporation
Problems with Loop Distribution
Increase reuse distances of memory references
for (i = 0; i < N; i++) {
1:
d[i+1] = d[i] + (rp[i] * u[i]);
2:
t[i]
= u[i] + (rp[i] * d[i]);
}
O(1)
Only one unit is fully utilized for each loop
for (i = 0;
1:
d[i+1] =
}
for (i = 0;
2:
t[i]
=
}
21
Automatic SIMDization
Original Loop
i < N; i++) {
d[i] + (rp[i] * u[i]);
SIMD idle
i < N; i++) {
u[i] + (rp[i] * d[i]);
Scalar idle
O(N)
After loop distribution
© 2005 IBM Corporation
Preliminary results
The prototyped mixed mode SIMDization has
illustrated a gain of 2 times speed up for the SPEC95
FP swim. With loop distribution, the speed up is only
1.5 times.
22
Automatic SIMDization
© 2005 IBM Corporation
Conclusion and Future tuning plan
Further improvement on scalar p/e code generation.
Currently, finding out cases when stmt re-execution is allowed. This will
allow us to fold away more if conditions
More experiments to determine the upper bound threshold for different data
types
Enable Mixed-mode SIMDization
Integration of SIMDization framework into TPO better
e.g. predicative commoning
23
Automatic SIMDization
© 2005 IBM Corporation
Acknowledgement
This work would not be possible without the technical contribution
from the following individuals.
Roch Archambault/Toronto/IBM
Raul Silvera/Toronto/IBM
Yaoqing Gao/Toronto/IBM
Gang Ren/Watson/IBM
24
Automatic SIMDization
© 2005 IBM Corporation