Presentation

http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Compiler Mediated Performance of a Cell
Processor
Alexandre Eichenberger, Kathryn O’Brien, Kevin O’Brien,
Peng Wu, Tong Chen, Peter Oden, Daniel Prener,
Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang,
Tao Zhang, Peng Zhao,Yaoqing Gao
www.research.ibm.com/cellcompiler/compiler.htm
Haifa Compiler and Architecture Seminar - 2005
© 2002 IBM
Corporation
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Background
Prototype compiler
Some functionality shipped in Alphaworks Cell xlc compiler
Based on and integrated with the product code base
Collaboration with compiler development group
Many research issues still in progress
2
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Cell Broadband Engine
Multiprocessor on a chip
241M transistors, 235mm2
200 GFlops (SP) @3.2GHz
200 GB/s bus (internal) @ 3.2GHz
Power Proc. Element (PPE)
general purpose
running full-fledged OSs
Synergistic Proc. Element (SPE)
optimized for compute density
3
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Cell Broadband Engine Overview
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
Heterogeneous, multi-core engine
1 multi-threaded power processor
up to 8 compute-intensive-ISA engines
Local Memories
8 Bytes
(per dir)
4
16Bytes
(one dir)
fast access to 256KB local memories
globally coherent DMA to transfer data
To
ToExternal
ExternalIO
IO
To
ToExternal
ExternalMem
Mem
Power
Power
L1
Processor
Processor
Element
Element L2
(PPE)
(PPE)
Element Interconnect Bus (96 Bytes/cycle)
Pervasive SIMD
PPE has VMX
SPEs are SIMD-only engines
High bandwidth
128Bytes
(one dir)
Compiler mediated performance for a Cell processor
fast internal bus (200GB/s)
dual XDRTM controller (25.6GB/s)
two configurable interfaces (76.8GB/s)
numbers based on 3.2GHz clock rate
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Compiler support
for the full spectrum of Cell programming expertise
Highest performance with programmer control
Automatic tuning for each ISA
Explicit parallelization with
local memories
SIMD
PROGRAMS
multipe ISAs
Explicit SIMD coding
SIMD/alignment
directives
Automatic simdization
PARALLELIZATION
Hand tuned for
Shared memory
abstraction through
single source
compilation
Automatic parallelization
Highest Productivity with fully automatic compiler technology
5
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Outline
Automatic tuning for each ISA
6
Explicit SIMD coding
Part 2:
Parallelization and
memory abstraction
Explicit parallelization with
local memories
SIMD
PROGRAMS
Multiple-ISA hand-tuned
programs
Part 3:
Automatic simdization
SIMD/alignment
directives
Automatic simdization
Compiler mediated performance for a Cell processor
PARALLELIZATION
Part 1:
Automatic SPE tuning
Shared memory,
single program
abstraction
Automatic parallelization
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Cell Architecture Overview
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
PPE executes traditional code
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
Cell Processor
SPE optimized for
high compute throughput
small area footprint
Element Interconnect Bus (96 Bytes/cycle)
SPE tradeoff
8 Bytes
(per dir)
7
16Bytes
(one dir)
peak performance requires
compiler assistance
To
ToExternal
ExternalIO
IO
To
ToExternal
ExternalMem
Mem
Power
Power
L1
Processor
Processor
Element
Element L2
(PPE)
(PPE)
lower hardware complexity
128Bytes
(one dir)
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Architecture: Relevant SPE Features
Synergistic Processing Element (SPE)
SIMD-only functional units
16-bytes register/memory accesses
EvenPipe
Pipe
Even
Floating/
Floating/
Fixed
Fixed
Point
Point
OddPipe
Pipe
Odd
Branch
Branch
Memory
Memory
Permute
Permute
1
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
Instr.Buffer
(3.5 x 32 instr)
(128xx16Byte
16Byteregister)
register)
(128
LocalStore
Store
Local
must be parallel & properly aligned
(256KByte,
KByte,Single
SinglePorted)
Ported)
(256
8 bytes
(per dir)
3
8
16 bytes
(one dir)
Dual-issue for instructions
full dependence check in hardware
2
(Globally-Coherent)
(Globally-Coherent)
no hardware branch predictor
compiler managed hint/predication
RegisterFile
File
Register
DMA
DMA
Simplified branch architecture
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
Single-ported local memory
aligned accesses only
contentions alleviated by compiler
128 bytes
(one dir)
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Feature #1: Functional Units are SIMD Only
SPE
Functional units are SIMD only
all transfers are 16 Bytes wide,
EvenPipe
Pipe
Even
Floating/
Floating/
Fixed
Fixed
Point
Point
OddPipe
Pipe
Odd
Branch
Branch
Memory
Memory
Permute
Permute
1
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
including register file and memory
How to handle scalar code?
Instr.Buffer
RegisterFile
File
Register
(3.5 x 32 instr)
(128xx16Byte
16Byteregister)
register)
(128
2
LocalStore
Store
Local
(256KByte,
KByte,Single
SinglePorted)
Ported)
(256
3
DMA
DMA
(Globally-Coherent)
(Globally-Coherent)
8 bytes
(per dir)
9
16 bytes
(one dir)
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128 bytes
(one dir)
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Single Instruction Multiple Data (SIMD)
Meant to process multiple “b[i]+c[i]” data per operation
16-byte boundaries
b0
b1
b2
b3
b4
b5
b6
b7
b8
memory
stream
b9 b10
registers
R1
b0
b1
b2
b3
VADD
R2
c0
c1
c2
c3
c0
c1
c2
c3
c4
c5
c6
b0+ b1+ b2+ b3+
c0 c1 c2 c3
c7
c8
c9 c10
R3
memory
stream
16-byte boundaries
10
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Scalar Code on SIMD Functional Units
Example: a[2] = b[1] + c[3]
b-1
b0
b1
b2
b3
b4
c0
c1
b6
b7
16-byte
boundaries
LOAD b[1]
c-1
b5
c2
c3
c4
c5
c6
b0 b1
b1 b2
b2 b3
b3
b0
r1
c0 c1
c1 c2
c2 b1
b1
c0
c3
r2
c7
LOAD c[3]
ADD
Problem #1:
Memory alignment defines
data location in register
Problem #2:
Adding aligned values
yields wrong result
b0+
b0+
c0
c0
b1+
b1+
c1
c1
b3+
b3+
c3
c3
b3+
b3+
c3
c3
r3
STORE a[2]
b1+
b3+
b2+
a-1 b0+
a0
a1
a2
c0 c1 c2 a3
c3 a4
11
a5
a6
Compiler mediated performance for a Cell processor
a7
Problem #3:
Vector store clobbers
neighboring values
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Scalar Load Handling
Use read-rotate sequence
b-1 b0 b1 b2 b3 b4
LOAD b[1]
b0 b1
b1 b2
b2 b3
b3
b0
r1
b1 b2
b2 b3
b3 b0
b0
b1
r1’
ROTATE &b[1]
Overhead (1 op, in blue)
one quad-word byte rotate
Outcome
desired scalar value always in the first slot of the register
this addresses Problems 1 & 2
12
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Scalar Store Handling
Use read-modify-write sequence
a-1 a0 a1 a2 a3 a4
LOAD a[2]
b1+
b1
b1
c3
** ** **
Computations
SHUFFLE
r3
CWD &a[2]
insertion mask
for &a[2]
STORE a[2]
a-1 a0 a1 b1+
b1
c3 a3 a4
Overhead (1 to 3 ops, in blue)
one shuffle byte, one mask formation (may reuse), one load (may reuse)
Outcome
SIMD store does not clobber memory (this addresses Problem 3)
13
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimizations for Scalar on SIMD
Significant overhead for scalar load/store can be lowered
For vectorizable code
generate SIMD code directly to fully utilize SIMD units
done by expert programmers or compilers (see SIMD presentation)
For scalar variable
allocate scalar variables in first slot, by themselves
i * * *
eliminate need for rotate when loading
data is guaranteed to be in first slot (Problems 1 & 2)
eliminate need for read-modify-write when storing
other data in 16-byte line is garbage (Problem 3)
can also deal with the alignment of scalar data
14
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Feature #2: Software-Assisted Instruction Issue
Dual-issue for Instructions
SPE
can dual-issue parallel instructions
EvenPipe
Pipe
Even
Floating/
Floating/
Fixed
Fixed
Point
Point
OddPipe
Pipe
Odd
Branch
Branch
Memory
Memory
Permute
Permute
1
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
Instr.Buffer
RegisterFile
File
Register
(3.5 x 32 instr)
(128xx16Byte
16Byteregister)
register)
(128
2
code layout constrains dual issuing
full dependence check in hardware
Alleviate constraints by
making the scheduler aware of
code layout issue
LocalStore
Store
Local
(256KByte,
KByte,Single
SinglePorted)
Ported)
(256
3
DMA
DMA
(Globally-Coherent)
(Globally-Coherent)
8 Bytes
(per dir)
15
16Bytes
(one dir)
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128Bytes
(one dir)
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Alleviating Issue Restriction
Scheduling finds the best possible schedule
dependence graph modified to account for latency of false dependences
Bundling ensures code layout restrictions
keep track of even/odd code layout at all times
swap parallel ops when needed
insert (even or odd) nops when needed
Engineering issues
each function must start at known even/odd code layout boundary
one cannot add any instructions after the last scheduling phase as it would impact
the code layout and thus the dual-issuing constraints
16
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Feature 3: Single-Ported Local Memory
Local store is single ported
SPE
denser hardware
EvenPipe
Pipe
Even
Floating/
Floating/
Fixed
Fixed
Point
Point
OddPipe
Pipe
Odd
Branch
Branch
Memory
Memory
Permute
Permute
1
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
Instr.Buffer
If we are not careful, we may
starve for instructions
2
LocalStore
Store
Local
(256KByte,
KByte,Single
SinglePorted)
Ported)
(256
8 Bytes
(per dir)
17
Ifetch
3
16Bytes
(one dir)
DMA > MEM > IFETCH
(3.5 x 32 instr)
(128xx16Byte
16Byteregister)
register)
(128
(Globally-Coherent)
(Globally-Coherent)
16 bytes for load/store ops
128 bytes for IFETCH/DMA
static priority
RegisterFile
File
Register
DMA
DMA
asymmetric port
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128Bytes
(one dir)
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Instruction Starvation Situation
There are 2 instruction buffers
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
up to 64 ops along the fall-through path
First buffer is half-empty
initiate
refill
after
half
empty
18
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
can initiate refill
When MEM port is continuously used
starvation occurs (no ops left in buffers)
instruction buffers
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Instruction Starvation Prevention
SPE has an explicit IFETCH op
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
initiate
refill
after
half
empty
19
which initiates a instruction fetch
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
before
it is too
late to
hide
latency
Scheduler monitors starvation situation
when MEM port is continuously used
insert IFETCH op within the red window
refill IFETCH latency
Compiler design
scheduler must keep track of code layout
Hardware design
instruction buffer
Compiler mediated performance for a Cell processor
IFETCH op is not needed if memory port is
idle for one or more cycles within the red
window
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Engineering Issues for Dual-Issue & Starvation Prevention
Initially, the scheduling and bundling phases were separate
satisfy dual-issue &
instruction-starvation
constraints by adding nops
Code
(not sched)
Sched
Code
(not bundled)
find best schedule,
using latencies, issue,
& resource constraints
Bundle
Code
(sched & bundled)
Problem: Bundler adds an IFETCH to prevent starvation.
A better schedule could be found if the scheduler had known that.
But the schedule is already “finalized”.
20
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Engineering Issues for Dual-Issue & Starvation Prevention
We integrate Scheduling and Bundling tightly, on a cycle per cycle basis
satisfy dual-issue &
instruction-starvation
constraints by adding nops
Code
(not sched)
Sched
Code
(not bundled)
find best schedule,
using latencies, issue,
& resource constraints
21
Compiler mediated performance for a Cell processor
Bundle
Code
(sched & bundled)
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Feature #4: Software-Assisted Branch Architecture
Branch architecture
SPE
no hardware branch-predictor, but
EvenPipe
Pipe
Even
Floating/
Floating/
Fixed
Fixed
Point
Point
OddPipe
Pipe
Odd
Branch
Branch
Memory
Memory
Permute
Permute
1
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
Instr.Buffer
RegisterFile
File
Register
(3.5 x 32 instr)
(128xx16Byte
16Byteregister)
register)
(128
compare/select ops for predication
software-managed branch-hint
one hint active at a time
Lowering overhead by
predicating small if-then-else
hinting predictably taken branches
2
LocalStore
Store
Local
(256KByte,
KByte,Single
SinglePorted)
Ported)
(256
3
DMA
DMA
(Globally-Coherent)
(Globally-Coherent)
8 bytes
(per dir)
22
16 bytes
(one dir)
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128 bytes
(one dir)
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Hinting Branches & Instruction Starvation Prevention
SPE provides a HINT operation
Dual-Issue
Dual-Issue
Instruction
Instruction
Logic
Logic
fetches the branch target into HINT buffer
no penalty for correctly predicted branches
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
instruction buffers
23
HINT buffer
HINT br, target
IFETCH
window
fetches ops from target;
needs a min of 15 cycles
and 8 intervening ops
refill
latency
BRANCH if true
target
compiler inserts hints when beneficial
Impact on instruction starvation
Compiler mediated performance for a Cell processor
after a correctly hinted branch, IFETCH
window is smaller
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
SPE Optimization Results
Relative reductions in execution time
.
.
.
.
.
.
Original
+Bundle
+Branch Hint
+ Ifetch
single SPE performance, optimized, simdized code
24
Compiler mediated performance for a Cell processor
Av
er
ag
e
Sa
xp
y
ul
t
at
M
M
ne
ra
yX
Y
O
C
on
vo
lu
tio
n
ck
VL
D
LU
EA
ID
FF
T
Li
np
a
H
uf
fm
an
.
(avg 1.00
0.78)
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Outline
Automatic tuning for each ISA
25
Explicit SIMD coding
Part 2:
Parallelization and
memory abstraction
Explicit parallelization with
local memories
SIMD
PROGRAMS
Multiple-ISA hand-tuned
programs
Part 3:
Automatic simdization
SIMD/alignment
directives
Automatic simdization
Compiler mediated performance for a Cell processor
PARALLELIZATION
Part 1:
Automatic SPE tuning
Shared memory,
single program
abstraction
Automatic parallelization
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Cell Broadband Engine
Multiprocessor on a chip
Power Proc. Element (PPE)
general purpose
running full-fledged OSs
Synergistic Proc. Element (SPE)
optimized for compute density
Compiler support for
parallelizing across the
heterogeneous processing
elements
26
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Single Source Compiler
user prepares an application as a collection of one or more source files
containing user directives targetting the Cell architecture
compiler, guided by these directives, takes care of the code partitioning
between PE and SPE
compiler takes care of data transfers.
identifies accesses in SPE functions that refer to data in system memory
locations
uses software cache or static buffers to manage transfer of this data
to/from SPE local stores
compiler takes care of code size
explore extending Code partitioning to Single Source, i.e. automatic
partitioning based on functionality rather than size
27
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Using OpenMP to partition/parallelize across Cell
Single source program contains C or Fortran with user directives
or pragmas
Compiler ‘outlines’ all code within the pragmas into separate
functions compiled for the SPE.
Replaces ‘outlined’ code with call to the parallel runtime and
compiles this code for the PPE
Master thread continues to execute on PPE and participate in
workshare constructs
PPE Runtime
places outlined functions on a work queue containing
information about number of iterations to execute, or ‘chunk’
size for each SPE
Creates up to 8 SPE threads to pull work items (outlined parallel
functions) from queue and execute on SPEs
May wait for SPE completion, or proceed with other PPE
statement execution
28
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Cloning for heterogeneous processors
Outlined function becomes new
node in call graph
In pass 2 of TPO, using whole
program call graph, outlined
function is cloned, then specialized
to create a ppe and an spe version
Ke y
Flo w G ra p h No d e
Ca ll G ra p h No d e
Flo w G ra p h Ed g e
Ca ll G ra p h Ed g e
Co m p ile fo r
PPE
All called functions must also be
cloned
SPE call sites modified to call SPE
versions of cloned subroutines
Co m p ile
fo r SPE
Partitioning pass creates SPE and
PPE partitions and invokes lower
level optimizer for machine specific
optimization
29
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
“Single Source” Compiler, using OpenMP
Foo1();
Foo1();
#pragmaOmp
Ompparallel
parallelfor
for
#pragma
for(i i=0;
=0;i<10000;
i<10000;i++)
i++)
for(
A[i]==x*
x*B[i];
B[i];
A[i]
Foo2();
Foo2();
outline omp
region
voidol$1()
ol$1()
void
for(i i=0;
=0;i<10000;
i<10000;i++)
i++)
for(
A[i]==x*
x*B[i];
B[i];
A[i]
Compile for PPE
Foo1();
Foo1();
Callomp-rte-init;
omp-rte-init;
Call
Callomp_rte_do_par;
omp_rte_do_par;
Call
Foo2();
Foo2();
clone for SPE
clone for PPE
voidOL$1_PPE()
OL$1_PPE()
void
for(i=LB;
i=LB; i<UB;
i<UB; i++)
i++)´´
for(
A[i]==xx**B[i];
B[i];
A[i]
30
Compiler mediated performance for a Cell processor
voidfoo_SPE()
foo_SPE()
void
void
foo_SPE()
void
foo_SPE()
voidfoo_SPE()
foo_SPE()
for(k=LB;
k=LB; k<UB;
k<UB; k++)
k++)
void
for(
void
OL$1_SPE()
for(
k=LB; k<UB;
k<UB; k++)
k++)
void OL$1_SPE()
for(
k=LB;
for(k=LB;
k=LB;
k<UB;
k++)
DMAk<UB;
100BBelements
elements
intoB´
B´
for(
k<UB;
k++)
DMA
100
into
for(
k=LB;
k++)
DMAk<UB;
100BBelements
elements
into
B´
for( k=LB;
k++)
DMA
100
into
B´
DMA
100
elements
intoB´
B´
for
(j=0;
j<100; j++)
j++)
DMA
100
BBelements
elements
into
for
(j=0;
j<100;
DMA
100
B
into
B´
for100
(j=0;
j<100; j++)
j++)
DMA
B j<100;
elements
into B´
for
(j=0;
j<100;
for
(j=0;
j++)
A´[j]
=
cache_lookup(x)
*B´[j];
B´[j];
for(j=0;
(j=0;
j<100;
j++)
A´[j]
=cache_lookup(x)
cache_lookup(x)
*B´[j];
for
j<100;
j++)
A´[j]
=
*
for
(j=0;
j<100;
j++)
A´[j]==cache_lookup(x)
cache_lookup(x)**B´[j];
B´[j];
A´[j]
DMA
100
A
elements
out
of
A´
cache_lookup(x)
*
B´[j];
A´[j]
=
A´[j]
=100
cache_lookup(x)
*B´[j];
B´[j];
DMA
100
Aelements
elementsout
out
ofA´
A´
DMA
A
of
A´[j]
=
cache_lookup(x)
*
DMA100
100AAelements
elementsout
outofofA´
A´
DMA
DMA
100
A
elements
out
of
A´
DMA100
100AAelements
elementsout
outofofA´
A´
DMA
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Cell Memory & DMA Architecture
Main
Memory*
SPE #1
MMU
Local
LocalStore
store#1
1
Local store 1
SPU
SPE #8
MMU
LocalStore
store#8
8
Local
Local store 1
SPU
set access rights
SPE can initiate DMAs
to any global addresses,
TLBs
MFC Regs
L1
QofS* / L3*
PPE
L2
IO Devices
Memory requests
* external
31
PPE
can access/DMA memory
...
ALIAS TO
Local stores are mapped in
global address space
DMA
Mapped
Compiler mediated performance for a Cell processor
including local stores of others.
translation done by MMU
Note
all elements may be masters,
there are no designated slaves
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Competing for the SPE Local Store
Local store is fast, need support when full.
irregular data
Provided compiler support:
SPE code too large
compiler partitions code
partition manager pulls in code as needed
code
Local
Store
regular
data
Data with regular accesses is too large
compiler stages data in & out
using static buffering
can hide latencies by using double buffering
Data with irregular accesses is present
e.g. indirection, runtime pointers...
use a software cache approach to pull the data in & out (last resort solution)
32
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
How can we run programs that have lots of data?
Data won’t all fit in Local Store
Solution: Use System Memory, its large and virtual
But the SPU doesn’t have Load/Store access to it
The compiler can automatically manage the transferring of Data between System
Memory, and Local Store
We call this Data Partitioning
33
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Data Partitioning
Single Source assumption: all data lives in System Memory
Naïve implementation, every load and store requires a dma operation
Too costly (~300 cycles per load or store)
MP will require locking on every reference
What can be done to make this acceptable?
34
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Reducing the Overhead of Accesses to System Memory
Rather than executing a DMA transaction for every variable access, we
would like to bring data over in larger pieces, and keep it in Local Store for
some portion of its lifetime
There are several techniques that can accomplish this
Prefetching predictable references, and accumulating writes
Software Cache
on demand, but leverage spatial and temporal locality
requires additional instructions inline, so it is essentially a fallback strategy
We can apply Tiling to increase reuse for both of these
35
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Software Cache for Irregular Accesses
Data with irregular accesses
cannot reside permanently in the SPE’s local memory (typically)
thus reside in global memory
when accessed,
must translate the global address into a local store address
must pull the data in/out when its not already present
Use a software cache
managed by the SPEs in the local store
generate DMA requests to transfer data to/from global memory
use 4-way set associative cache to naturally use the SIMD units of the SPE
36
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Software Cache Overview
A portion of Local Store is set aside to
hold the cache
Cache is made up of two arrays
Tag Array: Contains the comparands for the
address lookups and pointers to the data
lines, also contains “dirty-bits” for MP support
set tags
data ptrs.
0 a2
a2c2
c2e3
e3h4
h4
1 b2
b2c3
c3 f4
f4 i3
i3
...
x c4
c4 f6
f6 a1
a1 j5
j5
...
dirty bits...
...
d2
d2
..
..
..
Data Array: Contains the data lines
Geometry
Currently, 4-way Set Associative
line size is 128 bytes, and there are 128 of
them in each set
these parameters can be changed
Total size, Tags (16K) + Data(64K) is
80K
37
Compiler mediated performance for a Cell processor
data array
d1
d2
dn
...
42
42
...
...
... ...
...
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Software Cache Lookup
20:
20:
This code is optimized along with
all the other instructions
The branch to the misshandler is
treated as a regular instruction
throughout optimization, and
expanded only at the very end of
compilation
The misshandler uses a tailored
register convention, so that it has
no impact on the hit path
vr844=gr664,vr842
vr844=gr664,vr842
LI
20:
VLR
20:
VSHUFB
20:20:
VLR
VLR
20:20:
VAND
VAND
vr849=vr843,vr845
vr849=vr843,vr845
20:20:
VLQ
VLQ
vr850=.L_tagaddr(relative,0)
vr850=.L_tagaddr(relative,0)
20:20:
A A
gr851=vr844,vr850
gr851=vr844,vr850
20:20:
VLQ
VLQ
vr852=.L_tagarray(gr851,0)
vr852=.L_tagarray(gr851,0)
20:
20:
20:
20:
20:
LI
VLR
VSHUFB
VLQ
VLQ
VANDC
vr847=0x10203
vr847=0x10203
*vr846=vr847
*vr846=vr847
vr848=gr664,gr664,vr846
vr848=gr664,gr664,vr846
*vr845=vr848
*vr845=vr848
vr853=.L_tagarray(gr851,16)
vr853=.L_tagarray(gr851,16)
vr854=gr664,vr843
20:
VANDC
vr854=gr664,vr843
20:
VCEQW
vr855=vr849,vr852
20:
VGBB
vr856=vr855
20:20:
VCNTLZ
VQBR
vr857=vr856
vr858=vr853,vr857,gr1
20:20:
VQBR
MISS
vr858=vr853,vr857,gr1
*vr858,.L_tagarray=gr664,vr856,vr858
20:20:
MISS
A
20:20:
A VLR
20:20:
LI
VLR
20:20:
LI VLR
20:
20:
20:
20:
20:
20:
VCEQW
VGBB
VCNTLZ
VSHUFB
VLR
VLR
20:
VSHUFB
20:
VLR
20:
VLQ
20:
LR
20:
20:
38
VAND
20:
20:
The cache lookup code is
inserted inline at each Read or
Write reference to a Cacheable
Variable
VAND
Compiler mediated performance for a Cell processor
VLQ
LR
vr855=vr849,vr852
vr856=vr855
vr857=vr856
*vr858,.L_tagarray=gr664,vr856,vr858
gr859=vr854,vr858
*gr799=gr859
gr859=vr854,vr858
vr847=0x10203
*gr799=gr859
*vr846=vr847
vr847=0x10203
vr848=gr664,gr664,vr846
*vr846=vr847
*vr845=vr848
vr848=gr664,gr664,vr846
vr800=c[]0(gr799,0)
*vr845=vr848
*vr663=vr800
vr800=c[]0(gr799,0)
*vr663=vr800
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
translate this global address
addr a1
a1
addr
Software Cache Access
subset of addr
used by tags
extract & load set
set tags
data ptrs.
0 a2
a2c2
c2e3
e3h4
h4
1 b2
b2c3
c3 f4
f4 i3
i3
...
...
x c4
c4 f6
f6 a1
a1 j5
j5
splat addr
dirty bits...
...
d2
d2
..
..
addr offset
..
a1a1
a1a1
a1a1
a1
a1
compare ‘=‘
compute addr
0000
00FF
FF00
00
00
data array
d1
d2
dn
39
...
42
42
...
...
... ...
...
locate data ptr.
SIMD
comparison
hit
when successful
Compiler mediated performance for a Cell processor
hit latency: ~ 20 extra cycles
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Single Source Compiler Results
optimization
ps
in
v
re
si
d
gr
id
m
ca
lc
ca
lc
ca
lc
sw
im
softcache
rp
rj
Speedup with SPEs
Results for Swim, Mgrid, & some of their kernels
baseline: execution on one single PPE
40
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Some performance results for specOMP
Speedup
with Softcache
(softcache only, no
optimization
of data transfer, no simd)
Speedup over one PU
apsi
ammp
applu
art
equake
mgrid
sw im
w upw ise
spu=
41
spu=
spu=
Compiler mediated performance for a Cell processor
spu=
spu=
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Outline
Automatic tuning for each ISA
42
Explicit SIMD coding
Part 2:
Parallelization and
memory abstraction
Explicit parallelization with
local memories
SIMD
PROGRAMS
Multiple-ISA hand-tuned
programs
Part 3:
Automatic simdization
SIMD/alignment
directives
Automatic simdization
Compiler mediated performance for a Cell processor
PARALLELIZATION
Part 1:
Automatic SPE tuning
Shared memory,
single program
abstraction
Automatic parallelization
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Extract Parallelism
Successful Simdizer
loop level
Satisfy Constraints
alignment constraints
for (i=0; i<256; i++)
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
a[i] =
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
basic-block level
a[i+0] =
a[i+1] =
a[i+2] =
b0b1b2b3b4b5b6b7b8b9b10
R1 b0b1b2b3
R2 c0c1c2c3
for (i=0; i<8; i++)
a[i] =
43
load b[i]
b0+
b1+
b2+
b3+
c0c1c2c3 R3
SHORT
+
c0c1c2c3c4c5c6c7c8c9c10
load a[i]
unpack
add
a[i+3] =
entire short loop
data size conversion
store
multiple targets
unpack
load a[i+4]
add
INT 1
store
INT 2
non stride-one
GENERIC
VMX
Compiler mediated performance for a Cell processor
SPE
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Traditional Execution of a Loop
Sequential execution of “for (i=0;i<64;i++) a[i+2] = b[i+1] + c[i+3] ”
b0
b1
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10
c2
c3
c4
c5
c6
c7
c8
c9
load b[1]
c0
c1
c2
c10
load c[3]
Memory streams
add
(grey is 1st iteration)
store a[2]
a0
44
a1
a3
a2
a3
a4
a5
Compiler mediated performance for a Cell processor
a6
a7
a8
a9 a10
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
SIMD Alignment Problem
SIMD execution of “for(i=0;i<64;i+) a[i+2] = b[i+1] + c[i+3]”
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10
16-byte boundaries
vload b[1]
offset 4 b0
b1
b1
b2
b3
b0+ b1+ b2+ b3+
c0 c1 c2 c3
vadd
offset 8
c0
c1
c2
c2
c3
this is not b[1]+c[3], ...
because the alignments
vload c[3]
c0
c1
c2
do not match
c2
c3
c4
c5
c6
c7
c8
c9
c10
16-byte boundaries
45
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Loop-Level Simdization, Naïve Way
SIMD execution of “for(i=0;i<64;i+) a[i+2] = b[i+1] + c[i+3]”
16-byte boundaries
Memory stream
Register stream
b0
b1
b1
b2
b3
b1
b2
b3
b4
c0
c1
c2
c3
c3
c4
c5
c6
StreamShift Left
StreamShift Right
b4
b5
c4
c7
b5
b6
b6
c5
b7
c6
c8
b7
b8
b8
c7
b9 b10
b9 b10 b11 b12
c8
c9 c10
c9 c10
c11 c12 c13 c14
+
+
+
b1+ b2+ b3+ b4+
c3 c4 c5 c6
b5+ b6+ b7+ b8+
c7 c8 c9 c10
b9+ b10+ b11+ b12+
c11 c12 c13 c14
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9 a10
16-byte boundaries
46
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Solving the Alignment Problem
Data Reorganization Graph
original graph with alignment label each load/store
resolve alignment conflicts
vload b[i+1]
vload c[i+3]
offset 4
offset 12
vadd
conflict:
reaching alignments
are not all identical
offset 8
47
vstore a[i+2]
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Solving the Alignment Problem
Data Reorganization Graph
original graph with alignment label each load/store
resolve alignment conflicts by adding “stream-shift” aligning operations
vload b[i+1]
vload c[i+3]
offset 4
offset 12
stream-shift(4,0)
stream-shift(12,0)
vadd
offset 0
stream-shift(0,8)
offset 8
48
vstore a[i+2]
Compiler mediated performance for a Cell processor
set alignment
of streams
to zero
[VAST]
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Loop-Level Simdization, Optimized Way
SIMD execution of “for(i=0;i<64;i+) a[i+2] = b[i+1] + c[i+3]”
16-byte boundaries
Memory stream
b0
b1
b1
b2
b3
Register stream
b-1
b1
b0
b2
b1
b3
b2
b4
c0
c1
c2
c3
c1
c3
c2
c4
c3
c5
c4
c6
Stream-Shift Left
Stream-Shift Right
b4
b3
b5
c4
c5
c7
b5
b6
b4
b6
c5
b5
b7
c6
c6
c8
b7
b8
b6
b8
c7
b9 b10
b7
b9 b10
b8 b11
b9 b10
b12
c8
c7
c9 c10
c8
c9 c10
c11
c9 c10
c12 c11
c13 c12
c14
+
+
+
b1+
b-1 b0+
b2+ b1+
b3+ b2+
b4+
c3 c2
+c1
c4 c3
c5 c4
c6
b5+ b4+
b6+ b5+
b7+ b6+
b8+
b3+
c7 c6
c8 c7
c9 c10
c5
c8
b9+ b10+
b12+
b7+
b8+ b11+
b9+ b10+
c11
c12 c11
c13 c12
c14
c9 c10
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9 a10
16-byte boundaries
49
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimized Solving of the Alignment Problem, Eager Policy
Data Reorganization Graph
eagerly align to store alignment
vload b[i+1]
vload c[i+3]
offset 4
offset 12
stream-shift(4,8)
stream-shift(12,8)
vadd
offset 8
50
vstore a[i+2]
Compiler mediated performance for a Cell processor
set to store alignment;
reduce stream-shift #
from 3 to 2
[PLDI04, CGO05]
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Optimized Solving of the Alignment Problem, Lazy Policy
Data Reorganization Graph (slightly different “b[i+1] + c[i+1]” example)
lazily align to store alignment
vload b[i+1]
vload c[i+1]
offset 4
offset 4
vadd
stream-shift(4,8)
offset 8
51
vstore a[i+2]
Compiler mediated performance for a Cell processor
lazily set to store align;
reduce stream-shift #
from 3 to 1
[PLDI04, CGO05]
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
SIMD Code Generation
SIMD codes are generated from a valid data reorganization graph
Code generation for simdized loop
<simdized prologue>
for(i=0; i<100; i+=4)
<simdized loop body>
<simdized epilogue>
simdized loop (steady state)
simdized prologue if store is misaligned
simdized epilogue depending on store alignment and tripcount
52
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Code Generation for Stream-Shift
When you need a stream of vectors starting at address b[1]
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10 b11 b12
...
16-byte boundaries
load b[1]
b0
b1
b2
offset 4
load b[5]
b3
b4
b5
b6
shuffle
streamshift(4, 0)
b1
b2
b3
load b[9]
b7
b8
b9 b10 b11
shuffle
b4
b5
b6
b7
...
shuffle
b8
b9 b10 b11 b12
offset 0
53
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Code Generation for Partial Store
for (i=0; i<100; i++) a[i+2] = b[i+1] + c[i+3];
a0
a1
a2
a3
*
b1+ b2+
c3 c4
*
b3+ b4+ b5+ b6+
c5 c6 c7 c8
...
shuffle
a0
load a[2]
a1
b1+ b2+
c3 c4
store a[2]
a0
a1
a2
store a[6]
a3
a4
a5
a6
a7
…
16-byte boundaries
offset 8
54
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Code Generation for Loops (Multiple Statements)
for (i=0; i<n; i++) {
a[i] = …;
Implicit loop skewing (steady-state)
b[i+1] = …;
a[i+4] = …
c[i+3] = …;
b[i+4] = …
}
c[i+4] = …
a[i]=
a0 a1 a2 b3
a4 a5 a6 a7
...
a96 a97 a98 a99
a a a a
100 101 102 103
b[i+1]=
b0 b1 b2 b3
b4 b5 b6 b7
...
b96 b97 b98 b99
b b b b
100 101 102 103
c[i+3]=
c0 c1 c2 b3
c3
c4 c5 c6 c7
...
c96 c97 c98 c99
c c c c
100 101 102 103
loop prologue
(simdized)
55
loop steady state
(simdized)
Compiler mediated performance for a Cell processor
loop epilogue
(simdized)
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Machine-Independent Code After Simdization
Pseudo codes after simdization (no software pipelining on adjacent loads)
all operations are vectors
loads/stores normalized to truncated address
splice, shiftpairl, and shiftpairr are pseudo data reorganization operations to be
mapped to spu_shuffle or spu_sel
i = 0;
a[i] = splice(a[i], shiftpairr(b[i - 4], b[i], 4) + shiftpairl(c[i - 4], c[i],4), 8);
do {
a[i + 4] = shiftpairr(b[i], b[i + 4], 4) + shiftpairl(c[i], c[i + 4],4);
i = i + 4;
} while (i < 95);
a[i + 4] = splice(shiftpairr(b[i], b[i + 4],4) + shiftpairl(c[i], c[i + 4], 4), a[i + 4],8));
56
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Machine-Dependent Intrinsic Codes after Simdization
after software pipelining, loop normalization, and address truncation
spu_shuffle, spu_sel, spu_add, spu_mask are SPU intrinsics
<0x08090a0b, …> is a vector literal
a[0] = spu_sel(a[0],spu_add(spu_shuffle(b[-4],b[0], <0x08090a0b,0x0c0d0e0f,0x10111213,0x14151617>),
spu_shuffle(c[-4],c[0], <0x0c0d0e0f,0x10111213,0x14151617,0x18191a1b>),spu_maskb(15));
oldSPCopy0 = c[0];
oldSPCopy1 = b[0];
i = 0;
do {
tc = c[i*4+4];
tb = b[i*4+4];
a[i*4+4] = spu_add(spu_shuffle(oldSPCopy1,tb,<0x08090a0b,0x0c0d0e0f,0x10111213,0x14151617>),
spu_shuffle(oldSPCopy0, tc,<0x0c0d0e0f,0x10111213,0x14151617,0x18191a1b>);
oldSPCopy0 = tc;
oldSPCopy1 = tb;
i = i + 1;
} while (i < 24);
a[100] = spu_sel(spu_add(spu_shuffle(b[96],b[100],<0x08090a0b,0x0c0d0e0f,0x10111213, 0x14151617>),
spu_shuffle(c[96], c[100], <0x0c0d0e0f,0x10111213,0x14151617,0x18191a1b>), a[100], spu_maskb(15));
57
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Outline
Simdization example
Integrated simdization framework
58
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Extract Parallelism
Successful Simdizer
loop level
Satisfy Constraints
alignment constraints
for (i=0; i<256; i++)
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
a[i] =
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
basic-block level
a[i+0] =
a[i+1] =
a[i+2] =
b0b1b2b3b4b5b6b7b8b9b10
R1 b0b1b2b3
R2 c0c1c2c3
for (i=0; i<8; i++)
a[i] =
59
load b[i]
b0+
b1+
b2+
b3+
c0c1c2c3 R3
SHORT
+
c0c1c2c3c4c5c6c7c8c9c10
load a[i]
unpack
add
a[i+3] =
entire short loop
data size conversion
store
multiple targets
unpack
load a[i+4]
add
INT 1
store
INT 2
non stride-one
GENERIC
VMX
Compiler mediated performance for a Cell processor
SPE
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple Sources of SIMD Parallelism
loop level
for (i=0; i<256; i++)
a[i] =
Loop level
SIMD for a single statement across consecutive
iterations
successful at:
basic-block level
a[i+0] =
a[i+1] =
efficiently handling misaligned data
pattern recognition (reduction, linear recursion)
leverage loop transformations in most compilers
amortize overhead (versioning, alignment
handling) and employ cost models
a[i+2] =
a[i+3] =
entire short loop
[Bik et al, IJPP 2002]
for (i=0; i<8; i++)
a[i] =
[VAST compiler, 2004]
[Eichenberger et al, PLDI 2004] [Wu et al, CGO 2005]
[Naishlos, GCC Developer’s Summit 2004]
60
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple Sources of SIMD Parallelism (cont.)
loop level
for (i=0; i<256; i++)
Basic-block level
SIMD across multiple isomorphic operations
successful at
a[i] =
handling unrolled loops (manually or by compiler)
extracting SIMD parallelism within structs, e.g.
basic-block level
a[i].x =
a[i+0] =
a[i].y =
a[i+1] =
a[i].z =
a[i+2] =
a[i+3] =
extracting SIMD parallelism within a statement
+= a(i)*b(i)
a(i)*b(i)++a(i+1)*b(i+1)
a(i+1)*b(i+1)++a(i+2)*b(i+2)
a(i+2)*b(i+2)++
ss+=
a(i+3)*b(i+3)++a(i+4)*b(i+4)
a(i+4)*b(i+4)
a(i+3)*b(i+3)
entire short loop
for (i=0; i<8; i++)
a[i] =
[Larsen et al, PLDI 2000]
[Shin et al, PACT 2002]
61
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple Sources of SIMD Parallelism (cont.)
loop level
for (i=0; i<256; i++)
Short-loop level
SIMD across entire loop iterations
effectively collapse innermost loop
a[i] =
we can now extract SIMD at the next loop level
basic-block level
a[i+0] =
e.g. FIR
for (k=0; k<248; k++)
a[i+1] =
a[i+2] =
a[i+3] =
for (i=0; i<8; i++)
res[k] += in[k+i] * coef[k+i];
entire short loop
for (i=0; i<8; i++)
a[i] =
62
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple SIMD Hardware Constraints
Alignment in SIMD units matters
alignment constraints
when alignments within inputs do not match
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
must realign the data
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b0 b1 b2 b3 b4 b5 b6 b7
b1 b2 b3 b4
data size conversion
shuffle
load b[i]
load a[i]
R1 b1 b2 b3 b4
+
b1+ b2+ b3+ b4+
c0 c1 c2 c3
R2 c0 c1 c2 c3
unpack
add
store
SHORT
unpack
load a[i+4]
add
INT 1
store
INT 2
non stride-one
c0 c1 c2 c3 c4 c5 c6 c7
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
16-byte boundaries
63
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple SIMD Hardware Constraints (cont.)
Size of data in SIMD registers matters
a vector of int
int
short
alignment constraints
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
½ vector of short
b1 b2 b3 b4
data size conversion
a vector of short
short
int
load b[i]
load a[i]
2 vectors of int
unpack
add
store
SHORT
unpack
load a[i+4]
add
INT 1
store
INT 2
non stride-one
E.g. when converting from short to integer
we must issue 2x integer SIMD operations
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
64
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple SIMD Hardware Constraints (cont.)
Hardware supports 16-byte continuous access only
Non stride-one load requires packing
16-byte boundaries
b0 b1 b2 b3 b4 b5 b6 b7
load b[5]
load b[1]
b0 b1 b2 b3
alignment constraints
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
data size conversion
load b[i]
b4 b5
b1 b6 b7
load a[i]
pack
b1 b3
b1 b5 b7
unpack
add
store
for(i=0;i<n;i++)
…= a[i] + b[2i+1]
SHORT
unpack
load a[i+4]
add
INT 1
store
INT 2
non stride-one
Non stride-one store requires unpacking
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
65
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Multiple SIMD Hardware Constraints (cont.)
Different platforms have varying SIMD support
e.g. VMX / SPE have SIMD permute instructions
e.g. SPE has no memory page fault, VMX does
multiple targets
GENERIC
VMX
66
SPU
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Extract ParallelismHow to support the cross Satisfy Constraints
product of all these?
loop level
alignment constraints
for (i=0; i<256; i++)
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
a[i] =
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
basic-block level
a[i+0] =
a[i+1] =
a[i+2] =
b0b1b2b3b4b5b6b7b8b9b10
R1 b0b1b2b3
R2 c0c1c2c3
for (i=0; i<8; i++)
a[i] =
67
load b[i]
b0+
b1+
b2+
b3+
c0c1c2c3 R3
SHORT
+
c0c1c2c3c4c5c6c7c8c9c10
load a[i]
unpack
add
a[i+3] =
entire short loop
data size conversion
store
multiple targets
unpack
load a[i+4]
add
INT 1
store
INT 2
non stride-one
GENERIC
VMX
Compiler mediated performance for a Cell processor
SPE
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Key Abstraction: Virtual SIMD Vector
Virtual SIMD Vector
has arbitrary length
has no alignment constraints
loop level
for (i= ; i<
use virtual vector as representation
abstract away all the hardware complexity
Progressive “de-virtualization” of the
virtual vector
b
b b
b
b
b
b
b
b
...
b
-byte boundaries
vload b[ ]
a[i] =
b b
b
b
vload b[ ]
b
b
b
b
vpermute
b
basic-block level
a[i+ ] =
Extraction of SIMD Parallelism
alignment constraints
; i++)
a[i+ ] =
a[i+ ] =
a[i+ ] =
b b b b b b b b b bb
R b b b b
R c c c c
b
b
b
data size conversion
load b[i]
b b+ b
+ b+ +
c c c c R
+
c c c c c c c c c cc
load a[i]
unpack
add
for (i= ; i< ; i++)
a[i] =
load a[i+ ]
add
store
entire short loop
SHORT
unpack
INT
store
INT
multiple targets
GENERIC
VMX SPU BG/L
until each vector in the loop satisfies all
hardware constraints
or revert vectors back to scalars (if too
much overhead)
68
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Integrated Simdization Framework
Characteristics:
Support
Modular
Global
Integrated
Alignment
Analysis
Hide
complexity
Global
Pointer
Analysis
SIMD Extraction
Basic-block level aggregation
Idiom
Recognition
Data Layout
Optimization
arbitrary length
arbitrary alignment
Short-loop level aggregation
Loop-level aggregation
Constant
Propagation
Dependence
Elimination
Virtual Vectors
Transition
Alignment devirtualization
alignment handling
short length handling
Length devirtualization
Physical Vectors
SIMD Codegen
16 bytes long
machine alignment
Machine Specific
69
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
SPE Simdization Results
Speedup factors
.
.
.
.
.
.
.
.
.
Linpack
Swim-l
FIR
Autcor
.
Dot
Checksum Alpha
Product
Blending
Saxpy
Mat Mult
Average
single SPE, optimized, automatic simdization vs. scalar code
70
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Conclusions
Cell Broadband Engine architecture
heterogeneous parallelism
dense compute architecture
Present the application writer with a wide range of tool
from support to extract maximum performance
to support to achieve maximum productivity with automatic tools
Shown respectable speedups
using automatic tuning, simdization, and support for shared-memory abstraction
71
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Questions
For additional info:
www.research.ibm.com/cellcompiler/compiler.htm
72
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
First Example: Basic-Block & Loop Level Aggregation
Original loop
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
a[i].x==
a[i].x
a[i].y==
a[i].y
a[i].z==
a[i].z
b[i+1]==
b[i+1]
Value streams
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y …
b1
b2
b3
b4
}}
4 iterations shown here for the purpose of illustration
1st iter
73
Compiler mediated performance for a Cell processor
2nd iter
3rd iter
4th iter
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 1: Basic-Block Level Aggregation
Original loop
Value streams
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
a[i].x==
a[i].x
a[i].y==
a[i].y
a[i].z==
a[i].z
b[i+1]==
b[i+1]
pack
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y …
b1
a0.x a0.y a0.z
b1
}}
b2
b3
a1.x a1.y a1.z
a2.x a2.y a2.z
b2
b3
b4
a3.x a3.y …
b4
Basic-Block level aggregation
pack a[i].x, a[i].y, a[i].z into a vector of 3 elements
pack regardless of alignment
74
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 2: Loop-Level Aggregation
BB-aggregated loop
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
(a[i].x,y,z)==
(a[i].x,y,z)
b[i+1]==
b[i+1]
Value streams
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y …
b2
b1
a0.x a0.y a0.z
}}
a1.x a1.y a1.z
a2.x a2.y a2.z
b2
b3
b1
pack with self
b3
b4
a3.x a3.y …
b4
a0.x a0.y a0.z a1.x a1.y a1.z a0.x a0.y a0.z a0.x a0.y a0.z
b1
b2
b3
b4
Loop-level aggregation
pack each statement with itself across consecutive iterations
final vector lengths must be multiple of 16 bytes
scalar “b[i]” or vector “(a[i].x,y,z)” are treated alike
pack regardless of alignment
75
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 3: Alignment Devirtualization
Loop-aggregated
Value streams
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y …
for(i=0;
(i=0;i<256;
i<256;i+=4)
i+=4){{
for
(a[i].x,…,a[i+3].z)==
(a[i].x,…,a[i+3].z)
(b[i+1],…,b[i+4])==
(b[i+1],…,b[i+4])
}}
align
access
b2
b1
a0.x a0.y a0.z
b3
a1.x a1.y a1.z
a2.x a2.y a2.z
b2
b3
b1
b4
a3.x a3.y …
b4
a0.x a0.y a0.z a1.x a1.y a1.z a0.x a0.y a0.z a0.x a0.y a0.z
b1
b2
b3
b4
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y a3.z
b4
Alignment *
b5
b6
b7
shift misaligned streams
skew the computations so that loop computes (b[i+4]…b[i+7])
* Arrays (e.g. &a[0], &b[0],…) are assumed here 16-byte aligned.
76
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 4: Length Devirtualization
Aligned loop
(b[1]…b[3])==
(b[1]…b[3])
for(i=0;
(i=0;i<252;
i<252;i+=4)
i+=4){{
for
(a[i].x,…,a[i+3].z)==
(a[i].x,…,a[i+3].z)
(b[i+4],…,b[i+7])==
(b[i+4],…,b[i+7])
}}
(a[252].x,…,a[255].z)==
(a[252].x,…,a[255].z)
b[256]==
b[256]
Value streams
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y …
b2
b1
a0.x a0.y a0.z
b3
a1.x a1.y a1.z
a2.x a2.y a2.z
b2
b3
b1
b4
a3.x a3.y …
b4
a0.x a0.y a0.z a1.x a1.y a1.z a0.x a0.y a0.z a0.x a0.y a0.z
b1
b2
b3 b4
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y a3.z
b4
Length
break into 16-byte chunks
b6
b7
a0.x a0.y a0.z a1.x a1.y a1.z a2.x a2.y a2.z a3.x a3.y a3.z
b4
77
b5
b5 b6
b7
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Second Example: Data-Size Conversion and Misalignment
Original loop
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
a[i] +=
+= (int)
(int)b[i+1]
b[i+1]
a[i]
load b[i+1]
}}
load a[i]
short
computations
integer
computations
78
convert
short (2 bytes)
add
store a[i]
Compiler mediated performance for a Cell processor
integer (4 bytes)
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 1: Loop-Level Aggregation
Original loop
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
a[i] +=
+= (int)
(int)b[i+1]
b[i+1]
a[i]
vload b[i+1]
}}
vload a[i]
pack with self
convert
8 shorts
vadd
vstore a[i]
8 integer
Loop-level aggregation
pack each statement with itself across consecutive iterations
virtual vectors have uniform number of elements, even when
vector of 8 integer = 32 bytes of data
vector of 8 short = 16 bytes of data
79
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 2: Alignment Devirtualization
Original loop
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
a[i] +=
+= (int)
(int)b[i+1]
b[i+1]
a[i]
vload b[i+1]
}}
streamshift
align b[i+1]
vload a[i]
convert
8 shorts
vadd
vstore a[i]
8 integer
Alignment
shift misaligned streams
easy to do as we are still dealing with long vectors
80
Compiler mediated performance for a Cell processor
Kevin O’Brien
http://w3.ibm.com/ibm/presentations
Cell Broadband Engine
Phase 3: Length Devirtualization
Original loop
for(i=0;
(i=0;i<256;
i<256;i++)
i++){{
for
a[i] +=
+= (int)
(int)b[i+1]
b[i+1]
a[i]
8 shorts
vload b[i+1]
}}
streamshift
4 integer
4 integer
vload a[i]
unpack
unpack
vload a[i+4]
vadd
vadd
vstore a[i]
vstore a[i+4]
Length
8 shorts fit into a 16-byte register
8 integers do not fit; must replicate integer registers and associated instructions
81
Compiler mediated performance for a Cell processor
Kevin O’Brien