Cell Broadband Engine - Enabling Density Computing for Data-Rich Environments

Cell Broadband Engine
Cell BE –
enabling density computing
for data rich environments
Michael Gschwind
Bruce D’Amora
Alexandre Eichenberger
Cell Broadband Engine - enabling density
computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell History
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
2
IBM, SCEI/Sony, Toshiba Alliance formed in 2000
Design Center opened in March 2001
Based in Austin, Texas
Hardware designed in parallel with software
February 7, 2005: First external technical disclosures
August 15, 2005: First external architecture disclosures
August 25, 2005: Cell Launch - Architecture released
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Acknowledgements
ƒ Cell is the result of a partnership between SCEI/Sony,
Toshiba, and IBM
ƒ Cell represents the work of more than 400 people
starting in 2000 and a design investment of about
$400M
3
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
More about Cell Broadband Engine
ƒ http://www.research.ibm.com/cell
ƒ Online resources
–Specification
–Documentation
–Open Source and Proprietary Tools
–Operating System
–Platform Simulator
4
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Agenda
5
ƒ 8:00 – 9:00
Motivation and Architecture
ƒ 9:00 – 9:30
Heterogeneous Application Model
ƒ 9:30 – 10:30
Compilation and Auto-Parallelization
ƒ 10:30 – 11:00
BREAK
ƒ 11:00 – 12:00
Programming Models & Applications
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell Broadband Engine –
Architecture
Michael Gschwind
Cell Broadband Engine - enabling density
computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Computer Architecture at the turn of the millenium
ƒ “The end of architecture” was being proclaimed
– Frequency scaling as performance driver
– State of the art microprocessors
•multiple instruction issue
•out of order architecture
•register renaming
•deep pipelines
ƒ Little or no focus on compilers
– Questions about the need for compiler research
ƒ Academic papers focused on microarchitecture tweaks
7
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
The age of frequency scaling
Scaled Device
Voltage, V / α
1
WIRING
bips
tox/α
W/α
n+
source
n+
drain
L/α
p substrate, doping α*NA
SCALING:
Voltage:
Oxide:
Wire width:
Gate width:
Diffusion:
Substrate:
8
V/α
tox /α
W/α
L/α
xd /α
α * NA
xd/α
RESULTS:
Higher Density: ~α2
Higher Speed: ~α
Power/ckt:
~1/α2
Power Density:
~Constant
0.9
Performance
GATE
0.8
0.7
0.6
0.5
37
34
31
Source: Dennard et al., JSSC 1974.
Cell Broadband Engine - enabling density computing for data-rich environments
28
25
22
19
16
13
10
7
Total FO4 Per Stage
deeper pipeline
© 2006 IBM Corporation
Cell Broadband Engine
Frequency scaling
ƒ A trusted standby
– Frequency as the tide that raises all boats
– Increased performance across all applications
– Kept the industry going for a decade
ƒ Massive investment to keep going
– Equipment cost
– Power increase
– Manufacturing variability
– New materials
9
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
1000
… but reality was closing in!
Relative to Optimal FO4
100
1
10
0.8
1
0.6
0.1
0.4
bips
bips^3/W
0.2
0.01
Active Power
Leakage Power
0
37
34
31
28
25
22
19
Total FO4 Per Stage
10
16
13
10
7
1
Cell Broadband Engine - enabling density computing for data-rich environments
0.1
0.001
0.01
© 2006 IBM Corporation
Cell Broadband Engine
Power crisis
1000
ƒ The power crisis is not
“natural”
– Created by deviating from
ideal scaling theory
– Vdd and Vt not scaled by α
• additional performance with
increased voltage
100
Tox(Å)
10
Vdd
ƒ Laws of physics
– Pchip = ntransistors * Ptransistor
ƒ Marginal performance gain
per transistor low
– significant power increase
– decreased power/performance
efficiency
11
classic scaling
Vt
1
0.1
1
Cell Broadband Engine - enabling density computing for data-rich environments
0.1
0.01
gate length Lgate (µm)
© 2006 IBM Corporation
Cell Broadband Engine
The power inefficiency of deep pipelining
Power-performance optimal
Performance optimal
Relative to Optimal FO4
1
0.8
0.6
0.4
bips
bips^3/W
0.2
deeper pipeline
0
37
34
31
28
25
22
19
16
Total FO4 Per Stage
13
10
7
Source: Srinivasan et al., MICRO 2002
ƒ Deep pipelining increases number of latches and switching rate
⇒ power increases with at least f2
ƒ Latch insertion delay limits gains of pipelining
12
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
100
Hitting the memory wall
MFLOPS
Memory latency
80
ƒ Latency gap
– Memory speedup lags behind
processor speedup
– limits ILP
normalized
60
40
20
0
-20
1990
2003
-40
ƒ Chip I/O bandwidth gap
-60
Source: McKee, Computing Frontiers 2004
– Less bandwidth per MIPS
ƒ Latency gap as application
bandwidth gap
usable bandwidth
avg. request size
no. of inflight
roundtrip latency
requests
– Typically (much) less than
chip I/O bandwidth
Source: Burger et al., ISCA 1996
13
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Our Y2K challenge
ƒ 10× performance of desktop systems
ƒ 1 TeraFlop / second with a four-node configuration
ƒ 1 Byte bandwidth per 1 Flop
– “golden rule for balanced supercomputer design”
ƒ scalable design across a range of design points
ƒ mass-produced and low cost
14
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell Design Goals
ƒ Provide the platform for the future of computing
– 10× performance of desktop systems shipping in 2005
ƒ Computing density as main challenge
– Dramatically increase performance per X
• X = Area, Power, Volume, Cost,…
ƒ Single core designs offer diminishing returns on
investment
– In power, area, design complexity and verification cost
15
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Necessity as the mother of invention
ƒ Increase power-performance efficiency
–Simple designs are more efficient in terms of power and area
ƒ Increase memory subsystem efficiency
–Increasing data transaction size
–Increase number of concurrently outstanding transactions
⇒ Exploit larger fraction of chip I/O bandwidth
ƒ Use CMOS density scaling
–Exploit density instead of frequency scaling to deliver increased
aggregate performance
ƒ Use compilers to extract parallelism from application
–Exploit application parallelism to translate aggregate
performance to application performance
16
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Exploit application-level parallelism
ƒ
Data-level parallelism
–
ƒ
Thread-level parallelism
–
ƒ
Improve memory access efficiency by increasing number of
parallel memory transactions
Compute-transfer parallelism
–
17
Exploit application threads with multi-core design approach
Memory-level parallelism
–
ƒ
SIMD parallelism improves performance with little overhead
Transfer data in parallel to execution by exploiting application
knowledge
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Concept phase ideas
large architected
register set
high frequency
and low power!
reverse Vdd
scaling for low
power
18
modular design
design reuse
Chip
Multiprocessor
control cost of
coherency
efficient use of
memory interface
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell Architecture
ƒ Heterogeneous multicore system
architecture
– Power Processor
Element for control tasks
– Synergistic Processor
Elements for dataintensive processing
SPE
SPU
SPU
SPU
– Synergistic Processor
Unit (SPU)
– Synergistic Memory Flow
Control (SMF)
19
SPU
SPU
SPU
SPU
SXU
SXU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
LS
SMF
SMF
SMF
SMF
SMF
SMF
SMF
SMF
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
PPE
ƒ Synergistic Processor
Element (SPE)
consists of
SPU
SXU
16B/cycle
PPU
L2
L1
MIC
16B/cycle (2x)
BIC
PXU
32B/cycle 16B/cycle
Dual
XDRTM
FlexIOTM
64-bit Power Architecture with VMX
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Shifting the Balance of Power with Cell Broadband Engine
ƒ Data processor instead of control system
– Control-centric code stable over time
– Big growth in data processing needs
•
•
•
•
Modeling
Games
Digital media
Scientific applications
ƒ Today’s architectures are built on a 40 year old data model
– Efficiency as defined in 1964
– Big overhead per data operation
– Parallelism added as an after-thought
20
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Powering Cell – the Synergistic Processor Unit
64B
Fetch
ILB
Local Store
Issue /
Branch
2 instructions
Single Port
SRAM
128B
SMF
16B
VRF
16B x 2
16B x 3 x 2
V F P U
V F X U
PERM
LSU
Source: Gschwind et al., Hot Chips 17, 2005
21
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Density Computing in SPEs
ƒ Today, execution units only fraction of core area and power
– Bigger fraction goes to other functions
•
•
•
•
Address translation and privilege levels
Instruction reordering
Register renaming
Cache hierarchy
ƒ Cell changes this ratio to increase performance per area and
power
– Architectural focus on data processing
•
•
•
•
•
22
Wide datapaths
More and wide architectural registers
Data privatization and single level processor-local store
All code executes in a single (user) privilege level
Static scheduling
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Streamlined Architecture
ƒ Architectural focus on simplicity
–
–
–
–
Aids achievable operating frequency
Optimize circuits for common performance case
Compiler aids in layering traditional hardware functions
Leverage 20 years of architecture research
ƒ Focus on statically scheduled data parallelism
– Focus on data parallel instructions
• No separate scalar execution units
• Scalar operations mapped onto data parallel dataflow
– Exploit wide data paths
• Data processing
• Instruction fetch
– Address impediments to static scheduling
• Large register set
• Reduce latencies by eliminating non-essential functionality
23
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPE Highlights
ƒ User-mode architecture
– No page translation within SPU
ƒ SIMD dataflow
– Broad set of operations (8, 16, 32, 64 Bit)
– Graphics SP-Float
– IEEE DP-Float
FWD
FXU ODD
GPR
DMA
ƒ DMA block transfer
– using Power Architecture memory
translation
LS
LS
CHANNEL
ƒ 256KB Local Store
– Combined I & D
LS
FXU EVN
SBI
SMM
BEB
– 32 bit fixed instructions
– Load/store architecture
– Unified register file
DP
SFP
CONTROL
ƒ RISC organization
LS
ATO
RTB
14.5mm2 (90nm SOI)
Source: Kahle, Spring Processor Forum 2005
24
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Synergistic Processing
scalar
layering
shorter
pipeline
instruction
bundling
high
frequency
compute
density
25
simpler
μarch
large
register
file
DLP
static
scheduling
wide
data
paths
local
store
determ.
latency
single
port
Cell Broadband Engine - enabling density computing for data-rich environments
static
prediction
ILP
opt.
data
parallel
select
large
basic
blocks
sequential
fetch
© 2006 IBM Corporation
Cell Broadband Engine
Efficient data-sharing between scalar and SIMD processing
ƒ Legacy architectures separate scalar and SIMD
processing
– Data sharing between SIMD and scalar processing units
expensive
• Transfer penalty between register files
– Defeats data-parallel performance improvement in many
scenarios
ƒ Unified register file facilitates data sharing for
efficient exploitation of data parallelism
– Allow exploitation of data parallelism without data transfer
penalty
• Data-parallelism always an improvement
26
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Compiling for the Cell Broadband Engine
ƒ The lesson of “RISC computing”
– Architecture provides fast, streamlined primitives to compiler
– Compiler uses primitives to implement higher-level idioms
– If the compiler can’t target it Î do not include in architecture
ƒ Compiler focus throughout project
– Prototype compiler soon after first proposal
– Cell compiler team has made significant advances in
• Automatic SIMD code generation
• Automatic parallelization
• Data privatization
Raw Hardware
Performance
Programmability
Cell
Design
Programmer
Productivity
27
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPU Pipeline
SPU PIPELINE FRONT END
IF1
IF2
IF3
IF4
IF5
IB1
IB2
ID1
ID2
ID3
IS1
IS2
SPU PIPELINE BACK END
Branch Instruction
RF1
RF2
Permute Instruction
28
EX1 EX2 EX3 EX4
WB
Load/Store Instruction
EX1 EX2 EX3 EX4 EX5 EX6
WB
Fixed Point Instruction
EX1 EX2
WB
Floating Point Instruction
EX1 EX2 EX3 EX4 EX5 EX6
WB
Cell Broadband Engine - enabling density computing for data-rich environments
IF
IB
ID
IS
RF
EX
WB
Instruction Fetch
Instruction Buffer
Instruction Decode
Instruction Issue
Register File Access
Execution
Write Back
© 2006 IBM Corporation
Cell Broadband Engine
SPE Block Diagram
Floating-Point Unit
Fixed-Point Unit
Permute Unit
Load-Store Unit
Branch Unit
Channel Unit
Local Store
(256kB)
Single Port SRAM
Result Forwarding and Staging
Register File
Instruction Issue Unit / Instruction Line Buffer
128B Read 128B Write
On-Chip Coherent Bus
8 Byte/Cycle
29
16 Byte/Cycle
DMA Unit
64 Byte/Cycle
Cell Broadband Engine - enabling density computing for data-rich environments
128 Byte/Cycle
© 2006 IBM Corporation
Cell Broadband Engine
SPU Communication: Channel Architecture
ƒ SPU uses “channels” to communicate with
environment (incl. SMF)
– Access to special purpose registers
• Processor status
– Communication channels
• Requests to SMF
• SMF status
• Mailboxes
30
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPU Channel Features
ƒ Communication using channels numbered 0 - 127
– Implementation dependent
– Unidirectional
– Have capacity
• Burst without SPU execution stop if capacity available
– Channel operations are blocking
• On write if full
• On read if empty
31
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPU Channel Access
ƒ rdch
RT, ca
RT <= channel (ca)
– Read data word from channel
ƒ rdchcnt
RT, ca
RT <= channel capacity(ca)
– Determine channel capacity
ƒ wrch
ca, RT
channel(ca) <= RT
– Write data word to channel
32
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Synergistic Memory Flow Control
ƒ SMF implements memory
management and mapping
ƒ SMF operates in parallel to SPU
– Independent compute and transfer
– Command interface from SPU
• DMA queue decouples SMF & SPU
• MMIO-interface for remote nodes
SPE
SPU
SXU
LS
SMF
DMA
Queue
DMA Engine
ƒ Block transfer between system
memory and local store
ƒ SPE programs reference system
memory using user-level
effective address space
– Ease of data sharing
– Local store to local store transfers
– Protection
33
Cell Broadband Engine - enabling density computing for data-rich environments
Atomic
Facility
MMU
RMT
Bus I/F Control
MMIO
Data Bus
Snoop Bus
Control Bus
Translate Ld/St
MMIO
© 2006 IBM Corporation
Cell Broadband Engine
Synergistic Memory Flow Control
ƒ SMF implements memory
management and mapping
–
–
–
–
DMA for data transfer
MMU for page translation
AF for coherent data
BIF for access to Element Interconnect
Bus
– RMT for resource management
SPE
SPU
SXU
LS
SMF
ƒ SMF operates in parallel to SPU
– Independent compute and transfer
– Channel command interface from
SPU
• DMA queue decouples SMF SPU
• SPU can synchronize with SMF
• MMIO-based interface for remote nodes
34
Cell Broadband Engine - enabling density computing for data-rich environments
DMA
Queue
DMA Engine
Atomic
Facility
MMU
RMT
Bus I/F Control
MMIO
Data Bus
Snoop Bus
Control Bus
Translate Ld/St
MMIO
© 2006 IBM Corporation
Cell Broadband Engine
System-wide Virtual Memory Architecture
ƒ SMF MMU follows Power
Architecture™ Virtual Memory
architecture
– Two-level translation
SPE
SPU
SXU
LS
• Segmentation and paging
– PPE and SPEs share memory map
– SPE programs reference system
memory using user-level effective
address space
SMF
• Ease of data sharing
• Local store to local store transfers
• Protection
DMA
Queue
DMA Engine
Atomic
Facility
MMU
RMT
Bus I/F Control
ƒ Exceptions delivered to PPE
– SLB miss
– Page fault
35
Cell Broadband Engine - enabling density computing for data-rich environments
MMIO
Data Bus
Snoop Bus
Control Bus
Translate Ld/St
MMIO
© 2006 IBM Corporation
Cell Broadband Engine
Data Transfer with SMF DMAC
ƒ DMA Unit implements block data
transfer
– transfer specifies system memory and
local store address
– system memory address is effective
address
– translated to physical address by SMF
MMU
– variable block size from 1B to 16KB
SPE
SPU
SXU
LS
SMF
DMA
Queue
DMA Engine
ƒ DMA transfers
– LS Ù system memory
– LS Ù LS
– LS Ù I/O Transfers
ƒ DMA list command
– “SMF program” transfers data in parallel
to computation
36
Cell Broadband Engine - enabling density computing for data-rich environments
Atomic
Facility
MMU
RMT
Bus I/F Control
MMIO
Data Bus
Snoop Bus
Control Bus
Translate Ld/St
MMIO
© 2006 IBM Corporation
Cell Broadband Engine
Synergistic Memory Flow Control
ƒ Bus Interface
ƒ Element Interconnect Bus
– Up to 16 outstanding DMA
requests
– Requests up to 16KByte
– Token-based Bus Access
Management
PPE
–
–
–
–
–
Four 16 byte data rings
Multiple concurrent transfers
96B/cycle peak bandwidth
Over 100 outstanding requests
200+ GByte/s @ 3.2+ GHz
SPE1
SPE3
SPE5
SPE7
16B 16B
16B 16B
16B 16B
16B 16B
16B
IOIF1
16B
16B
16B
Data Arb
16B
16B
16B
16B
MIC
16B 16B
16B 16B
16B 16B
16B 16B
SPE0
SPE2
SPE4
SPE6
BIF/IOIF0
Source: Clark et al., Hot Chips 17, 2005
37
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Memory efficiency as key to application performance
ƒ Greatest challenge in translating peak into app performance
– Peak Flops useless without way to feed data
ƒ Cache miss provides too little data too late
– Inefficient for streaming / bulk data processing
– Initiates transfer when transfer results are already needed
– Application-controlled data fetch avoids not-on-time data delivery
ƒ SMF is a better way to look at an old problem
– Fetch data blocks based on algorithm
– Blocks reflect application data structures
38
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Traditional memory usage
ƒ Long latency memory access operations exposed
ƒ Cannot overlap 500+ cycles with computation
ƒ Memory access latency severely impacts application performance
computation
39
mem protocol
mem idle
Cell Broadband Engine - enabling density computing for data-rich environments
mem contention
© 2006 IBM Corporation
Cell Broadband Engine
Exploiting memory-level parallelism
ƒ Reduce performance impact of memory accesses with concurrent
access
ƒ Carefully scheduled memory accesses in numeric code
ƒ Out-of-order execution increases chance to discover more
concurrent accesses
– Overlapping 500 cycle latency with computation using OoO illusory
ƒ Bandwidth limited by queue size and roundtrip latency
40
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Exploiting MLP with Chip Multiprocessing
ƒ More threads Î more memory level parallelism
– overlap accesses from multiple cores
41
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
A new form of parallelism: CTP
ƒ Compute-transfer parallelism
– Concurrent execution of compute and transfer increases
efficiency
• Avoid costly and unnecessary serialization
– Application thread has two threads of control
• SPU Î computation thread
• SMF Î transfer thread
ƒ Optimize memory access and data transfer at the
application level
– Exploit programmer and application knowledge about
data access patterns
42
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Memory access in Cell with MLP and CTP
ƒ Super-linear performance gains observed
– decouple data fetch and use
43
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Synergistic Memory Management
timely
data
delivery
decouple
fetch and
access
high
throughput
block
fetch
storage
density
44
explicit
data
transfer
reduced
coherence
cost
application
control
local
store
low
latency
access
parallel
compute &
transfer
deeper
fetch
queue
many
outstanding
transfers
eliminate
cache miss
logic
single
port
Cell Broadband Engine - enabling density computing for data-rich environments
multi
core
improved
code gen
shared
I&D
sequential
fetch
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
Power Processor Element (PPE)
ƒIndustry-standard 64-bit IBM Power
Architecture™ processor
In the Beginning
– PowerPC AS 2.0.2
– the Power Architecture™ Processor
ƒ2-Way Hardware Multithreaded
ƒL1 : 32KB I ; 32KB D
ƒL2 : 512KB
NCU
ƒCoherent load/store
Power Core
ƒVMX
(PPE)
ƒ3.2+ GHz
L2 Cache
ƒRealtime Control
– Locking L2 Cache & TLB
Custom Designed
– Bandwidth Reservation
– for high frequency, area
and power efficiency
N
N
45
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
Element Interconnect Bus
ƒdata ring for internal communication
– Four 16 byte data rings,
supporting multiple transfers
– 96B/cycle peak bandwidth
– Over 100 outstanding requests
– 200+ GByte/s @ 3.2+ GHz
96 Byte/Cycle
200+GB/sec @ 3.2+GHz
NCU
Power Core
(PPE)
L2 Cache
Element Interconnect Bus
46
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
Local Store
AUC
AUC
MFC
Local Store
MFC
SPU
SPU
AUC
MFC
N
N
N
N
96 Byte/Cycle
200+GB/sec @ 3.2+GHz
NCU
Power Core
(PPE)
L2 Cache
AUC
SPU
Local Store
Local Store
MFC
Local Store
AUC
AUC
SPU
SPU
MFC
Local Store
MFC
AUC
SPU
Cell Broadband Engine - enabling density computing for data-rich environments
N
N
N
N
Element Interconnect Bus
MFC
47
Local Store
SPU
AUC
MFC
Local Store
SPU
SPE provides computational
performance
ƒDual issue 32-bit SIMD architecture
ƒDedicated resources
– 128-entry 128-bit VRF
– 256KB Local Store
ƒEach SMF can be dynamically
configured to protect resources
ƒDedicated DMA engine
– Up to 16 outstanding requests
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
Local Store
AUC
AUC
MFC
Local Store
MFC
SPU
SPU
N
N
N
200+GB/sec @ 3.2+GHz
NCU
Power Core
(PPE)
L2 Cache
AUC
SPU
Local Store
Local Store
MFC
Local Store
AUC
AUC
SPU
SPU
MFC
Local Store
MFC
AUC
SPU
Cell Broadband Engine - enabling density computing for data-rich environments
N
N
N
N
Element Interconnect Bus
MFC
48
AUC
MFC
N
96 Byte/Cycle
• Using Power Architecture™ system
memory map
• System memory map compatible
with Power Architecture™ Virtual
Memory architecture
– S/W controllable from PPE MMIO
ƒDMA 1,2,4,8,16,128 Byte ⇒ 16Kbyte
transfers for memory and I/O access
Local Store
SPU
AUC
MFC
Local Store
SPU
SMF provides memory management
& mapping
ƒSPE Local Store aliased into system
memory map
ƒSMF controls SPE DMA accesses
– Implements page translation and
protection
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
SPU
Local Store
SPU
Local Store
AUC
MFC
AUC
MFC
AUC
MFC
MIC
N
N
N
N
96 Byte/Cycle
200+GB/sec @ 3.2+GHz
Power Core
(PPE)
MIC
NCU
IOIF0
L2 Cache
25 GB / s
DRAM
AUC
SPU
Local Store
Local Store
MFC
Local Store
AUC
AUC
SPU
SPU
MFC
Local Store
Cell Broadband Engine - enabling density computing for data-rich environments
MFC
AUC
SPU
Southbridge
I/O
N
N
N
N
Element Interconnect Bus
MFC
5 GB / s
49
Local Store
SPU
AUC
MFC
Local Store
SPU
20 GB / s
BIF or IOIF1
IOIF1
I/O provides high bandwidth
ƒDual XDR™ controller
– 25.6GB/s @ 3.2Gbps
ƒTwo configurable interfaces
– 76.8GB/s @ 6.4Gbps
– Configurable number of Bytes
– Coherent or I/O Mode
Interconnect
ƒSupports multiple system
configurations
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
SPU
Local Store
SPU
Local Store
MFC
AUC
MFC
AUC
N
N
N
200+GB/sec @ 3.2+GHz
Power Core
(PPE)
MIC
NCU
L2 Cache
25 GB / s
DRAM
IIC
AUC
SPU
Local Store
Local Store
MFC
Local Store
AUC
AUC
SPU
SPU
MFC
Local Store
MFC
AUC
SPU
Cell Broadband Engine - enabling density computing for data-rich environments
N
N
N
N
Element Interconnect Bus
MFC
Southbridge
I/O
50
MIC
AUC
MFC
N
IOIF0
5 GB / s
Local Store
SPU
AUC
MFC
Local Store
SPU
96 Byte/Cycle
IOIF1
IIC – Internal Interrupt Controller
ƒHandles SPE Interrupts
20 GB / s
BIF
or IOIF1
ƒHandles External Interrupts
– From Coherent Interconnect
– From IOIF0 or IOIF1
ƒInterrupt Priority Level Control
ƒDuplicated for each PPE
hardware thread
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
SPU
Local Store
SPU
Local Store
MFC
AUC
MFC
AUC
N
N
N
200+GB/sec @ 3.2+GHz
Power Core
(PPE)
MIC
NCU
L2 Cache
25 GB / s
DRAM
IOT
IIC
AUC
SPU
Local Store
Local Store
MFC
Local Store
AUC
AUC
SPU
SPU
MFC
Local Store
MFC
AUC
SPU
Cell Broadband Engine - enabling density computing for data-rich environments
N
N
N
N
Element Interconnect Bus
MFC
Southbridge
I/O
51
MIC
AUC
MFC
N
IOIF0
5 GB / s
Local Store
SPU
AUC
MFC
Local Store
SPU
96 Byte/Cycle
IOIF1
IOT implements I/O Bus Master
Translation
20 GB / s
ƒTranslates bus address to system BIF or IOIF1
address
ƒTwo Level translation
– I/O Segments: 256 MB
– I/O Pages: 4KB, 64KB,1MB, 16MB
ƒI/O Device Identifier / page for LPAR
ƒIOST and IOPT Cache
– hardware/software managed
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Processor Components
Local Store
SPU
Local Store
MFC
AUC
MFC
AUC
N
N
Each SPE
PPE L2 / NCU
IOIF 0 Bus Master
IOIF 1 Bus Master
IOIF0
L2 Cache
25 GB / s
DRAM
TKM
IIC
AUC
SPU
Local Store
Local Store
MFC
Local Store
AUC
AUC
SPU
SPU
MFC
Local Store
MFC
AUC
Cell Broadband Engine - enabling density computing for data-rich environments
N
N
N
N
SPU
Southbridge
I/O
IOT
Element Interconnect Bus
MFC
5 GB / s
Power Core
(PPE)
MIC
NCU
ƒ Priority order for using another RAGs
unused tokens
ƒ Resource overcommit warning interrupt
52
SPU
N
200+GB/sec @ 3.2+GHz
ƒ Requestors assigned RAG ID by
OS/hypervisor
–
–
–
–
MIC
AUC
MFC
N
96 Byte/Cycle
IOIF1
– 1 per each memory bank (16 total)
– 2 for each IOIF (4 total)
Local Store
SPU
AUC
MFC
Local Store
SPU
Token Manager provides Bandwidth
Reservation for shared resources
ƒ Optionally used for RT tasks or LPAR
20 GB / s
ƒ Multiple Resource Allocation Groups BIF or IOIF1
ƒ Generates access tokens at configurable
rate for each allocation group
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Implementation Characteristics
Frequency Increase vs. Power Consumption
ƒ 241M transistors
ƒ 235mm2
ƒ Design operates across wide
frequency range
ƒ > 200 GFlops (SP) @3.2GHz
ƒ > 20 GFlops (DP) @3.2GHz
Relative
– Optimize for power & yield
ƒ Up to 25.6 GB/s memory bandwidth
ƒ Up to 75 GB/s I/O bandwidth
ƒ 100+ simultaneous bus
transactions
– 16+8 entry DMA queue per SPE
Voltage
Source: Kahle, Spring Processor Forum 2005
53
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell Broadband Engine
54
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell BE Applications
Michael Gschwind
Cell Broadband Engine - enabling density
computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Compiling and linking an integrated executable
56
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Loading and execution of a program
System memory
SPE
.text:
pu_main:
…
spu_create_thread(0, spu0, spu0_main);
…
spu_create_thread(1, spu1, spu1_main);
…
SPU
SXU
r
.text:
SPU
SPU
SPU
SXU
SPU
SXU
SPU
SXU
SPU
SXU
SPU
SPU
SXU
spu0_main:
…
printf:
…
SXU
SXU
LS
SXU
LS
LS
LS
LS
LS
LS
LS
SMF
SMF
SMF
SMF
SMF
SMF
SMF
SMF
…
.data:
…
SMF
spu0:
.text:
spu0_main:
…
p
…
.data:
EIB
…
q
PPE
o
n
spu1:
.text:
spu1_main:
PPU
…
…
.data:
L1
L2
32B/cycle
57
…
PXU
16B/cycle
n PPE image loads
and executes;
o PPE initiates
SMF transfer;
p SMF data transfer;
q start SPU at
specified address;
r SMF starts SPU
execution
.data:
…
…
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPE thread creation
ƒ PPE transfers mini-loader and parameters
–256b program
ƒ Mini-loader transfers thread memory image
–Embedded in Power Architecture executable
–Provided to spe_create_thread() call
ƒ SPE side program load more efficient
–SPE has access more queue entries
–Parallel loading on 8 SPEs
–Direct channel access vs. MMIO access
58
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Heterogeneous Multi-threading and OS
management
Application Source
& Libraries
ƒ Heterogeneous MultiThreading Model
–PPE Threads
PPE object files
SPE object files
–SPE Threads
–SPE DMA EA = PPE
Process EA Space
–OS supports Create &
Destroy SPE tasks
Cell Broadband Engine-aware OS (Linux)
SPE Virtualization / Scheduling Layer
PPE threads
SPE threads
–Atomic Update Primitives
used for Mutex
–SPE Context Fully Managed
• OS assignment of SPE
threads
• Programmer directed using
affinity mask
59
PPE
T1
T2
SPE SPE SPE SPE
SPE SPE SPE SPE
Physical PPE
Physical SPEs
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Linux on Cell BE
ƒ All software in STIDC written on Linux OS
– Started with Linux 2.4 PPC64 on Cell Simulator
• SPEs exposed as I/O Devices (function offload model)
• SPE DMA required pre-pinned memory
• Inflexible programming model
ƒ Moved to 2.6.3
– Added heterogenous thread model – via system call – moved to SPUFS in 2.6.12
• SPE thread API created (similar to pthreads library)
• User mode direct and indirect SPE access models
• Full pre-emptive SPE context management
• spe_ptrace() added for gdb support
• spe_schedule() for thread to physical SPE assignment
– currently FIFO – run to completion
– SPE threads share address space with parent PPE process (through DMA)
• Demand paging for SPE accesses
• Shared hardware page table with PPE
– SPE Error, Event and Signal handling directed to parent PPE thread
– SPE elf objects wrapped into PPE shared objects with extended gld
• SPE-side mini-loader
– madvise() extended for L2 cache and TLB locking/preloading (realtime feature)
– All patches for Cell in architecture dependent layer (subtree of PPC64)
• Except for a few shameless hacks - being removed in 2.6.12
ƒ Publishing Initial Cell BE Patches for 2.6.12
60
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPE data transfer (from SPE or PPE)
ƒ Embedded in program
– At compile time
– At runtime
ƒ Direct store to SPU LS
– Using memory map alias when SPU LS mapped into memory map
ƒ Mailbox
– Channel in Synergistic Processor Architecture
– MMIO in IBM Power Architecture™ core
ƒ Externally initiated transfer
– Using SMF block transfer capabilities
– From PPE or remote SPE
ƒ SPU-initiated transfer
– Based on an address provided using one of these four methods
– Based on address computed from data obtained by these five methods
61
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Data sharing between PPE and SPE
ƒ SPE addresses system memory via SMF copyin/copy-out using effective address
–System memory pointers can be shared between PPE
and SPE
ƒ PPE can access SPE local store using SMF or
using memory accesses
–PPE enqueues SMF requests via memory mapped I/O
–Aliasing of SPE LS gives PPE addressability as system
address
•Add LS base address of local store to use in PPE
62
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Access to data using common effective addresses
ppe_sum_all(float *a)
{
for (i=0; i<=MAX; i++)
sum += a[i];
}
63
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Access to data using common effective addresses
ppe_sum_all(float *a)
{
for (i=0; i<=MAX; i++)
sum += a[i];
}
Power Architecture™
effective (virtual) address
spe_sum_all(float *a)
{
float local_a[MAX] __attribute__ ((aligned (128)));
mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0);
mfc_write_tag_mask(1<<31);
mfc_read_tag_status_all();
for (i=0; i<=MAX; i++)
sum += local_a[i];
}
64
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Access to data using common effective addresses
ppe_sum_all(float *a)
{
for (i=0; i<=MAX; i++)
sum += a[i];
}
local work buffer
in SPE local store
spe_sum_all(float *a)
{
float local_a[MAX] __attribute__ ((aligned (128)));
mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0);
mfc_write_tag_mask(1<<31);
mfc_read_tag_status_all();
for (i=0; i<=MAX; i++)
sum += local_a[i];
}
65
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Access to data using common effective addresses
ppe_sum_all(float *a)
{
for (i=0; i<=MAX; i++)
sum += a[i];
}
spe_sum_all(float
*a)target address
SMF copy
LS
size (max 16KB)
{
float local_a[MAX] __attribute__ ((aligned (128)));
mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0);
mfc_write_tag_mask(1<<31);
mfc_read_tag_status_all();
tag
EA source address
for (i=0; i<=MAX; i++)
sum += local_a[i];
}
66
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Access to data using common effective addresses
ppe_sum_all(float *a)
{
for (i=0; i<=MAX; i++)
sum += a[i];
}
spe_sum_all(float *a)
{
set tags for status request
float local_a[MAX] __attribute__ ((aligned (128)));
mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0);
mfc_write_tag_mask(1<<31);
wait for request to complete
mfc_read_tag_status_all();
for (i=0; i<=MAX; i++)
sum += local_a[i];
}
67
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Access to data using common effective addresses
ppe_sum_all(float *a)
{
for (i=0; i<=MAX; i++)
sum += a[i];
}
spe_sum_all(float *a)
{
float local_a[MAX] __attribute__ ((aligned (128)));
mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0);
mfc_write_tag_mask(1<<31);
mfc_read_tag_status_all();
for (i=0; i<=MAX; i++)
sum += local_a[i];
perform algorithm on local copy
}
68
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Mailbox communication
ƒ SPE
unsigned int mbox = spu_read_in_mbox();
spu_write_out_mbox(mbox);
ƒ PPE
while (spe_stat_in_mbox(speid) == 0);
spe_write_in_mbox(speid,data);
unsigned int rmbox = spe_read_out_mbox(speid);
69
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Atomic updates and semaphores
ƒ Atomic updates
atomic_set((atomic_ea_t)(ptrAtomicData),0xffffffff);
atomic_set, atomic_add, atomic_sub,
atomic_dec_and_test,...
ƒ Mutex lock/unlock
mutex_lock (cond_mutex_ea);
cond_signal (cond_ea);
mutex_unlock (cond_mutex_ea);
70
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
SPE
SPU
SXU
Locks: atomic_set(val)
LS
SMF
DMA
Queue
DMA Engine
ea64.ull = ALIGN128_EA(v);
offset = OFFSET128_EA(v, u32);
Atomic
Facility
MMU
RMT
Bus I/F Control
MMIO
do {
MFC_DMA(buf, ea64, size, tagid, MFC_GETLLAR_CMD);
spu_readch (MFC_RdAtomicStat);
ret_val = buf[offset];
buf[offset] = val;
MFC_DMA(buf, ea64, size, tagid, MFC_PUTLLC_CMD);
status = spu_readch(MFC_RdAtomicStat);
} while (status != 0);
71
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Cell Real Address Memory Map
ƒ Local storage of each SPE aliased in system memory map
– Direct (uncacheable) access by PPE
– Used for LS Ù LS transfer
– Access control via page table
ƒ QoS memory is pinned system memory
– bandwidth and latency guarantee
– managed by O/S
ƒ I/O devices external to BE
– defined by system and I/O architecture
72
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
System management of Cell BE resources
ƒ Cell BE implements full set of Power Architecture™
virtualization and dynamic partitioning
–Support of partition configuration state
•Logical Partition ID etc.
ƒ Full state management by PPE
–Access via memory mapped I/O registers
–Grouped by privilege level
•Access control to MMIO facilities controlled by page access
control
73
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Per SPE Resources (PPE Side)
Problem State
4K Physical Page Boundary
8 Entry MFC Command Queue Interface
DMA Command and Queue Status
DMA Tag Status Query Mask
DMA Tag Status
32 bit Mailbox Status and Data from SPU
32 bit Mailbox Status and Data to SPU
4 deep FIFO
Signal Notification 1
Signal Notification 2
SPU Run Control
SPU Next Program Counter
SPU Execution Status
4K Physical Page Boundary
Optionally Mapped 256K Local Store
74
Privileged 2 State
(OS or Hypervisor)
Privileged 1 State (OS)
4K Physical Page Boundary
SPU Privileged Control
SPU Channel Counter Initialize
SPU Channel Data Initialize
SPU Signal Notification Control
SPU Decrementer Status & Control
MFC DMA Control
MFC Context Save / Restore Registers
SLB Management Registers
4K Physical Page Boundary
Optionally Mapped 256K Local Store
4K Physical Page Boundary
SPU Master Run Control
SPU ID
SPU ECC Control
SPU ECC Status
SPU ECC Address
SPU 32 bit PU Interrupt Mailbox
MFC Interrupt Mask
MFC Interrupt Status
MFC DMA Privileged Control
MFC Command Error Register
MFC Command Translation Fault Register
MFC SDR (PT Anchor)
MFC ACCR (Address Compare)
MFC DSSR (DSI Status)
MFC DAR (DSI Address)
MFC LPID (logical partition ID)
MFC TLB Management Registers
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Per SPE Resources (SPU Side)
SPU Direct Access Resources
128 - 128 bit GPRs
External Event Status (Channel 0)
Decrementer Event
Tag Status Update Event
DMA Queue Vacancy Event
SPU Incoming Mailbox Event
Signal 1 Notification Event
Signal 2 Notification Event
Reservation Lost Event
External Event Mask (Channel 1)
External Event Acknowledgement (Channel 2)
Signal Notification 1 (Channel 3)
Signal Notificaiton 2 (Channel 4)
Set Decrementer Count (Channel 7)
Read Decrementer Count (Channel 8)
16 Entry MFC Command Queue Interface (Channels 16-21)
DMA Tag Group Query Mask (Channel 22)
Request Tag Status Update (Channel 23)
Immediate
Conditional - ALL
Conditional - ANY
Read DMA Tag Group Status (Channel 24)
DMA List Stall and Notify Tag Status (Channel 25)
DMA List Stall and Notify Tag Acknowledgement (Channel 26)
Lock Line Command Status (Channel 27)
Outgoing Mailbox to PU (Channel 28)
Incoming Mailbox from PU (Channel 29)
Outgoing Interrupt Mailbox to PU (Channel 30)
75
SPU Indirect Access Resources
(via EA Addressed DMA)
System Memory
Memory Mapped I/O
This SPU Local Store
Other SPU Local Store
Other SPU Signal Registers
Atomic Update (Cacheable Memory)
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Memory Flow Controller Commands
DMA Commands
Put - Transfer from Local Store to EA space
Puts - Transfer and Start SPU execution
Putr - Put Result - (Arch. Scarf into L2)
Putl - Put using DMA List in Local Store
Putrl - Put Result using DMA List in LS (Arch)
Get - Transfer from EA Space to Local Store
Gets - Transfer and Start SPU execution
Getl - Get using DMA List in Local Store
Sndsig - Send Signal to SPU
Command Modifiers: <f,b>
f: Embedded Tag Specific Fence
Command will not start until all previous commands
in same tag group have completed
b: Embedded Tag Specific Barrier
Command and all subsiquent commands in same
tag group will not start until previous commands in same
tag group have completed
SL1 Cache Management Commands
sdcrt - Data cache region touch (DMA Get hint)
sdcrtst - Data cache region touch for store (DMA Put hint)
sdcrz - Data cache region zero
sdcrs - Data cache region store
sdcrf - Data cache region flush
76
Command Parameters
LSA - Local Store Address (32 bit)
EA - Effective Address (32 or 64 bit)
TS - Transfer Size (16 bytes to 16K bytes)
LS - DMA List Size (8 bytes to 16 K bytes)
TG - Tag Group(5 bit)
CL - Cache Management / Bandwidth Class
Synchronization Commands
Lockline (Atomic Update) Commands:
getllar - DMA 128 bytes from EA to LS and set Reservation
putllc - Conditionally DMA 128 bytes from LS to EA
putlluc - Unconditionally DMA 128 bytes from LS to EA
barrier - all previous commands complete before subsequent
commands are started
mfcsync - Results of all previous commands in Tag group
are remotely visible
mfceieio - Results of all preceding Puts commands in same
group visible with respect to succeeding Get commands
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
Raising the bar with parallelism…
ƒ Data-parallelism and static ILP in both core types
– Results in low overhead per operation
ƒ Multithreaded programming is key to great Cell BE performance
– Exploit application parallelism with 9 cores
– Regardless of whether code exploits DLP & ILP
– Challenge regardless of homogeneity/heterogeneity
ƒ Leverage parallelism between data processing and data transfer
– A new level of parallelism exploiting bulk data transfer
– Simultaneous processing on SPUs and data transfer on SMFs
– Offers superlinear gains beyond MIPS-scaling
77
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
… while understanding the trade-offs
ƒ Uniprocessor efficiency is actually low
– Gelsinger’s Law captures historic (performance) efficiency
• 1.4x performance for 2x transistors
– Marginal uni-processor (performance) efficiency is 40% (or lower!)
• And power efficiency is even worse
ƒ The “true” bar is marginal uniprocessor efficiency
– A multiprocessor “only” has to beat a uniprocessor to be the better
solution
– Many low-hanging fruit to be picked in multithreading applications
• Embarrassing application parallelism which has not been exploited
TRE
78
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
XDR™
System configurations
XDR™
XDR™
Cell BE
Processor
ƒ Game console systems
IOIF
ƒ Blades
XDR™
Cell BE
Processor
BIF
IOIF
ƒ HDTV
ƒ Home media servers
Cell
Design
XDR™
ƒ Supercomputers
XDR™
XDR™
Cell BE
Processor
Cell BE
Processor
IOIF
XDR™
IOIF
BIF
switch
BIF
XDR™
XDR™
IOIF
IOIF
Cell BE
Processor
IOIF1
XDR™
XDR™
Cell Broadband Engine - enabling density computing for data-rich environments
XDR™
79
XDR™
IOIF0
Cell BE
Processor
Cell BE
Processor
© 2006 IBM Corporation
Cell Broadband Engine
Cell: a Synergistic System Architecture
ƒ Cell is not a collection of different processors, but a synergistic whole
– Operation paradigms, data formats and semantics consistent
– Share address translation and memory protection model
ƒ SPE optimized for efficient data processing
– SPEs share Cell system functions provided by Power Architecture
– SMF implements interface to memory
• Copy in/copy out to local storage
ƒ Power Architecture provides system functions
– Virtualization
– Address translation and protection
– External exception handling
ƒ EIB integrates system as data transport hub
80
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation
Cell Broadband Engine
© Copyright International Business Machines Corporation 2006.
All Rights Reserved. Printed in the United States June 2006.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM
IBM Logo
Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.
81
Cell Broadband Engine - enabling density computing for data-rich environments
© 2006 IBM Corporation