Cell Broadband Engine Cell BE – enabling density computing for data rich environments Michael Gschwind Bruce D’Amora Alexandre Eichenberger Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell History 2 IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opened in March 2001 Based in Austin, Texas Hardware designed in parallel with software February 7, 2005: First external technical disclosures August 15, 2005: First external architecture disclosures August 25, 2005: Cell Launch - Architecture released Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Acknowledgements Cell is the result of a partnership between SCEI/Sony, Toshiba, and IBM Cell represents the work of more than 400 people starting in 2000 and a design investment of about $400M 3 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine More about Cell Broadband Engine http://www.research.ibm.com/cell Online resources –Specification –Documentation –Open Source and Proprietary Tools –Operating System –Platform Simulator 4 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Agenda 5 8:00 – 9:00 Motivation and Architecture 9:00 – 9:30 Heterogeneous Application Model 9:30 – 10:30 Compilation and Auto-Parallelization 10:30 – 11:00 BREAK 11:00 – 12:00 Programming Models & Applications Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell Broadband Engine – Architecture Michael Gschwind Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Computer Architecture at the turn of the millenium “The end of architecture” was being proclaimed – Frequency scaling as performance driver – State of the art microprocessors •multiple instruction issue •out of order architecture •register renaming •deep pipelines Little or no focus on compilers – Questions about the need for compiler research Academic papers focused on microarchitecture tweaks 7 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine The age of frequency scaling Scaled Device Voltage, V / α 1 WIRING bips tox/α W/α n+ source n+ drain L/α p substrate, doping α*NA SCALING: Voltage: Oxide: Wire width: Gate width: Diffusion: Substrate: 8 V/α tox /α W/α L/α xd /α α * NA xd/α RESULTS: Higher Density: ~α2 Higher Speed: ~α Power/ckt: ~1/α2 Power Density: ~Constant 0.9 Performance GATE 0.8 0.7 0.6 0.5 37 34 31 Source: Dennard et al., JSSC 1974. Cell Broadband Engine - enabling density computing for data-rich environments 28 25 22 19 16 13 10 7 Total FO4 Per Stage deeper pipeline © 2006 IBM Corporation Cell Broadband Engine Frequency scaling A trusted standby – Frequency as the tide that raises all boats – Increased performance across all applications – Kept the industry going for a decade Massive investment to keep going – Equipment cost – Power increase – Manufacturing variability – New materials 9 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine 1000 … but reality was closing in! Relative to Optimal FO4 100 1 10 0.8 1 0.6 0.1 0.4 bips bips^3/W 0.2 0.01 Active Power Leakage Power 0 37 34 31 28 25 22 19 Total FO4 Per Stage 10 16 13 10 7 1 Cell Broadband Engine - enabling density computing for data-rich environments 0.1 0.001 0.01 © 2006 IBM Corporation Cell Broadband Engine Power crisis 1000 The power crisis is not “natural” – Created by deviating from ideal scaling theory – Vdd and Vt not scaled by α • additional performance with increased voltage 100 Tox(Å) 10 Vdd Laws of physics – Pchip = ntransistors * Ptransistor Marginal performance gain per transistor low – significant power increase – decreased power/performance efficiency 11 classic scaling Vt 1 0.1 1 Cell Broadband Engine - enabling density computing for data-rich environments 0.1 0.01 gate length Lgate (µm) © 2006 IBM Corporation Cell Broadband Engine The power inefficiency of deep pipelining Power-performance optimal Performance optimal Relative to Optimal FO4 1 0.8 0.6 0.4 bips bips^3/W 0.2 deeper pipeline 0 37 34 31 28 25 22 19 16 Total FO4 Per Stage 13 10 7 Source: Srinivasan et al., MICRO 2002 Deep pipelining increases number of latches and switching rate ⇒ power increases with at least f2 Latch insertion delay limits gains of pipelining 12 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine 100 Hitting the memory wall MFLOPS Memory latency 80 Latency gap – Memory speedup lags behind processor speedup – limits ILP normalized 60 40 20 0 -20 1990 2003 -40 Chip I/O bandwidth gap -60 Source: McKee, Computing Frontiers 2004 – Less bandwidth per MIPS Latency gap as application bandwidth gap usable bandwidth avg. request size no. of inflight roundtrip latency requests – Typically (much) less than chip I/O bandwidth Source: Burger et al., ISCA 1996 13 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Our Y2K challenge 10× performance of desktop systems 1 TeraFlop / second with a four-node configuration 1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design” scalable design across a range of design points mass-produced and low cost 14 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell Design Goals Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 Computing density as main challenge – Dramatically increase performance per X • X = Area, Power, Volume, Cost,… Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost 15 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Necessity as the mother of invention Increase power-performance efficiency –Simple designs are more efficient in terms of power and area Increase memory subsystem efficiency –Increasing data transaction size –Increase number of concurrently outstanding transactions ⇒ Exploit larger fraction of chip I/O bandwidth Use CMOS density scaling –Exploit density instead of frequency scaling to deliver increased aggregate performance Use compilers to extract parallelism from application –Exploit application parallelism to translate aggregate performance to application performance 16 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Exploit application-level parallelism Data-level parallelism – Thread-level parallelism – Improve memory access efficiency by increasing number of parallel memory transactions Compute-transfer parallelism – 17 Exploit application threads with multi-core design approach Memory-level parallelism – SIMD parallelism improves performance with little overhead Transfer data in parallel to execution by exploiting application knowledge Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Concept phase ideas large architected register set high frequency and low power! reverse Vdd scaling for low power 18 modular design design reuse Chip Multiprocessor control cost of coherency efficient use of memory interface Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell Architecture Heterogeneous multicore system architecture – Power Processor Element for control tasks – Synergistic Processor Elements for dataintensive processing SPE SPU SPU SPU – Synergistic Processor Unit (SPU) – Synergistic Memory Flow Control (SMF) 19 SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF 16B/cycle EIB (up to 96B/cycle) 16B/cycle PPE Synergistic Processor Element (SPE) consists of SPU SXU 16B/cycle PPU L2 L1 MIC 16B/cycle (2x) BIC PXU 32B/cycle 16B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Shifting the Balance of Power with Cell Broadband Engine Data processor instead of control system – Control-centric code stable over time – Big growth in data processing needs • • • • Modeling Games Digital media Scientific applications Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Parallelism added as an after-thought 20 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Powering Cell – the Synergistic Processor Unit 64B Fetch ILB Local Store Issue / Branch 2 instructions Single Port SRAM 128B SMF 16B VRF 16B x 2 16B x 3 x 2 V F P U V F X U PERM LSU Source: Gschwind et al., Hot Chips 17, 2005 21 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Density Computing in SPEs Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • • • • Address translation and privilege levels Instruction reordering Register renaming Cache hierarchy Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • • • • • 22 Wide datapaths More and wide architectural registers Data privatization and single level processor-local store All code executes in a single (user) privilege level Static scheduling Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Streamlined Architecture Architectural focus on simplicity – – – – Aids achievable operating frequency Optimize circuits for common performance case Compiler aids in layering traditional hardware functions Leverage 20 years of architecture research Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • Data processing • Instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality 23 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPE Highlights User-mode architecture – No page translation within SPU SIMD dataflow – Broad set of operations (8, 16, 32, 64 Bit) – Graphics SP-Float – IEEE DP-Float FWD FXU ODD GPR DMA DMA block transfer – using Power Architecture memory translation LS LS CHANNEL 256KB Local Store – Combined I & D LS FXU EVN SBI SMM BEB – 32 bit fixed instructions – Load/store architecture – Unified register file DP SFP CONTROL RISC organization LS ATO RTB 14.5mm2 (90nm SOI) Source: Kahle, Spring Processor Forum 2005 24 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Synergistic Processing scalar layering shorter pipeline instruction bundling high frequency compute density 25 simpler μarch large register file DLP static scheduling wide data paths local store determ. latency single port Cell Broadband Engine - enabling density computing for data-rich environments static prediction ILP opt. data parallel select large basic blocks sequential fetch © 2006 IBM Corporation Cell Broadband Engine Efficient data-sharing between scalar and SIMD processing Legacy architectures separate scalar and SIMD processing – Data sharing between SIMD and scalar processing units expensive • Transfer penalty between register files – Defeats data-parallel performance improvement in many scenarios Unified register file facilitates data sharing for efficient exploitation of data parallelism – Allow exploitation of data parallelism without data transfer penalty • Data-parallelism always an improvement 26 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Compiling for the Cell Broadband Engine The lesson of “RISC computing” – Architecture provides fast, streamlined primitives to compiler – Compiler uses primitives to implement higher-level idioms – If the compiler can’t target it Î do not include in architecture Compiler focus throughout project – Prototype compiler soon after first proposal – Cell compiler team has made significant advances in • Automatic SIMD code generation • Automatic parallelization • Data privatization Raw Hardware Performance Programmability Cell Design Programmer Productivity 27 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPU Pipeline SPU PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPU PIPELINE BACK END Branch Instruction RF1 RF2 Permute Instruction 28 EX1 EX2 EX3 EX4 WB Load/Store Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB Fixed Point Instruction EX1 EX2 WB Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB Cell Broadband Engine - enabling density computing for data-rich environments IF IB ID IS RF EX WB Instruction Fetch Instruction Buffer Instruction Decode Instruction Issue Register File Access Execution Write Back © 2006 IBM Corporation Cell Broadband Engine SPE Block Diagram Floating-Point Unit Fixed-Point Unit Permute Unit Load-Store Unit Branch Unit Channel Unit Local Store (256kB) Single Port SRAM Result Forwarding and Staging Register File Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write On-Chip Coherent Bus 8 Byte/Cycle 29 16 Byte/Cycle DMA Unit 64 Byte/Cycle Cell Broadband Engine - enabling density computing for data-rich environments 128 Byte/Cycle © 2006 IBM Corporation Cell Broadband Engine SPU Communication: Channel Architecture SPU uses “channels” to communicate with environment (incl. SMF) – Access to special purpose registers • Processor status – Communication channels • Requests to SMF • SMF status • Mailboxes 30 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPU Channel Features Communication using channels numbered 0 - 127 – Implementation dependent – Unidirectional – Have capacity • Burst without SPU execution stop if capacity available – Channel operations are blocking • On write if full • On read if empty 31 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPU Channel Access rdch RT, ca RT <= channel (ca) – Read data word from channel rdchcnt RT, ca RT <= channel capacity(ca) – Determine channel capacity wrch ca, RT channel(ca) <= RT – Write data word to channel 32 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Synergistic Memory Flow Control SMF implements memory management and mapping SMF operates in parallel to SPU – Independent compute and transfer – Command interface from SPU • DMA queue decouples SMF & SPU • MMIO-interface for remote nodes SPE SPU SXU LS SMF DMA Queue DMA Engine Block transfer between system memory and local store SPE programs reference system memory using user-level effective address space – Ease of data sharing – Local store to local store transfers – Protection 33 Cell Broadband Engine - enabling density computing for data-rich environments Atomic Facility MMU RMT Bus I/F Control MMIO Data Bus Snoop Bus Control Bus Translate Ld/St MMIO © 2006 IBM Corporation Cell Broadband Engine Synergistic Memory Flow Control SMF implements memory management and mapping – – – – DMA for data transfer MMU for page translation AF for coherent data BIF for access to Element Interconnect Bus – RMT for resource management SPE SPU SXU LS SMF SMF operates in parallel to SPU – Independent compute and transfer – Channel command interface from SPU • DMA queue decouples SMF SPU • SPU can synchronize with SMF • MMIO-based interface for remote nodes 34 Cell Broadband Engine - enabling density computing for data-rich environments DMA Queue DMA Engine Atomic Facility MMU RMT Bus I/F Control MMIO Data Bus Snoop Bus Control Bus Translate Ld/St MMIO © 2006 IBM Corporation Cell Broadband Engine System-wide Virtual Memory Architecture SMF MMU follows Power Architecture™ Virtual Memory architecture – Two-level translation SPE SPU SXU LS • Segmentation and paging – PPE and SPEs share memory map – SPE programs reference system memory using user-level effective address space SMF • Ease of data sharing • Local store to local store transfers • Protection DMA Queue DMA Engine Atomic Facility MMU RMT Bus I/F Control Exceptions delivered to PPE – SLB miss – Page fault 35 Cell Broadband Engine - enabling density computing for data-rich environments MMIO Data Bus Snoop Bus Control Bus Translate Ld/St MMIO © 2006 IBM Corporation Cell Broadband Engine Data Transfer with SMF DMAC DMA Unit implements block data transfer – transfer specifies system memory and local store address – system memory address is effective address – translated to physical address by SMF MMU – variable block size from 1B to 16KB SPE SPU SXU LS SMF DMA Queue DMA Engine DMA transfers – LS Ù system memory – LS Ù LS – LS Ù I/O Transfers DMA list command – “SMF program” transfers data in parallel to computation 36 Cell Broadband Engine - enabling density computing for data-rich environments Atomic Facility MMU RMT Bus I/F Control MMIO Data Bus Snoop Bus Control Bus Translate Ld/St MMIO © 2006 IBM Corporation Cell Broadband Engine Synergistic Memory Flow Control Bus Interface Element Interconnect Bus – Up to 16 outstanding DMA requests – Requests up to 16KByte – Token-based Bus Access Management PPE – – – – – Four 16 byte data rings Multiple concurrent transfers 96B/cycle peak bandwidth Over 100 outstanding requests 200+ GByte/s @ 3.2+ GHz SPE1 SPE3 SPE5 SPE7 16B 16B 16B 16B 16B 16B 16B 16B 16B IOIF1 16B 16B 16B Data Arb 16B 16B 16B 16B MIC 16B 16B 16B 16B 16B 16B 16B 16B SPE0 SPE2 SPE4 SPE6 BIF/IOIF0 Source: Clark et al., Hot Chips 17, 2005 37 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Memory efficiency as key to application performance Greatest challenge in translating peak into app performance – Peak Flops useless without way to feed data Cache miss provides too little data too late – Inefficient for streaming / bulk data processing – Initiates transfer when transfer results are already needed – Application-controlled data fetch avoids not-on-time data delivery SMF is a better way to look at an old problem – Fetch data blocks based on algorithm – Blocks reflect application data structures 38 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Traditional memory usage Long latency memory access operations exposed Cannot overlap 500+ cycles with computation Memory access latency severely impacts application performance computation 39 mem protocol mem idle Cell Broadband Engine - enabling density computing for data-rich environments mem contention © 2006 IBM Corporation Cell Broadband Engine Exploiting memory-level parallelism Reduce performance impact of memory accesses with concurrent access Carefully scheduled memory accesses in numeric code Out-of-order execution increases chance to discover more concurrent accesses – Overlapping 500 cycle latency with computation using OoO illusory Bandwidth limited by queue size and roundtrip latency 40 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Exploiting MLP with Chip Multiprocessing More threads Î more memory level parallelism – overlap accesses from multiple cores 41 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine A new form of parallelism: CTP Compute-transfer parallelism – Concurrent execution of compute and transfer increases efficiency • Avoid costly and unnecessary serialization – Application thread has two threads of control • SPU Î computation thread • SMF Î transfer thread Optimize memory access and data transfer at the application level – Exploit programmer and application knowledge about data access patterns 42 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Memory access in Cell with MLP and CTP Super-linear performance gains observed – decouple data fetch and use 43 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Synergistic Memory Management timely data delivery decouple fetch and access high throughput block fetch storage density 44 explicit data transfer reduced coherence cost application control local store low latency access parallel compute & transfer deeper fetch queue many outstanding transfers eliminate cache miss logic single port Cell Broadband Engine - enabling density computing for data-rich environments multi core improved code gen shared I&D sequential fetch © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components Power Processor Element (PPE) Industry-standard 64-bit IBM Power Architecture™ processor In the Beginning – PowerPC AS 2.0.2 – the Power Architecture™ Processor 2-Way Hardware Multithreaded L1 : 32KB I ; 32KB D L2 : 512KB NCU Coherent load/store Power Core VMX (PPE) 3.2+ GHz L2 Cache Realtime Control – Locking L2 Cache & TLB Custom Designed – Bandwidth Reservation – for high frequency, area and power efficiency N N 45 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components Element Interconnect Bus data ring for internal communication – Four 16 byte data rings, supporting multiple transfers – 96B/cycle peak bandwidth – Over 100 outstanding requests – 200+ GByte/s @ 3.2+ GHz 96 Byte/Cycle 200+GB/sec @ 3.2+GHz NCU Power Core (PPE) L2 Cache Element Interconnect Bus 46 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components Local Store AUC AUC MFC Local Store MFC SPU SPU AUC MFC N N N N 96 Byte/Cycle 200+GB/sec @ 3.2+GHz NCU Power Core (PPE) L2 Cache AUC SPU Local Store Local Store MFC Local Store AUC AUC SPU SPU MFC Local Store MFC AUC SPU Cell Broadband Engine - enabling density computing for data-rich environments N N N N Element Interconnect Bus MFC 47 Local Store SPU AUC MFC Local Store SPU SPE provides computational performance Dual issue 32-bit SIMD architecture Dedicated resources – 128-entry 128-bit VRF – 256KB Local Store Each SMF can be dynamically configured to protect resources Dedicated DMA engine – Up to 16 outstanding requests © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components Local Store AUC AUC MFC Local Store MFC SPU SPU N N N 200+GB/sec @ 3.2+GHz NCU Power Core (PPE) L2 Cache AUC SPU Local Store Local Store MFC Local Store AUC AUC SPU SPU MFC Local Store MFC AUC SPU Cell Broadband Engine - enabling density computing for data-rich environments N N N N Element Interconnect Bus MFC 48 AUC MFC N 96 Byte/Cycle • Using Power Architecture™ system memory map • System memory map compatible with Power Architecture™ Virtual Memory architecture – S/W controllable from PPE MMIO DMA 1,2,4,8,16,128 Byte ⇒ 16Kbyte transfers for memory and I/O access Local Store SPU AUC MFC Local Store SPU SMF provides memory management & mapping SPE Local Store aliased into system memory map SMF controls SPE DMA accesses – Implements page translation and protection © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components SPU Local Store SPU Local Store AUC MFC AUC MFC AUC MFC MIC N N N N 96 Byte/Cycle 200+GB/sec @ 3.2+GHz Power Core (PPE) MIC NCU IOIF0 L2 Cache 25 GB / s DRAM AUC SPU Local Store Local Store MFC Local Store AUC AUC SPU SPU MFC Local Store Cell Broadband Engine - enabling density computing for data-rich environments MFC AUC SPU Southbridge I/O N N N N Element Interconnect Bus MFC 5 GB / s 49 Local Store SPU AUC MFC Local Store SPU 20 GB / s BIF or IOIF1 IOIF1 I/O provides high bandwidth Dual XDR™ controller – 25.6GB/s @ 3.2Gbps Two configurable interfaces – 76.8GB/s @ 6.4Gbps – Configurable number of Bytes – Coherent or I/O Mode Interconnect Supports multiple system configurations © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components SPU Local Store SPU Local Store MFC AUC MFC AUC N N N 200+GB/sec @ 3.2+GHz Power Core (PPE) MIC NCU L2 Cache 25 GB / s DRAM IIC AUC SPU Local Store Local Store MFC Local Store AUC AUC SPU SPU MFC Local Store MFC AUC SPU Cell Broadband Engine - enabling density computing for data-rich environments N N N N Element Interconnect Bus MFC Southbridge I/O 50 MIC AUC MFC N IOIF0 5 GB / s Local Store SPU AUC MFC Local Store SPU 96 Byte/Cycle IOIF1 IIC – Internal Interrupt Controller Handles SPE Interrupts 20 GB / s BIF or IOIF1 Handles External Interrupts – From Coherent Interconnect – From IOIF0 or IOIF1 Interrupt Priority Level Control Duplicated for each PPE hardware thread © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components SPU Local Store SPU Local Store MFC AUC MFC AUC N N N 200+GB/sec @ 3.2+GHz Power Core (PPE) MIC NCU L2 Cache 25 GB / s DRAM IOT IIC AUC SPU Local Store Local Store MFC Local Store AUC AUC SPU SPU MFC Local Store MFC AUC SPU Cell Broadband Engine - enabling density computing for data-rich environments N N N N Element Interconnect Bus MFC Southbridge I/O 51 MIC AUC MFC N IOIF0 5 GB / s Local Store SPU AUC MFC Local Store SPU 96 Byte/Cycle IOIF1 IOT implements I/O Bus Master Translation 20 GB / s Translates bus address to system BIF or IOIF1 address Two Level translation – I/O Segments: 256 MB – I/O Pages: 4KB, 64KB,1MB, 16MB I/O Device Identifier / page for LPAR IOST and IOPT Cache – hardware/software managed © 2006 IBM Corporation Cell Broadband Engine Cell BE Processor Components Local Store SPU Local Store MFC AUC MFC AUC N N Each SPE PPE L2 / NCU IOIF 0 Bus Master IOIF 1 Bus Master IOIF0 L2 Cache 25 GB / s DRAM TKM IIC AUC SPU Local Store Local Store MFC Local Store AUC AUC SPU SPU MFC Local Store MFC AUC Cell Broadband Engine - enabling density computing for data-rich environments N N N N SPU Southbridge I/O IOT Element Interconnect Bus MFC 5 GB / s Power Core (PPE) MIC NCU Priority order for using another RAGs unused tokens Resource overcommit warning interrupt 52 SPU N 200+GB/sec @ 3.2+GHz Requestors assigned RAG ID by OS/hypervisor – – – – MIC AUC MFC N 96 Byte/Cycle IOIF1 – 1 per each memory bank (16 total) – 2 for each IOIF (4 total) Local Store SPU AUC MFC Local Store SPU Token Manager provides Bandwidth Reservation for shared resources Optionally used for RT tasks or LPAR 20 GB / s Multiple Resource Allocation Groups BIF or IOIF1 Generates access tokens at configurable rate for each allocation group © 2006 IBM Corporation Cell Broadband Engine Cell BE Implementation Characteristics Frequency Increase vs. Power Consumption 241M transistors 235mm2 Design operates across wide frequency range > 200 GFlops (SP) @3.2GHz > 20 GFlops (DP) @3.2GHz Relative – Optimize for power & yield Up to 25.6 GB/s memory bandwidth Up to 75 GB/s I/O bandwidth 100+ simultaneous bus transactions – 16+8 entry DMA queue per SPE Voltage Source: Kahle, Spring Processor Forum 2005 53 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell Broadband Engine 54 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell BE Applications Michael Gschwind Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Compiling and linking an integrated executable 56 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Loading and execution of a program System memory SPE .text: pu_main: … spu_create_thread(0, spu0, spu0_main); … spu_create_thread(1, spu1, spu1_main); … SPU SXU r .text: SPU SPU SPU SXU SPU SXU SPU SXU SPU SXU SPU SPU SXU spu0_main: … printf: … SXU SXU LS SXU LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF … .data: … SMF spu0: .text: spu0_main: … p … .data: EIB … q PPE o n spu1: .text: spu1_main: PPU … … .data: L1 L2 32B/cycle 57 … PXU 16B/cycle n PPE image loads and executes; o PPE initiates SMF transfer; p SMF data transfer; q start SPU at specified address; r SMF starts SPU execution .data: … … Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPE thread creation PPE transfers mini-loader and parameters –256b program Mini-loader transfers thread memory image –Embedded in Power Architecture executable –Provided to spe_create_thread() call SPE side program load more efficient –SPE has access more queue entries –Parallel loading on 8 SPEs –Direct channel access vs. MMIO access 58 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Heterogeneous Multi-threading and OS management Application Source & Libraries Heterogeneous MultiThreading Model –PPE Threads PPE object files SPE object files –SPE Threads –SPE DMA EA = PPE Process EA Space –OS supports Create & Destroy SPE tasks Cell Broadband Engine-aware OS (Linux) SPE Virtualization / Scheduling Layer PPE threads SPE threads –Atomic Update Primitives used for Mutex –SPE Context Fully Managed • OS assignment of SPE threads • Programmer directed using affinity mask 59 PPE T1 T2 SPE SPE SPE SPE SPE SPE SPE SPE Physical PPE Physical SPEs Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Linux on Cell BE All software in STIDC written on Linux OS – Started with Linux 2.4 PPC64 on Cell Simulator • SPEs exposed as I/O Devices (function offload model) • SPE DMA required pre-pinned memory • Inflexible programming model Moved to 2.6.3 – Added heterogenous thread model – via system call – moved to SPUFS in 2.6.12 • SPE thread API created (similar to pthreads library) • User mode direct and indirect SPE access models • Full pre-emptive SPE context management • spe_ptrace() added for gdb support • spe_schedule() for thread to physical SPE assignment – currently FIFO – run to completion – SPE threads share address space with parent PPE process (through DMA) • Demand paging for SPE accesses • Shared hardware page table with PPE – SPE Error, Event and Signal handling directed to parent PPE thread – SPE elf objects wrapped into PPE shared objects with extended gld • SPE-side mini-loader – madvise() extended for L2 cache and TLB locking/preloading (realtime feature) – All patches for Cell in architecture dependent layer (subtree of PPC64) • Except for a few shameless hacks - being removed in 2.6.12 Publishing Initial Cell BE Patches for 2.6.12 60 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPE data transfer (from SPE or PPE) Embedded in program – At compile time – At runtime Direct store to SPU LS – Using memory map alias when SPU LS mapped into memory map Mailbox – Channel in Synergistic Processor Architecture – MMIO in IBM Power Architecture™ core Externally initiated transfer – Using SMF block transfer capabilities – From PPE or remote SPE SPU-initiated transfer – Based on an address provided using one of these four methods – Based on address computed from data obtained by these five methods 61 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Data sharing between PPE and SPE SPE addresses system memory via SMF copyin/copy-out using effective address –System memory pointers can be shared between PPE and SPE PPE can access SPE local store using SMF or using memory accesses –PPE enqueues SMF requests via memory mapped I/O –Aliasing of SPE LS gives PPE addressability as system address •Add LS base address of local store to use in PPE 62 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Access to data using common effective addresses ppe_sum_all(float *a) { for (i=0; i<=MAX; i++) sum += a[i]; } 63 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Access to data using common effective addresses ppe_sum_all(float *a) { for (i=0; i<=MAX; i++) sum += a[i]; } Power Architecture™ effective (virtual) address spe_sum_all(float *a) { float local_a[MAX] __attribute__ ((aligned (128))); mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0); mfc_write_tag_mask(1<<31); mfc_read_tag_status_all(); for (i=0; i<=MAX; i++) sum += local_a[i]; } 64 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Access to data using common effective addresses ppe_sum_all(float *a) { for (i=0; i<=MAX; i++) sum += a[i]; } local work buffer in SPE local store spe_sum_all(float *a) { float local_a[MAX] __attribute__ ((aligned (128))); mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0); mfc_write_tag_mask(1<<31); mfc_read_tag_status_all(); for (i=0; i<=MAX; i++) sum += local_a[i]; } 65 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Access to data using common effective addresses ppe_sum_all(float *a) { for (i=0; i<=MAX; i++) sum += a[i]; } spe_sum_all(float *a)target address SMF copy LS size (max 16KB) { float local_a[MAX] __attribute__ ((aligned (128))); mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0); mfc_write_tag_mask(1<<31); mfc_read_tag_status_all(); tag EA source address for (i=0; i<=MAX; i++) sum += local_a[i]; } 66 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Access to data using common effective addresses ppe_sum_all(float *a) { for (i=0; i<=MAX; i++) sum += a[i]; } spe_sum_all(float *a) { set tags for status request float local_a[MAX] __attribute__ ((aligned (128))); mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0); mfc_write_tag_mask(1<<31); wait for request to complete mfc_read_tag_status_all(); for (i=0; i<=MAX; i++) sum += local_a[i]; } 67 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Access to data using common effective addresses ppe_sum_all(float *a) { for (i=0; i<=MAX; i++) sum += a[i]; } spe_sum_all(float *a) { float local_a[MAX] __attribute__ ((aligned (128))); mfc_get(&local_a[0], &a[0], sizeof(float)*MAX, 31, 0, 0); mfc_write_tag_mask(1<<31); mfc_read_tag_status_all(); for (i=0; i<=MAX; i++) sum += local_a[i]; perform algorithm on local copy } 68 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Mailbox communication SPE unsigned int mbox = spu_read_in_mbox(); spu_write_out_mbox(mbox); PPE while (spe_stat_in_mbox(speid) == 0); spe_write_in_mbox(speid,data); unsigned int rmbox = spe_read_out_mbox(speid); 69 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Atomic updates and semaphores Atomic updates atomic_set((atomic_ea_t)(ptrAtomicData),0xffffffff); atomic_set, atomic_add, atomic_sub, atomic_dec_and_test,... Mutex lock/unlock mutex_lock (cond_mutex_ea); cond_signal (cond_ea); mutex_unlock (cond_mutex_ea); 70 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine SPE SPU SXU Locks: atomic_set(val) LS SMF DMA Queue DMA Engine ea64.ull = ALIGN128_EA(v); offset = OFFSET128_EA(v, u32); Atomic Facility MMU RMT Bus I/F Control MMIO do { MFC_DMA(buf, ea64, size, tagid, MFC_GETLLAR_CMD); spu_readch (MFC_RdAtomicStat); ret_val = buf[offset]; buf[offset] = val; MFC_DMA(buf, ea64, size, tagid, MFC_PUTLLC_CMD); status = spu_readch(MFC_RdAtomicStat); } while (status != 0); 71 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Cell Real Address Memory Map Local storage of each SPE aliased in system memory map – Direct (uncacheable) access by PPE – Used for LS Ù LS transfer – Access control via page table QoS memory is pinned system memory – bandwidth and latency guarantee – managed by O/S I/O devices external to BE – defined by system and I/O architecture 72 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine System management of Cell BE resources Cell BE implements full set of Power Architecture™ virtualization and dynamic partitioning –Support of partition configuration state •Logical Partition ID etc. Full state management by PPE –Access via memory mapped I/O registers –Grouped by privilege level •Access control to MMIO facilities controlled by page access control 73 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Per SPE Resources (PPE Side) Problem State 4K Physical Page Boundary 8 Entry MFC Command Queue Interface DMA Command and Queue Status DMA Tag Status Query Mask DMA Tag Status 32 bit Mailbox Status and Data from SPU 32 bit Mailbox Status and Data to SPU 4 deep FIFO Signal Notification 1 Signal Notification 2 SPU Run Control SPU Next Program Counter SPU Execution Status 4K Physical Page Boundary Optionally Mapped 256K Local Store 74 Privileged 2 State (OS or Hypervisor) Privileged 1 State (OS) 4K Physical Page Boundary SPU Privileged Control SPU Channel Counter Initialize SPU Channel Data Initialize SPU Signal Notification Control SPU Decrementer Status & Control MFC DMA Control MFC Context Save / Restore Registers SLB Management Registers 4K Physical Page Boundary Optionally Mapped 256K Local Store 4K Physical Page Boundary SPU Master Run Control SPU ID SPU ECC Control SPU ECC Status SPU ECC Address SPU 32 bit PU Interrupt Mailbox MFC Interrupt Mask MFC Interrupt Status MFC DMA Privileged Control MFC Command Error Register MFC Command Translation Fault Register MFC SDR (PT Anchor) MFC ACCR (Address Compare) MFC DSSR (DSI Status) MFC DAR (DSI Address) MFC LPID (logical partition ID) MFC TLB Management Registers Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Per SPE Resources (SPU Side) SPU Direct Access Resources 128 - 128 bit GPRs External Event Status (Channel 0) Decrementer Event Tag Status Update Event DMA Queue Vacancy Event SPU Incoming Mailbox Event Signal 1 Notification Event Signal 2 Notification Event Reservation Lost Event External Event Mask (Channel 1) External Event Acknowledgement (Channel 2) Signal Notification 1 (Channel 3) Signal Notificaiton 2 (Channel 4) Set Decrementer Count (Channel 7) Read Decrementer Count (Channel 8) 16 Entry MFC Command Queue Interface (Channels 16-21) DMA Tag Group Query Mask (Channel 22) Request Tag Status Update (Channel 23) Immediate Conditional - ALL Conditional - ANY Read DMA Tag Group Status (Channel 24) DMA List Stall and Notify Tag Status (Channel 25) DMA List Stall and Notify Tag Acknowledgement (Channel 26) Lock Line Command Status (Channel 27) Outgoing Mailbox to PU (Channel 28) Incoming Mailbox from PU (Channel 29) Outgoing Interrupt Mailbox to PU (Channel 30) 75 SPU Indirect Access Resources (via EA Addressed DMA) System Memory Memory Mapped I/O This SPU Local Store Other SPU Local Store Other SPU Signal Registers Atomic Update (Cacheable Memory) Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Memory Flow Controller Commands DMA Commands Put - Transfer from Local Store to EA space Puts - Transfer and Start SPU execution Putr - Put Result - (Arch. Scarf into L2) Putl - Put using DMA List in Local Store Putrl - Put Result using DMA List in LS (Arch) Get - Transfer from EA Space to Local Store Gets - Transfer and Start SPU execution Getl - Get using DMA List in Local Store Sndsig - Send Signal to SPU Command Modifiers: <f,b> f: Embedded Tag Specific Fence Command will not start until all previous commands in same tag group have completed b: Embedded Tag Specific Barrier Command and all subsiquent commands in same tag group will not start until previous commands in same tag group have completed SL1 Cache Management Commands sdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint) sdcrz - Data cache region zero sdcrs - Data cache region store sdcrf - Data cache region flush 76 Command Parameters LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit) CL - Cache Management / Bandwidth Class Synchronization Commands Lockline (Atomic Update) Commands: getllar - DMA 128 bytes from EA to LS and set Reservation putllc - Conditionally DMA 128 bytes from LS to EA putlluc - Unconditionally DMA 128 bytes from LS to EA barrier - all previous commands complete before subsequent commands are started mfcsync - Results of all previous commands in Tag group are remotely visible mfceieio - Results of all preceding Puts commands in same group visible with respect to succeeding Get commands Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine Raising the bar with parallelism… Data-parallelism and static ILP in both core types – Results in low overhead per operation Multithreaded programming is key to great Cell BE performance – Exploit application parallelism with 9 cores – Regardless of whether code exploits DLP & ILP – Challenge regardless of homogeneity/heterogeneity Leverage parallelism between data processing and data transfer – A new level of parallelism exploiting bulk data transfer – Simultaneous processing on SPUs and data transfer on SMFs – Offers superlinear gains beyond MIPS-scaling 77 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine … while understanding the trade-offs Uniprocessor efficiency is actually low – Gelsinger’s Law captures historic (performance) efficiency • 1.4x performance for 2x transistors – Marginal uni-processor (performance) efficiency is 40% (or lower!) • And power efficiency is even worse The “true” bar is marginal uniprocessor efficiency – A multiprocessor “only” has to beat a uniprocessor to be the better solution – Many low-hanging fruit to be picked in multithreading applications • Embarrassing application parallelism which has not been exploited TRE 78 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine XDR™ System configurations XDR™ XDR™ Cell BE Processor Game console systems IOIF Blades XDR™ Cell BE Processor BIF IOIF HDTV Home media servers Cell Design XDR™ Supercomputers XDR™ XDR™ Cell BE Processor Cell BE Processor IOIF XDR™ IOIF BIF switch BIF XDR™ XDR™ IOIF IOIF Cell BE Processor IOIF1 XDR™ XDR™ Cell Broadband Engine - enabling density computing for data-rich environments XDR™ 79 XDR™ IOIF0 Cell BE Processor Cell BE Processor © 2006 IBM Corporation Cell Broadband Engine Cell: a Synergistic System Architecture Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model SPE optimized for efficient data processing – SPEs share Cell system functions provided by Power Architecture – SMF implements interface to memory • Copy in/copy out to local storage Power Architecture provides system functions – Virtualization – Address translation and protection – External exception handling EIB integrates system as data transport hub 80 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation Cell Broadband Engine © Copyright International Business Machines Corporation 2006. All Rights Reserved. Printed in the United States June 2006. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. 81 Cell Broadband Engine - enabling density computing for data-rich environments © 2006 IBM Corporation