C66x CorePac: Achieving High Performance Agenda 1. 2. 3. 4. CorePac Architecture Single Instruction Multiple Data (SIMD) Memory Access Pipeline Concept CorePac Architecture 1. 2. 3. 4. CorePac Architecture Single Instruction Multiple Data (SIMD) Memory Access Pipeline Concept C66x CorePac Level 1 Program Memory (L1P) Single-Cycle Cache / RAM Level 2 Memory (L2) 256 Instruction Fetch Program / Data Cache / RAM DSP Core M L M L S D S D Reg A [32] Reg B [32] 64-bit Level 1 Data Memory (L1D) Single-Cycle Cache / RAM Memory Controller C66x CorePac CorePac includes: • DSP Core • Two registers • Four functional units per register side • L1P memory (Cache/RAM) • L1D memory (Cache/RAM) • L2 memory (Cache/RAM) C66x DSP Core Memory A0 .D1 .D2 .S1 .S2 B0 • • MACs .M1 .. A31 .L1 .M2 .L2 Controller/Decoder .. B31 • • • Four functional units per side: o Multiplier (.M) o ALU (.L) o Data (.D) o Control (.S) These independent functional units enable efficient execution of parallel specialized instructions: o Multiplier (.M1and.M2) and ALU (.L1 and .L2) provide MAC (multiple accumulation) operations. o Data (.D) provides data input/output. o Control (.S) provides control functions (loop, branch, call). Each DSP core dispatches up to eight parallel instructions each cycle. All instructions are conditional, which enables efficient pipelining. The optimized C compiler generates efficient target code. C66x DSP Core Cross-Path Register File A Register File B Any 64-bit pair of registers from A can be one of the inputs to a B functional unit, and vice versa. A0 A1 A2 B0 B1 B2 A3 B3 A4 B4 . . . A31 A B .D1 .D1 .S1 .S1 .M1 .M1 .L1 .L1 . . . B31 Partial List of .D Instructions Partial List of .L Instructions Partial List of .M Instructions Partial List of .S Instructions Single Instruction Multiple Data (SIMD) 1. 2. 3. 4. CorePac Architecture Single Instruction Multiple Data (SIMD) Memory Access Pipeline Concept C66x SIMD Instructions: Examples • • ADDDP – Add Two Double-Precision Floating-Point Values DADD2 – 4-Way SIMD Addition, Packed Signed 16-bit – – – • • Performs 4 additions of two sets of 4 16-bit numbers packed into 64bit registers. The 4 results are rounded to 4 packed 16-bit values unit = .L1, .L2, .S1, .S2 FMPYDP - Fast Double-Precision Floating Point Multiply QMPY32 - 4-Way SIMD Multiply, Packed Signed 32-bit. – – – Performs 4 multiplications of two sets of 4 32-bit numbers packed into 128-bit registers. The 4 results are packed 32-bit values. unit = .M1 or .M2 C66x SIMD Instruction: CMATMPY Many applications use complex matrix arithmetic. • CMATMPY – 2x1 Complex Vector Multiply 2x2 Complex Matrix – Results in 1x2 signed complex vector. – All values are 16-bit (16-bit real/16-bit Imaginary) – unit = .M1 or .M2 • How many multiplications are complex multiplication, where each complex multiplication has the following? – 4 complex multiplications (4 real multiplications each) – Two M units (16 multiplications each) = 32 multiplications – Core cycles per second (1.25 G) – Total multiplications per second = 40 G multiplications – 8 cores = 320 G multiplications The issue here is, can we feed the functional units data fast enough? Feeding the Functional Units There are two challenges: • How to provide enough data from memory to the core – – • Access to L1 memory is wide (2 x 64 bit) and fast (0 wait state) Multiple mechanisms are used to efficiently transfer new data to L1 from L2 and external memory. How to get values in and out of the functional units – – Hardware pipeline enables execution of instructions every cycle. Efficient instruction scheduling maximizes functional unit throughput. Memory Access 1. 2. 3. 4. CorePac Architecture Single Instruction Multiple Data (SIMD) Memory Access Pipeline Concept Internal Buses Program Address L1 Memories Program Data Data Address - T1 Data Data L2 and External Memory x32 PC x256 Fetch x32 A - T1 x32/64 Data Address - T2 x32 Data Data - T2 x32/64 Regs B Regs Peripherals C62x: Dual 32-Bit Load/Store C67x: Dual 64-Bit Load / 32-Bit Store C64x, C674x, C66x: Dual 64-Bit Load/Store Pipeline Concept 1. 2. 3. 4. CorePac Architecture Single Instruction Multiple Data (SIMD) Memory Access Pipeline Concept Non-Pipelined vs. Pipelined CPU Clock Cycles CPU Type Non-Pipelined Pipelined Stage 1 2 F1 D1 E1 4 5 6 F2 D2 E2 7 8 9 F3 D3 E3 F1 D1 E1 F2 D2 E2 F3 D3 E3 Pipeline Function F Fetch • Generate program fetch address • Read opcode D Decode • Route opcode to functional units • Decode instructions E Execute 3 Execute instructions Pipeline full Now look at the C66x pipeline. Program Fetch Phases Phase Description PG Generate fetch address PS Send address to memory PW Wait for data ready PR Read opcode C66x Core PR Memory PW PG PS Functional Units Pipeline Phases - Review Program Fetch PG PS PG PW PS PG Decode PR PW PS PG D PR PW PS PG E D PR PW PS PG Execute E D PR PW PS E D PR PW E D PR E D Single-cycle performance is not affected by adding three program fetch phases. That is, there is still an execute every cycle. E How about decode? Is it only one cycle? Decode Phases Decode Phase Description DP Intelligently routes instruction to functional unit (dispatch) DC Instruction decoded at functional unit (decode) C66x Core PR DP Functional Units DC Memory PW PG PS Pipeline Phases PG PS PG PW PS PG Execute Decode Program Fetch PR PW PS PG DP PR PW PS PG DC DP PR PW PS PG E1 DC DP PR PW PS PG E1 DC DP PR PW PS E1 DC DP PR PW E1 DC DP PR E1 DC DP E1 DC E1 Pipeline Full How many cycles does it take to execute an instruction? Instruction Delays All C66x instructions require only one cycle to execute, but some results are delayed. Description Instruction Example Delay Single Cycle All instructions except 0 Integer multiplication and MPY, FMPYSP new floating point Legacy floating point MPYSP multiplication 1 Load Branch 4 5 LDW B 2 Software Pipeline Example Dot product; A typical DSP MAC operation. || LDH LDH MPY ADD How many cycles would it take to perform this loop five times? (Disregard delay slots). ______________ cycles Software Pipeline Example A typical DSP MAC operation- dot product || LDH LDH MPY ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). 5 x 3 = 15 cycles Non-Pipelined Code Cycle 1 .D1 ldh .D2 ldh 2 .M1 add ldh ldh 5 mpy 6 7 8 9 .L1 mpy 3 4 .M2 add ldh ldh mpy add .L2 .S1 .S2 Pipelining Code Cycle 1 .D1 ldh .D2 ldh .M1 2 ldh ldh mpy 3 ldh ldh mpy add 4 ldh ldh mpy add 5 ldh ldh mpy add mpy add 6 7 .M2 .L1 .L2 .S1 add Pipelining these instructions took 1/2 the cycles! .S2 Software Pipeline Support • • • • The compiler is smart enough to schedule instructions efficiently. DSP algorithms are typically loop intensive. Generally speaking, servicing of interrupts is not allowed in the middle of the loop because fixed timing is essential. The C66x hardware SPLOOP enables servicing of interrupts in the middle of loops. NOTE: For more information on SPLOOP, refer to Chapter 8 of the C66x CPU and Instruction Set Reference Guide. For More Information • For more information, refer to the C66x CPU and Instruction Set Reference Guide. • For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.