file

C66x CorePac: Achieving
High Performance
Agenda
1.
2.
3.
4.
CorePac Architecture
Single Instruction Multiple Data (SIMD)
Memory Access
Pipeline Concept
CorePac Architecture
1.
2.
3.
4.
CorePac Architecture
Single Instruction Multiple Data (SIMD)
Memory Access
Pipeline Concept
C66x CorePac
Level 1 Program
Memory (L1P)
 Single-Cycle
 Cache / RAM
Level 2
Memory
(L2)

256
Instruction Fetch

Program / Data
Cache / RAM
DSP Core
M L
M L
S D
S D
Reg A [32] Reg B [32]
64-bit
Level 1 Data
Memory (L1D)
 Single-Cycle
 Cache / RAM
Memory
Controller
C66x CorePac
CorePac includes:
•
DSP Core
•
Two registers
•
Four functional units per
register side
•
L1P memory (Cache/RAM)
•
L1D memory (Cache/RAM)
•
L2 memory (Cache/RAM)
C66x DSP Core
Memory
A0
.D1
.D2
.S1
.S2
B0
•
•
MACs
.M1
..
A31
.L1
.M2
.L2
Controller/Decoder
..
B31
•
•
•
Four functional units per side:
o Multiplier (.M)
o ALU (.L)
o Data (.D)
o Control (.S)
These independent functional units
enable efficient execution of parallel
specialized instructions:
o Multiplier (.M1and.M2) and ALU (.L1
and .L2) provide MAC (multiple
accumulation) operations.
o Data (.D) provides data input/output.
o Control (.S) provides control
functions (loop, branch, call).
Each DSP core dispatches up to eight
parallel instructions each cycle.
All instructions are conditional, which
enables efficient pipelining.
The optimized C compiler generates
efficient target code.
C66x DSP Core Cross-Path
Register File A
Register File B
Any 64-bit pair of
registers from A can
be one of the inputs
to a B functional
unit, and vice versa.
A0
A1
A2
B0
B1
B2
A3
B3
A4
B4
.
.
.
A31
A
B
.D1
.D1
.S1
.S1
.M1
.M1
.L1
.L1
.
.
.
B31
Partial List of .D Instructions
Partial List of .L Instructions
Partial List of .M Instructions
Partial List of .S Instructions
Single Instruction Multiple Data (SIMD)
1.
2.
3.
4.
CorePac Architecture
Single Instruction Multiple Data (SIMD)
Memory Access
Pipeline Concept
C66x SIMD Instructions: Examples
•
•
ADDDP – Add Two Double-Precision Floating-Point Values
DADD2 – 4-Way SIMD Addition, Packed Signed 16-bit
–
–
–
•
•
Performs 4 additions of two sets of 4 16-bit numbers packed into 64bit registers.
The 4 results are rounded to 4 packed 16-bit values
unit = .L1, .L2, .S1, .S2
FMPYDP - Fast Double-Precision Floating Point Multiply
QMPY32 - 4-Way SIMD Multiply, Packed Signed 32-bit.
–
–
–
Performs 4 multiplications of two sets of 4 32-bit numbers packed
into 128-bit registers.
The 4 results are packed 32-bit values.
unit = .M1 or .M2
C66x SIMD Instruction: CMATMPY
Many applications use complex matrix arithmetic.
•
CMATMPY – 2x1 Complex Vector Multiply 2x2 Complex Matrix
– Results in 1x2 signed complex vector.
– All values are 16-bit (16-bit real/16-bit Imaginary)
– unit = .M1 or .M2
•
How many multiplications are complex multiplication, where each
complex multiplication has the following?
– 4 complex multiplications (4 real multiplications each)
– Two M units (16 multiplications each) = 32 multiplications
– Core cycles per second (1.25 G)
– Total multiplications per second = 40 G multiplications
– 8 cores = 320 G multiplications
The issue here is, can we feed the functional units data fast enough?
Feeding the Functional Units
There are two challenges:
• How to provide enough data from memory to the core
–
–
•
Access to L1 memory is wide (2 x 64 bit) and fast (0 wait state)
Multiple mechanisms are used to efficiently transfer new data to L1
from L2 and external memory.
How to get values in and out of the functional units
–
–
Hardware pipeline enables execution of instructions every cycle.
Efficient instruction scheduling maximizes functional unit
throughput.
Memory Access
1.
2.
3.
4.
CorePac Architecture
Single Instruction Multiple Data (SIMD)
Memory Access
Pipeline Concept
Internal Buses
Program Address
L1
Memories
Program Data
Data Address - T1
Data Data
L2 and
External
Memory
x32
PC
x256
Fetch
x32
A
- T1
x32/64
Data Address - T2
x32
Data Data
- T2
x32/64
Regs
B
Regs
Peripherals
C62x: Dual 32-Bit Load/Store
C67x: Dual 64-Bit Load / 32-Bit Store
C64x, C674x, C66x: Dual 64-Bit Load/Store
Pipeline Concept
1.
2.
3.
4.
CorePac Architecture
Single Instruction Multiple Data (SIMD)
Memory Access
Pipeline Concept
Non-Pipelined vs. Pipelined CPU
Clock Cycles
CPU Type
Non-Pipelined
Pipelined
Stage
1
2
F1 D1 E1
4
5
6
F2 D2 E2
7
8
9
F3 D3 E3
F1 D1 E1
F2 D2 E2
F3 D3 E3
Pipeline Function
F
Fetch
• Generate program fetch address
• Read opcode
D
Decode
• Route opcode to functional units
• Decode instructions
E
Execute
3
Execute instructions
Pipeline full
Now look at the C66x pipeline.
Program Fetch Phases
Phase
Description
PG
Generate fetch address
PS
Send address to memory
PW
Wait for data ready
PR
Read opcode
C66x
Core
PR
Memory
PW
PG
PS
Functional
Units
Pipeline Phases - Review
Program Fetch
PG
PS
PG
PW
PS
PG
Decode
PR
PW
PS
PG
D
PR
PW
PS
PG
E
D
PR
PW
PS
PG
Execute
E
D
PR
PW
PS
E
D
PR
PW
E
D
PR
E
D

Single-cycle performance is not affected by adding three
program fetch phases.

That is, there is still an execute every cycle.
E
How about decode? Is it only one cycle?
Decode Phases
Decode Phase
Description
DP
Intelligently routes instruction to
functional unit (dispatch)
DC
Instruction decoded at functional unit
(decode)
C66x
Core
PR
DP
Functional
Units
DC
Memory
PW
PG
PS
Pipeline Phases
PG
PS
PG
PW
PS
PG
Execute
Decode
Program Fetch
PR
PW
PS
PG
DP
PR
PW
PS
PG
DC
DP
PR
PW
PS
PG
E1
DC
DP
PR
PW
PS
PG
E1
DC
DP
PR
PW
PS
E1
DC
DP
PR
PW
E1
DC
DP
PR
E1
DC
DP
E1
DC
E1
Pipeline Full
How many cycles does it take to execute an instruction?
Instruction Delays
All C66x instructions require only one cycle to
execute, but some results are delayed.
Description
Instruction Example
Delay
Single Cycle
All instructions except
0
Integer multiplication and MPY, FMPYSP
new floating point
Legacy floating point
MPYSP
multiplication
1
Load
Branch
4
5
LDW
B
2
Software Pipeline Example
Dot product; A typical DSP MAC operation.
||
LDH
LDH
MPY
ADD
How many cycles would
it take to perform this
loop five times?
(Disregard delay slots).
______________ cycles
Software Pipeline Example
A typical DSP MAC operation- dot product
||
LDH
LDH
MPY
ADD
How many cycles would
it take to perform this
loop 5 times?
(Disregard delay-slots).
5 x 3 = 15 cycles
Non-Pipelined Code
Cycle
1
.D1
ldh
.D2
ldh
2
.M1
add
ldh
ldh
5
mpy
6
7
8
9
.L1
mpy
3
4
.M2
add
ldh
ldh
mpy
add
.L2
.S1
.S2
Pipelining Code
Cycle
1
.D1
ldh
.D2
ldh
.M1
2
ldh
ldh
mpy
3
ldh
ldh
mpy
add
4
ldh
ldh
mpy
add
5
ldh
ldh
mpy
add
mpy
add
6
7
.M2
.L1
.L2
.S1
add
Pipelining these instructions took 1/2 the cycles!
.S2
Software Pipeline Support
•
•
•
•
The compiler is smart enough to schedule instructions
efficiently.
DSP algorithms are typically loop intensive.
Generally speaking, servicing of interrupts is not allowed in
the middle of the loop because fixed timing is essential.
The C66x hardware SPLOOP enables servicing of interrupts
in the middle of loops.
NOTE: For more information on SPLOOP, refer to Chapter 8
of the C66x CPU and Instruction Set Reference Guide.
For More Information
• For more information, refer to the C66x CPU
and Instruction Set Reference Guide.
• For questions regarding topics covered in this
training, visit the support forums at the
TI E2E Community website.