TriCore™ DSP Optimization Guide Part 2: Routines

U s e r G u i d e , v 1 . 6 . 4 , J a n u a ry 2 0 0 3
TriCoreTM
32-bit Unified Processor
DSP Optimization Guide
Part 2: Routines
IP Cores
N e v e r
s t o p
t h i n k i n g .
Edition 2003-01
Published by Infineon Technologies AG,
St.-Martin-Strasse 53,
D-81541 München, Germany
© Infineon Technologies AG 2003.
All Rights Reserved.
Attention please!
The information herein is given to describe certain components and shall not be considered as warranted
characteristics.
Terms of delivery and rights to technical change reserved.
We hereby disclaim any and all warranties, including but not limited to warranties of non-infringement, regarding
circuits, descriptions and charts stated herein.
Infineon Technologies is an approved CECC manufacturer.
Information
For further information on technology, delivery terms and conditions and prices please contact your nearest
Infineon Technologies Office in Germany or our Infineon Technologies Representatives worldwide
(see www.infineon.com).
Warnings
Due to technical requirements components may contain dangerous substances. For information on the types in
question please contact your nearest Infineon Technologies Office.
Infineon Technologies Components may only be used in life-support devices or systems with the express written
approval of Infineon Technologies, if a failure of such components can reasonably be expected to cause the failure
of that life-support device or system, or to affect the safety or effectiveness of that device or system. Life support
devices or systems are intended to be implanted in the human body, or to support and/or maintain and sustain
and/or protect human life. If they fail, it is reasonable to assume that the health of the user or other persons may
be endangered.
U s e r G u i d e , v 1 . 6 . 4 , J a n u a ry 2 0 0 3
T r i C o r e TM
32-b it Un ified Pro c essor
DSP Optimization Guid e
Part 2: Routines
DSP Optimization Guide
Revision History:
2003-01
Previous Version:
v1.6.3
Page
Subjects (major changes since last revision)
All
Revised following internal review
v1.6.4
We Listen to Your Comments
Is there any information within this document that you feel is wrong, unclear or missing?
Your feedback will help us to continuously improve the quality of this document.
Please send your comments (referencing this document) to:
[email protected]
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Table of Contents
Page
1
1.1
1.1.1
1.1.2
1.2
1.2.1
1.2.2
1.2.3
1.2.4
1.2.5
1.2.6
1.3
1.3.1
1.3.2
1.3.3
1.3.4
1.3.5
1.3.6
1.3.7
1.4
1.4.1
1.4.1.1
1.4.1.2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Information Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Number of Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Arithmetic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Routine Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Pipe Resource Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Register Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
How to Test a DSP Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
The Golden Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Transcendental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Measuring Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
How to Count Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Counting Cycles for a Routine without Loops . . . . . . . . . . . . . . . . . . 23
Counting Cycles for a Routine with Loops . . . . . . . . . . . . . . . . . . . . . 25
2
2.1
Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Complex Wave Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Square Root (by Newton-Raphson) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Square Root (Taylor) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inverse (y=1/x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Natural Logarithm (y= ln(x)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exponential (y=ex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sine (y =sin(x)), range [-Pi/2, Pi/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sine (y =sin(x)), range [-Pi, Pi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
32
33
34
35
36
37
39
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16-bit signed Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32-bit signed Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32-bit signed Multiplication (Result on 64-bit) . . . . . . . . . . . . . . . . . . . . . .
‘C’ Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16-bit Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32-bit Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2nd Order Difference Equation (16-bit) . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
42
43
43
44
45
46
47
User Guide
5
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Table of Contents
Page
4.8
4.9
4.10
4.11
4.12
2nd Order Difference Equation (32-bit) . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complex Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complex Multiplication (Packed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complex Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complex Update (Packed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
49
51
52
54
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Pre-emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Square Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Complex Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Complex Multiplication (Packed) . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Complex Multiplication (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . .
Color Space Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
57
58
59
60
62
64
66
68
71
73
6
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Magnitude Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
First Order FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Second Order FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Block FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Auto-Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Complex FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
First Order IIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Second Order IIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
BIQUAD 4 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
N-stage BIQUAD 4 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
N-stage BIQUAD 5 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Lattice Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Leaky LMS (Update Only) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Delayed LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Delayed LMS – 32-bit Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Delayed LMS – Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7
7.1
7.2
7.3
7.4
Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Real Butterfly – DIT – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Real Butterfly – DIF – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complex Butterfly – DIT – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complex Butterfly – DIT – Radix 2 – with shift . . . . . . . . . . . . . . . . . . . .
User Guide
6
113
116
118
120
123
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Table of Contents
Page
7.5
Complex Butterfly – DIF – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8
8.1
8.2
8.2.1
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TriBoard Project Cycles Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Steps to Run the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
User Guide
7
129
129
129
129
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1
Introduction
This second part of the TriCore DSP Optimization guide contains short routines which,
from the machine’s perspective, offer a high degree of optimization.
Note: Machine (or assembly) perspective means instead of the algorithm perspective.
There is always more gain to be made at the algorithm level, than at the machine
(assembly implementation) level.
The time-pressures for a ‘real-world’ project situation coupled with basic DSP features,
will often add cycles to a project routine, but there are ways in which some routines can
be further optimized. Aside from the ingenuity of the DSP programmer themselves, these
include for example:
• Using more unrolling and/or software pipelining (although there is then the potential
drawback of making the code less readable and increasing the code size).
• Using a memory model where coefficients and data are not separated in 2 memory
spaces, allowing many algorithms to take advantage of interleaving data and
coefficients.
The chapters of this second part of the TriCore DSP Optimization guide are divided into
the following function types:
• Generator
• Transcendental Functions
• Scalars
• Vectors
• Filters
• Transforms
The following table identifies the characteristics of these routine types:
Generator
Input
Output
Processing
single
1 series of n elements Iterative processing
Transcendental single
single
Iterative processing
Scalar
single
single
simple equation
Vector 1)
1 series of n elements 1 series of n elements Identical to all n
elements
Filter
1 series of n elements single
Transform
multiple
multiple
1 series of n elements 1 series of n elements Not always
identical to n
elements
1)
Includes a matrix operation
User Guide
8
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Most routines are implemented as memory to memory, but some routines can be:
• Register to register
• Memory to memory with initialization of pointers
• Full context switching (identical to a library call)
It is very simple to use one type of model or another.
In this document, each routine is presented with a short summary and all routines of the
same type are grouped in an Information Table (a summary of the routines critical
characteristics), at the start of each chapter.
1.1
Information Table
Each section begins with an information table which gives:
•
•
•
•
Number of cycles
Code size
Optimization techniques (used in the assembly code)
Arithmetic method used
1.1.1
Number of Cycles
The number of cycles is counted as follows:
–
–
–
–
–
–
Software Pipelining
Loop Unrolling
Packed Operation
Load / Store Scheduling
Data Memory Interleaving
Packed Load / Store
Software Pipelining
Software pipelining means starting an equation before the previous equation has
finished. This is achieved with knowledge of the TriCore pipelining rules.
Example: Implementation of a C square difference
int sum = 0;
for (i=0; i<N; i++)
{
a = X[i];
b = Y[I];
c = a – b;
sum = sum + c*c;
}
User Guide
9
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
A naïve implementation would be:
mov
sumloop:
d0,#0
ld.w
d1,[Xptr+]4
ld.w
d2,[Yptr+]4
sub
d3,d1,d2
madd
d0,d0,d3,d3
LC,sumloop
sumAddr,d0
loop
st.w
;
;
;
;
;
;
prolog
loop
(1)
(2)
(3)
(4,5,6)
; ld X0
; ld Y0
;
;
; epilog
This implementation is very expensive in terms of cycles, because the loop begins with
two LS (Load/Store) instructions followed by one IP (Integer Processing) instruction, and
one MAC 32*32.
The number of cycles can easily be decreased by using a different instruction order:
• IP followed by LS, MAC 32*32 followed by LS
mov
ld.w
ld.w
sumloop:
d0,#0
d1,[Xptr+]4
d2,[Yptr+]4
sub
ld.w
madd
ld.w
LC,sumloop
sumAddr,d0
loop
st.w
;
;
;
;
d3,d1,d2
;
d1,[Xptr+]4 ;
d0,d0,d3,d3 ;
d2,[Yptr+]4 ;
prolog
ld X0
ld Y0
loop
(1)
||
(2)
||
;
; ld X1 for next pass
;
; ld Y1 for next pass
; epilog
Now the number of cycles is reduced to 2 cycles per loop, compared with the previous
6 cycles per loop. This can be described in C as:
int
a =
b =
for
sum = 0;
X[0];
Y[0];
(i=1; i<N; i++)
{
c = a – b;
a = X[i];
sum += c*c;
b = Y[I];
}
User Guide
10
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Loop Unrolling
The equation is written twice or more, inside a loop. This technique is usually used with
software pipelining.
Example: Implementation of a C array sum
z =0;
for (i=0; i<N; i++) z += X[i];
In TriCore assembly language this becomes:
mov d0,#0
vsumloop:
loop
st.w
ld.w
add
LC,vsumloop
Zaddr,d0
d1,[Xptr+]4
d0,d0,d1
;
;
;
;
prolog
loop
(1)
; ld X0
(2)
; z
; epilog
Here the Load and Add operations are performed in 2 cycles for a single element of the
array. This can be improved by computing the addition of two elements at a time, in the
loop:
mov
d0,#0
ld.w
d1,[Xptr+]4
vsumloop:
add
ld.w
add
ld.w
loop
LC,vsumloop
st.w
Zaddr,d0
d0,d0,d1
d1,[Xptr+]4
d0,d0,d1
d1,[Xptr+]4
;
;
;
;
;
;
;
prolog
ld X0
loop
(1)
||
(2)
||
;
;
;
;
z
ld X1
z
ld X2
; epilog
Adding two elements now takes only 2 cycles instead of 4. This can also be written in C:
z = 0;
for (i=0; i<N/2; i++)
{
z+= X[2*i];
z+= Z[2*i+1];
}
User Guide
11
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Packed Operation
With packed operation, two different data are packed in the same register.
Example: Load of two 16-bits values in a register
ld.w ssX,[Xaddr]
; ld X0 X1
Load / Store Scheduling
In Load/Store scheduling, the Load and Store instructions are reorganized to reduce the
number of cycles.
Example: Transform routine
X0
Æ
X0’
X1
Æ
X1’
This transform takes 6 cycles:
ld.w
mul
sub
st.w
ld.w
mul
sub
st.w
d0,[Xptr]
d0,d0,#5
d0,d0,#1
[Xptr+]4,d0
d0,[Xptr]
d0,d0,#5
d0,d0,#1
[Xptr+]4,d0
;
;
;
;
;
;
;
;
(1)
(2)
(3)
||
(4)
(5)
(6)
||
A cycle is saved with Load and Store scheduling:
ld.w
mul
ld.w
sub
st.w
mul
sub
st.w
User Guide
d0,[Xptr]
d0,d0,#5
d1,[Xptr]+4
d0,d0,#1
[Xptr+]4,d0
d1,d1,#5
d1,d1,#1
[Xptr+]4,d1
;
;
;
;
;
;
;
;
(1)
(2)
||
(3)
||
(4)
(5)
||
12
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Data Memory Interleaving
Here, data are mixed with other types of data in memory.
Example:
Xi0 at 0xD0000000
Xi1 at 0xD0000002
Xr0 at 0xD0000004
Xr1 at 0xD0000006
instead of:
Xi0 at 0xD0000000
Xr0 at 0xD0000002
Xi1 at 0xD0000004
Xr1 at 0xD0000006
Packed Load / Store
With Packed Load/Store at least two data are loaded or stored in the same instruction.
Example: Load two 32-bits values
Instead of:
ld.w
ld.w
d0,[Xptr+]4
d1,[Xptr+]4
; ld X0
; ld X1
It can be written as:
ld.d
1.1.2
e0,[Xptr+]8
; ld X0 X1
Arithmetic Methods
• Saturation
– at least one instruction is used with saturation.
• Rounding
– at least one instruction is used with rounding.
User Guide
13
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.2
Routine Organization
A typical routine page in this document is divided into the following different sections:
Title
5.4
Equation
Equation
Zn=(Xn-Yn)2
Vector Square Difference
n=0..N-1
Pseudo code:
for (n=0; n<N; n++)
{
sTmp = xX[n]-sY[n];
sZ[n] = sTmp*sTmp;
}
Pseudo
Code
Pipe Resource
Table
IP=2 (1 sub, 1 mul)
LD/ST=3 (read sX, read sY, write sZ)
Assembly code:
lea
LC,(N/2 - 1)
ld.d
ssssXY, [Xptr+]8
sqdloop:
subs.h ssTmp,ssX,ssY
ld.d
ssssXY, [Xptr+]8
mulr.h ssZ,ssTmp,ssTmp ul,#1
st.w
[Zptr+]4, ssZ
Loop LC, sqdloop
Assembly
Code
;(1)
;(2)
;get loop number
;X0 X1 Y0 Y1
;(1)
;X1 - Y1 || X0 Y0
;||
;X2 X3 Y2 Y3
;(2,3) ;(X1-Y1)^2 || (X0-Y0)^2
Memory Organization
Memory
Organization
Number of
Cycles
Xaddr
sX0
Xaddr +2
sX1
Xaddr +4
sY0
Xaddr +6
sY1
Xaddr +8
sX2
Xaddr +10
etc...
Example
N=160 244 cycles
Æ
Register Diagram
Register
Diagram
Instruction d0
d1
d3 / d2
Load/Store
y1 y0 x1 x0
ld x0x1y0y1
The topics shown in the figure above are described in the following sub-sections:
1.2.1
Equation
The Equation section gives the generic equation of the algorithm. When two (or more)
variables are written back, there are several equations.
• n the variable is an array or a vector
• r the value is real
• i the value is imaginary
User Guide
14
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.2.2
Pseudo Code
Pseudo Code provides an exact description of the equation(s). The ‘Pseudo C code’ has
several advantages compared to C code:
• There is no strict syntax requirement.
The classic example is typecasting in C, which sometimes gives unreadable code.
• It may use ‘non C’ constructs. It is very difficult for example, to express 64-bit
quantities (not a standard) and circular addressing in C.
The Pseudo code also determines the number of IP (Integer Processing) and LS (Load/
Store) instructions, as well as the memory bandwidth (2, 16-bit values do not give the
same bandwidth as 2, 64-bit values). This is important to remember, as the Optimization
can not perform better than implementing the minimum number of operations.
Pseudo Code Implementation
Pseudo code implementation is used to make the link between the description of
specifications (in Pseudo code) and the implementation (in Assembly code). It explains
how the specification Pseudo-code lines are actually computed in the Assembly code,
and is very useful when Packed Techniques and Software Pipelining are used.
1.2.3
Pipe Resource Table
IP = 1 (1 MUL)
Integer
Processing
LD / ST = 2 (READ lX, write lZ)
Number of IP,
MAC Instructions
Load and Store
Number of LD / ST
Instructions
The Pipe Resource table acts as a ‘sanity check’, which can help in implementing the
routine. For example, there is no need to optimize 2 instructions on the IP side of a
routine if the bottleneck is due to 5 instructions on the LS side.
Note: As the pipe resource table is based on the Pseudo code, it will not show the
number of IP and LS instructions inside the routine’s loop, because this is
dependent on all of the Optimization techniques used/applied.
User Guide
15
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.2.4
Assembly Code
The Assembly code given in this document is the actual TriCore Assembly code.
Where ‘N’ appears in the loop counter, it refers to the number of points given by the
Equation or Pseudo code sections. Variable names (rather than registers) are used for
easy readability. It should be noted that the pointers are not defined as these are
generally declared in a global file.
Cycles are indicated in the Assembly code comments: (1), (2), (3). The total number of
cycles is summarized in a table following the code. The time taken to enter and leave a
loop is only indicated in the Information table.
1.2.5
Register Diagram
The Register diagram is an aid to visualizing the TriCore pipeline model. It also acts as
a working sheet to optimize the algorithm.
The Register diagram is made up of the following fields:
• Instruction
– Contains processing instructions (add, sub, madd, shifts, logic operation)
• d0…d7
– The first eight registers are shown. The value in the register is the value at the end
of the instruction. This can be confusing when a register is being loaded from
memory at the same time as it is used in the instruction. The value is loaded after
the calculation.
• Load / Store
– Indicates when data is being read from memory or being stored back to memory.
• Bold borders
– Indicates which instructions are in the loop.
User Guide
16
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.2.6
Notation
Certain names and abbreviations are used to make the routines easier to understand.
This notation is very useful when attempting to optimize a routine.
Abbreviation
Description
s
Short (16-bit value)
ss
Two short values are in a 32-bit register
ssss
Four short values are in a 64-bit register
l
Long (32-bit value)
ll
Long-long (2 long in a 64-bit register)
ll
Long-long (64-bit value)
||
Instruction executed in parallel
(1)
Cycle number
Xptr
Pointer for X values
Yptr
Pointer for Y values
Kptr
Pointer for K values
Zptr
Pointer for Z values
Vptr
Pointer for V values
Wptr
Pointer for W values
XBptr
Pointer for X values used for circular addressing (ex:a2/a3)
Xaddr
Address of X values
Kaddr
Address of K values
Yaddr
Address of Y values
Zaddr
Address of Z values
LC
Address register (usually a5) used as a loop counter
User Guide
17
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.3
How to Test a DSP Routine
All of the routines described in this optimization guide are implemented in TriCore
assembly code. The TriCore assembly code becomes, in effect, the reference code. The
question that is then raised is how do we know it works? In other words, how do you test
an optimized DSP routine? These questions are addressed in this section.
1.3.1
The Golden Models
For every type of routine there is a different input and output format. This, together with
the way in which they use the memory, means that there is a specific way to test them.
The essential idea is to see the routine as a ‘black box’:
M
D
DSP ROUTINE
The ‘Golden model’ is used as the reference and is written in C.
When the data type is integer it gives a bit-exact result, which can be used to perform a
direct comparison with the TriCore assembly implementation results.
When the data type is Floating-Point, the comparison cannot be exact. However the
approximation of the result is sufficient to verify the TriCore implementation.
The best test process should have 3 steps:
• Compute the DSP routine and the Golden model on the same data buffer.
• Store the two results.
• Compute the difference between the Golden model and the TriCore assembly for
each value, and keep the greatest difference. If this value is less than the maximum
allowed error P, the test succeeds. If it exceeds the maximum allowed error then the
implementation is either wrong or not accurate enough.
• If more precision is required, the comparison should be performed on a spreadsheet.
P
DSP Routine
written in TriCore
assembly (‘asm’)
Yes
< Accuracy ?
DATA
No
Golden Model
Test processes will be different for each kind of DSP algorithm, because of the different
kind of input / output formats and different memory implementation.
User Guide
18
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
To help the programmer testing the routines, a project using Tasking EDE can be
created. The project space should contain 6 projects, one for each type of routine:
• Generators
• Scalars
• Filters
• Transcendental
• Vectors
• Transforms
1.3.2
Generators
There are 5 assembly files (*.src) organized in 5 directories. The 5 generators are called
by the C program (‘generators.c’) and are compared against the Golden models.
Only the complex wave generation is described in detail, as the other 4 generators are
either directly written in C or are derived from the complex wave.
TriCore asm
DISPLAY
Yes
Same Shape ?
DATA
Golden Model (C)
1.3.3
No
DISPLAY
Transcendental
There are 6 transcendental assembly files (*.src), organized in 6 directories.
The two sine functions described in the manual are in the same file (‘testsin.src’), which
also contains 8 sine versions representing different precision’s. All functions are called
by the C program ‘trancendental.c’, and are compared against the Golden models.
TriCore asm f(x)
C: f -1(x)
P
Yes
< Accuracy ?
DATA
Golden Model (C)
f(x)
No
C: f -1(x)
Additionally, the inverse of each function is computed. This is easily achieved since a
transcendental function such as y=f(x) will always have a corresponding f-1(x).
The principle advantage is that it is very fast to see mistakes. This is because the output
(f-1(f(x)) should be the same as the input.
For each routine there is an associated spreadsheet comparing precision.
User Guide
19
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.3.4
Scalars
There are 12 scalar routines, organised as 11 scalar assembly files (8.src) in 8
directories. Two, 32-bit signed multiplications are in the same file.
All functions are called by the C program (‘scalars.c’) and are compared against the
Golden models.
P
TriCore asm
Yes
< Accuracy ?
DATA
No
Golden Model (C)
In this instance the required accuracy is generally bit-exact, as Scalar routines are very
easy to model in 32-bit integer C.
1.3.5
Vectors
There are 10 vector assembly files (*.src), each in its own directory. All functions are
called by the C program ‘vectors.c’, and are compared against the Golden models.
Inside a loop
P
TriCore asm
Yes
< Accuracy ?
DATA
No
Golden Model (C)
In this instance the required accuracy is generally bit-exact, as these routines are easy
to model in 32-bit integer C.
User Guide
20
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.3.6
Filters
There are 19 filter assembly files (*.src), each in its own directory. The C program
‘filters.c’, calls all functions. There are no Golden models. Instead, the Tasking data
analysis window ‘scope function’, can be used.
Inside a loop
Oscilloscope
TriCore asm
O
O
O
DATA
The routine is usually just the kernel of a filter, so it will be inside a loop and will require
careful implementation in memory.
1.3.7
Transforms
There are 5 Transform assembly files (*.src), each in its own directory. All 5 routines are
called by the C program ‘transform.c’, and compared against the Golden model.
TriCore asm
DISPLAY
Yes
Same Shape ?
DATA
Golden Model (C)
User Guide
DISPLAY
21
No
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.4
Measuring Cycles
This section describes how to measure the number of cycles for a specific routine. As
the test is dependent on the routine type (transcendentals, scalars, vectors, etc.), this
section offers a universal method of testing.
Note: Some routines with loops require modifications. These are explained in this
section.
A test should not change the code or the pointers. This means that it is important to be
careful with the data memory mapping (When there is not enough space in the DMU, the
test loop number should be decreased).
1.4.1
How to Count Cycles
The test code is a small program with the routine included. This code is called by an
assembly function, CYCLE_COUNT().
The CYCLE_COUNT() function executes the code that starts in 0xD4000000 and
returns an integer value, the number of cycles. The timer counter is read before the call,
subtracted from the value after executing the test code, and then multiplied by two
(because the timer counter is on the FPI clock, its speed is half of the CPU one). In the
cycle test project this function is mapped in PMU, just after the test code.
Assembly code:
.sect "program.code"
CONST.A
.macro reg,addr
movh.a reg,#((addr) + 0x8000 >> 16)
lea reg,[reg]((((addr) + 0x8000) & 0xffff) - 0x8000)
.endm
CYCLE_COUNT:
CONST.A
a10,0xd4000000
ld.w
d9,0xf0000310
calli
a10
ld.w
d10,0xf0000310
sub
d2,d10,d9
sh
d2,#1
ret
User Guide
;load program location address
;load sys timer counter value before call
;call the function
;load sys timer counter value after call
;compute the difference
22
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.4.1.1
Counting Cycles for a Routine without Loops
The test program consists of a loop performed 1000 times with:
• 50 nop32
• The routine
• 50 add d0,d0,d0
To run the test:
1. Run an loop without the routine, execute and write down the cycle number N.
2. Include the routine inside the loop, execute and write down the cycle number M.
3. The number of cycles is (M-N)/1000.
The code is located in PMU (Rider-D: 0xD4000000 to 0xD4007fff), and the data should
be in DMU (Rider-D: 0xD0000000 to 0xD0007fff).
Tasking macro operation .dup is used here for more clarity.
1. Run an empty Test Loop
Assembly code:
;####### DATA ################################################
.sect "test_cycles.data"
;####### CODE ################################################
.sect "test_cycles.code"
lea a5,999
. align 8
testloop:
.dup 50
nop32
.endm
;--- include routine here
;--- end routine
.dup 50
add
d0,d0,d0
.endm
loop
a5,testloop
ret
User Guide
23
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
2. Include the Routine in the Loop, and Data Memory
Assembly code:
;####### DATA #################################################
.sect "test_cycles.data"
Xaddr: .half 0x1111
Kaddr: .half 0x2222
Zaddr: .half 0xDEAD
;####### CODE #################################################
.sect "test_cycles.code"
NUM .set 50
lea
a5,999
.align 8
superloop:
.dup NUM
nop32
.endm
;--- include routine here
ld.q
d1,Xaddr
ld.q
d2,Kaddr
mulr.q
d15,d1 u,d2 u, #1
st.q
Zaddr,d15
;--- end routine
.dup NUM
add
d0,d0,d0
.endm
loop
a5,superloop
ret
Note: Because of the addressing mode used in the routine, an initialization of pointers is
usually required. This initialization should take place outside the loop (this should
not be counted in the routine’s number of cycles).
User Guide
24
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
1.4.1.2
Counting Cycles for a Routine with Loops
Loops need to be aligned on an 8-byte boundary in order to be executed in the exact
number of cycles predicted. Therefore to test the routine with a loop, alignment has to
be carried out first, by including some nop16 and nop32 before the test loop.
The ld.d and st.d instructions also need special care. The pointer of the data loaded or
stored needs to be aligned on an 8-byte boundary. Data memory and pointers’
sometimes have to be changed to avoid a misalignment.
The test program consists of a jump loop, run 1000 times with:
•
•
•
•
50 nop32
The pointer initialization
The routine
50 add d0,d0,d0
To Run the Test:
1. Include the routine in the test loop and align data and loops.
2. Run a loop with the pointer initialization but without the routine (by commenting it out),
execute and write down the cycle number N.
3. Remove the comments around the routine, execute and write down the cycle number
M.
4. The number of cycles is (M-N)/1000.
Assembly code:
mov
d15,#999
testloop:
.dup 50
nop32
.endm
;--- include routine + init here --;--- end routine + init here ------.dup 50
add
.endm
jned
ret
d0,d0,d0
d15,#0, testloop
Note: In this example the data register d15 is used for the jump. If this register is used in
the routine, another one should be used (define macros area has to be checked).
User Guide
25
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Examples: Test of a routine with loop
Assembly code:
;####### CODE ##############################################
.sect "test_cycles.code"
;-----------------------------------------------------------;vector complex multiplication
(5cycles in the loop)
;-----------------------------------------------------------.define
sYr
"d1"
.define
sYi
"d3"
.define
ssX
"d4"
.define
ssK
"d5"
.define
Xptr
"a2"
.define
Kptr
"a3"
.define
Yptr
"a4"
.define
LC
"a5"
.define
N
"8"
mov
nop32
d15,#999
nop16
; testloop counter
; nops here to align label nmloop on a 8 bytes
; boundary
;
testloop2:
.dup 50
nop32
.endm
;--- include routine + init here --lea
Xptr,buffin_vcplxmul1
lea
Kptr,buffin_vcplxmul2
lea
Yptr,buffout_vcplxmul
lea
movh
mov
nmloop:
LC,(N-1)
d6,#0
d7,#0
ld.w
ssK,[Kptr+]4
ld.w
ssX,[Xptr+]4
msubadm.h e0,e6,ssK,ssX ul,#1
mulm.h
e2,ssK,ssX lu,#1
st.h
[Yptr+]2,sYr
st.h
[Yptr+]2,sYi
loop
LC,nmloop
;--- end routine + init here -------
User Guide
26
; Xptr
; Kptr
; Yptr
; (1)
; (2)
; (3)
; get loop number
; clear 3rd source
; clear 3rd source
;
;
;
;
;
;
;
;
;
;
;
;
(1)
(2)
(3)
(4)
||
(5)
load k
load x
yr = xr*kr - xi*ki
yi = xr*ki + xi*kr
store yr
store yi
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
.dup 50
add
.endm
jned
ret
d0,d0,d0
d15,#0,testloop2
Problems Related to Cycle Count
• ld.d and st.d:
– Data pointer should be aligned on an 8-byte boundary.
The memory mapping needs changes if the pointer is misaligned in the loop.
• Loop alignment:
– Label should be aligned on an 8-byte boundary in PMU.
Use nop32 and nop16 before the test loop to align the label.
• PMU cache (in rider-D: 0xD4000000 to 0xD4007FFF):
– Because of the internal scratchpad, an address in the PMU cache could add one
cycle.
Note: Warning for Rider-D
• Two loops inside each other will add more cycles, so this should be avoided
• Use 32-bit opcode inside a loop to maintain alignment
User Guide
27
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
2
Generator
A Generator can be regarded as a moving vector (one or two dimensions), stored in
memory, in a buffer. This window in memory is then displayed on a screen such as an
oscilloscope.
2.1
Complex Wave Generation
Equation:
X = X*Kr - Y*Ki
Y = X*Ki + Y*Kr
Pseudo code:
for (n=0; n<N; n++)
{
sX = sX*sKr-sY*sKi;
sY = sX*sKi+sY*sKr;
}
IP= 4 (2 mul, 1 msub, 1 madd) LD/ST= 1 in packed format (write sX || sY)
Assembly code:
lea
ld.w
ld.w
ldloop:
loop
st.w
Cycles
User Guide
LC,(n-1)
k,[Kptr]
xy,startvect
; (1)
; (2)
; (3)
mulr.h
temp,xy,k ll,#1
st.w
[OUTptr+]4,xy
maddsur.h
xy,temp,xy,k uu,#1
LC,ldloop
[OUTptr+]4,xy
; (1)
; y’ = y*b || x’ = x*b
; ||
; st x1,y1 (next loop)
; (2,3) ; y’ += x*k || x’-= y*k
N = 100
28
Æ
; ||
; load loop counter
; ld rotation vector
; ld start vector
; st last x,y
305
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d3
d5
d4
kikr
ld kikr
xy
mulr.h
temp,xy,k ll,#1
Load / Store
ld x y
y=x*ki || x=x*kr
maddsur.h
xy,temp,xy,k uu,#1
y=y+y*kr || x=y-y*ki st x y
st x y
User Guide
29
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3
Transcendental Functions
Please note the following points on Transcendental functions:
• They are commonly used in domains other than DSP.
• They have more acute arithmetic problems (as opposed to signal processing
problems).
• The same functions require different precision levels (application/programmer
dependent)
• They do not easily lend themselves to multi-MAC operations, since they are inherently
iterative (no parallelism) and they produce a single result.
• They can be implemented as a table look-up (space) or opposed to a series expansion
(time).
• They can be implemented as a combination of space and time (partial look-up table
and partial computation). This illustrates the first law of algorithms, that the space-time
continuum is constant.
The table which follows summarises the Optimization techniques and Arithmetic
methods that are applicable to the different types of Transcendental Functions.
User Guide
30
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Name
Cycles
Code
Size 1)
Optimization
Techniques
Software Pipelining
Loop Unrolling
Packed Operation
Load/Store Scheduling
Data Memory Interleaving
Packed Load/Store
Saturation
Rounding
Transcendental Functions Summary Table
Square Root
(Newton-Raphson)
44
72
-
-
-
-
-
-
9
9
Square Root
(Taylor)
26
82
-
-
-
9
-
9
9
9
Inverse
24
42
-
-
-
-
-
-
9
9
Natural Logarithm
16
46
-
-
-
-
-
-
-
9
Exponential
22
40
-
-
-
-
-
-
-
9
Sine [-PI/2,PI/2)
38
48
-
-
-
-
-
-
9
-
Sine [-PI,PI)
42
68
-
-
-
-
-
-
9
-
1)
Arithmetic
Methods
Code Size is in Bytes
User Guide
31
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3.1
Square Root (by Newton-Raphson)
Equation:
Input: [0.25,1) in 1Q15 (X should be normalized to 0.25.. 1)
Output: 2Q14
Y 0 = 1.1033 – 0.6666* X
Y n+1 = Yn * (1.5 – (X*2)* Yn2)
where n = 0,1,2
Pseudo code:
// The loop calculates 1/(2*z)
sK = 2* sX;
sY = 1.1033 – 0.6666*sX;
for (n = 0; n<3; n++) sY = sY*(1.5- sK*sY*sY);
sZ = sY*(2*sX-1) + sY;
IP= 4 (3 mul, 1 sub)
LD/ST= 2 (load sK, store sY)
Assembly code:
movh
lea
movh
ld.q
movh
movh
mov
msubr.q
d0,#0x469c
a4,2
d1,#0x5553
sX,[a3+]
d9,#0x8000
d8,#0x1000
sK,sX
sY,d0,sX u,d1u,#0
;
;
;
;
;
;
;
;
(1)
||
(2)
||
(3)
(4)
(5)
(6)
;
;
;
;
;
;
;
;
1.1033 in 2q14
initialization of counter
0.6666 in 1q15
load sX in 1q15
d9 = -1 in1q15
d8 = 0.5 in3q13
sK in 2q14(1q15->2q14->*2)
Y02q14
;
;
;
;
;
;
;
;
(1,2)
;Y0^21q15
(3)
;result in2q14
(4,5)
;0.5-sK*Y0^23q13
(6)
;result in 2q14
(7)
;Y in 3q13
(8,9)
Y0+Y0[0.5-sK*sY0^2]3q13
(10)
;Y1 in 2q14
;
;
;
;
2*sX-1 in 1q15
(2*sX-1)/(2*sqrt(sX))
(2*sX-1)/(2*sqrt(sX))+1/
(2*sqrt(sX)) = sqrt(sX)
sqrloop:
mulr.q
sh
msubrs.q
shas
sh
maddrs.q
tmp,sY u,sY u,#1
tmp,tmp,#-1
tmp,d8,tmp u,sK u,#1
tmp,tmp,#1
d1,sY,#-1
sY,d1,sY u,tmp u,#1
shas
a4,sqrloop
sY,sY,#1
loop
adds
mulr.q
adds
d6,sK,d9
d0,sY u,d6u,#1
d3,sY,d0
User Guide
; (7)
; (8,9)
; (10)
32
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3.2
Square Root (Taylor)
Equations:
Input: [0.5,1) in 2Q14 (X should be normalized to 0.5.. 1)
Output: 1Q15
x= (y – 1)/2
y^0.5 = 1 + x – 0.5*(x^2) + 0.5*(x^3) – 0.625*(x^4) + 0.875*(x^5)
Pseudo code:
sX[0] = (sY – 1)/2;
for(n=1;n<5;n++)
for(n=0;n<3;n++)
sX[n] = sX[0] * sX[n-1];
eY += sX[2*n] * sK[2*n] + sX[2*n+1] * sK[2*n+1];
IP= 1+2 (1 mul), (1 mul, 1 add)
LD/ST= 1 + 2 (st sX), (ld sX,sK)
Assembly code:
lea
lea
ld.q
sh
addih
st.q
movh
st.q
mov
iloop:
a3,xsqrtvalue
a4,3
d1,[a3]
d1,d1,#-1
d1,d1,#0xc000
[a3+]2,d1
d0,#0x8000
[a3+]2,d0
d2,d1
;
;
;
;
;
;
;
;
;
(1)
(2)
(3)
(4)
(5)
||
(6)
||
(7)
loop
mulr.q
nop
st.q
a4,iloop
d2,d1u,d2u,#1 ; (1,2)
; ||
[a3+]2,d2
; ||
lea
mov
ld.w
mov
ld.d
a2,ksqrtvalue
d0,#0
d2,[a2+]4
d1,#0
e4,[-a3]8
;
;
;
;
;
(8)
(9)
||
(10)
||
maddms.h
ld.d
maddms.h
ld.w
maddms.h
st.h
e0,e0,d5,d2ul,#1
e2,[a2+]8
e0,e0,d4,d2ul,#1
d5,[-a3]4
e0,e0,d5,d3ul,#1
YAddr,d1
;
;
;
;
;
;
(11,12)
||
(13,14)
||
(15,16)
||
;load loop counter
;x
;y/2
;x1=y/2-0.5
;x0
;computation of x2,x3,x4,x5
;initialization of y(lower)
;k5k4
;initialization of y(upper)
;x5x4x3x2
;y=y+x4k4+x5k5
;k3k2k1k0
;y=y+x2k2+x3k3
;x1x0
;y=y+x0k0+x1k1
ksqrtvalue:.half 0xb000,0x7000,0xc000,0x4000,0x7fff,0x8000 ;k5k4k3k2k1k0
xsqrtvalue:.half 0x7000,0x0000,0x0000,0x0000,0x0000,0x0000 ;x1x0x2x3x4x5
User Guide
33
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3.3
Inverse (y=1/x)
Equations:
Input: [+0.5..+1) in 2Q14.
Output: (+1..+2] in 2Q14.
Yk+1 = 2*Yk(1 – (X/2)*Yk)
X, Y are 16-bit values
Pseudo code:
sY = 1.457;
for(i=0;i<3;i++) sY = 2*sY*(1 - (sX/2)*sY)
IP= 4 (2 shift, 1 mul, 1 msub)
LD/ST= 0
Assembly code:
lea
lea
a2,coef_inv
a4,0x02
ld.q
sh
movh
ld.q
scond:
d0,[a3+]
d0,d0,#-1
d3,#0x2000
d2,[a2]
msubrs.q
sh
mulr.q
loop
shas
a4,scond
coef_inv: .half
User Guide
;
;
;
;
;
;
;
(1)
(2)
number of iteration on the series
(3)
; load x
2q14
(4)
; x/2
2q14
(5)
; 1 in
3q13
||
; y[0]= 1.457 in2q14
d4,d3,d0u,d2u,#1 ; (1,2)
d4,d4,#1
; (3)
d4,d4u,d2u,#1
; (4,5)
; temp =
d2,d4,#2
; (6)
0x5d3f
; temp = 1 - (x/2)*y 3q13
; temp = 1 - (x/2)*y 2q14
y*(1 - (x/2)*y)3q13
; y = temp 2q14
;1.457
34
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3.4
Natural Logarithm (y= ln(x))
Equations:
Y = K1*(X-1)+K2*(X-1)2+K3*(X-1)3+K4*(X-1)4+K5*(X-1)5
Y, X, K are 16-bit values.
Input: [+1..+2) in 2Q14.
Ouput: in 1Q15
Pseudo code:
for(i=0;i<4;i++) sY *= sX + sK[n]
IP= 1 (1 madd)
LD/ST= 1 (1 ld sK)
Assembly code:
lea
a2,coef_log
lea
ld.q
movh
ld.q
sub
ld.q
sh
ilop:
a3,0x03
d4,[a4+]2
d5,#0x4000
d2,[a2+]2
d4,d4,d5
d3,[a2+]2
d4,d4,#1
loop
mulr.q
;
;
;
;
;
;
;
;
;
(1)
load the address of the first coeff k5
(2)
;initialize the counter
(3)
;load the number to log in 2q14
(4)
;12q14
||
;load k5
(5)
;z = x - 1
||
;load k4
(6)
;result in 1q15
maddr.q
d2,d3,d2u,d4u,#1
ld.q
d3,[a2+]2
a3,ilop
; ||
d6,d2u,d4u,#1
; (7,8)
;(1,2)
;
;||
;
;(((k5*z+k4)*z+k3)z+k2)z+k1
;((((k5*z+k4)z+k3)z+k2)z+k1)z
coef_log: .half 0x0404,0xeef8,0x2491,0xc149,0x7fe3 ;in 1Q15
User Guide
35
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Exponential (y=ex)
3.5
Equations:
Y = K1*X+K2*X2+K3*X3+K4*X4+K5*X5+K6*X6+K7*X7
Y, X, K are 16-bit values
Input: [0..1) in 1Q15
Output: in 3Q13
Pseudo code:
for(i=0;i<6;i++) sY *= sX + sK[n]
IP= 1 (1 madd)
LD/ST= 1 (1 ld sK)
Assembly code:
lea
lea
ld.q
a2,coef_exp
a3,0x05
d4,[a4+]2
; (1)
; (2)
; (4)
ld.q
ld.q
ilop:
d2,[a2+]2
d3,[a2+]2
; (5)
; (6)
loop
mulr.q
addih
sh
;load address of the first coeff k7
;initialize the counter
;load the number we would like the exp
;in 1q15
;load k7 2q14
;load k6 2q14
maddr.q
ld.q
a3,ilop
d2,d3,d2u,d4u,#1 ; (1,2) ;2q14
d3,[a2+]2
; ||
;load next coef 2q14
; ||
; (((((k7*z+k6)z+k5)z+k4)z+k3)z+k2)z+k1
d6,d2u,d4u,#1
; (7,8)
; ((((((k7*z+k6)z+k5)z+k4)z+k3)z+k2)z+k1)z
; 2q14
d6,d6,#0x4001
; (9) ;add 1 to result
d6,#-1
; (10) ;result in 3Q13
coef_exp:.half 0x0003,0x0016,0x0088,0x02aa,0x0aaa,0x2000,0x4000;in Q14
User Guide
36
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3.6
Sine (y =sin(x)), range [-Pi/2, Pi/2)
Equations:
Sin(x)=k1*x+k2*x3+k3*x5+k4*x7+k5*x9+…
Input: [-1,1) in 1Q15
Output: [-1,1) in 1Q31
This can also be written as:
Sin(x)=((((k5*x2+k4)*x2+k4)*x2+k3)*x2+k2)*x2+k1)*x
This series is valid for x∈[-Pi/2, Pi/2], so the input between [-1,+1) is scaled to the range
[-Pi/2, Pi/2) and gives a result between –1 and 1.
0x7fff
x
0x0000
Y=Sine(x)
0x8000
routines_sec3_6a.eps
Pseudo code:
for(i=0;i<5;i++) lY *= lX + lK[n]
IP= 1 (1 madd)
User Guide
LD/ST= 1 (1 ld sK)
37
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
lea
lea
ld.q
1Q15
ld.w
a2,coef_sin
a3,0x04
d4,[a4+]2
; (1)
; (2)
; (3)
d2,LX1
mul.q
mul.q
ld.w
ld.w
lloop:
d4,d4,d2,#1
d1,d4,d4,#1
d2,[a2+]4
d8,[a2+]4
;
;
;
;
;
;
loop
mul.q
shas
LX1:
madds.q
sh
ld.w
a3,lloop
d6,d2,d4,#1
d6,d6,#1
;load the address of the first coeff
;initialize the counter
;load number we would like the sine
(4)
;load the factor of norm
(PI/2) 2Q30
(5,6,7) ;x = x*a 2q30
(8,9)
;z = x*x, 3q29
||
;load k5 1q31
(10)
;load k4 1q31
d2,d8,d2,d1,#1
d2,d2,#2
d8,[a2+]4
; ||
; (11,12)
; (13)
; (1,2,3) ;give the result in 3q29
; (4)
;1q31
; ||
;1q31
;(((k5*z+k4)*z+k3)z+k2)z+k1
;((((k5*z+k4)*z+k3)z+k2)z+k1)x 2q30
;1q31
.word 0x6487ED51;PI/2 in Q30
coef_sin:.word
0xffffffca,0x000005c7,0xfffe5fe6,0x00444444,0xfaaaaaab,0x20000000;1Q31
and 3Q29
User Guide
38
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3.7
Sine (y =sin(x)), range [-Pi, Pi)
Equations:
Sin(x)=k1*x+k2*x3+k3*x5+k4*x7+k5*x9+…
Input: [-1,1) in 1Q15
Output: [-1,1) in 1Q31
By using the previous sine, we can get the result for the range [-Pi, Pi] with:
• Sin(Pi-x) = Sin(x)
0x4000
0x3fff
Change of variable:
x
If 0x4000<x<0xbfff f Then x’=(0x7fff-x)<<1
0x7fff
0x0000
If 0xc000<x<0x3ff Then x’=x<<1
0x8000
Y=Sine(x’)
0xbfff
0xc000
routine_sec3_7a.eps
Pseudo code:
for(i=0;i<5;i++) lY *= lX + lK[n]
IP= 1 (1 madd)
User Guide
LD/ST= 1 (1 ld sK)
39
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
lea
lea
ld.q
a2,coef_sin0
a3,0x04
d4,[a4+]2
;change of variable
movh
d9,#0x8000
xor.t
d2,d4:31,d4:30
jz
d2,lab
add
d4,d9,d4
rsub
d4,d4,#0
lab:
sh d4,#1
ld.w
d2,LX1
mul.q
mul.q
ld.w
ld.w
llloop:
d4,d4,d2,#1
d1,d4,d4,#1
d2,[a2+]4
d8,[a2+]4
madds.q
;
;
;
;
load address of first coeff
initialize the counter
load number we would like
the sine 1Q15
;
;
;
;
;
;
(4)
(5)
(6,7)
(8)
(9)
(10)
;
;
;
;
;
;
;
-1
0x4000<x<0xbfff ?
if not, go to lab
else x = x-1
x = -(x-1) = 1-x
x’=x<<1
end change of variable
;
;
;
;
;
;
(11)
load the factor of norm (PI/2)
(12,13,14) ; x = x*a 2q30
(15,16)
; z = x*x, 3q29
||
; load k51q31
(17)
; load k41q31
d2,d8,d2,d1,#1
loop
mul.q
sh
d2,d2,#2
ld.w
d8,[a2+]4
a3,llloop
d6,d2,d4,#1
shas
d6,d6,#1
LX1:
; (1)
; (2)
; (3)
;
;
;
;
;
;
;
;
;
;
;
2Q30
(1,2,3)
give the result in 3q29
1q31
1q31
(((k5*z+k4)*z+k3)z+k2)z+k1
(4)
||
||
(18,19)
((((k5*z+k4)*z+k3)z+k2)z+k1)x 2q30
(20)
; 1q31
.word 0x6487ED51;PI/2 in Q30
coef_sin0:.word
0xffffffca,0x000005c7,0xfffe5fe6,0x00444444,0xfaaaaaab,0x20000000;1Q31
and 3Q29
User Guide
40
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Cycles Code Optimization Techniques
Size
Packed
Operation
Load/Store
Scheduling
Data Memory
Interleaving
Packed
Load/Store
Saturation
Rounding
16-bit signed
Multiplication
3,4
16
-
-
-
-
-
-
-
9
32-bit signed
Multiplication
32-bit result
5
16
-
-
-
-
-
-
-
-
32-bit signed
Multiplication
64-bit result
5
16
-
-
-
-
-
-
-
-
“C” integer
multiplication
5
16
-
-
-
-
-
-
-
-
16-bit update
5
20
-
-
-
-
-
-
9
9
32-bit update
6,5
20
-
-
-
-
-
-
9
-
2nd order diff.
4
equation (16-bit)
24
-
-
9
-
-
9
-
-
2nd order diff.
4
equation (32-bit)
24
-
-
9
-
-
9
-
-
Complex
multiplication
5
20
-
-
-
-
-
-
-
-
Complex
multiplication
(packed)
5
14
-
-
9
-
-
9
9
9
Complex update 7
24
-
-
-
-
-
-
-
-
Complex update 6
(packed)
18
-
-
9
-
-
9
9
9
Bytes
Name
Loop
Unrolling
Scalars
Software
Pipelining
4
User Guide
41
Arithmetic
Methods
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.1
16-bit signed Multiplication
Pseudo code:
sZ = sX*sK;
IP= 1 (1 mul)
LD/ST= 3 (read lX, read lK, write lZ)
Assembly code:
;result = 16-bit
ld.q
sX, Xaddr
ld.q
sK, Kaddr
mul.q
lZ,sX u,sK u,#1
st.q
Zaddr,sZ
;
;
;
;
(1)
(2)
(3)
||
;
;
; left justified
; store 16-bit upper
;same with result = 32-bit
ld.q
sX, Xaddr
ld.q
sK, Kaddr
mul.q
lZ,sX u,sK u,#1
st.w
Zaddr,lZ
;
;
;
;
(1)
(2)
(3)
||
;
;
; left justified
; store 32-bit
(1)
(2)
(3,4)
||
;
;
; left justified
; store 16-bit upper
;same with result = rounded 16-bit
ld.q
sX, Xaddr
;
ld.q
sK, Kaddr
;
mulr.q
sZ,sX u,sK u,#1
;
st.q
Zaddr,sZ
;
User Guide
42
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.2
32-bit signed Multiplication
Pseudo code:
lZ = lX*lK;
IP= 1 (1 mul)
LD/ST= 3 (read lX, read lK, write lZ)
Assembly code:
; lX,lK,lZ 32-bit signed
ld.w
lX, Xaddr
ld.w
lK, Kaddr
mul.q
lZ,lX,lK,#1
st.w
Zaddr,lZ
4.3
;
;
;
;
(1)
(2)
(3,4,5)
||
;
;
;
;
32-bit signed Multiplication (Result on 64-bit)
Pseudo code:
llZ = lX*lK; all signed
IP= 1 (1 mul)
LD/ST= 3 (read lX, read lK, write llZ)
Assembly code:
; lX,lK 32-bit signed, llZ 64-bit signed
ld.w
lX, Xaddr
; (1)
ld.w
lK, Kaddr
; (2)
mul.q
llZ,lX,lK,#1
; (3,4,5)
st.d
Zaddr,llZ
; ||
User Guide
43
;
;
;
; same number of cycles
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.4
‘C’ Integer Multiplication
Pseudo code:
This multiplication takes the lower part of the result (MUL), whereas previous
multiplications take the upper part (MUL.Q).
lZ = lX*lK;
IP= 1 (1 mul)
LD/ST= 3 (read lX, read lK, write lZ)
Assembly code:
; lX,lK,lZ 32-bit signed
ld.w
lX, Xaddr
ld.w
lK, Kaddr
mul
lZ,lX,lK
st.w
Zaddr,lZ
;
;
;
;
(1)
(2)
(3,4,5)
||
;
;
;
;
; lX,lK,lZ 32-bit unsigned
ld.w
lX, Xaddr
; (1)
;
ld.w
lK, Kaddr
; (2)
;
mul
lZ,lX,lK
; (3,4,5) ;
; unsigned multiplication gives the same result
; as a signed multiplication. That is why there is no MUL.U instruction.
st.w
Zaddr,lZ
; ||
;
; lX,lK,lZ 32-bit signed (saturated result)
ld.w
lX, Xaddr
; (1)
;
ld.w
lK, Kaddr
; (2)
;
muls
lZ,lX,lK
; (3,4,5) ;
st.w
Zaddr,lZ
; ||
;
; lX,lK,lZ 32-bit unsigned (saturated result)
ld.w
lX, Xaddr
; (1)
;
ld.w
lK, Kaddr
; (2)
;
muls.u
lZ,lX,lK
; (3,4,5) ;
st.w
Zaddr,lZ
; ||
;
User Guide
44
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.5
16-bit Update
Pseudo code:
sZ = sY + sX*sK;
IP = 1 (1 madd)
LD/ST= 4
(read sX, read sK, read sY, write sZ)
Assembly code:
;result = 32-bit
ld.q
sX,Xaddr
ld.q
sK,Kaddr
ld.q
sY,Yaddr
madds.q
lZ,sY,sX u,sK u,#1
st.q
Zaddr,sZ
;
;
;
;
;
(1)
(2)
(3)
(4,5)
||
;
;
;
;
;
;same with result = rounded 16-bit
ld.q
sX,Xaddr
;
ld.q
sK,Kaddr
;
ld.q
sY,Yaddr
;
maddrs.q sZ,sY,sX u,sK u,#1
;
st.q
Zaddr,sZ
;
(1)
(2)
(3)
(4,5)
||
;
;
;
;
;
The pseudo code sZ = sY + sX*sK would have been written sZ = sY + (long) (sX*sK),
since the result of a 16*16-bit multiplication is a 32-bit result in hardware.
sZ = sY + (long) (sX*sK) is actually equivalent to sZ = sY + (short) (sX*sK). The proof
can be seen in the following figure:
C
16
0000
C
16
P
X
R
16
32
P
not necessary
+
R
routines_sec4_5a.eps
Since the lower part of C is only zeros, it will not change the lower part of P. We can
therefore say that the result of the 16*16-bit multiplication is a short value. The pseudo
code is:
sZ = sY + sX*sK
User Guide
45
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.6
32-bit Update
Pseudo code:
lY += lX*lK;
IP = 1 (1 madd)
LD/ST= 4
(read lX, read lK, read lY, write lY)
Assembly code:
;32*32-bit multiplication, result = 32-bit
ld.w
lX,Xaddr
; (1)
ld.w
lK,Kaddr
; (2)
ld.w
lY,Yaddr
; (3)
madds.q
lY,lY,lX,lK,#1
; (4,5,6)
st.w
Yaddr,lY
; ||
;
;
;
;
;
A 32-bit update can not only be executed with a 32*32-bit multiplication, but also with a
16*32-bit multiplication and a 16*16-bit multiplication.
16
16
16
32
32
32
32
X
32
+
32
routines_sec4_6a.eps
;16*32-bit multiplication, result = 32-bit
ld.q
sX,Xaddr
; (1)
ld.w
lK,Kaddr
; (2)
ld.w
lY,Yaddr
; (3)
madds.q
lY,lY,lK,sX u,#1
; (4,5)
st.w
Yaddr,lY
; ||
;
;
;
;
;
;16*16-bit multiplication, result
ld.q
sX,Xaddr
ld.w
lK,Kaddr
ld.w
lY,Yaddr
madds.q
lY,lY,sK u,sX u,#1
st.w
Yaddr,lY
;
;
;
;
;
User Guide
=
;
;
;
;
;
32-bit
(1)
(2)
(3)
(4,5)
||
46
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.7
2nd Order Difference Equation (16-bit)
Pseudo code:
sY = sX - 2*sX1 + sX2;
IP= 3
(1 add, 1 sub, 1mul)
LD/ST= 4
(read sX, read sX1, read sX2, write sY)
Optimization note:
sXX is 32-bit register holding the 2 16-bit variables sX, sX1
Assembly code:
ld.w
sh
sub
ld.h
add
st.h
sXX, [Xptr]
sX1,sXX,#-15
sY,sXX,sX1
sX2, [Xptr]+4
sY,sY,sX2
Yaddr,sY
;
;
;
;
;
;
(1)
(2)
(3)
||
(4)
||
;
;
;
;
;
;
X1 || X
0000 || X1*2
Y = X - 2*X1
Y = X - 2*X1 + X2
Memory organization:
[Xptr] -->
User Guide
Xaddr
sX
Xaddr + 2
sX1
Xaddr + 4
sX2
47
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.8
2nd Order Difference Equation (32-bit)
Pseudo code:
lY = lX - 2*lX1 + lX2;
IP= 3
(1 add, 1 sub, 1mul)
LD/ST= 4
(read lX, read lX1, read lX2, write lY)
Assembly code:
ld.d
sh
sub
ld.w
add
st.w
lX1/lX, [Xptr]
lX1,lX1,#1
lY,lXX,lX1
lX2, [Xptr]+8
lY,lY,lX2
Yaddr,lY
;
;
;
;
;
;
(1)
(2)
(3)
||
(4)
||
;
;
;
;
;
;
lX1 || lX
lX1*2
lX - 2*lX1
lX - 2*lX1 + lX2
Memory organization:
[Xptr] -->
User Guide
Xaddr
lX
Xaddr + 4
lX1
Xaddr + 8
lX2
48
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.9
Complex Multiplication
Equations:
Yr = Xr*Kr - Xi*Ki
Yi = Xr*Ki + Xi*Kr
Pseudo code:
sYr = sXr*sKr – sXi*sKi;
sYi = sXr*sKi + sXi*sKr;
IP= 4
(2 mul, 1 madd, 1msub)
LD/ST= 6
(read sXr, sXi, sKr, sKi, write sYr, sYi)
is equivalent to (packed format) = 3 (read
sXr||sXi, sKr||sKi, write sYr||sYi)
Assembly code:
mov
ld.w
mov
ld.w
d6,#0
ssK,[Kptr+]4
d7,#0
ssX,[Xptr+]4
;
;
;
;
(1)
||
(2)
||
;
; Ki Kr
;
; Xi Xr
msubadm.h
mulm.h
st.h
st.h
e0,e6,ssK,ssX ul,#1
e2,ssK,ssX lu,#1
[Yptr+]2,sYr
[Yptr+]2,sYi
;
;
;
;
(3)
(4)
||
(5)
;
;
;
;
Yr = Xr*Kr - Xi*Ki
Yi = Xr*Ki + Xi*Kr
store Yr
store Yi
Note: TriCore MULM.H does not have a direct subtraction, only addition. A
MADDSUM.H instruction is used to get the subtraction, with the 3rd source register
set to 0 (e6). The 2 results are not packed in one register, and so two stores are
required.
User Guide
49
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1/ d0
d3/ d2
d5
d4
d7/ d6 Load /
Store
0
kikr
ld krki
0
xixr
msubadm.h
yr=xr*kr-xi*ki
e0,e6,ssK,ssX ul,#1
mulm.h
e2,ssK,ssX lu,#1
User Guide
ld xrxi
st yr
yi=xr*ki+xi*kr
50
st yi
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.10
Complex Multiplication (Packed)
Equations:
Yr = Xr*Kr - Xi*Ki
Yi = Xr*Ki + Xi*Kr
Pseudo code:
sYr = sXr*sKr – sXi*sKi;
sYi = sXr*sKi – sXi*sKr;
IP= 4
(2 mul, 1 madd, 1 msub)
LD/ST= 6
(read sXr, sXi, sKr, sKi, write sYr, sYi)
is equivalent to (packed format) = 3 (read
sXr||sXi, sKr||sKi, write sYr||sYi)
Pseudo code implementation:
sYi = sXr*sKi; sYr = sXr*sKr;
sYi += sXi*sKr; sYr -= sXi*sKi;
Assembly code:
ld.w
ld.w
mulr.h
maddsurs.h
st.w
ssK,[Kptr+]4
;
ssX,[Xptr+]4
;
ssY,ssK,ssX ll,#1
;
ssY,ssY,ssK,ssX uu, #1 ;
[Yptr+],ssY
;
(1)
(2)
(3)
(4,5)
||
;
;
;
;
;
Ki Kr
Xi Xr
Yi =Xr*Ki || Yr = Xr*Kr
Yi +=Xi*Kr || Yr -= Xi*Ki
store Yi Yr
Note: In this example we save one store because the computed results are packed in
one register. Rounding is used to pack them.
Register diagram:
Instruction
d0
d5
d4
ki kr
ld krki
xi xr
mulr.h
ssY,ssK,ssX ll,#1
yi = xr*ki ||yr = xr*kr
maddsurs.h
ssY,ssY,ssK,ssX uu, #1
yi += xi*kr ||yr -= xi*ki
User Guide
51
Load / Store
ld xrxi
st yiyr
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.11
Complex Update
Equations:
Zr = Yr + Xr*Kr - Xi*Ki
Zi = Yi + Xr*Ki + Xi*Kr
Pseudo code:
sZr = sYr + sXr*sKr - sXi*sKi;
sZi = sYi + sXr*sKi + sXi*sKr;
IP= 4 (3 madd, 1 msub)
LD/ST= 8
(read sXr, sXi, sKr, sKi, sYr, sYi, write
sZr, sZi) equivalent to (packed format)= 4
(read sXr|| sXi, sKr || sKi, sYr || sYi,
write sZr || sZi)
Assembly code:
movh
ld.w
ld.w
ld.h
msubadm.h
ld.h
st.h
maddm.h
st.h
d2,#0
ssK,[Kptr+]4
ssX,[Xptr+]4
sYr,[Yptr+]2
e0,e2,ssK,ssX ul, #1
sYi,[Yptr+]2
[Zptr+]2,sZr
e0,e2,ssK,ssX lu,#1
[Zptr+]2,sZi
;
;
;
;
;
;
;
;
;
(1)
||
(2)
(3)
(4,5)
||
||
(6,7)
||
;
;
;
;
;
;
;
;
;
Ki Kr
Xi Xr
Yr
Zr = Yr + Xr*Kr - Xi*Ki
Yi
store Zr
Zi = Yi + Xr*Ki + Xi*Kr
store Zi
The MSUBADM.H instruction is used because the memory organization is imaginary ||
real. The equation is zr = yr – (xi*ki - xr*kr), which is equivalent to zr = yr + xr*kr - xi*ki.
User Guide
52
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1/ d0
d2
d3
0
d5
d4
kikr
ld kikr
xixr
msubadm.h
e0,e2,ssK,ssX ul,#1
zr = yr + xr*kr - xi*ki
Load / Store
ld xixr
yr
ld yr
yi
ld yi
st zr
maddm.h
e0,e2,ssK,ssX lu,#1
zi = yi + xr*ki + xi*kr
st zi
User Guide
53
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
4.12
Complex Update (Packed)
Equations:
Zr = Yr + Xr*Kr - Xi*Ki
Zi = Yi + Xr*Ki + Xi*Kr
Pseudo code:
sZr = sYr + sXr*sKr - sXi*sKi;
sZi = sYi + sXr*sKi + sXi*sKr;
IP= 4 (3 madd, 1 msub)
LD/ST= 8
(read sXr, sXi, sKr, sKi, sYr, sYi, write
sZr, sZi) equivalent to (packed format)= 4
(read sXr|| sXi, sKr || sKi, sYr || sYi,
write sZr || sZi)
Pseudo code implementation:
sZi = sYi + sXr * sKi; sZr = sYr + sXr * sKr;
sZi += sXi * sKr; sZr -= sXi * sKi;
Assembly code:
ld.w
ld.w
ld.w
maddrs.h
maddsurs.h
st.w
ssK,[Kptr+]4
; (1)
ssX,[Xptr+]4
; (2)
ssY,[Yptr+]4
; (3)
ssZ,ssY,ssK,ssX ll,#1 ; (4)
ssZ,ssZ,ssK,ssX uu,#1 ; (5,6)
[Zptr+]4,ssZ
; ||
;
;
;
;
;
;
Ki Kr
Xi Xr
Yi Yr
Zi = Yi+Xr*ki || Zr = Yr+Xr*Kr
Zi +=Xi*kr || Zr -= Xi*Ki
store Zi Zr
Note: Rounding is used to pack the 2 results in 1 register.They can be stored in one
instruction.
User Guide
54
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d2
d4
xixr
maddrs.h
ssZ,ssY,ssK,ssX ll,#1
zi=yi+xr*ki ||zr=yr+xr*kr
maddsurs.h
ssZ,ssZ,ssK,ssX uu,#1
zi=zi+xi*kr || zr=zr-xi*ki
zizr
User Guide
d5
Load / Store
kikr
ld kikr
ld xixr
yiyr
ld yiyr
yiyr
ld yiyr
st zizr
55
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Data memory interleaving
Packed Load/store
Saturation
Rounding
Code
Size 1)
Optimization
Techniques
Vector sum
(3*N/4 +2)+ 3 32
9 9
9
9
-
9
9
-
Vector multiplication
(3*N/4 +2) +3 32
9 9
9
9
-
9
-
-
Vector pre-emphasis (3*N/4 +2) +3 40
9 9
9
9
-
9
9
9
Vector square
difference
(3*N/2 +2) +2 24
9 -
9
9
9
9
9
9
Vector complex
multiplication
(5*N +2) +3
28
-
-
9
-
-
9
-
-
Vector complex
multiplication
(packed)
(4*N +2) +3
24
-
-
9
-
-
9
9
9
Vector complex
(2*N +2) +4
multiplication (unroll)
42
9 9
9
-
9
9
9
9
Color space
conversion
11
64
9 9
9
-
9
9
-
-
Vector scaling
(2*N/2 +2) +2 28, 20
-
-
-
-
-
9
9
9
-
-
9
-
-
9
-
-
Vector normalization (2*N/2 +2) + 54
(2*N/2 +2) +7
1)
Loop Unrolling
Cycles
Software Pipelining
Name
Load/store Scheduling
Vectors
Packed Operation
5
Arithmetic
Methods
Code Size is in Bytes
User Guide
56
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.1
Vector Sum
Equation:
Zn = Vn + Wn
n = 0..N-1
Pseudo code:
for (n=0; n<N; n++) sZ[n] = sV[n] + sW[n];
IP= 1 (1 add)
LD/ST= 3 (read sV, read sW, write sZ)
Assembly code:
lea
ld.w
ld.d
vadloop:
loop
LC,(N/4–1)
ssV0,[Vptr+]4
ssssW,[Wptr+]8
adds.h
ld.d
adds.h
ld.d
st.d
LC,vadloop
Example
ssZ0,ssV0,ssW0
ssssV,[Vptr+]8
ssZ1,ssV1,ssW1
ssssW,[Wptr+]8
[Zptr+]8,ssssZ
; (1)
; (2)
; (3)
; get loop number
; V0 V1
; W0 W1 W2 W3
;
;
;
;
;
;
;
;
;
;
(1)
||
(2)
||
(3)
V1+W1
V2 V3
V3+W3
W4 W5
store
||
V4
||
W6
Z0
V0+W0
V5
V2+W2
W7
Z1 Z2 Z3
N = 64 Î 53 cycles
Register diagram:
Instruction
d1 / d0
d5 / d4
v1 v0 _ _
d7/ d6
Load / Store
ld v0v1
w3 w2 w1 w0 ld w0w1w2w3
adds.h
ssZ1,ssV2,ssW1
d1= ------------v5 v4 v3 v2
d0= v1+w1 || v0+w0
adds.h
ssZ2,ssV1,ssW2
d1= v3+w3 || v2+w2
w7 w6 w5 w4 ld w4w5w6w7
d1= z3 z2
d0= z1 z0
User Guide
ld v2v3v4v5
st z0z1z2z3
57
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.2
Vector Multiplication
Equation:
Zn = Vn * Wn
n = 0..N-1
Pseudo code:
for (n=0; n<N; n++)
sZ[n] = sV[n] * sW[n];
IP= 1 (1 mul)
LD/ST= 3 (read sV, read sW, write sZ)
Assembly code:
lea
ld.w
ld.d
LC,(N/4 - 1)
ssV0,[Vptr+]4
ssssW,[Wptr+]8
; (1)
; (2)
; (3)
; get loop number
; V0 V1
; W0 W1 W2 W3
vect2loop:
loop
mulr.h ssZ0,ssV0,ssW0 ul,#1
ld.d
ssssV,[Vptr+]8
mulr.h ssZ1,ssV1,ssW1 ul,#1
ld.d
ssssW,[Wptr+]8
st.d
[Zptr+]8,ssssZ
LC,vect2loop
Example
;
;
;
;
;
(1)
||
(2,3)
||
||
;
;
;
;
;
V1*W1
V2 V3
V3*W3
W4 W5
store
||
V4
||
W6
Z0
V0*sW0
V5
V2*W2
W7
Z1 Z2 Z3
N = 64 Î 53 cycles
Register diagram:
Instruction
d1/ d0
d5 / d4
v1 v0 _ _
d7/ d6
Load / Store
ld v0v1
w3 w2 w1 w0 ld w0w1w2w3
mulr.h
_ _ z1 z0
ssZ0,ssV0,ssW0 ul,#1
v5 v4 v3 v2
mulr.h
z3 z2 z1 z0
ssZ1,ssV1,ssW1 ul,#1
w7 w6 w5 w4 ld w4w5w6w7
z3 z2 z1 z0
User Guide
ld v2v3v4v5
st z0z1z2z3
58
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.3
Vector Pre-emphasis
Equation:
Zn = Vn + Xn*K
n = 0..N-1
K = -28180
Pseudo code:
for (n=0; n<N; n++)
sZ[n] = sV[n] + sX[n]*sK;
IP= 1 (1 madd)
LD/ST= 3 (read sX, read sV, write sZ)
Assembly code:
lea
mov.u
ld.w
addih
ld.d
LC,(N/4 –1)
d6,#0x91ec
ssX0,[Xptr+]4
d6,d6,#0x91ec
ssssV,[Vptr+]8
preloop:
maddrs.h
ld.d
maddrs.h
loop
ld.d
st.d
LC,preloop
Example
;
;
;
;
;
(1)
(2)
||
(3)
||
;get loop number
;K=-28180
;X0 X1
;K || K
;V0 V1 V2 V3
ssZ0,ssV0,ssX0,d6 ul,#1 ;(1); Z1=V1+X1*K ||Z0=V0+X0*K
ssssX,[Xptr+]8
; ||
;X2 X3 X4 X5
ssZ1,ssV1,ssX1,d6 ul,#1 ;(2,3)
;Z3=V3+X3*K ||Z2=V2+X2*K
ssssV,[Vptr+]8
; ||
;V4 V5 V6 V7
[Zptr+]8,ssssZ
; ||
;store Z0 Z1 Z2 Z3
N = 160 Î 125 cycles
Register diagram:
Instruction
d1/ d0
d3 / d2
d5 / d4
x1x0 _ _
d6
Load / Store
k
ld x0x1
v3 v2 v1 v0 k||k
maddrs.h
ssZ0,ssV0,ssX0,d6 ul,#1
_ _ z1 z0 x5 x4 x3 x2
maddrs.h
ssZ1,ssV1,ssX1,d6 ul,#1
z3 z2 z1
z0
ld v0v1v2v3
ld x2x3x4x5
v7 v6 v5 v4
ld v4v5v6v7
st z0z1z2z3
User Guide
59
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.4
Vector Square Difference
Equation:
Zn = (Xn-Yn)2
n = 0..N-1
Pseudo code:
for (n=0; n<N; n++)
{
sTmp = sX[n]-sY[n];
sZ[n] = sTmp*sTmp;
}
IP= 2 (1 sub, 1 mul)
LD/ST= 3 (read sX, read sY, write sZ)
Assembly code:
lea
ld.d
sqdloop:
LC, (N/2 - 1)
ssssXY, [Xptr+]8
subs.h
ld.d
mulr.h
; (1) ; get loop number
; (2) ; X0 X1 Y0 Y1
ssTmp,ssX,ssY
; (1) ; X1 – Y1 || X0 – Y0
ssssXY, [Xptr+]8 ; || ; X2 X3 Y2 Y3
ssZ, ssTmp,ssTmp ul,#1 ; (2,3)
; (X1–Y1)^2 || (X0–Y0)^2
[Zptr+]4,ssZ
; || ; store Z0 Z1
st.w
loop LC, sqdloop
Memory organization:
Xaddr
sX0
Xaddr + 2
sX1
Xaddr + 4
sY0
Xaddr + 6
sY1
Xaddr + 8
sX2
Xaddr + 10
etc…
Example
N = 160 Î 244 cycles
User Guide
60
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
subs.h
ssTmp,ssX,ssY
mulr.h ssZ,
ssTmp,ssTmp ul,#1
d1
d3 / d2
Load / Store
y1 y0 x1 x0
ld x0x1y0y1
x1–y1 || x0–y0 y3 y2 x3 x2
ld x2x3y2y3
(x1–y1)^2 ||
(x0–y0)^2
st z0z1
User Guide
61
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.5
Vector Complex Multiplication
Equations:
Yrn = Xrn*Krn - Xin*Kin
n = 0..N-1
Yin = Xrn*Kin + Xin*Krn
Pseudo code:
for (n=0; n<N; n++)
{
sYr[n] = sXr[n]*sKr[n]-sXi[n]*sKi[n];
sYi[n] = sXr[n]*sKi[n]+sXi[n]*sKr[n];
}
IP= 4
(2 mul, 1 msub, 1 madd)
LD/ST= 3 in packed format (read sXr || sXi,
read sKr|| sKi, write sYr || sYi)
Assembly code:
lea
mov
mov
nmloop:
loop
Example
User Guide
LC, (N – 1)
d6,#0
d7,#0
ld.w
ld.w
msubadm.h
mulm.h
st.h
st.h
LC,nmloop
ssK,[Kptr+]4
ssX,[Xptr+]4
e0,e6,ssK,ssX ul,#1
e2,ssK,ssX lu,#1
[Yptr+]2,sYr
[Yptr+]2,sYi
; (1)
; (2)
; (3)
; get loop number
; clear 3rd source
; clear 3rd source
;
;
;
;
;
;
;
;
;
;
;
;
(1)
(2)
(3)
(4)
||
(5)
load K
load X
Yr= Xr*Kr-Xi*Ki
Yi= Xr*Ki+Xi*Kr
store Yr
store Yi
N = 64 Î 325 cycles
62
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0 d3 / d2 d4
d5
d7/ d6
Load / Store
d6=0
d7=0
ki kr
xi xr
msubadm.h
e0,e6,ssK,ssX ul,#1
mulm.h
e2,ssK,ssX lu,#1
ld kikr
ld xixr
yr
yi
st yr
st yi
User Guide
63
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.6
Vector Complex Multiplication (Packed)
Equations:
Yrn = Xrn*Krn - Xin*Kin
n = 0..N-1
Yin = Xrn*Kin + Xin*Krn
Pseudo code:
for (n=0; n<N; n++)
{
sYr[n] = sXr[n]*sKr[n]-sXi[n]*sKi[n];
sYi[n] = sXr[n]*sKi[n]+sXi[n]*sKr[n];
}
IP= 4
(2 mul, 1 msub, 1 madd)
LD/ST= 3 in packed format (read sXr || sXi,
sKr|| sKi, write sYr || sYi)
Pseudo code implementation:
for (n=0; n<N; n++)
{
sYr[n] = sXr[n]*sKr[n]; sYi[n] = sXr[n]*sKi[n];
sYr[n] -= sXi[n]*sKi[n]; sYi[n] += sXi[n]*sKr[n];
}
Assembly code:
lea
ld.w
ld.w
nloop:
LC,(N – 1)
ssK,[Kptr+]4
ssX,[Xptr+]4
mulr.h
maddsurs.h
loop
ld.w
ld.w
st.w
LC,nloop
Example
User Guide
; (1)
; (2)
; (3)
; get loop number
; Ki Kr
; Xi Xr
ssY,ssK,ssX ll,#1
; (1)
; Yi =Xr*Ki || Yr = Xr*Kr
ssY,ssY,ssK,ssX uu,#1
; (2,3)
; Yi+=Xi*Kr || Yr -= Xi*Ki
ssK,[Kptr+]4
; ||
; Ki1 Kr1
ssX,[Xptr+]4
; ||
; Xi1 Xr1
[Yptr+]4,ssY
; (4)
; store Yi Yr
N = 64 Î 261 cycles
64
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d5
d4
kikr
ld kikr
xixr
mulr.h
ssY,ssK,ssX ll,#1
ld xixr
yi=xr*ki || yr=xr*kr
maddsurs.h
yi=yi+xi*kr||yr=yr-xi*ki
ssY,ssY,ssK,ssX uu,#1
kikr
ld kikr
xixr
yiyr
User Guide
d7/ d6 Load / Store
ld xixr
st yiyr
65
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.7
Vector Complex Multiplication (Unrolled)
Equations:
Yrn = Xrn*Krn - Xin*Kin
n = 0..N-1
Yin = Xrn*Kin + Xin*Krn
Pseudo code:
for (n=0; n<N; n++)
{
sYr[n] = sXr[n]*sKr[n]-sXi[n]*sKi[n];
sYi[n] = sXr[n]*sKi[n]+sXi[n]*sKr[n];
}
IP= 4
(2 mul, 1 msub, 1 madd)
LD/ST= 3 in packed format (read sXr || sXi,
sKr|| sKi, write sYr || sYi)
Pseudo Code implementation:
for (n=0; n<N/2-1; n++)
{
sYr[2*n]
= sXr[2*n]*sKr[2*n]; sYi[2*n] = sXr[2*n]*sKi[2*n];
sYr[2*n]
-= sXi[2*n]*sKi[2*n]; sYi[2*n] += sXi[2*n]*sKr[2*n];
sYr[2*n+1] = sXr[2*n+1]*sKr[2*n+1];sYi[2*n+1] = sXr[2*n+1]*sKi[2*n+1];
sYr[2*n+1] -= sXi[2*n+1]*sKi[2*n+1];sYi[2*n+1] += sXi[2*n+1]*sKr[2*n+1];
}
Assembly code:
lea
ld.w
ld.d
cxloop:
LC,(N/2–1)
ssK0,[Kptr+]4
ssssX,[Xptr+]8
mulr.h
st.w
maddsurs.h
ld.d
mulr.h
st.w
maddsurs.h
loop
st.w
User Guide
;(1)
;(2)
;(3)
;get loop number
;Ki0 Kr0
;Xi1 Xr1 Xi0 Xr0
ssY0,ssK0,ssX0 ll,#1;(1) ;Yi0 =Xr*Ki||Yr0=Xr*Kr
[Yptr+]4,ssY1
;||
;store former Yi1 Yr1
ssY0,ssY0,ssK0,ssX0 uu,#1 ;(2)
;Yi0 +=Xi*sKr ||Yr0 -= Xi*Ki
ssssK,[Kptr+]8
;||
;Ki2 Kr2 Ki1 Kr1
ssY1,ssK1,ssX1 ll,#1;(3) ;Yi1 =Xr*Ki ||Yr1=Xr*Kr
[Yptr+]4,ssY0
;||
;store Yi0 Yr0
ssY1,ssY1,ssK1,ssX1 uu,#1 ;(4)
;Yi1 +=Xi*Kr ||Yr1 -= Xi*Ki
ssssX,[Xptr+]8
;||
;Xi3 Xr3 Xi2 Xr2
ld.d
LC,cxloop
[Yptr],ssY1 ; (4)
;store last Yi1 Yr1
66
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d1
d5
d4
d7
d6
Load /
Store
ki0 kr0 ld k0
x x
xi1 xr1
xi0 xr0
mulr.h
ssY0,ssK0,ssX0 ll,#1
yi0=xr0*ki0 ||
yr0=xr0*kr0
ld x0 x1
st y1
maddsurs.h
yi0 +=xi0*kr0 ||
ssY0,ssY0,ssK0,ssX0uu,#1 yr0 -= xi0*ki0
ki2 kr2 ld k1 k2
ki1 ki1
mulr.h
ssY1,ssK1,ssX1 ll,#1
yi1=xr1*ki1 ||
yr1=xr1*kr1
st y0
maddsurs.h
ssY1,ssY1,ssK1,ssX1uu,#1
yi1+=xi1*kr1 || xi3 xr3
yr1 -= xi1*ki1 xi2 xr2
ld x2 x3
st y1
User Guide
67
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.8
Color Space Conversion
Equation:
X
A
Y
Cr
=
Cb
0.257
0.504
0.098
0.439
-0.368
-0.071
-0.148
-0.291
0.439
*
R’
K
R
0*
G
B
+
0.5
0.5
* 0 for UPF format, 16 for CCIR 601.
Xi = ∑ Aij * R’j + Kj
for i, j = 0..n
Pseudo code:
sY
= 0.257*sR + 0.504*sG + 0.098*sB;
sCr = 0.439*sR - 0.368*sG - 0.071*sB + 0.5;
sCb = -0.148*sR - 0.291*sG + 0.439*sB + 0.5;
RGB belongs to [0; +1[ ([0; 256[) so YCrCb will be in [0; +1[ ([0; 256[).
(Assembly code and Register diagram follow)
User Guide
68
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
.define
.define
.define
.define
.define
RGBptr "a0"
OFFSET128 "0x2000"
llY "e8"
llCr "e10"
llCb "e12"
lea
lea
lea
RGBptr,rgbvalue
Kptr,kmatvalue
a4,stvalue
mov
ld.d
mov.u
ld.d
d0,#0
e2,[RGBptr+]6
d1,#OFFSET128
e4,[Kptr+]6
;
;
;
;
mulm.h
madd.q
llY,d2,d4ul,#1
llY,llY,d3l,d5l,#1
; (3)
; llY = sR*sK00 + sG*sK01
; (4,5) ; llY = llY + sB*sK02
ld.d
maddm.h
madd.q
e4,[Kptr+]6
llCr,e0,d2,d4ul,#1
llCr,llCr,d3l,d5l,#1
; ||
; sK10,sK11,sK12
; (6)
; llCr=sR*sK10+sG*sK11+128
; (7,8) ; llCr=llCr+sB*sK12
ld.d
maddm.h
st.h
madd.q
st.h
st.h
e4,[Kptr+]6
llCb,e0,d2,d4ul,#1
[a4+]2,d9
llCb,llCb,d3l,d5l,#1
[a4+]2,d11
[a4+]2,d13
;
;
;
;
;
;
kmatvalue:.half
.half
.half
rgbvalue: .half
User Guide
0x20e6,0x4803,0x0c83
0x3831,0xd0e5,0xf6e9
0xed0e,0xdcc1,0x3831
0x4000,0x3000,0x2000
69
(1)
||
(2)
||
||
(9)
||
(10)
||
(11)
;
; sRsGsB
; in Q14 with 16-bits of sign
; sK00,sK01,sK02
;
;
;
;
;
;
sK20,sK21,sK22
llCb=sR*sK20+sG*sK21+128
store sY
llCb=llCb+sB*sK22
store sCr
store sCb
;
;
;
;
abcin Q15
defin Q15
ghiin Q15
RGB are [0;+1[ in Q14
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1/d0
d3/d2 d5/
d4
0
R’G’B’
offset128
Load/
Store
ld k00,
k01, k02
Y=R*k00
+G*k01
k10, Y = Y +
k11, B*k02
k12
ld k10,
k11, k12
Cr=R*k1
0+
G*k11+1
28
k20,
k21,
k22
maddm.h
llCb,e0,d2,d4ul,#1
Cr=Cr+B*
k12
Y
madd.q
llCb,llCb,d3l,d5l,#1
User Guide
d13/
d12
k00,
k01,
k02
maddm.h
llCr,e0,d2,d4ul,#1
madd.q
llCr,llCr,d3l,d5l,#1
d11/d10
ld R’G’B’
mulm.h
llY,d2,d4ul,#1
madd.q
llY,llY,d3l,d5l,#1
d9/d8
Cb=
st Y
R*k20+
G*k21+
128
Cr
70
ld k20,
k21, k22
Cb=
Cb+B*
k22
st Cr
Cb
st Cb
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.9
Vector Scaling
Equation:
Zn = (Xn >> 3) << 2
or Zn = (Xn >> shift1) << shift2
n = 0..N-1
Pseudo code:
for (n=0; n<N; n++)sZ[n] = (sX[n] >>3) << 2;
IP= 2 (2 shifts)
LD/ST= 2 (read sX, write sZ)
Assembly code:
mov
lea
mov
ld.w
isloop:
loop
Shift1,#-3
LC,(N/2 - 1)
Shift2,#2
ssX,[Xptr+]4
;
;
;
;
(1)
||
(2)
||
;
;
;
;
load 1st shift value
get loop number
load 2nd shift value
X0 X1
sha.h
ld.w
sha.h
t.w
LC,isloop
;
;
;
;
(1)
||
(2)
||
;
;
;
;
X0>>3, X1>>3
X2 X3
X0<<2, X1<<2
store Z0,Z1
d1,ssX,Shift1
ssX,[Xptr+]4
ssZ,d1,Shift2
[Zptr+]4,ssZ
Alternatively, the 2 shift values can be directly used
lea
LC,(N/2 - 1)
ld.w
ssX,[Xptr+]4
isloop:
sha.h
d1,ssX,#-3
ld.w
ssX,[Xptr+]4
sha.h
ssZ,d1,#2
st.w
[Zptr+]4,ssZ
loop
LC,isloop
Example
User Guide
; (1) ; get loop number
; (2) ; X0 X1
;
;
;
;
(1)
||
(2)
||
;
;
;
;
X0>>3, X1>>3
X2 X3
X0<<2, X1<<2
store Z0,Z1
N = 160 Î 164 cycles
71
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1/ d0
d2
d3
d4
Shift1
d7/ d6
Load /
Store
ld shift1
Shift2
ld shift2
x1x0
ld x0x1
sha.h d1,ssX,Shift1
x1>>3 || x0>>3 x3x2
ld x2x3
sha.h ssZ,d1,Shift2
x1<<2 || x0<<2
st z0z1
User Guide
72
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
5.10
Vector Normalization
Equations:
(1) minex = minimum (minex,exponent (Xn))
(2) Xn = Xn << minex
n = 0..N-1
n = 0..N-1
Pseudo code Equation (1):
sMin = 32;
for (n = 0; n<N; n++)
{
sZ = exponent(sX[n]);
if (sZ < sMin) sMin = sZ; else sMin = sMin;
}
IP= 2 (1 min, 1 count
leading sign)
LD/ST= 1 (read sX)
Pseudo code Equation (2):
for (n = 0; n<N; n++)
sX[n] = sX[n] << sMin;
IP= 1 (1 shift)
LD/ST= 2 (read sX, write sX)
Assembly code:
movh
lea
addi
ld.w
bexloop:
d4,#16
LC,(N/2 - 1)
d4,d4,#16
ssX,[Xptr+]4
cls.h
min.h
ssZ,ssX
d4,d4,ssZ
;
;
;
ssX,[Xptr+]4 ;
ld.w
loop
LC,bexloop
sh
d1,d4,#-16
extr.u
d3,d4,#0,#16
min.h
d3,d1,d3
ld.w
ssX,[X1ptr]
lea
LC, (N/2 - 1)
normloop:
sh.h
ssX,ssX,d3
st.w
[X1ptr+]4,ssX
ld.w
ssX,[X1ptr]
loop
LC,normloop
User Guide
;
;
;
;
;
;
;
;
;
(1)
||
(2)
||
;
;
;
;
d4 upper = max
get loop number
d4 lower = max
X0 X1
(1)
; Z1 =exp.(X1)||Z0 =exp.(X0)
(2)
Min1=min(Z1,Min1)|| Min0=min(Z0,Min0)
||
; X2 X3
(3)
(4)
(5)
(6)
(7)
; (1)
; ||
; (2)
73
;
;
;
;
;
d1 = Min1
d0 = Min0
Min = min(Min1,Min0)
X0 X1
get loop number
; X0 << Min || X1 << Min
; store normalized X0, X1
; load unormalized X2, X3
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Finding minimum exponent loop:
Example
N = 160 Î 169 cycles
Normalization loop:
Example
N = 160 Î 162 cycles
Total:
Example
User Guide
N = 160 Î 331 cycles
74
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6
Filters
If all the possible data types are considered, the list of Filter routines can be very large:
•
•
•
•
16-bit, 32-bit, mixed 16-bit/32-bit
Complex 16-bit, complex 32-bit, complex mixed 16-bit/32-bit
Result accumulated on 32bit or > 32bit
Saturated data type
Most common cases are covered, with the aim of providing a maximum of diversity.
The first 3 instances are not filter routines, but Vector to Scalar operations. These have
a lot in common with Filters since the result is accumulated. The only difference between
a FIR (Finite Impulse Response) filter and a Dot Product for example, is that a FIR is
likely to use Circular Addressing. It therefore makes sense to place these together, as
they will be optimized in the same way.
The next 5 are FIR filter routines, beginning with the more trivial (non-looped) cases and
ending with the complex FIR routine. This routine can be extremely complicated to
optimize, especially since TriCore can perform it in 2 cycles per complex tap. Between
these 2 extreme are the standard n-tap FIR (0.5 cycle per tap) and the Block FIR.
Auto-correlation appears next to the Block FIR, because from an implementation point
of view, these are nearly identical. They belong to a famous class of algorithms, known
as BLOCK algorithms. Those 2 examples should be sufficient to develop other routines.
The IIR filter routines logically come after the FIR. These routines all contain feedback
terms, which always makes the implementation more difficult than for FIR. The first 3 IIR
routines are non-looped cases (1, 2 and 4 coefficients). This is followed by the most
common N-stage biquad (in 2 types, of 4 or 5 coefficients) and ends with the Lattice IIR.
The Lattice IIR routine is always extremely complicated to optimize with pipeline MAC,
which is the case with TriCore.
The last group are LMS routines, or more precisely, adaptive FIR filters using the Least
Mean Square (LMS) method to update the coefficients. There are dozens of adaptive
methods and the delayed LMS is used because it is fast.
Leaky LMS and the 3 standard cases (16-bit, 32-bit & complex) have also been
addressed. In many ways the LMS can be seen as the standard algorithm by which to
judge the power of an architecture, and TriCore gives:
• 1 cycle per tap (16-bit coefficients)
• 1 cycle per tap (32-bit coefficients)
• 2 cycles per complex tap (16-bit coefficients)
These results are extremely good, and few DSPs can reach these figures.
User Guide
75
v1.6.4, 2003-01
Name
Cycles
Code Optimization Techniques
Size 1)
Software Pipelining
Loop Unrolling
Packed Operation
Load/Store Scheduling
Data Memory Interleaving
Packed Load/Store
Saturation
Rounding
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Dot product
(2*N/4)+6
34
9
-
9
-
-
9
9
-
Magnitude
square
(1*N/2)+5
16
9
-
9
-
-
9
-
-
Vector
quantization
(3*N/2)+6
38
-
-
9
-
-
9
-
-
First order FIR 4
14
-
-
-
-
-
9
-
-
Second order
FIR
5
22
-
-
-
-
-
9
-
-
FIR
(2*N/4 +2) +4
34
9
9
9
-
-
9
-
-
Block FIR
(4*(N/2) +2) +9
66
9
-
9
9
-
9
-
-
Autocorrelation
((11+3*(N/2- 1) + 70
2) *M/2+2)+5
9
-
9
9
-
9
-
-
Complex FIR
(2*N +2) +4
32
9
-
9
9
-
9
9
9
First order IIR
5
24
-
-
9
-
-
9
9
9
2nd order IIR
7
34
9
-
9
-
-
9
9
-
Biquad 4
coefficients
5
26
9
-
9
-
-
9
-
-
48
9
9
9
9
-
9
-
-
N-stage Biquad (3*(N-1) +2) +6
4 coefficients
User Guide
76
Arithmetic
Methods
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
74
9
9
9
9
-
9
9
-
Lattice filter
(4*(N-2) +2) +10 54
9
9
-
9
-
-
9
9
Leaky LMS
(update only)
(4*(N/4-1) +2) +9 70
9
9
9
9
-
9
9
9
Delayed LMS
(4*N/4 +2) +5
54
9
-
9
9
-
9
9
9
Delayed LMS – (4*(N/2 –1) +2) + 64
32-bit
8
coefficients
9
-
9
9
-
9
9
-
Delayed LMS – (4*(N-1) +2) +9
complex
9
9
9
9
-
9
9
9
N-stage Biquad (5*(N-1) +2) +9
5 coefficients
1)
60
Code Size is in Bytes
User Guide
77
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.1
Dot Product
Equation:
Z = ∑ (Vn * Wn)
n = 0..N-1
Pseudo code:
sZ = 0;
for (n = 0; n < N; n++) sZ += sV[n] * sW[n];
IP= 1 (1 madd)
LD/ST= 2 (read sV, read sW)
Assembly code:
lea
LC, (N/4 -1)
mov
d0,#0
ld.d
ssssV,[Vptr+]8
mov
d1,#0
ld.d
ssssW,[Wptr+]8
dotloop:
maddms.h
llZ,llZ,ssW0,ssV0
ld.d
ssssV,[Vptr+]8
maddms.h
llZ,llZ,ssW1,ssV1
ld.d
ssssW,[Wptr+]8
loop
LC,dotloop
st.h
[Zptr],sZ
Example
;
;
;
;
;
(1)
(2)
||
(3)
||
;
;
;
;
;
get loop number
Z =0 (lower)
dummy dummy V0 V1
Z =0 (upper)
W0 W1 W2 W3
ul,#1
; ||
ul,#1
; ||
;
;
;
;
(1) ; Z
V2 V3 V4
(2) ; Z
W4 W5 W6
+=V0*W0+V1*W1
V5
+=V2*W2+V3*W3
W7
; (4) ; store Z
N = 64 Î 38 cycles
Register diagram:
Instruction
d1 / d0
d3 / d2
z(=0)
z(=0)
maddms.h
llZ,llZ,lW0,lV0ul,#1
z+v0w0+v1w1
maddms.h
llZ,llZ,lW1,lV1ul,#1
z+v2w2+v3w3
User Guide
d5 / d4
Load / Store
v1v0 _ _
ld v0v1
w3w2w1w0
ld w0w1w2w3
v5v4v3v2
w7w6w5w4
78
ld v2v3v4v5
ld w4w5w6w7
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.2
Magnitude Square
Equation:
Z = Σ (Xrn2 + Xin2)
n = 0..N-1
Pseudo code:
sZ = 0;
for (n=0; n<N; n++)
sZ += (sXr[n]*sXr[n] + sXi[n]*sXi[n]);
IP= 2 (2 madd)
LD/ST= 2 (read sXr, read sXi)
Assembly code:
lea
LC, (N/2 - 1)
; (1)
ld.w
ssX, [Xptr+]4
; (2)
msqloop:
maddm.h
llZ,llZ,ssX,ssX ul,#1
ld.w
ssX, [Xptr+]4
; ||
loop
LC, msqloop
;
st.h
[Zaddr],sZ
; (3)
; get loop number
; load Xr, Xi
; (1); Z +=(Xr^2 + Xi^2)
; load next Xr,Xi
; store Z
Memory organization:
Xaddr
Xr0
Xaddr + 2
Xi0
Xaddr + 4
Xr1
Xaddr + 6
Xi1
Xaddr + 8
Xr2
Xaddr +10
etc..
Example
N = 64 Î 37 cycles
Register diagram:
Instruction
d1/ d0
d2
0
xi0 xr0
ld xr0 xi0
maddm.h
Z
llZ,llZ,ssX,ssX,#1
xi1 xr1
ld xr1 xi1
User Guide
d5 / d4
79
d7/ d6
Load / Store
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.3
Vector Quantization
Note: VALIDATED ON TC1 V1.3 SILICON.
Equation:
Z = Σ (Kn - Xn)2
n = 0..N-1
Pseudo code:
sZ = 0;
for (n=0; n<N; n++) sZ += (sK[n]-sX[n])^2;
IP= 2 (1 sub, 1 madd)
LD/ST= 2 (read sX, read sK)
Assembly code:
lea
mov
ld.w
mov
ld.w
LC,(N/2 - 1)
d0,#0
ssX,[Xptr+]4
d1,#0
ssK,[Kptr+]4
;(1)
;(2)
;||
;(3)
;||
quantloop:
subs.h
ld.w
maddm.h
loop
st.h
;
;
;
;
;
ssTp,ssK,ssX
;(1)
ssX,[Xptr+]4
;||
llZ,llZ,ssTp,ssTp ul,#1
;Z + =
ld.w
ssK,[Kptr+]4
;||
LC,quantloop
[Zaddr],d1
;(4)
Example
get loop number
Z = 0(lower)
sX0sX1
Z = 0 (upper)
K0 K1
; K1-X1|| K0-X0
; X2 X3
; (2,3)
((K0-X0)^2+(K1-X1)^2)
; K2 K3
; store sZ
N = 64 Î 102 cycles
Register diagram:
Instruction
d1 / d0
d3 / d2
z = 0(lower)
d5 / d4
x1 x0
z = 0 (upper)
subs.h ssTp,ssK,ssX
maddms.h
llZ,llZ,ssTp, ssTp ul, #1
User Guide
d7 / d6
load x0x1
k1 k0
k1-x1 ||
k0-x0
z += (k0-x0)^2
+ (k1-x1)^2)
x3 x2
load k0k1
load x2x3
k3 k2
80
Load / Store
load k2k3
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.4
First Order FIR
Note: Validated on TriCore Rider D board.
Equation:
Yt = K0*Xt + K1*Xt-1
Pseudo code:
sY = sK0*sX0 + sK1*sX1;
sX1 = sX0;
IP= 2 (1 mul, 1 madd)
LD/ST= 6 (read sX0, sX1, sK0, sK1, write
sY, write sX0
sX1)
Æ
Assembly code:
ld.w
ld.w
mulm.h
st.h
st.h
ssK,[Kptr]
ssX,[Xptr]
llY,ssK,ssX ul,#1
[Xptr]+2,ssX
[Yptr],d1
;
;
;
;
;
(1)
(2)
(3,4)
||
||
;
;
;
;
;
K0 K1
X0 X1
Y = K0*X0 + K1*X1
X0 --> X1
store Y
Memory organization:
Entering
Leaving
Xaddr
X0
X0
Xaddr + 2
X1
X0
Register diagram:
Instruction
d1 / d0
d4
d6
k1 k0
ld k1k0
x1 x0
mulm.h llY,ssK,ssX ul,#1
Load / Store
ld x1x0
y = k0*x + k1*x1
st y
User Guide
81
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.5
Second Order FIR
Note: Validated on TriCore Rider D board.
Equation:
Yt = K0*Xt+ K1*Xt-1 + K2*Xt-2
Pseudo code:
sY = sK0*sX0 + sK1*sX1 + sK2*sX2 ;
sX2 = sX1;
sX1 = sX0;
IP= 2 (1 mul, 2 madd)
LD/ST= 9 (read sX0, sX1, sX2, sK0, sK1,
sK2, write sY, write sX0
sX1, write sX1
sX2)
Æ
Æ
Assembly code:
ld.d
ld.d
mulm.h
st.w
madd.q
st.h
sssK,[Kptr]
sssX,[Xptr]
llY,d4,d6 ul,#1
[Xptr]+2,d6
llY,llY,d5l,d7l,#1
[Yptr],d1
;
;
;
;
;
;
(1)
(2)
(3)
||
(4,5)
||
;
;
;
;
;
;
K0 K1 K2
X X1 X2
Y = K0*X + K1*X1
X0 --> X1 X1 --> X2
Y = Y + K2*X2
store Y
Memory organization:
Entering
Leaving
Xaddr
X0
X0
Xaddr + 2
X1
X0
Xaddr + 4
X2
X1
User Guide
82
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0
d5 / d4
d7 / d6
_ k2 k1 k0
ld k0 k1 k2
_ x2 x1 x0
mulm.h llY,d4,d6ul,#1
y = k0*x0 + k1*x1
madd.q llY,llY,d5l,d7l,#1 y += k2*x2
User Guide
Load /
Store
ld x0 x1 x2
st x0 x1
st y
83
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.6
FIR
Equation:
Yt = ∑ (Xt-n * Kn)
n = 0..N-1
Pseudo code:
sY = 0;
for (n=0; n<N; n++) sY += circular(sX[n])*sK[n];
IP= 1 (1 madd)
LD/ST= 2 (read sX, read sK)
Assembly code:
mov.aa
CONST.A
a2,Xptr
a3,(2*N)<<16
; circular buffer initialization
lea
LC,(N/4 -1)
; (1)
; get loop number
mov
ld.w
mov
ld.d
d0,#0
ssK0,[Kptr+]4
d1,#0
ssssX,[a2/a3+c]8
;
;
;
;
;
;
;
;
sfirloop:
maddm.h
ld.d
maddm.h
ld.d
loop
LC,sfirloop
st.h
[Yptr],d1
Example
User Guide
(2)
||
(3)
||
Y = 0 (lower)
K0 K1
Y = 0 (upper)
X0 X1 X2 X3
llY,llY,ssX0,ssK0 ul,#1 ; (1) ;
ssssK,[Kptr+]8
; || ;
llY,llY,ssX1,ssK1 ul,#1 ; (2) ;
ssssX,[a2/a3+c]8
; || ;
; (4)
Y +=X0*K0+X1*K1
K2 K3 K4 K5
Y +=X2*K2+X3*K3
X4 X5 X6 X7
; store Y
N = 20 Î 16 cycles
84
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0
d3 / d2
d5 / d4
Load / Store
k1k0 _ _
ld k0k1
x3x2 x1x0
maddm.h
y +=x0*k0+x1*k1
llY,llY,ssX0,ssK0 ul,#1
maddm.h
y +=x2*k2+x3*k3 x7x6 x5x4
llY,llY,ssX1,ssK1 ul,#1
User Guide
85
ld x0x1x2x3
k5k4 k3k2
ld k2k3k4k5
ld x4x5x6x7
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.7
Block FIR
Note: Validated on TriCore Rider D board.
Equation:
Ym = ∑ Xm-n * Kn
n = 0..N-1, m = 0..M-1
Pseudo code:
for (m = 0; m <M/4; m+4)
{
sY[m] = 0;
for (n = 0; n<N; n++)
sY[m]
+= sX[m+n]*sK[n];
sY[m+1] += sX[m+n+1]*sK[n];
sY[m+2] += sX[m+n+2]*sK[n];
sY[m+3] += sX[m+n+3]*sK[n];
}
IP= 4 (4 madd)
LD/ST=5 (read sXm, sXm+1, sXm+2, sXm+3, sK)
Assembly code:
mov.aa
CONST.A
mov.aa
CONST.A
a2,Xptr
a3,(2*N)<<16
a14,Xptr
a15,(2*N)<<16
;circular buffer address initialization
lea
LC,(N/2 - 1)
; (1)
; get loop number
mul
mul
ld.q
mul
ld.w
mul
ld.d
llY0,d13,#0
llY2,d13,#0
d13,[a14/a15+c]2
llY1,d13,#0
ssK,[Kptr+]4
llY3,d13,#0
ssssX,[a2/a3+c]4
;
;
;
;
;
;
;
;
;
;
;
;
;
;
y0 = 0
y2 = 0
dummy load
y1 = 0
k1k0
y3 = 0
x3x2x1x0
;
;
;
;
;
;
;
(1)
(2)
||
(3)
||
(4)
||
;
;
;
;
store
store
store
store
;circular buffer address initialization
(2)
(3)
||
(4)
||
(5)
||
blkfir:
loop
maddm.h
maddm.h
ld.d
maddm.h
ld.d
maddm.h
ld.w
LC,blkfir
st.h
st.h
st.h
st.h
[Yptr+]2,d1
[Yptr+]2,d3
[Yptr+]2,d5
[Yptr+]2,d7
User Guide
llY0,llY0,d8,ssK ul,#1
llY2,llY2,d9,ssK ul,#1
ssssX1,[a14/a15+c]4
llY1,llY1,d10,ssK ul,#1
ssssX,[a2/a3+c]4
llY3,llY3,d11,ssK ul,#1
ssK,[Kptr+]4
;
;
;
;
(6)
(7)
(8)
(9)
86
;
;
;
;
;
;
;
y0+=x0*k0+x1*k1
y2+=x2*k0+x3*k1
x4x3x2x1
y1+=x1*k0+x2*k1
x5x4x3x2
y3+=x3*k0+x4*k1
k3k2
y0
y2
y1
y3
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Example
N = 20 Î 51 cycles
Register diagram:
Instruction
d1/d0 d3/d2 d5/d4 d7/d6 d9/d8 d11/d10 d12
Load /
Store
y0 = 0
y2 = 0
y1 = 0
k1k0 ld k1k0
y3 = 0 x3x2x
1x0
maddm.h
llY0,llY0,d8,ssK
ul,#1
y0+=
x0*k0+
x1*k1
maddm.h
llY2,llY2,d9,ssK
ul,#1
y2+=
x2*k0
+x3*k1
maddm.h
llY1,llY1,d10,ssK
ul,#1
x4x3x2x1
y1+=
x1*k0
+x2*k1
maddm.h
llY3,llY3,d11,ssK
ul,#1
x5x4x
3x2
y3+=
x3*k0
+x4*k1
y0
ld
x4x3x2x1
ld
x5x4x3x2
k3k2 ld k3k2
st y0
y2
st y2
y1
st y1
y3
User Guide
ld
x3x2x1x0
87
st y3
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.8
Auto-Correlation
Note: Validated on TriCore Rider D board.
Equation:
Zm = ∑ Xn*Xn+m
n = 0..N-1 ;
m=0..M-1;
Pseudo code:
for (m=0; m<M; m++)
{
lZ[m] = 0;
for (n=0; n<N; n++) lZ += sX[n]*sX[n+m];
}
IP= 1 (1 madd)
LD/ST= 3 (read sX, s Xn+m, write lZ)
Assembly code:
mov.aa
a8,Xaddress
;
mov.aa
a9,Xaddress
;
add.a
a8,#2
;
lea
LCe,(M/2 - 1)
;
mov.aa
Xoptr,a8
;
extautoloop:
mov
d0,#0
;
lea
Xptr,Xaddress ;
mov
d1,#0
;
lea
LCi,(N/2 - 2) ;
mov
d2,#0
;
ld.w
ssX1,[Xptr+]4 ;
mov
d3,#0
;
ld.w
ssX3,[Xeptr+]4 ;
(1)
(2)
(3)
(4)
(5)
;
;
;
;
;
odd values
even values
adjust pointer position
get external loop number
odd pointer init
(1)
||
(2)
||
(3)
||
(4)
||
;
;
;
;
;
;
;
;
z_even = 0
even values
z_even = 0
get internal loop number
z_odd = 0
x1 x0 (even values)
z_odd = 0
x1 x0
maddm.h llZe,llZe,ssX1,ssX3 ul,#1
; (5)
; z_even=z_even+x0*x0+x1*x1
ld.w
ssX2,[Xoptr+]4 ; ||
; x2 x1
intautoloop:
maddm.h
ld.w
ld.w
maddm.h
ld.w
loop LCi,intautoloop
User Guide
llZo,llZo,ssX1,ssX2
; (1,2) ;
ssX1,[Xptr+]4
;
ssX3,[Xeptr+]4
;
llZe,llZe,ssX1,ssX3
; (3)
;
ssX2,[Xoptr+]4
;
88
ul,#1
z_odd +=x0*x1+x1*x2
|| ; x3 x2
|| ; x3 x2
ul,#1
z_even +=x2*x2+x3*x3
|| ; x4 x3
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
loop
maddm.h llZo,llZo,ssX1,ssX2 ul,#1
; (6)
;
st.w
[Zptr+]4,d1
; ||
;
add.a
a9,#4
; (7)
;
st.w
[Zptr+]4,d3
; (8)
;
mov.aa Xeptr,a9
; (9)
;
add.a
a8,#4
; (10)
;
mov.aa Xoptr,a8
; (11)
;
LCe,extautoloop
Example
store z_even
adjust pointer position
store z_odd
even pointer init
adjust pointer position
odd pointer init
N=160, M = 10 Î 1247 cycles
Register diagram:
Instruction
d1 / d0
d3 / d2
d5
d4
d6
Load /
Store
z_even = 0
(lower)
z_even = 0
(upper)
z_odd = 0
(lower)
x1x0
z_odd = 0
(upper)
maddm.h
llZe,llZe,lX1,lX3 ul,#1
x1x0
z_even +=
x0*x0+x1*x1
maddm.h
llZo,llZo,lX1,lX2 ul,#1
ld x1x0
x2x1
z_odd +=
x0*x1+x1*x2
ld x2x1
x3x2
ld x3x2
x3x2
maddm.h
llZe,llZe,lX1,lX3 ul,#1
maddm.h
llZo,llZo,lX1,lX2 ul,#1
z_even +=
x2*x2+x3*x3
x4x3
z_odd +=
x0*x1+x1*x2
ld x1x0
ld x3x2
ld x4x3
st z_even
st z_odd
User Guide
89
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.9
Complex FIR
Note: Validated on TriCore Rider D board.
Equations:
Yrt = ∑ (Xrt-n*Krn – Xit-n*Kin)
n = 0..N-1
Yit = ∑ (Xit-n*Krn + Xrt-n*Kin)
n = 0..N-1
Pseudo code:
sYr
sYi
for
{
sYr
sYi
}
= 0;
= 0;
(n = 0; n<N; n++)
+= circular(sXr)*sKr - circular(sXi)*sKi;
+= circular(sXr)*sKi + circular(sXi)*sKr;
IP= 4 (3 madd, 1 msub)
LD/ST (in packed format) = 2
(read sXi || sXr, sKr || sKi)
Pseudo code implementation:
sYr
sYi
for
{
sYi
sYi
}
= 0;
= 0;
(n = 0; n< N; n++)
+= sXr*sKi; sYr -= sXi*sKi;
+= sXi*sKr; sYr += sXr*sKr;
Assembly code:
lea
mov
ld.w
ld.q
cxloop:
LC,(N - 1)
ssY,#0
ssX,[Xptr+]4
ssK,[Kptr+]2
maddsurs.h
ld.w
maddrs.h
loop
st.w
ld.w
LC,cxloop
[Yptr],ssY
;
;
;
;
(1)
(2)
||
(3)
;
;
;
;
get loop number
y = 0
xi0 xr0
ki0
ssY,ssY,ssX,ssK uu,#1
; (1)
; yi += xr0*ki0 ||
ssK,[Kptr+]4
;||
; ki1 kr0
ssY,ssY,ssX,ssK ll,#1
; (2)
; yi += xi0*kr0 ||
ssX,[a2/a3+c]4
;||
; xi1 xr1
; (4)
yr -= xi0*ki0
yr += xr0*kr0
; store yiyr
Note: Warning! This algorithm only works when the coefficients are organised in the
reverse order {imag,real} compared to data {real,imag} .
User Guide
90
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Memory organization:
Xaddr
sXr0
Kaddr
sKi0
Xaddr + 2
sXi0
Kaddr + 2
sKr0
Xaddr + 4
sXr1
Kaddr + 4
sKi1
Xaddr + 6
sXi1
Kaddr + 6
sKr1
Xaddr + 8
sXr2
Kaddr + 8
sKi2
Xaddr + 10
etc..
Kaddr + 10
etc..
Example
N = 20 Î 46 cycles
Register diagram:
Instruction
d0
d4
y=0
xi0 xr0
maddsurs.h
ssY,ssY,ssX,ssK uu,#1
yi+=xr*ki
|| yr-=xi*ki
maddrs.h
ssY,ssY,ssX,ssK ll,#1
yi+=xi*kr
|| yr+=xr*kr
User Guide
xi1 xr1
91
d5
Load / Store
ld xi0xr0
ki0
ld ki0
ki1 kr0
ld ki1kr0
ld xi1xr1
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.10
First Order IIR
Note: Validated on TriCore Rider D board.
Equation:
Yt = B*Yt-1 + K*Xt
where:
B= (1- K)
Pseudo code:
sY0 = sY1 – sY1*sK + sX0*sK;
sY1 = sY0;
IP= 2 (1 madd, 1 msub)
LD/ST= 4
(read sX, sY1, sK, write sY0
Æ
sY1)
Assembly code:
;sX and sY are in different registers
ld.q
sK,[Kptr]
;
ld.q
sX,[Xptr+]2
;
ld.q
sY,[Yptr]
;
msubs.q
sY,sY,sK u,sY u,#1
;
madds.q
sY,sY,sK u,sX u,#1
;
st.q
[Yptr],sY
;
(1)
(2)
(3)
(4)
(5,6)
||
;
;
;
;
;
;
;alternatively if sX0 and sY1 are in the same
ld.q
sK,[Kptr]
; (1)
;
ld.w
ssXY,[Xptr+]2
; (2)
;
msubrs.h
sY,ssXY,ssXY,sK ul,#1
; (3)
;
maddrs.h
sY,sY,ssXY,sK uu,#1
; (4,5) ;
st.q
[Yptr],sY
; ||
;
K
X0
Y1
Y0 = Y1-Y1*K
Y0 += X0*K
store Y0
register
K
X0 || Y1
Y0 = Y1- Y1*K
Y0 += X0*K
store Y0
Note: Operates with dual MAC instruction but only 1 result is used.
User Guide
92
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
D4
d5
Load / Store
k0 0
ld k
x0 0
ld x0
y0 0
ld y1
msubs.q sY,sY,sKu , sY u,#1
y1 = y0*(1-k)
madds.q sY,sY,sK u,sX u,#1
y1 += x0*k
Instruction
d0
st y0
D3
x0||y1
msubrs.h sY,ssXY,ssXY,sK
uu,#1
Load / Store
k0 0
ld k
ld x0 y1
y0 = y1-y1*k
maddrs.h sY,sY,ssXY,sK uu,#1 y0 += x0*k
User Guide
d5
93
st y
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.11
Second Order IIR
Note: Validated on TriCore Rider D board.
Equation:
Yt = K0*Xt + K1*Xt-1 + K2*Xt-2 + B1*Yt-1 + B2*Yt-2
where:
B1, B2 are negative
Pseudo code:
sY0 = sK0*sX0 + sK1*sX1 + sK2*sX2 + sB1*sY1 + sB2*sY2;
sY2 = sY1;
sY1 = sY0;
IP=5 (1 mul, 4 madd)
LD/ST= 12 (read sX0, sX1, sX2, sK0, sK1,
sK2, sB1, sB2, sY1, sY2, write sY0 -->
sY1, write sY1 --> sY2)
Assembly code:
ld.d
ld.w
ld.d
ssssK,[Kptr]
ssB,[Bptr]
ssssX,[Xptr+]2
; (1)
; (2)
; (3)
; K2 K1 K0
; B1 B0
; X2 X1 X0
mulm.h
ld.w
madds.q
ld.d
maddm.h
st.h
e0,d2,d6 ul,#1
ssY,[a2/a3+c]2
e0,e0,ssK l,ssX l,#1
ssssX,[Xptr+]2
e0,e0,ssB,ssY ul,#1
[a2/a3+c]0,d1
;
;
;
;
;
;
;
;
;
;
;
;
User Guide
(4)
||
(5)
||
(6,7)
||
94
Y0 = K1*X1 + K0*X0
Y2 Y1
Y0 += K2*X2
X3 X2 X1
Y0 += B1*Y1 + B2*Y2
store Y0
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0
d3 / d2 d5 / d4 d7 / d6 d8
Load / Store
k2k1k0
ld k0k1k2
b1b0
ld b0b1
x2x1x0
mulm.h
e0,d2,d6ul,#1
y2 = k1*x1 +
k0*x0
y2y1
madds.q
y2 = y + x*k2
e0,e0,ssK l,ssX l,#1
x3x2x1
maddm.h
y2 = y2 +
e0,e0,ssB,ssYul,#1 b1*y1 +
b2*y2
User Guide
ld x0x1x2
ld y1y2
ld x1x2x3
st y0
95
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.12
BIQUAD 4 Coefficients
Note: Validated on TriCore Rider D board.
Equations:
Wt = Xt + B1*Wt-1 + B2*Wt-2
Yt = Wt + K1*Wt-1 + K2*Wt-2;
Pseudo code:
sW0
sY0
sW2
sW1
=
=
=
=
sX0 + sB1*sW1 + sB2*sW2;
sW0 + sK1*sW1 + sK2*sW2;
sW1;
sW0;
IP=4 (4 madd)
LD/ST= 10 (read sX0, read sW1, sW2, read
sK1, sK2, read sB1, sB2, write sY0, write
sW0 --> sW1, write sW1 --> sW2)
Assembly code:
ld.h
mov
ld.w
ld.w
maddm.h
ld.w
maddm.h
st.h
st.h
User Guide
d0,[Xptr]
d1,#0
ssB,[BKptr+]4
ssW,[a14/a15+c]2
llW,llY,ssW,ssB ul,#1
ssK,[BKptr+]4
llY,llW,ssW,ssK ul,#1
[a14/a15+c]0,d7
[Yptr],d1
;
;
;
;
;
;
;
;
;
(1)
||
||
(3)
||
(4,5)
||
||
96
;x0-->d0
; y0 = 0
; b2 b1
; w2 w1
; w0=x0+w1*b1+w2*b2
; k2 k1
; y0=w0+w1*k1+w2*k2
; w0--> w1
; store y0
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0
d2
x = 0 (upper)
x = 0 (lower)
d5
d7 / d6
b2b1
ld w2w1
k2k1
maddm.h
y0 = w0+w1*k1
llY,llW,ssW,ssK ul,#1 +w2*k2
w0 =
ld k2k1
x+w1*b1+
w2*b2
w1w0
y0
Load /
Store
ld b2b1
w2w1
maddm.h
llW,llY,ssW,ssB ul,#1
User Guide
d4
st w0
st y0
97
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.13
N-stage BIQUAD 4 Coefficients
Note: Validated on TriCore Rider D board.
Equations:
W0,n = Y0,n-1 + B1,n*W1,n + B2,n*W2,n
Y0,n = W0,n + K1,n*W1,n + K2,n*W2,n
n = 0..N-1
Pseudo code:
for (n = 0; n< N; n++)
{
sW0 = sY0 + sB1[n]*sW1[n] + sB2[n]*sW2[n];
sY0 = sW0 + sK1[n]*sW1[n] + sK2[n]*sW2[n];
sW2[n] = sW1[n];
sW1[n] = sW0;
}
IP=4 (4 madd)
LD/ST= 8 (read sW1, sW2, sK1, sK2, sB1,
sB2, write sW0 --> sW1, write sW1 --> sW2)
Note: sY0 can be kept in a register and does not need to be written back and re-read
from memory between stages.
Assembly code:
ld.h
mov
ld.w
;mov
ld.w
maddm.h
ld.w
lea
bq4loop:
loop
maddm.h
st.h
st.h
Example
User Guide
d0,[Xptr]
d1,#0
ssB,[BKptr+]4
d0,#0
ssW,[a14/a15+c]2
llW,llY,ssW,ssB ul,#1
ssK,[BKptr+]4
LC,(N - 2)
;
;
;
;
;
;
;
;
(2)
||
(3)
||
(4)
||
(1)
maddm.h llY,llW,ssW,ssK ul,#1
st.h
[a14/a15+c]0,d7
; ||
ld.w
ssW,[a14/a15+c]2
; ||
maddm.h llW,llY,ssW,ssB ul,#1
LC,bq4loop
llY,llW,ssW,ssK ul,#1
; (5,6)
[a14/a15+c]0,d7
; ||
[Yptr+]2,d1
; ||
;
;
;
;
;
;
;
;
x0-->d0
y0 = 0
b2 b1
y0 = 0
w2 w1
w0=y0+w1*b1+w2*b2
k2 k1
get loop number
;
;
;
;
(1,2) ; y0=w0+w1*k1+w2*k2
store w0
w1 w0
(3)
; w0=y0+w1*b1+w2*b2
; y0=w0+w1*k1+w2*k2
; store w0
; store y0
N = 13 Î 44 cycles
98
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0
d2
d4
x = 0 (upper)
d5
d7 / d6
Load /
Store
b2b1
ld b2b1
x = 0 (lower) w2w1
maddm.h
llW,llY,ssW,ssB ul,#1
ld w2w1
k2k1
maddm.h
y0=
llY,llW,ssW,ssK ul,#1 w0+w1*k1+
w2*k2
w0=
ld k2k1
x0+w1*b1+
w2*b2
w0
st w0
w1w0
ld w1w0
maddm.h
llW,llY,ssW,ssB ul,#1
w0=
y0+w1*b1+
w2*b2
maddm.h
y0=
llY,llW,ssW,ssK ul,#1 w0+w1*k1+
w2*k2
w0
y0
User Guide
st w0
st y0
99
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.14
N-stage BIQUAD 5 Coefficients
Note: Validated on TriCore Rider D board.
Equations:
W0,n = Y0,n-1 + B1,n*W1,n + B2,n*W2,n
Y0,n = K0,n*W0,n + K1,n*W1,n + K2,n*W2n
Pseudo code:
for (n = 0; n<N; n++)
{
sW0 = sY0 + sB1[n]*sW1[n] + sB2[n]*sW2[n];
sY0 = sK0[n]*sW0 + sK1[n]*sW1[n] + sK2[n]*sW2[n];
sW2[n] = sW1[n];
sW1[n] = sW0;
}
IP= 5 (5 madd)
LD/ST= 9 (read sW1, sW2, read sK0, sK1,
sK2, read sB1, write sW0 --> sW1, write sW1
--> sW2)
Note: sY0 can be kept in a register and does not need to be written back and re-read
from memory between stages.
Assembly code:
ld.h
mov
ld.w
mov
ld.w
maddm.h
ld.d
dextr
lea
bq5loop:
loop
mulm.h
madds.q
st.h
st.h
User Guide
d0,[Xptr]
d1,#0
ssB,[BKptr+]4
d0,#0
ssW,[a14/a15+c]2
llW,llY,ssW,ssB ul,#1
ssssK,[BKptr]
d8,d7,d6,#16
LC,(N - 2)
;
;
;
;
;
;
;
;
;
mulm.h
llY,ssW,ssK2 ul,#1
;
st.q
[a14/a15+c]0,d8
;
madds.q
llY,llY,d8u,ssK u,#1 ;
ld.w
d2,[a14/a15+c]2
;
maddm.h
llW,llY,ssW,ssB ul,#1 ;
dextr
d8,d7,d6,#16
;
LC,bq5loop
llY,ssW,ssK2 ul,#1
;
llY,llY,d8u,ssK u,#1
;
[a14/a15+c]0,d8
;
[Yptr+]2,d1
;
100
(2)
||
(3)
||
(4,5)
||
(6)
(1)
;
;
;
;
;
;
;
;
;
x0-->d0
y0 = 0
b2 b1
y0 = 0
w2 w1
w0=y0+w1*b1+w2*b2
k2 k1 k0 0
extract w0
get loop number
(1)
||
(2)
||
(3,4)
(5)
;
;
;
;
;
;
y0=w1*k1+w2*k2
store w0
y0=y0+k0*w0
w1 w0
w0=y0+w1*b1+w2*b2
extract w0
(7)
(8,9)
||
||
;
;
;
;
y0=w1*k1+w2*k2
y0=y0+k0*w0
store w0
store y0
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Example
N = 13 Î 71 cycles
Register diagram:
Instruction
d1 / d0
D2
y=0
(lower)
y=0
(upper)
d3
d5 /
d4
d7 / d6
b2b1
ld b2b1
W2w1
ld w2w1
maddm.h
llW,llY,ssW,ssB
ul,#1
k2k1k w0=y0+w1*
0
b1+w2*b2
dextr d8,d7,d6,#16
mulm.h
llY,ssW,ssK2 ul,#1
d8 Load /
Store
ld k2k1k0
w0
y0=w1*k1+
w2*k2
madds.q
y0+=k0*w0
llY,llY,d8u,ssK u,#1
w0 st w0
W1w0
ld w1w0
maddm.h
llW,llY,ssW,ssB
ul,#1
w0=y0+w1*
b1+w2*b2
dextr
d8,d7,d6,#16
mulm.h
llY,ssW,ssK2 ul,#1
w0
y0=w1*k1+
w2*k2
madds.q
y0+=k0*w0
llY,llY,d8u,ssK u,#1
w0 st w0
y0
User Guide
st y0
101
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.15
Lattice Filter
Equations:
Zn = Zn-1 – Xn*Kn-1
n = 1..N-1
Tn-1 = Xn + Zn*Kn-1
Pseudo code:
for (n = 1; n<N; n++)
{
sZ[n] = sZ[n-1] – sX[n]*sK[n-1];
sT[n-1] = sX[n] + sZ[n]*sK[n-1];
}
IP = 2 (1 msub, 1 madd)
LD/ST= 5 (read sX, read sK, read sZ, write
sZ, write sT)
Assembly code:
lea
ld.q
ld.q
ld.q
msubrs.h
ld.q
ld.q
msubrs.h
latloop:
loop
maddrs.h
st.q
st.q
Example
User Guide
LC,(N - 3)
sX,[Xptr+]2
sK,[Kptr+]2
sY,[Yaddr]
sZ,sY,sX,sK ul,#1
sX,[Xptr+]2
sK,[Kptr+]2
sZ,sZ,sX,sK ul,#1
;
;
;
;
;
;
;
;
(1)
(2)
(3)
(4)
(5,6)
||
||
(7,8)
;
;
;
;
;
;
;
;
get loop number
x9
k10
y (input)
z9 = y - x9*k10
x8
k9
z8 = z9 - x8*k9
maddrs.h
ld.q
ld.q
msubrs.h
st.q
LC,latloop
sT,sX,sZ,sK
[Xptr]-4,sT
[Xptr]-2,sZ
sT,sX,sZ,sK ul,#1
sX,[Xptr+]2
sK,[Kptr+]2
sZ,sZ,sX,sK ul,#1
[Xptr]-6,sT
;
;
;
;
;
(1,2)
||
||
(3,4)
||
;
;
;
;
;
t9 = x8 + z8*k9
x7
k8
z7 = z8 - x7*k8
store t9
ul,#1
; (9,10) ; t1 = x0 + z0*k1
; ||
; store t1
; ||
; store z0 (result)
N = 10 Î 44 cycles
102
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d1
d4
d5
d6
x9
Load /
Store
ld x9
k10
ld k10
Y
msubrs.h
z9 = y - x9*k10
sZ,sY,sX,sK ul,#1
ld y
ld x8
ld x9
msubrs.h
z8 = z9 - x8*k9
sZ,sZ,sX,sK ul,#1
maddrs.h
sT,sX,sZ,sK ul,#1
t9 = x8 + z8*k9
x8
ld x8
k9
msubrs.h
z7 = z8 – x7*k8
sZ,sZ,sX,sK ul,#1
maddrs.h
sT,sX,sZ,sK ul,#1
User Guide
ld k9
st t10
t1 = x0 + z0*k1
st t0
t1
st t1
103
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.16
Leaky LMS (Update Only)
Note: Validated on TriCore Rider D board.
Equation:
Kt,n = Kt-1,n * B + Xt-n*u*Errt-1
n = 0..N-1
Pseudo code:
for (n = 0; n<N; n++)
sK[n] = sK[n]*sB + circular(sX[n])*sErr;
IP = 2 (1 mul, 1 madd)
LD/ST= 3 (read sX, read sK, write sK)
Assembly code:
lea
ld.h
ld.h
ld.w
mulr.h
ld.d
llmsloop:
LC,(N/4 - 2)
sB,[Baddr]
sErr,[Eaddr]
ssK0,[Kptr+]4
ssKK0,ssK0,sB ll,#1
ssssX,[a14/a15+c]8
;
;
;
;
;
;
maddrs.h
loop
ssKK0,ssKK0,ssX0,sErr
;
;
ld.d
ssssK,[Kptr+]8
;
mulr.h
ssKK1,ssK1,sB ll,#1 ;
;
st.w
[Kptr]-12,ssKK0
;
maddrs.h ssKK1,ssKK1,ssX1,sErr
;
ld.d
ssssX,[a14/a15+c]8 ;
mulr.h
ssKK1,ssK1,sB ll,#1 ;
;
st.w
[Kptr]-8,ssK1
;
LC,llmsloop
maddrs.h
ssKK0,ssKK0,ssX0,sErr ll,#1
ld.w
mulr.h
ssK1,[Kptr+]4
ssKK1,ssK1,sB ll,#1
maddrs.h
ssKK1,ssKK1,ssX1,sErr ll,#1
st.d
[Kptr]-8,e0
User Guide
104
;
;
;
;
;
;
;
;
;
(1)
(2)
(3)
(4)
(5)
||
;
;
;
;
;
;
get loop number
beta
error
K1 K0
K1=K1*B|| K0 =K0*B
X3 X2 X1 X0
ll,#1
(1)
;
||
;
||
;
(2)
;
||
;
||
;
ll,#1 ;
||
;
||
;
(4)
;
||
;
||
;
K1+=X1*Err
K0+=X0*Err
K5 K4 K3 K2
K3= K3*B
K2= K2*B
store K0 K1
(3) ; K3+=X3*Err
K2+=X2*Err
X7 X6 X5 X4
K5=K5*B
K4=K4*B
store K2 K3
epilog
(6)
;
||
;
||
;
(7)
;
||
;
(8,9) ;
||
;
||
;
K(n-2)+=X(n-2)*Err
K(n-3)+=X(n-3)*Err
Kn K(n-1)
Kn=Kn*B
K(n-1)= K(n-1)*B
Kn +=Xn*Err
K(n-1)+=X(n-1)*Err
Kn K(n-1) K(n-2) K(n-3)
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Example
N = 20 Î 27 cycles
Register representation:
X input
sK1
Constant input
Result of the instruction
sK0
sB
sK1*sB
sK0*sB
Register diagram:
Instruction
d0
d1
d3
d2
k1k0
__ __
mulr.h
ssKK0,ssK0,sB ll,#1
k1= k1*B ||
k0= k0*B
k5 k4
k3 k2
mulr.h
ssKK1,ssK1,sB ll,#1
k3= k3*B ||
k2= k2*B
maddrs.h
ssKK1,ssKK1,ssX1, sErr ll,#1
k3+=x3*Err||
k2+=x2*Err
k5= k5*B || k4=
k4*B
Load /
Store
ld k0k1
x3 x2 ld
x1 x0 x0x1x2x3
maddrs.h
k1+=x1*Err||k0
ssKK0,ssKK0,ssX0, sErr ll,#1 +=x0*Err
mulr.h
ssKK0,ssK0,sB ll,#1
d5
d4
ld
k2k3k4k5
st k0k1
x7 x6 ld
x5 x4 x4x5x6x7
st k3k2
maddrs.h
ssKK0,ssKK0,ssX0, sErr ll,#1
mulr.h
ssKK1,ssK1,sB ll,#1
maddrs.h
ssKK1,ssKK1,ssX1, sErr ll,#1
User Guide
105
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.17
Delayed LMS
Note: Validated on TriCore Rider D board.
Equations:
Yt = ∑ Xt-n*Kt-1,n
n = 0..N-1
Kt,n = Kt-1,n + Xt-n*u*errt-1
n = 0..N-1
Pseudo code:
sY = 0;
for (n = 0; n<N; n++)
{
sY += circular(sX[n])*sK[n];
sK[n] += circular(sX[n])*sErr;
}
IP = 2 (2 madd)
LD/ST = 3 (read sX, read sK, write sK)
Assembly code:
lea
LC,(N/4 - 1)
ld.h
sErr,[Eaddr]
mov
d0,#0
ld.w
ssK1,[Kptr+]4
mov
d1,#0
ld.d
ssssX,[a14/a15+c]8
dlmsloop:
maddm.h llY,llY,ssX0,ssK0 ul,#1
st.d
[Kptr]-16,e6
maddrs.h d6,ssK0,ssX0,sErr ll,#1
ld.d
maddm.h
ld.d
maddrs.h
loop
st.d
st.h
ssssK,[Kptr+]8
llY,llY,ssX1,ssK1 ul,#1
ssssX,[a14/a15+c]8
d7,ssK1,ssX1,sErr ll,#1
LC,dlmsloop
[Kptr]-16,e6; ||
[Yptr+]2,d1; (5)
;
;
;
;
;
;
(1)
(2)
(3)
||
(4)
||
;
;
;
;
;
;
get loop number
Err
Y = 0 (lower)
K0 K1
Y = 0 (upper)
X0 X1 X2 X3
;
;
;
;
;
;
;
;
;
;
(1) ; Y += X0*K0+X1*K1
|| ; store (next loop)
K0 K1 K2 K3
(2)
K0 +=X*Err|| K1 +=X1*Err
|| ; K2 K3 K4 K5
(3) ; Y += sX2*K2+X3*K3
|| ; X4 X5 X6 X7
(4)
K2 +=X2*Err||K3 +=X3*Err
; store last 4 K
; store Y
Note: Warning! Dummy store on first iteration of the loop.
Example
User Guide
N = 20 Î 27 cycles
106
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register representation:
X input
sX1
Error input
Result of the instruction
sX0
sErr
sX1*sErr
sX0*sErr
Register diagram:
Instruction
d1 / d0
d3
d2
y = 0 (d1)
d5
d4
d7
d6
Load /
Store
k1k0
----
y = 0 (d0)
ld k0k1
x3x2
x1x0
ld
x0x1x2x3
maddm.h
y += x0*k0 + x1*k1
llY,llY,ssX0,ssK0 ul,#1
k3k2
k1k0
maddrs.h
d6,ssK0,ssX0,d8ul,#1
k5k4
k3k2
dummy
store
----------k1 += x1*err
|| k0 += x0*err
maddm.h
y + = x2*k2 + x3*k3 x7x6
llY,llY,ssX1,ssK1 ul,#1
x5x4
maddrs.h
d7,ssK1,ssX1,d8ul,#1
ld
x4x5x6x7
k3 += x3*err
|| k2 += x2*err
k1 += x1*err
|| k0 += x0*err
next iteration
User Guide
ld
k2k3k4k5
st
k0k1k2k3
107
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.18
Delayed LMS – 32-bit Coefficients
Note: Validated on TriCore Rider D board.
Equations:
Yt = ∑ Xt-n*Kt-1,n
n = 0..N-1
Kt,n = Kt-1,n + Xt-n*u*errt-1
n = 0..N-1
Pseudo code:
sY = 0;
for (n = 0; n< N; n++)
{
sY += circular(sX[n])* up(lK[n]);
lK[n] += circular(sX[n])* sErr;
}
IP = 2 (2 madd)
LD/ST= 3 (read sX, read lK, write lK)
Assembly code:
lea
LC,(N/2 - 2)
;(1) ; get loop number
ld.h
sErr,[Eaddr]
;(2) ; err
mov
d0,#0
;(3) ; y = 0 (lower)
ld.d
llK,[Kptr+]8
;||
; k0k1
mov
d1,#0
;(4) ; y = 0 (upper)
ld.w
ssX,[a14/a15+c]4
;||
; x0x1
madds.h
llY,llY,ssX,lK0 ul,#1
;(5) ; y +=x0*k0
dlms32loop:
madds.h llY,llY,ssX,lK1 uu,#1 ;(1) ; y += x1*k1
madds.h e6,llK,ssX,sErr ll,#1 ;(2,3)
;k0=k0+x0*err||k1=k1+x1*err
ld.d
llK,[Kptr+]8
;||
; k3k2
ld.w
ssX,[a14/a15+c]4
;||
; x2x3
madds.h llY,llY,ssX,lK0 ul,#1 ;(4) ; y=y+x2*k2
st.d
[Kptr]-16,llK
;||
; store k0k1
loop
LC,dlms32loop
madds.h
llY,llY,ssX,lK1 uu,#1
;(6) ; y=y+xn*kn
madds.h
e6,llK,ssX,sErr ll,#1
;(7,8); kn=kn+xn*err
;||
; k(n-1)=k(n-1)+x(n-1)*err
st.h
[Yptr]2,d1
;||
; store Y
st.d
[Kptr]-8,e6
;||
; store k(n-1)kn
Example
User Guide
N = 20 Î 46 cycles
108
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d1 / d0
d3 / d2
y=0
(lower)
y=0
(upper)
madds.h
llY,llY,ssX,lK0 ul,#1
y += x0*k0
madds.h
llY,llY,ssX,lK1 uu,#1
y += x1*k1
d5 / d4
d7/ d6
k1k0
ld k1k0
x1x0
ld x1x0
madds.h
e6,llK,ssX,sErr ll,#1
k3k2
k1 += x1*err ||
k0 += x0*err
x3x2
madds.h
llY,llY,ssX,lK0 ul,#1
y += x2*k2
madds.h
llY,llY,ssX,lK1 uu,#1
y=y+xn*kn
ld k3k2
ld x3x2
k5k4
madds.h
e6,llK,ssX,sErr ll,#1
User Guide
Load / Store
st k1k0
kn=kn+xn*err
|| k(n-1)=k(n-1)
+ x(n-1)*err
109
st y
st k(n-1)kn
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
6.19
Delayed LMS – Complex
Note: Validated on TriCore Rider D board.
Equations:
Yrt = ∑ Xrt-n*Krt-1,n- Xit-n*Kit-1,n
Yit = ∑ Xrt-n*Kit-1,n + Xit-n*Krt-1,n
Krt,n = Krt-1,n + Xrt-n*u*Er t-1 - Xit-n*u*Ei t-1
Kit,n = Kit-1,n + Xrt-n*u*Ei t-1 + Xit-n*u*Er t-1
n = 0..N-1
Pseudo code:
sYr = 0;
sYi = 0;
for (n = 0; n<N; n++)
{
sYr += circular(sXr[n]*sKr[n] – circular(sXi[n])*sKi[n];
sYi += circular(sXi[n]*sKr[n] + circular(sXr[n])*sKi[n];
sKr[n] += circular(sXr[n])*sEr - circular(sXi[n])*sEi;
sKi[n] += circular(sXr[n])*sEi + circular(sXi[n])*sEr;
}
IP=8 (6 madd, 2 msub)
User Guide
LD/ST=6 (read sXr, read sXi, read sKr, read
sKi, write sKr, write sKi)
110
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
lea
mov
ld.w
ld.w
ld.w
LC,(N - 2)
ssY,#0
ssE,[Eaddr]
ssX,[Xptr+]4
ssK,[Kptr+]4
;
;
;
;
;
(1)
(2)
||
(3)
(4)
maddsurs.h ssY,ssY,ssX,ssK uu,#1
; (5)
maddsurs.h ssK0,ssK,ssX,ssE uu,#1
; (6)
dlmscpxloop:
maddrs.h
ssY,ssY,ssX,ssK ll,#1
; (1)
; ||
ld.w
ssK,[Kptr+]4
; ||
maddrs.h
ssK0,ssK0,ssX,ssE ll,#1
; (2)
; ||
ld.w
ssX,[a14/a15+c]4 ; ||
maddsurs.h ssY,ssY,ssX,ssK uu,#1
; (3)
st.w
[Kptr]-8,ssK0
; ||
maddsurs.h ssK0,ssK,ssX,ssE uu,#1
; (4)
loop
LC,dlmscpxloop
maddrs.h
maddrs.h
ssY,ssY,ssX,ssK ll,#1
ssK0,ssK0,ssX,ssE ll,#1
st.w
st.w
[Yptr+]4,ssY
[Kptr]-4,ssK0
Example
User Guide
;
;
;
;
;
;get loop number
;yi = 0 || yr = 0
;Ei || Er
;xixr0
;ki0kr0
;yi+=xr0*ki0 || yr-=xi0*ki0
;ki0+=xr0*ei||kr0-=xi0*ei
;yi+=xi0*kr0
;yr+=xr0*kr0
;ki1kr1
;ki0+=xi0*er
;kr0+=xr0*er
;xixr1
;yi+=xr1*ki1 || yr-=xi1*ki1
;store ki0kr0
;ki1+=xr1*ei || kr1-=xi1*ei
(7) ;yi+=xin*krn || yr+=xrn*krn
(8,9)
kin+=xin*er || krn+=xrn*er
|| ;store yryi
|| ;store kin krn
N = 20 Î 87 cycles
111
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d1
d4
d5
yi = 0 || yr = 0
xixr0
d6
Load /
Store
EiEr
ld EiEr
ld xixr0
ki0kr0
ld ki0kr0
ki1kr1
ld ki1kr1
maddsurs.h
yi+=xr0*ki0 ||
ssY,ssY,ssX,ssK uu,#1 yr-=xi0*ki0
maddsurs.h
ssK1,ssK,ssX,ssE
uu,#1
maddrs.h
ssY,ssY,ssX,ssK ll,#1
ki0+=xr0*ei
||kr0-=xi0*ei
yi+=xi0*kr0 ||
yr+=xr0*kr0
maddrs.h
ssK1,ssK,ssX,ssE ll,#1
ki0+=xi0*er
xixr1
||kr0+=xr0*er
maddsurs.h
yi+=xr1*ki1 ||
ssY,ssY,ssX,ssK uu,#1 yr-=xi1*ki1
maddsurs.h
ssK1,ssK,ssX,ssE
uu,#1
maddrs.h
ssY,ssY,ssX,ssK ll,#1
maddrs.h
ssK1,ssK,ssX,ssE ll,#1
ki0kr0
st ki0kr0
yryi
st yryi
ki1+=xr1*ei
||kr1-=xi1*ei
yi+=xin*krn ||
yr+=xrn*krn
kin+=xin*er ||
krn+=xrn*er
kinkrn
User Guide
ld xixr1
112
st kinkrn
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
7
Transforms
The DSP functions classed as transforms are much more complete than simple kernel
functions. They cover complete applications and can be sub-divided into a series of
kernel routines. Taking the example of FFT, this is sub-divided into:
•
•
•
•
•
•
•
Initialisation
Bit reverse
Pass loop (most external loop)
Group loop (medium loop)
Butterfly loop (inner loop)
Squaring of results
Normalisation, etc.
Note: Each loop has a different importance. For a 256 pt FFT the pass loop will be
carried out 8 times, whereas the butterfly loop in radix 2 will be done 8*128 times.
The butterfly can be divided into various types:
•
•
•
•
•
•
Radix 2 or radix 4 or radix 8 or radix N
Real or complex data
Real or complex coefficients
16-bit or 32-bit data
16-bit or 32-bit coefficients
Decimation In Time (DIT) or Decimation In Frequency (DIF) butterfly
The butterfly can be implemented in several ways:
• With or without shift
(shift is generally necessary above 32 points and application dependent)
• Use of block floating point
• Use of packed data (2 points at a time)
• Data incremented in power of 2, coefficients constants
• Data incremented, coefficients incremented
• Data incremented, coefficients incremented in power of 2
• Degenerated (case of first 3 passes in radix 8)
As demonstrated, the number of variables is large.
The most common cases of inner loops of radix 2 butterflies for FFTs are:
•
•
•
•
•
Real butterfly – DIT – radix 2
Real butterfly – DIF – radix 2
Complex butterfly – DIT – radix 2
Complex butterfly – DIT – radix 2 – with shift
Complex butterfly – DIF – radix 2
User Guide
113
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Real butterfly – DIT – radix 2
A simple implementation, 2 points at a time. The coefficients and data are incremented
by 2 points.
Conditions:
• Works for pass 1 to pass N-1; does not work for pass 0 (due to packing of data)
• Number of coefficients must be doubled (duplicated table)
Real butterfly – DIF – radix 2
As above, a simple implementation, 2 points at a time. The coefficients and data are
incremented by 2 points.
Conditions:
• Works for pass 0 to pass N-2; does not work for pass N-1 (due to packing of data)
• Number of coefficients must be doubled (duplicated table).
Complex butterfly – DIT – radix 2
A straightforward implementation, 2 points at a time. The coefficients are incremented
by an index value.
Conditions:
• Works for pass 1 to pass N-1; does not work for pass 0 (due to packing of data)
Complex butterfly – DIT – radix 2 – with shift
Same implementation as above, with shift.
Conditions:
• Works for pass 1 to pass N-1; does not work for pass 0 (due to packing of data)
Complex butterfly – DIF – radix 2
A straightforward implementation, 2 points at a time. The coefficients are incremented
by an index value.
Conditions:
• Works for pass 0 to pass N-2; does not work for pass N-1 (due to packing of data)
User Guide
114
v1.6.4, 2003-01
Name
Cycles
Code Optimization
Size 1) Techniques
Software Pipelining
Loop Unrolling
Packed Operation
Load/Store Scheduling
Data Memory Interleaving
Packed Load/Store
Saturation
Rounding
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Real butterfly –
DIT2) radix 2
(5*N/2 +2) +3
32
9
-
9
9
-
9
-
9
Real butterfly –
DIF3) radix 2
(5*N/2 +2) +3
36
-
-
-
-
-
9
9
9
Complex butterfly
– DIT radix 2
(9*N/2 +2) +3
74
9
-
9
9
-
9
-
-
Complex butterfly (11*N/2 +2) +3
– DIT radix 2 – with
shift
82
9
-
9
9
-
9
-
-
Complex butterfly
– DIF radix 2
76
9
-
9
9
-
9
-
-
(9*N/2 +2) +3
1)
Code size is in Bytes
2)
DIT = Decimation In Time
3)
DIF = Decimation In Frequency
User Guide
115
Arithmetic
Methods
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
7.1
Real Butterfly – DIT – Radix 2
Equations:
Xn’ = Xn + Yn*Kn
Yn’ = Xn – Yn*Kn
n=0..N-1
Pseudo code:
for (n = 0; n<N; n++)
{
sXX[n] = sX[n] + sY[n]*sK[n];
sYY[n] = sX[n] - sY[n]*sK[n];
}
IP = 2 (1 madd, 1 msub)
LD/ST= 5 (read sX, read sK, read sY, write
sXX, write sYY)
Pseudo code implementation:
ssP1 = ssY1 * ssK1; ssP0 = ssY0 * ssK0;
ssXX1 = ssX1 + ssP1; ssXX0 = ssX0 + ssP0;
ssYY1 = ssX1 – ssP1; ssYY0 = ssX0 – ssP0;
Assembly code:
lea
LC,(N/2 – 1)
ld.w
ssY,[Yptr+]4
ld.w
ssK,[Kptr+]4
rditloop:
mulr.h
ssP,ssY,ssK ul,#1
ld.w
ssX,[Xptr+]4
ld.w
ssY,[Yptr+]4
add.h
ssXX,ssX,ssP
st.w
[Xptr]-4,ssXX
sub.h
ssYY,ssX,ssP
st.w
[Yptr]-8,ssYY
ld.w
ssK,[Kptr+]4
loop
LC,rditloop
User Guide
116
; (1)
; (2)
; (3)
;get loop number
;y1y0
;k1k0
;(1,2)
; ||
; ||
; (3)
; ||
; (4)
; ||
; (5)
;p1=y1*k1 || p0=y0*k0
;x1 x0
;y3 y2
;xx1=x1+p1 || xx0=x0+p0
;store xx1 xx0
;yy1=x1-p1 || yy0=x0-p0
;store yy1 yy0
;k3 k2
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d2
d3
d4 d6 d8 Load /
Store
y1
y0
ld y1y0
k1 ld k1k0
k0
mulr.h
ssP,ssY,ssKul,#1
p1=y1*k1 || x1
p0=y0*k0 x0
ld x1x0
y3
y2
add.h ssXX,ssX,ssP x’1=x1+p1 ||
x’0=x0+p0
sub.h ssYY,ssX,ssP
ld y3y2
st x’1x’0
y’1=x1-p1 ||
y’0=x0-p0
st y’1y’0
k3 ld k3k2
k2
User Guide
117
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
7.2
Real Butterfly – DIF – Radix 2
Equations:
Xn’ = Xn + Yn
Yn’ = Xn*Kn – Yn*Kn
n=0..N-1
Pseudo code:
for (n = 0; n<N; n++)
{
sXX[n] = sX[n] + sY[n];
sYY[n] = sX[n]*sK[n] - sY[n]*sK[n];
}
IP = 3
(1 add, 1 mul, 1 msub)
LD/ST= 5 (read sX, sK, sY, write sXX, sYY)
Pseudo code implementation (loop only):
ssXX1 = ssX1 + ssY1; ssXX0 = ssX0 + ssY0;
ssD1 = ssX1 – ssY1; ssD0 = ssX0 – ssY0;
ssYY1 = ssD1*ssK1;
ssYY0 = ssD0*ssK0;
Assembly code:
lea
LC,(N/2 - 1)
ld.w
ssY,[Yptr+]4
ld.w
ssX,[Xptr+]4
rdifloop:
add.h
ssXX,ssX,ssY
st.w
[Xptr]-4,ssXX
sub.h
ssD,ssX,ssY
ld.w
ssY,[Yptr+]4
ld.w
ssK,[Kptr+]4
mulr.h ssYY,ssD,ssK ul,#1
ld.w
ssX,[Xptr+]4
st.w
[Yptr]-8,ssYY
loop
LC,rdifloop
User Guide
; (1)
; (2)
; (3)
; get loop number
; y1y0
; x1x0
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
118
(1)
||
(2)
||
(3)
(4,5)
||
||
xx1=x1+y1 || xx0=x0+y0
store xx1 xx0
d1=x1-y1 || d0=x0-y0
y3 y2
k1 k0
yy1=d1*k1 || yy0=d0*k0
x3 x2
store yy1 yy0
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d2
d3 / d2
d4 d6
d8 Load /
Store
y1
y0
ld y1y0
x1
x0
add.h
ssXX,ssX, ssY
ld x1x0
xx1=x1+y1||
xx0=x0+y0
st xx1 xx0
sub.h
ssD,ssX,ssY
d1=x1-y1 ||
d0=x0-y0
y3
y2
ld y3y2
k1 ld k1k0
k0
mulr.h
ssYY,ssD,ssK ul,#1
yy1=d1*k1||
yy0=d0*k0
yy1 yy0
User Guide
119
x3
x2
ld x3x2
st yy1 yy0
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
7.3
Complex Butterfly – DIT – Radix 2
Equations:
X’r = Xr + (Yr*Kr – Yi*Ki)
X’i = Xi + (Yi*Kr + Yr*Ki)
Y’r = Xr – (Yr*Kr - Yi*Ki)
Y’i = Xi – (Yi*Kr + Yr*Ki)
Pseudo code:
for (n = 0; n<N; n++)
{
sXXr[n] = sXr[n]
sXXi[n] = sXi[n]
sYYr[n] = sXr[n]
sYYi[n] = sXi[n]
}
+
+
–
–
sYr[n]*sKr[n]
sYi[n]*sKr[n]
sYr[n]*sKr[n]
sYi[n]*sKr[n]
+
+
-
sYi[n]*sKi[n];
sYr[n]*sKi[n];
sYi[n]*sKi[n];
sYr[n]*sKi[n];
LD/ST= 10 (read sXr, sXi, sYr, sYi, read
sKr, sKi, write sXXr, sXXi, sYYr, sYYi)
IP=8 (4 madd, 4 msub)
Pseudo code implementation (loop only):
sPi0
sPi0
sPi1
sPi1
sXXi0
sYYi0
sXXi1
sYYi1
=
=
=
=
=
=
=
=
sYr0
sPi0
sYr1
sPi1
sXi0
sXi0
sXi1
sXi1
User Guide
*
+
*
+
+
–
+
–
sKi0
sYi0 * sKr0
sKi1
sYi1 * sKr1
sPi0
sPi0
sPi1
sPi1
;
;
;
;
;
;
;
;
sPr0
sPr0
sPr1
sPr1
sXXr0
sYYr0
sXXr1
sYYr1
=
=
=
=
=
=
=
=
120
sYr0
sPr0
sYr1
sPr1
sXr0
sXr0
sXr1
sXr1
*
*
–
+
–
+
–
sKr0;
sYi0 * sKi0;
sKr1;
sYi1 * sKi1;
sPr0;
sPr0;
sPr1;
sPr1;
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
lea
LC,(N/2 - 1)
; (1) ;get loop number
lea
kindex,4
; (2) ;index is pass dependent
ld.d
e6,[Yptr+]8
; (3) ;y1y0
ld.w
ssK1,[Kptr]
; (4) ;kikr0
cditloop:
mulr.h
ssP1,ssK1,ssY1 ll,#1 ; (1) ;pi0=yr0*ki0 || pr0=yr0*kr0
add.a
Kptr,Kptr,kindex
; || ;
maddsur.h ssP1,ssP1,ssK1,ssY1 uu,#1
; (2)
; pi0+=kr0*yi0 || pr0-=yi0*ki0
ld.w
ssK2,[Kptr]
; || ;kikr1
mulr.h
ssP2,ssK2,ssY2 ll,#1 ; (3)
; pi1=yr1*ki1 || pr1=yr1*kr1
maddsur.h ssP2,ssP2,ssK2,ssY2 uu,#1 ; (4,5)
; pi1+=kr1*yi1 ||pr1-=yi1*ki1
ld.d
e4,[Xptr+]8
; || ; xixr1 xixr0
ld.d
e6,[Yptr+]8
; || ; yiyr3 yiyr2
add.h
ssXX1,ssX1,ssP1
; (6)
; xi’0=xi0+pi0 || xr’0=xr0+pr0
add.a
Kptr,Kptr,kindex
; || ;
sub.h
ssYY1,ssX1,ssP1
; (7)
; yi’0=xi0-pi0 || yr’0=xr0-pr0
ld.w
ssK1,[Kptr]
; || ; kikr2
add.h
ssXX2,ssX2,ssP2
; (8)
; xi’1=xi1+pi1 || xr’1=xr1+pr1
st.d
[Xptr]-8,e0
; || ; store x’1x’0
sub.h
ssYY2,ssX2,ssP2
; (9)
; yi’1=xi1-pi1 || yr’1=xr1-pr1
st.d
[Yptr]-16,e2
; || ; store yy1yy0
loop LC,cditloop
User Guide
121
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d1
d2
d3
d5 /
d4
d7/
d6
d8
d9
d10
d11
yiyr1
yiyr0
kikr0
mulr.h
ssP1,ssK1,ssY1
ll,#1
pi0 ||
pr0
maddsur.h
ssP1,ssP1,ssK1,s
sY1 uu,#1
kikr1 pi0 ||
pr0
mulr.h
ssP2,ssK2,ssY2
ll,#1
pi1 ||
pr1
maddsur.h
ssP2,ssP2,ssK2,s
sY2 uu,#1
xixr1
xixr0
pi1 ||
pr1
yiyr3
yiyr2
add.h
xi’0 ||
ssXX1,ssX1,ssP1 xr’0
sub.h
ssYY1,ssX1,ssP1
yi’0 ||
yr’0
kikr2
add.h
x’1x’0 xi’1 ||
ssXX2,ssX2,ssP2
xr’1
sub.h
ssYY2,ssX2,ssP2
User Guide
yy1
yy0
yi’1 ||
yr’1
122
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
7.4
Complex Butterfly – DIT – Radix 2 – with shift
Equations:
X’r = Xr + (Yr*Kr – Yi*Ki)
X’i = Xi + (Yi*Kr + Yr*Ki)
Y’r = Xr – (Yr*Kr - Yi*Ki)
Y’i = Xi – (Yi*Kr + Yr*Ki)
Pseudo code:
for (n = 0; n<N; n++)
{
sXXr[n] = sXr[n]
sXXi[n] = sXi[n]
sYYr[n] = sXr[n]
sYYi[n] = sXi[n]
}
+
+
–
–
sYr[n]*sKr[n]
sYi[n]*sKr[n]
sYr[n]*sKr[n]
sYi[n]*sKr[n]
+
+
-
sYi[n]*sKi[n];
sYr[n]*sKi[n];
sYi[n]*sKi[n];
sYr[n]*sKi[n];
LD/ST= 10 (read sXr, sXi, sYr, sYi, read
sKr, sKi, write sXXr, sXXi, sYYr, sYYi)
IP=8 (4 madd, 4 msub)
Pseudo code implementation (loop only):
sPi0 = sYr0 * sKi0
sPi0 = sPi0 + sYi0 * sKr0
sPi1 = sYr1 * sKi1
sPi1 = sPi1 + sYi1 * sKr1
sXi0 >> 1
sXXi0 = sXi0 + sPi0
sYYi0 = sXi0 – sPi0
sXi1 >> 1
sXXi1 = sXi1 + sPi1
sYYi1 = sXi1 – sPi1
User Guide
;
;
;
;
;
;
;
;
;
;
sPr0 = sYr0 * sKr0;
sPr0 = sPr0 - sYi0 * sKi0;
sPr1 = sYr1 * sKr1;
sPr1 = sPr1 – sYi1 * sKi1;
sXr0 >> 1;
sXXr0 = sXr0 + sPr0;
sYYr0 = sXr0 – sPr0;
sXr1 >> 1;
sXXr1 = sXr1 + sPr1;
sYYr1 = sXr1 – sPr1;
123
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
lea
lea
ld.d
ld.w
LC,(N/2 - 1)
kindex,4
e6,[Yptr+]8
ssK1,[Kptr]
;
;
;
;
(1)
(2)
(3)
(4)
;
;
;
;
get loop number
index is pass dependent
y1y0
kikr0
cditsloop:
mulr.h
ssP1,ssK1,ssY1 ll,#0 ; (1)
; pi0=yr0*ki0 || pr0=yr0*kr0
add.a
Kptr,Kptr,kindex
; ||
;
maddsur.h ssP1,ssP1,ssK1,ssY1 uu,#0
; (2)
; pi0+=kr0*yi0 || pr0-=yi0*ki0
ld.w
ssK2,[Kptr]
; ||
; kikr1
mulr.h
ssP2,ssK2,ssY2 ll,#0 ; (3)
; pi1=yr1*ki1 || pr1=yr1*kr1
maddsur.h ssP2,ssP2,ssK2,ssY2 uu,#0
; (4,5)
; pi1+=kr1*yi1 || pr1-=yi1*ki1
ld.d
e4,[Xptr+]8
; ||
; xixr1 xixr0
sha.h
ssX1,ssX1,#-1
; (6) ; x0>>1
add.a
Kptr,Kptr,kindex
; ||
;
add.h
ssXX1,ssX1,ssP1
; (7)
; xi’0=xi0+pi0 || xr’0=xr0+pr0
ld.w
ssK1,[Kptr]
; ||
; kikr2
sub.h
ssYY1,ssX1,ssP1
; (8)
; yi’0=xi0-pi0 || yr’0=xr0-pr0
ld.d
e6,[Yptr+]8
; ||
; yiyr3 yiyr2
sha.h
ssX2,ssX2,#-1
; (9) ; x1>>1
add.h
ssXX2,ssX2,ssP2
; (10)
; xi’1=xi1+pi1 || xr’1=xr1+pr1
st.d
[Xptr]-8,e0
; ||
; store x’1x’0
sub.h
ssYY2,ssX2,ssP2
; (11)
; yi’1=xi1-pi1 || yr’1=xr1-pr1
st.d
[Yptr]-16,e2
; ||
; store yy1yy0
loop LC,cditsloop
User Guide
124
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d1
d2
d3
d5 /
d4
d7 /
d6
d8
d9
d10 d11
yiyr1
yiyr0
kikr0
mulr.h
ssP1,ssK1,ssY1
ll,#1
pi0||
pr0
maddsur.h
ssP1,ssP1,ssK1,s
sY1 uu,#1
kikr1
mulr.h
ssP2,ssK2,ssY2
ll,#1
pi1||
pr1
maddsur.h
ssP2,ssP2,ssK2,s
sY2 uu,#1
xixr1
xixr0
sha.h
ssX1,ssX1,#-1
add.h
ssXX1,ssX1,ssP1
xi’0 ||
xr’0
User Guide
kikr2
yi’0||
yr’0
sha.h
ssX2,ssX2,#-1
sub.h
ssYY2,ssX2,ssP2
pi1||
pr1
x0>>1
sub.h
ssYY1,ssX1,ssP1
add.h
ssXX2,ssX2,ssP2
pi0||
pr0
yiyr3
yiyr2
x1>>1
x’1x’0 xi’1||
xr’1
yy1y yi’1||yr’1
y0
125
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
7.5
Complex Butterfly – DIF – Radix 2
Equations:
X’r = Xr + Yr
X’i = Xi + Yi
Y’r = (Xr – Yr)*Kr – (Xi – Yi)*Ki
Y’i = (Xi – Yi)*Kr + (Xr – Yr)*Ki
Pseudo code:
for (n = 0; n<N; n++)
{
sXXr[n] = sXr[n] + sYr[n];
sXXi[n] = sXi[n] + sYi[n];
sYYr[n] = (sXr[n] – sYr[n])*sKr[n] – (sXi[n] - sYi[n])*sKi[n];
sYYi[n] = (sXi[n] – sYi[n])*sKr[n] + (sXr[n] - sYr[n])*sKi[n];
}
IP=8 (2 add, 2 sub, 2 mul,
1 madd, 1 msub)
LD/ST= 10 (read sXr, sXi, sYr, sYi, read
sKr, sKi, write sXXr, sXXi, sYYr, sYYi)
Pseudo code implementation:
sXXi0
sXXi1
sDi0
sDi1
sYYi0
sYYi0
sYYi1
sYYi1
=
=
=
=
=
=
=
=
sXi0 + sYi0
sXi1 + sYi1
sXi0 - sYi0
sXi1 - sYi1
sDi0 * sKr0
sYYi0 + sDr0 * sKi0
sDi1 * sKr1
sYYi1 + sDr1 * sKi1
User Guide
;
;
;
;
;
;
;
;
sXXr0
sXXr1
sDr0
sDr1
sYYr0
sYYr0
sYYr1
sYYr1
=
=
=
=
=
=
=
=
126
sXr0 + sYr0;
sXr1 + sYr1;
sXr0 - sYr0;
sXr1 – sYr1;
sDr0 * sKr0;
sYYr0 – sDi0 * sKi0;
sDr1 * sKr1;
sYYr1 – sDi1 * sKi1;
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Assembly code:
lea LC,(N/2 - 1)
;
lea kindex,4
;
ld.d e6,[Yptr+]8
;
ld.d e4,[Xptr+]8
;
cdifloop:
add.h
ssXX1,ssX1,ssY1
;
add.a
Kptr,Kptr,kindex
;
add.h
ssXX2,ssX2,ssY2
;
ld.w
ssK1,[Kptr]
;
sub.h
ssD1,ssX1,ssY1
;
add.a
Kptr,Kptr,kindex
;
sub.h
ssD2,ssX2,ssY2
;
ld.w
ssK2,[Kptr]
;
mulr.h
ssYY1,ssD1,ssK1 ll,#1
127
v1.6.4, 2003-01
ld.d
e4,[Xptr+]8
;
maddsur.h ssYY2,ssYY2,ssD2,ssK2
User Guide
get loop number
index is pass dependent
y1y0
x1x0
;
;
ssYY2,ssD2,ssK2 ll,#1
ld.d
e6,[Yptr+]8
st.d
[Yptr]-16,e2
loop LC,cdifloop
;
;
;
;
; xi’0=xi0+yi0 || xr’0=xr0+yr0
;
; xi’1=xi1+yi1 || xr’1=xr1+yr1
; kikr0
; di0=xi0-yi0 || dr0=xr0-yr0
;
; di1=xi1-yi1 || dr1=xr1-yr1
; kikr1
; (5)
; yi’0=di0*kr0 || yr’0=dr0*kr0
|| ; store x’1x’0
uu,#1; (6)
; yi’0+=dr0*ki0||yr’0-=di0*ki0
; (7)
; yi’1=di1*kr1 || yr’1=dr1*kr1
|| ; xixr3 xixr2
uu,#1
;(8,9)
; yi’1+=dr1*ki1||yr’1-=di1*ki1
|| ; yiyr3 yiyr2
|| ; store yy1yy0
st.d
[Xptr]-8,e0
;
maddsur.h ssYY1,ssYY1,ssD1,ssK1
mulr.h
(1)
(2)
(2)
(3)
(1)
||
(2)
||
(3)
||
(4)
||
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
Register diagram:
Instruction
d0
d1
d2
d3
d5 /
d4
d7/
d6
d8
d9 d10 d11
yiyr1
yiyr0
xixr1
xixr0
add.h
ssXX1,ssX1,ssP1
xi’0||xr’0
add.h
ssXX2,ssX2,ssP2
sub.h
ssYY1,ssX1,ssP1
xi’1||xr’1
kikr0
di0||
dr0
sub.h
ssYY2,ssX2,ssP2
di1|
dr1
kikr1
mulr.h
x’1x’0
ssP1,ssK1,ssY1 ll,#1
yi0||
yr0
maddsur.h
ssP1,ssP1,ssK1,ssY
1 uu,#1
yi0||
yr0
xixr3
xixr2
mulr.h
ssP2,ssK2,ssY2 ll,#1
maddsur.h
ssP2,ssP2,ssK2,ssY
2 uu,#1
yi1||
yr1
yiyr3
yiyr2
yi1||
yr1
yy1
yy0
User Guide
128
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
8
Appendices
8.1
Tools
Several tools have been used to test the DSP routines shown in this Optimization guide:
• EDE Tasking 1.4 r1:
Used to build the projects. One project was created for each DSP algorithm, and one
project for the cycle count.
• CrossView debugger:
Used to run the routine. The output is the screen (stdout) or a file.
• A spreadsheet such as Excel:
Used to compare TriCore fixed-point precision (or approximation of the fixed-point
DSP algorithms) and the floating point results.
8.2
TriBoard Project Cycles Count
The project will send the number of cycles of the routine under test by the serial port. The
configuration used is indicated below:
Hardware
Software
TriBoard Rider-D w/MMU M3301.
Power supply 12V/2A
One Parallel cable female/female
One serial cable male/male
Tasking EDE 1.4 r1
Crossview debugger 1.1
Windows Hyperterminal
Excel
8.2.1
Steps to Run the Project
1. Connect the TriBoard with the computer
2. Launch the Hyperterminal
The settings should be:
Port used
: COM1 (This is the com port where the serial cable connects the board
to the computer)
Bits/Second
: 9600
Data bits
:8
Parity
: None
Stop bits
:1
Flow
: Hardware
User Guide
129
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
3. Connect the power supply. The board will respond with:
Hello World!
I’am the TriBoard with Rider-D
developed at
INFINEON Technologies AG in Munich
Department AI MC AE
St.-Martin-Str. 76
D-81541 Munich
Tel.:+49-89-234-0
Fax.:+49-89-234-81785
If you have questions to this board or to TriCore CPU,
see the manuals on the TriBoard CD.
Have fun working with me!
The CPU running at 160000000 Hz
The EBU running at 80000000 Hz
Checking On Board SDRAM: ok
running since: 00h:00min:02sec
4. Launch Tasking EDE and select the project space “Testing DSP”.
Open the project “cycles count”, add the routine (see 1.3.2) and compile it.
5. Open Crossview debugger 1.1 and load the file cycle count.abs in the board.
6. Run the code. A message is sent by the board:
f CPU = 160000000 Hz
f EBU = 80000000 Hz
K = 1
N = 10
-------------------------------number of cycle : 102018
--------------------------------
User Guide
130
v1.6.4, 2003-01
TriCore™ 32-bit Unified Processor
DSP Optimization Guide Part 2: Routines
9
Glossary
Reference
Definition
asm
Assembly code
CPU
Central Processing Unit
DIF
Decimation In Frequency
DIT
Decimation In Time
DSP
Digital Signal Processing
FFT
Fast Fourier Transformations
FIR
Finite Impulse Response
FPI
Flexible Peripheral Interface
IIR
Infinite Impulse Response
IP
Integer Processing
LDMST
Load Modify Store instruction
LMS
Least Mean Square
LS
Load Store
LSB
Least Significant Bit
LT
Less Than - compare instruction
MAC
Multiply and Accumulate
MSB
Most Significant Bit
nop
No Operation
PMU
Program Memory Unit
TC
Abbreviation for TriCore (TC1 or TriCore v2.0, for example)
User Guide
131
v1.6.4, 2003-01
((132))
Infineon goes for Business Excellence
“Business excellence means intelligent approaches and clearly
defined processes, which are both constantly under review and
ultimately lead to good operating results.
Better operating results and business excellence mean less
idleness and wastefulness for all of us, more professional
success, more accurate information, a better overview and,
thereby, less frustration and more satisfaction.”
Dr. Ulrich Schumacher
http://www.infineon.com
Published by Infineon Technologies AG