U s e r G u i d e , v 1 . 6 . 4 , J a n u a ry 2 0 0 3 TriCoreTM 32-bit Unified Processor DSP Optimization Guide Part 2: Routines IP Cores N e v e r s t o p t h i n k i n g . Edition 2003-01 Published by Infineon Technologies AG, St.-Martin-Strasse 53, D-81541 München, Germany © Infineon Technologies AG 2003. All Rights Reserved. Attention please! The information herein is given to describe certain components and shall not be considered as warranted characteristics. Terms of delivery and rights to technical change reserved. We hereby disclaim any and all warranties, including but not limited to warranties of non-infringement, regarding circuits, descriptions and charts stated herein. Infineon Technologies is an approved CECC manufacturer. Information For further information on technology, delivery terms and conditions and prices please contact your nearest Infineon Technologies Office in Germany or our Infineon Technologies Representatives worldwide (see www.infineon.com). Warnings Due to technical requirements components may contain dangerous substances. For information on the types in question please contact your nearest Infineon Technologies Office. Infineon Technologies Components may only be used in life-support devices or systems with the express written approval of Infineon Technologies, if a failure of such components can reasonably be expected to cause the failure of that life-support device or system, or to affect the safety or effectiveness of that device or system. Life support devices or systems are intended to be implanted in the human body, or to support and/or maintain and sustain and/or protect human life. If they fail, it is reasonable to assume that the health of the user or other persons may be endangered. U s e r G u i d e , v 1 . 6 . 4 , J a n u a ry 2 0 0 3 T r i C o r e TM 32-b it Un ified Pro c essor DSP Optimization Guid e Part 2: Routines DSP Optimization Guide Revision History: 2003-01 Previous Version: v1.6.3 Page Subjects (major changes since last revision) All Revised following internal review v1.6.4 We Listen to Your Comments Is there any information within this document that you feel is wrong, unclear or missing? Your feedback will help us to continuously improve the quality of this document. Please send your comments (referencing this document) to: [email protected] TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Table of Contents Page 1 1.1 1.1.1 1.1.2 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.2.6 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 1.4 1.4.1 1.4.1.1 1.4.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Information Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Number of Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Arithmetic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Routine Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Pipe Resource Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Register Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 How to Test a DSP Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 The Golden Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Transcendental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Measuring Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 How to Count Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Counting Cycles for a Routine without Loops . . . . . . . . . . . . . . . . . . 23 Counting Cycles for a Routine with Loops . . . . . . . . . . . . . . . . . . . . . 25 2 2.1 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Complex Wave Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Square Root (by Newton-Raphson) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Square Root (Taylor) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inverse (y=1/x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natural Logarithm (y= ln(x)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponential (y=ex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sine (y =sin(x)), range [-Pi/2, Pi/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sine (y =sin(x)), range [-Pi, Pi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 32 33 34 35 36 37 39 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-bit signed Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-bit signed Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-bit signed Multiplication (Result on 64-bit) . . . . . . . . . . . . . . . . . . . . . . ‘C’ Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-bit Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-bit Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2nd Order Difference Equation (16-bit) . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 42 43 43 44 45 46 47 User Guide 5 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Table of Contents Page 4.8 4.9 4.10 4.11 4.12 2nd Order Difference Equation (32-bit) . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Multiplication (Packed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Update (Packed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 49 51 52 54 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Pre-emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Square Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Complex Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Complex Multiplication (Packed) . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Complex Multiplication (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . Color Space Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 57 58 59 60 62 64 66 68 71 73 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Magnitude Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 First Order FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Second Order FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Block FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Auto-Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Complex FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 First Order IIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Second Order IIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 BIQUAD 4 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 N-stage BIQUAD 4 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 N-stage BIQUAD 5 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Lattice Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Leaky LMS (Update Only) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Delayed LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Delayed LMS – 32-bit Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Delayed LMS – Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 7.1 7.2 7.3 7.4 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real Butterfly – DIT – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real Butterfly – DIF – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Butterfly – DIT – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complex Butterfly – DIT – Radix 2 – with shift . . . . . . . . . . . . . . . . . . . . User Guide 6 113 116 118 120 123 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Table of Contents Page 7.5 Complex Butterfly – DIF – Radix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8 8.1 8.2 8.2.1 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TriBoard Project Cycles Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steps to Run the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 User Guide 7 129 129 129 129 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1 Introduction This second part of the TriCore DSP Optimization guide contains short routines which, from the machine’s perspective, offer a high degree of optimization. Note: Machine (or assembly) perspective means instead of the algorithm perspective. There is always more gain to be made at the algorithm level, than at the machine (assembly implementation) level. The time-pressures for a ‘real-world’ project situation coupled with basic DSP features, will often add cycles to a project routine, but there are ways in which some routines can be further optimized. Aside from the ingenuity of the DSP programmer themselves, these include for example: • Using more unrolling and/or software pipelining (although there is then the potential drawback of making the code less readable and increasing the code size). • Using a memory model where coefficients and data are not separated in 2 memory spaces, allowing many algorithms to take advantage of interleaving data and coefficients. The chapters of this second part of the TriCore DSP Optimization guide are divided into the following function types: • Generator • Transcendental Functions • Scalars • Vectors • Filters • Transforms The following table identifies the characteristics of these routine types: Generator Input Output Processing single 1 series of n elements Iterative processing Transcendental single single Iterative processing Scalar single single simple equation Vector 1) 1 series of n elements 1 series of n elements Identical to all n elements Filter 1 series of n elements single Transform multiple multiple 1 series of n elements 1 series of n elements Not always identical to n elements 1) Includes a matrix operation User Guide 8 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Most routines are implemented as memory to memory, but some routines can be: • Register to register • Memory to memory with initialization of pointers • Full context switching (identical to a library call) It is very simple to use one type of model or another. In this document, each routine is presented with a short summary and all routines of the same type are grouped in an Information Table (a summary of the routines critical characteristics), at the start of each chapter. 1.1 Information Table Each section begins with an information table which gives: • • • • Number of cycles Code size Optimization techniques (used in the assembly code) Arithmetic method used 1.1.1 Number of Cycles The number of cycles is counted as follows: – – – – – – Software Pipelining Loop Unrolling Packed Operation Load / Store Scheduling Data Memory Interleaving Packed Load / Store Software Pipelining Software pipelining means starting an equation before the previous equation has finished. This is achieved with knowledge of the TriCore pipelining rules. Example: Implementation of a C square difference int sum = 0; for (i=0; i<N; i++) { a = X[i]; b = Y[I]; c = a – b; sum = sum + c*c; } User Guide 9 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines A naïve implementation would be: mov sumloop: d0,#0 ld.w d1,[Xptr+]4 ld.w d2,[Yptr+]4 sub d3,d1,d2 madd d0,d0,d3,d3 LC,sumloop sumAddr,d0 loop st.w ; ; ; ; ; ; prolog loop (1) (2) (3) (4,5,6) ; ld X0 ; ld Y0 ; ; ; epilog This implementation is very expensive in terms of cycles, because the loop begins with two LS (Load/Store) instructions followed by one IP (Integer Processing) instruction, and one MAC 32*32. The number of cycles can easily be decreased by using a different instruction order: • IP followed by LS, MAC 32*32 followed by LS mov ld.w ld.w sumloop: d0,#0 d1,[Xptr+]4 d2,[Yptr+]4 sub ld.w madd ld.w LC,sumloop sumAddr,d0 loop st.w ; ; ; ; d3,d1,d2 ; d1,[Xptr+]4 ; d0,d0,d3,d3 ; d2,[Yptr+]4 ; prolog ld X0 ld Y0 loop (1) || (2) || ; ; ld X1 for next pass ; ; ld Y1 for next pass ; epilog Now the number of cycles is reduced to 2 cycles per loop, compared with the previous 6 cycles per loop. This can be described in C as: int a = b = for sum = 0; X[0]; Y[0]; (i=1; i<N; i++) { c = a – b; a = X[i]; sum += c*c; b = Y[I]; } User Guide 10 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Loop Unrolling The equation is written twice or more, inside a loop. This technique is usually used with software pipelining. Example: Implementation of a C array sum z =0; for (i=0; i<N; i++) z += X[i]; In TriCore assembly language this becomes: mov d0,#0 vsumloop: loop st.w ld.w add LC,vsumloop Zaddr,d0 d1,[Xptr+]4 d0,d0,d1 ; ; ; ; prolog loop (1) ; ld X0 (2) ; z ; epilog Here the Load and Add operations are performed in 2 cycles for a single element of the array. This can be improved by computing the addition of two elements at a time, in the loop: mov d0,#0 ld.w d1,[Xptr+]4 vsumloop: add ld.w add ld.w loop LC,vsumloop st.w Zaddr,d0 d0,d0,d1 d1,[Xptr+]4 d0,d0,d1 d1,[Xptr+]4 ; ; ; ; ; ; ; prolog ld X0 loop (1) || (2) || ; ; ; ; z ld X1 z ld X2 ; epilog Adding two elements now takes only 2 cycles instead of 4. This can also be written in C: z = 0; for (i=0; i<N/2; i++) { z+= X[2*i]; z+= Z[2*i+1]; } User Guide 11 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Packed Operation With packed operation, two different data are packed in the same register. Example: Load of two 16-bits values in a register ld.w ssX,[Xaddr] ; ld X0 X1 Load / Store Scheduling In Load/Store scheduling, the Load and Store instructions are reorganized to reduce the number of cycles. Example: Transform routine X0 Æ X0’ X1 Æ X1’ This transform takes 6 cycles: ld.w mul sub st.w ld.w mul sub st.w d0,[Xptr] d0,d0,#5 d0,d0,#1 [Xptr+]4,d0 d0,[Xptr] d0,d0,#5 d0,d0,#1 [Xptr+]4,d0 ; ; ; ; ; ; ; ; (1) (2) (3) || (4) (5) (6) || A cycle is saved with Load and Store scheduling: ld.w mul ld.w sub st.w mul sub st.w User Guide d0,[Xptr] d0,d0,#5 d1,[Xptr]+4 d0,d0,#1 [Xptr+]4,d0 d1,d1,#5 d1,d1,#1 [Xptr+]4,d1 ; ; ; ; ; ; ; ; (1) (2) || (3) || (4) (5) || 12 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Data Memory Interleaving Here, data are mixed with other types of data in memory. Example: Xi0 at 0xD0000000 Xi1 at 0xD0000002 Xr0 at 0xD0000004 Xr1 at 0xD0000006 instead of: Xi0 at 0xD0000000 Xr0 at 0xD0000002 Xi1 at 0xD0000004 Xr1 at 0xD0000006 Packed Load / Store With Packed Load/Store at least two data are loaded or stored in the same instruction. Example: Load two 32-bits values Instead of: ld.w ld.w d0,[Xptr+]4 d1,[Xptr+]4 ; ld X0 ; ld X1 It can be written as: ld.d 1.1.2 e0,[Xptr+]8 ; ld X0 X1 Arithmetic Methods • Saturation – at least one instruction is used with saturation. • Rounding – at least one instruction is used with rounding. User Guide 13 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.2 Routine Organization A typical routine page in this document is divided into the following different sections: Title 5.4 Equation Equation Zn=(Xn-Yn)2 Vector Square Difference n=0..N-1 Pseudo code: for (n=0; n<N; n++) { sTmp = xX[n]-sY[n]; sZ[n] = sTmp*sTmp; } Pseudo Code Pipe Resource Table IP=2 (1 sub, 1 mul) LD/ST=3 (read sX, read sY, write sZ) Assembly code: lea LC,(N/2 - 1) ld.d ssssXY, [Xptr+]8 sqdloop: subs.h ssTmp,ssX,ssY ld.d ssssXY, [Xptr+]8 mulr.h ssZ,ssTmp,ssTmp ul,#1 st.w [Zptr+]4, ssZ Loop LC, sqdloop Assembly Code ;(1) ;(2) ;get loop number ;X0 X1 Y0 Y1 ;(1) ;X1 - Y1 || X0 Y0 ;|| ;X2 X3 Y2 Y3 ;(2,3) ;(X1-Y1)^2 || (X0-Y0)^2 Memory Organization Memory Organization Number of Cycles Xaddr sX0 Xaddr +2 sX1 Xaddr +4 sY0 Xaddr +6 sY1 Xaddr +8 sX2 Xaddr +10 etc... Example N=160 244 cycles Æ Register Diagram Register Diagram Instruction d0 d1 d3 / d2 Load/Store y1 y0 x1 x0 ld x0x1y0y1 The topics shown in the figure above are described in the following sub-sections: 1.2.1 Equation The Equation section gives the generic equation of the algorithm. When two (or more) variables are written back, there are several equations. • n the variable is an array or a vector • r the value is real • i the value is imaginary User Guide 14 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.2.2 Pseudo Code Pseudo Code provides an exact description of the equation(s). The ‘Pseudo C code’ has several advantages compared to C code: • There is no strict syntax requirement. The classic example is typecasting in C, which sometimes gives unreadable code. • It may use ‘non C’ constructs. It is very difficult for example, to express 64-bit quantities (not a standard) and circular addressing in C. The Pseudo code also determines the number of IP (Integer Processing) and LS (Load/ Store) instructions, as well as the memory bandwidth (2, 16-bit values do not give the same bandwidth as 2, 64-bit values). This is important to remember, as the Optimization can not perform better than implementing the minimum number of operations. Pseudo Code Implementation Pseudo code implementation is used to make the link between the description of specifications (in Pseudo code) and the implementation (in Assembly code). It explains how the specification Pseudo-code lines are actually computed in the Assembly code, and is very useful when Packed Techniques and Software Pipelining are used. 1.2.3 Pipe Resource Table IP = 1 (1 MUL) Integer Processing LD / ST = 2 (READ lX, write lZ) Number of IP, MAC Instructions Load and Store Number of LD / ST Instructions The Pipe Resource table acts as a ‘sanity check’, which can help in implementing the routine. For example, there is no need to optimize 2 instructions on the IP side of a routine if the bottleneck is due to 5 instructions on the LS side. Note: As the pipe resource table is based on the Pseudo code, it will not show the number of IP and LS instructions inside the routine’s loop, because this is dependent on all of the Optimization techniques used/applied. User Guide 15 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.2.4 Assembly Code The Assembly code given in this document is the actual TriCore Assembly code. Where ‘N’ appears in the loop counter, it refers to the number of points given by the Equation or Pseudo code sections. Variable names (rather than registers) are used for easy readability. It should be noted that the pointers are not defined as these are generally declared in a global file. Cycles are indicated in the Assembly code comments: (1), (2), (3). The total number of cycles is summarized in a table following the code. The time taken to enter and leave a loop is only indicated in the Information table. 1.2.5 Register Diagram The Register diagram is an aid to visualizing the TriCore pipeline model. It also acts as a working sheet to optimize the algorithm. The Register diagram is made up of the following fields: • Instruction – Contains processing instructions (add, sub, madd, shifts, logic operation) • d0…d7 – The first eight registers are shown. The value in the register is the value at the end of the instruction. This can be confusing when a register is being loaded from memory at the same time as it is used in the instruction. The value is loaded after the calculation. • Load / Store – Indicates when data is being read from memory or being stored back to memory. • Bold borders – Indicates which instructions are in the loop. User Guide 16 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.2.6 Notation Certain names and abbreviations are used to make the routines easier to understand. This notation is very useful when attempting to optimize a routine. Abbreviation Description s Short (16-bit value) ss Two short values are in a 32-bit register ssss Four short values are in a 64-bit register l Long (32-bit value) ll Long-long (2 long in a 64-bit register) ll Long-long (64-bit value) || Instruction executed in parallel (1) Cycle number Xptr Pointer for X values Yptr Pointer for Y values Kptr Pointer for K values Zptr Pointer for Z values Vptr Pointer for V values Wptr Pointer for W values XBptr Pointer for X values used for circular addressing (ex:a2/a3) Xaddr Address of X values Kaddr Address of K values Yaddr Address of Y values Zaddr Address of Z values LC Address register (usually a5) used as a loop counter User Guide 17 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.3 How to Test a DSP Routine All of the routines described in this optimization guide are implemented in TriCore assembly code. The TriCore assembly code becomes, in effect, the reference code. The question that is then raised is how do we know it works? In other words, how do you test an optimized DSP routine? These questions are addressed in this section. 1.3.1 The Golden Models For every type of routine there is a different input and output format. This, together with the way in which they use the memory, means that there is a specific way to test them. The essential idea is to see the routine as a ‘black box’: M D DSP ROUTINE The ‘Golden model’ is used as the reference and is written in C. When the data type is integer it gives a bit-exact result, which can be used to perform a direct comparison with the TriCore assembly implementation results. When the data type is Floating-Point, the comparison cannot be exact. However the approximation of the result is sufficient to verify the TriCore implementation. The best test process should have 3 steps: • Compute the DSP routine and the Golden model on the same data buffer. • Store the two results. • Compute the difference between the Golden model and the TriCore assembly for each value, and keep the greatest difference. If this value is less than the maximum allowed error P, the test succeeds. If it exceeds the maximum allowed error then the implementation is either wrong or not accurate enough. • If more precision is required, the comparison should be performed on a spreadsheet. P DSP Routine written in TriCore assembly (‘asm’) Yes < Accuracy ? DATA No Golden Model Test processes will be different for each kind of DSP algorithm, because of the different kind of input / output formats and different memory implementation. User Guide 18 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines To help the programmer testing the routines, a project using Tasking EDE can be created. The project space should contain 6 projects, one for each type of routine: • Generators • Scalars • Filters • Transcendental • Vectors • Transforms 1.3.2 Generators There are 5 assembly files (*.src) organized in 5 directories. The 5 generators are called by the C program (‘generators.c’) and are compared against the Golden models. Only the complex wave generation is described in detail, as the other 4 generators are either directly written in C or are derived from the complex wave. TriCore asm DISPLAY Yes Same Shape ? DATA Golden Model (C) 1.3.3 No DISPLAY Transcendental There are 6 transcendental assembly files (*.src), organized in 6 directories. The two sine functions described in the manual are in the same file (‘testsin.src’), which also contains 8 sine versions representing different precision’s. All functions are called by the C program ‘trancendental.c’, and are compared against the Golden models. TriCore asm f(x) C: f -1(x) P Yes < Accuracy ? DATA Golden Model (C) f(x) No C: f -1(x) Additionally, the inverse of each function is computed. This is easily achieved since a transcendental function such as y=f(x) will always have a corresponding f-1(x). The principle advantage is that it is very fast to see mistakes. This is because the output (f-1(f(x)) should be the same as the input. For each routine there is an associated spreadsheet comparing precision. User Guide 19 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.3.4 Scalars There are 12 scalar routines, organised as 11 scalar assembly files (8.src) in 8 directories. Two, 32-bit signed multiplications are in the same file. All functions are called by the C program (‘scalars.c’) and are compared against the Golden models. P TriCore asm Yes < Accuracy ? DATA No Golden Model (C) In this instance the required accuracy is generally bit-exact, as Scalar routines are very easy to model in 32-bit integer C. 1.3.5 Vectors There are 10 vector assembly files (*.src), each in its own directory. All functions are called by the C program ‘vectors.c’, and are compared against the Golden models. Inside a loop P TriCore asm Yes < Accuracy ? DATA No Golden Model (C) In this instance the required accuracy is generally bit-exact, as these routines are easy to model in 32-bit integer C. User Guide 20 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.3.6 Filters There are 19 filter assembly files (*.src), each in its own directory. The C program ‘filters.c’, calls all functions. There are no Golden models. Instead, the Tasking data analysis window ‘scope function’, can be used. Inside a loop Oscilloscope TriCore asm O O O DATA The routine is usually just the kernel of a filter, so it will be inside a loop and will require careful implementation in memory. 1.3.7 Transforms There are 5 Transform assembly files (*.src), each in its own directory. All 5 routines are called by the C program ‘transform.c’, and compared against the Golden model. TriCore asm DISPLAY Yes Same Shape ? DATA Golden Model (C) User Guide DISPLAY 21 No v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.4 Measuring Cycles This section describes how to measure the number of cycles for a specific routine. As the test is dependent on the routine type (transcendentals, scalars, vectors, etc.), this section offers a universal method of testing. Note: Some routines with loops require modifications. These are explained in this section. A test should not change the code or the pointers. This means that it is important to be careful with the data memory mapping (When there is not enough space in the DMU, the test loop number should be decreased). 1.4.1 How to Count Cycles The test code is a small program with the routine included. This code is called by an assembly function, CYCLE_COUNT(). The CYCLE_COUNT() function executes the code that starts in 0xD4000000 and returns an integer value, the number of cycles. The timer counter is read before the call, subtracted from the value after executing the test code, and then multiplied by two (because the timer counter is on the FPI clock, its speed is half of the CPU one). In the cycle test project this function is mapped in PMU, just after the test code. Assembly code: .sect "program.code" CONST.A .macro reg,addr movh.a reg,#((addr) + 0x8000 >> 16) lea reg,[reg]((((addr) + 0x8000) & 0xffff) - 0x8000) .endm CYCLE_COUNT: CONST.A a10,0xd4000000 ld.w d9,0xf0000310 calli a10 ld.w d10,0xf0000310 sub d2,d10,d9 sh d2,#1 ret User Guide ;load program location address ;load sys timer counter value before call ;call the function ;load sys timer counter value after call ;compute the difference 22 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.4.1.1 Counting Cycles for a Routine without Loops The test program consists of a loop performed 1000 times with: • 50 nop32 • The routine • 50 add d0,d0,d0 To run the test: 1. Run an loop without the routine, execute and write down the cycle number N. 2. Include the routine inside the loop, execute and write down the cycle number M. 3. The number of cycles is (M-N)/1000. The code is located in PMU (Rider-D: 0xD4000000 to 0xD4007fff), and the data should be in DMU (Rider-D: 0xD0000000 to 0xD0007fff). Tasking macro operation .dup is used here for more clarity. 1. Run an empty Test Loop Assembly code: ;####### DATA ################################################ .sect "test_cycles.data" ;####### CODE ################################################ .sect "test_cycles.code" lea a5,999 . align 8 testloop: .dup 50 nop32 .endm ;--- include routine here ;--- end routine .dup 50 add d0,d0,d0 .endm loop a5,testloop ret User Guide 23 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 2. Include the Routine in the Loop, and Data Memory Assembly code: ;####### DATA ################################################# .sect "test_cycles.data" Xaddr: .half 0x1111 Kaddr: .half 0x2222 Zaddr: .half 0xDEAD ;####### CODE ################################################# .sect "test_cycles.code" NUM .set 50 lea a5,999 .align 8 superloop: .dup NUM nop32 .endm ;--- include routine here ld.q d1,Xaddr ld.q d2,Kaddr mulr.q d15,d1 u,d2 u, #1 st.q Zaddr,d15 ;--- end routine .dup NUM add d0,d0,d0 .endm loop a5,superloop ret Note: Because of the addressing mode used in the routine, an initialization of pointers is usually required. This initialization should take place outside the loop (this should not be counted in the routine’s number of cycles). User Guide 24 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 1.4.1.2 Counting Cycles for a Routine with Loops Loops need to be aligned on an 8-byte boundary in order to be executed in the exact number of cycles predicted. Therefore to test the routine with a loop, alignment has to be carried out first, by including some nop16 and nop32 before the test loop. The ld.d and st.d instructions also need special care. The pointer of the data loaded or stored needs to be aligned on an 8-byte boundary. Data memory and pointers’ sometimes have to be changed to avoid a misalignment. The test program consists of a jump loop, run 1000 times with: • • • • 50 nop32 The pointer initialization The routine 50 add d0,d0,d0 To Run the Test: 1. Include the routine in the test loop and align data and loops. 2. Run a loop with the pointer initialization but without the routine (by commenting it out), execute and write down the cycle number N. 3. Remove the comments around the routine, execute and write down the cycle number M. 4. The number of cycles is (M-N)/1000. Assembly code: mov d15,#999 testloop: .dup 50 nop32 .endm ;--- include routine + init here --;--- end routine + init here ------.dup 50 add .endm jned ret d0,d0,d0 d15,#0, testloop Note: In this example the data register d15 is used for the jump. If this register is used in the routine, another one should be used (define macros area has to be checked). User Guide 25 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Examples: Test of a routine with loop Assembly code: ;####### CODE ############################################## .sect "test_cycles.code" ;-----------------------------------------------------------;vector complex multiplication (5cycles in the loop) ;-----------------------------------------------------------.define sYr "d1" .define sYi "d3" .define ssX "d4" .define ssK "d5" .define Xptr "a2" .define Kptr "a3" .define Yptr "a4" .define LC "a5" .define N "8" mov nop32 d15,#999 nop16 ; testloop counter ; nops here to align label nmloop on a 8 bytes ; boundary ; testloop2: .dup 50 nop32 .endm ;--- include routine + init here --lea Xptr,buffin_vcplxmul1 lea Kptr,buffin_vcplxmul2 lea Yptr,buffout_vcplxmul lea movh mov nmloop: LC,(N-1) d6,#0 d7,#0 ld.w ssK,[Kptr+]4 ld.w ssX,[Xptr+]4 msubadm.h e0,e6,ssK,ssX ul,#1 mulm.h e2,ssK,ssX lu,#1 st.h [Yptr+]2,sYr st.h [Yptr+]2,sYi loop LC,nmloop ;--- end routine + init here ------- User Guide 26 ; Xptr ; Kptr ; Yptr ; (1) ; (2) ; (3) ; get loop number ; clear 3rd source ; clear 3rd source ; ; ; ; ; ; ; ; ; ; ; ; (1) (2) (3) (4) || (5) load k load x yr = xr*kr - xi*ki yi = xr*ki + xi*kr store yr store yi v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines .dup 50 add .endm jned ret d0,d0,d0 d15,#0,testloop2 Problems Related to Cycle Count • ld.d and st.d: – Data pointer should be aligned on an 8-byte boundary. The memory mapping needs changes if the pointer is misaligned in the loop. • Loop alignment: – Label should be aligned on an 8-byte boundary in PMU. Use nop32 and nop16 before the test loop to align the label. • PMU cache (in rider-D: 0xD4000000 to 0xD4007FFF): – Because of the internal scratchpad, an address in the PMU cache could add one cycle. Note: Warning for Rider-D • Two loops inside each other will add more cycles, so this should be avoided • Use 32-bit opcode inside a loop to maintain alignment User Guide 27 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 2 Generator A Generator can be regarded as a moving vector (one or two dimensions), stored in memory, in a buffer. This window in memory is then displayed on a screen such as an oscilloscope. 2.1 Complex Wave Generation Equation: X = X*Kr - Y*Ki Y = X*Ki + Y*Kr Pseudo code: for (n=0; n<N; n++) { sX = sX*sKr-sY*sKi; sY = sX*sKi+sY*sKr; } IP= 4 (2 mul, 1 msub, 1 madd) LD/ST= 1 in packed format (write sX || sY) Assembly code: lea ld.w ld.w ldloop: loop st.w Cycles User Guide LC,(n-1) k,[Kptr] xy,startvect ; (1) ; (2) ; (3) mulr.h temp,xy,k ll,#1 st.w [OUTptr+]4,xy maddsur.h xy,temp,xy,k uu,#1 LC,ldloop [OUTptr+]4,xy ; (1) ; y’ = y*b || x’ = x*b ; || ; st x1,y1 (next loop) ; (2,3) ; y’ += x*k || x’-= y*k N = 100 28 Æ ; || ; load loop counter ; ld rotation vector ; ld start vector ; st last x,y 305 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d3 d5 d4 kikr ld kikr xy mulr.h temp,xy,k ll,#1 Load / Store ld x y y=x*ki || x=x*kr maddsur.h xy,temp,xy,k uu,#1 y=y+y*kr || x=y-y*ki st x y st x y User Guide 29 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3 Transcendental Functions Please note the following points on Transcendental functions: • They are commonly used in domains other than DSP. • They have more acute arithmetic problems (as opposed to signal processing problems). • The same functions require different precision levels (application/programmer dependent) • They do not easily lend themselves to multi-MAC operations, since they are inherently iterative (no parallelism) and they produce a single result. • They can be implemented as a table look-up (space) or opposed to a series expansion (time). • They can be implemented as a combination of space and time (partial look-up table and partial computation). This illustrates the first law of algorithms, that the space-time continuum is constant. The table which follows summarises the Optimization techniques and Arithmetic methods that are applicable to the different types of Transcendental Functions. User Guide 30 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Name Cycles Code Size 1) Optimization Techniques Software Pipelining Loop Unrolling Packed Operation Load/Store Scheduling Data Memory Interleaving Packed Load/Store Saturation Rounding Transcendental Functions Summary Table Square Root (Newton-Raphson) 44 72 - - - - - - 9 9 Square Root (Taylor) 26 82 - - - 9 - 9 9 9 Inverse 24 42 - - - - - - 9 9 Natural Logarithm 16 46 - - - - - - - 9 Exponential 22 40 - - - - - - - 9 Sine [-PI/2,PI/2) 38 48 - - - - - - 9 - Sine [-PI,PI) 42 68 - - - - - - 9 - 1) Arithmetic Methods Code Size is in Bytes User Guide 31 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3.1 Square Root (by Newton-Raphson) Equation: Input: [0.25,1) in 1Q15 (X should be normalized to 0.25.. 1) Output: 2Q14 Y 0 = 1.1033 – 0.6666* X Y n+1 = Yn * (1.5 – (X*2)* Yn2) where n = 0,1,2 Pseudo code: // The loop calculates 1/(2*z) sK = 2* sX; sY = 1.1033 – 0.6666*sX; for (n = 0; n<3; n++) sY = sY*(1.5- sK*sY*sY); sZ = sY*(2*sX-1) + sY; IP= 4 (3 mul, 1 sub) LD/ST= 2 (load sK, store sY) Assembly code: movh lea movh ld.q movh movh mov msubr.q d0,#0x469c a4,2 d1,#0x5553 sX,[a3+] d9,#0x8000 d8,#0x1000 sK,sX sY,d0,sX u,d1u,#0 ; ; ; ; ; ; ; ; (1) || (2) || (3) (4) (5) (6) ; ; ; ; ; ; ; ; 1.1033 in 2q14 initialization of counter 0.6666 in 1q15 load sX in 1q15 d9 = -1 in1q15 d8 = 0.5 in3q13 sK in 2q14(1q15->2q14->*2) Y02q14 ; ; ; ; ; ; ; ; (1,2) ;Y0^21q15 (3) ;result in2q14 (4,5) ;0.5-sK*Y0^23q13 (6) ;result in 2q14 (7) ;Y in 3q13 (8,9) Y0+Y0[0.5-sK*sY0^2]3q13 (10) ;Y1 in 2q14 ; ; ; ; 2*sX-1 in 1q15 (2*sX-1)/(2*sqrt(sX)) (2*sX-1)/(2*sqrt(sX))+1/ (2*sqrt(sX)) = sqrt(sX) sqrloop: mulr.q sh msubrs.q shas sh maddrs.q tmp,sY u,sY u,#1 tmp,tmp,#-1 tmp,d8,tmp u,sK u,#1 tmp,tmp,#1 d1,sY,#-1 sY,d1,sY u,tmp u,#1 shas a4,sqrloop sY,sY,#1 loop adds mulr.q adds d6,sK,d9 d0,sY u,d6u,#1 d3,sY,d0 User Guide ; (7) ; (8,9) ; (10) 32 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3.2 Square Root (Taylor) Equations: Input: [0.5,1) in 2Q14 (X should be normalized to 0.5.. 1) Output: 1Q15 x= (y – 1)/2 y^0.5 = 1 + x – 0.5*(x^2) + 0.5*(x^3) – 0.625*(x^4) + 0.875*(x^5) Pseudo code: sX[0] = (sY – 1)/2; for(n=1;n<5;n++) for(n=0;n<3;n++) sX[n] = sX[0] * sX[n-1]; eY += sX[2*n] * sK[2*n] + sX[2*n+1] * sK[2*n+1]; IP= 1+2 (1 mul), (1 mul, 1 add) LD/ST= 1 + 2 (st sX), (ld sX,sK) Assembly code: lea lea ld.q sh addih st.q movh st.q mov iloop: a3,xsqrtvalue a4,3 d1,[a3] d1,d1,#-1 d1,d1,#0xc000 [a3+]2,d1 d0,#0x8000 [a3+]2,d0 d2,d1 ; ; ; ; ; ; ; ; ; (1) (2) (3) (4) (5) || (6) || (7) loop mulr.q nop st.q a4,iloop d2,d1u,d2u,#1 ; (1,2) ; || [a3+]2,d2 ; || lea mov ld.w mov ld.d a2,ksqrtvalue d0,#0 d2,[a2+]4 d1,#0 e4,[-a3]8 ; ; ; ; ; (8) (9) || (10) || maddms.h ld.d maddms.h ld.w maddms.h st.h e0,e0,d5,d2ul,#1 e2,[a2+]8 e0,e0,d4,d2ul,#1 d5,[-a3]4 e0,e0,d5,d3ul,#1 YAddr,d1 ; ; ; ; ; ; (11,12) || (13,14) || (15,16) || ;load loop counter ;x ;y/2 ;x1=y/2-0.5 ;x0 ;computation of x2,x3,x4,x5 ;initialization of y(lower) ;k5k4 ;initialization of y(upper) ;x5x4x3x2 ;y=y+x4k4+x5k5 ;k3k2k1k0 ;y=y+x2k2+x3k3 ;x1x0 ;y=y+x0k0+x1k1 ksqrtvalue:.half 0xb000,0x7000,0xc000,0x4000,0x7fff,0x8000 ;k5k4k3k2k1k0 xsqrtvalue:.half 0x7000,0x0000,0x0000,0x0000,0x0000,0x0000 ;x1x0x2x3x4x5 User Guide 33 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3.3 Inverse (y=1/x) Equations: Input: [+0.5..+1) in 2Q14. Output: (+1..+2] in 2Q14. Yk+1 = 2*Yk(1 – (X/2)*Yk) X, Y are 16-bit values Pseudo code: sY = 1.457; for(i=0;i<3;i++) sY = 2*sY*(1 - (sX/2)*sY) IP= 4 (2 shift, 1 mul, 1 msub) LD/ST= 0 Assembly code: lea lea a2,coef_inv a4,0x02 ld.q sh movh ld.q scond: d0,[a3+] d0,d0,#-1 d3,#0x2000 d2,[a2] msubrs.q sh mulr.q loop shas a4,scond coef_inv: .half User Guide ; ; ; ; ; ; ; (1) (2) number of iteration on the series (3) ; load x 2q14 (4) ; x/2 2q14 (5) ; 1 in 3q13 || ; y[0]= 1.457 in2q14 d4,d3,d0u,d2u,#1 ; (1,2) d4,d4,#1 ; (3) d4,d4u,d2u,#1 ; (4,5) ; temp = d2,d4,#2 ; (6) 0x5d3f ; temp = 1 - (x/2)*y 3q13 ; temp = 1 - (x/2)*y 2q14 y*(1 - (x/2)*y)3q13 ; y = temp 2q14 ;1.457 34 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3.4 Natural Logarithm (y= ln(x)) Equations: Y = K1*(X-1)+K2*(X-1)2+K3*(X-1)3+K4*(X-1)4+K5*(X-1)5 Y, X, K are 16-bit values. Input: [+1..+2) in 2Q14. Ouput: in 1Q15 Pseudo code: for(i=0;i<4;i++) sY *= sX + sK[n] IP= 1 (1 madd) LD/ST= 1 (1 ld sK) Assembly code: lea a2,coef_log lea ld.q movh ld.q sub ld.q sh ilop: a3,0x03 d4,[a4+]2 d5,#0x4000 d2,[a2+]2 d4,d4,d5 d3,[a2+]2 d4,d4,#1 loop mulr.q ; ; ; ; ; ; ; ; ; (1) load the address of the first coeff k5 (2) ;initialize the counter (3) ;load the number to log in 2q14 (4) ;12q14 || ;load k5 (5) ;z = x - 1 || ;load k4 (6) ;result in 1q15 maddr.q d2,d3,d2u,d4u,#1 ld.q d3,[a2+]2 a3,ilop ; || d6,d2u,d4u,#1 ; (7,8) ;(1,2) ; ;|| ; ;(((k5*z+k4)*z+k3)z+k2)z+k1 ;((((k5*z+k4)z+k3)z+k2)z+k1)z coef_log: .half 0x0404,0xeef8,0x2491,0xc149,0x7fe3 ;in 1Q15 User Guide 35 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Exponential (y=ex) 3.5 Equations: Y = K1*X+K2*X2+K3*X3+K4*X4+K5*X5+K6*X6+K7*X7 Y, X, K are 16-bit values Input: [0..1) in 1Q15 Output: in 3Q13 Pseudo code: for(i=0;i<6;i++) sY *= sX + sK[n] IP= 1 (1 madd) LD/ST= 1 (1 ld sK) Assembly code: lea lea ld.q a2,coef_exp a3,0x05 d4,[a4+]2 ; (1) ; (2) ; (4) ld.q ld.q ilop: d2,[a2+]2 d3,[a2+]2 ; (5) ; (6) loop mulr.q addih sh ;load address of the first coeff k7 ;initialize the counter ;load the number we would like the exp ;in 1q15 ;load k7 2q14 ;load k6 2q14 maddr.q ld.q a3,ilop d2,d3,d2u,d4u,#1 ; (1,2) ;2q14 d3,[a2+]2 ; || ;load next coef 2q14 ; || ; (((((k7*z+k6)z+k5)z+k4)z+k3)z+k2)z+k1 d6,d2u,d4u,#1 ; (7,8) ; ((((((k7*z+k6)z+k5)z+k4)z+k3)z+k2)z+k1)z ; 2q14 d6,d6,#0x4001 ; (9) ;add 1 to result d6,#-1 ; (10) ;result in 3Q13 coef_exp:.half 0x0003,0x0016,0x0088,0x02aa,0x0aaa,0x2000,0x4000;in Q14 User Guide 36 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3.6 Sine (y =sin(x)), range [-Pi/2, Pi/2) Equations: Sin(x)=k1*x+k2*x3+k3*x5+k4*x7+k5*x9+… Input: [-1,1) in 1Q15 Output: [-1,1) in 1Q31 This can also be written as: Sin(x)=((((k5*x2+k4)*x2+k4)*x2+k3)*x2+k2)*x2+k1)*x This series is valid for x∈[-Pi/2, Pi/2], so the input between [-1,+1) is scaled to the range [-Pi/2, Pi/2) and gives a result between –1 and 1. 0x7fff x 0x0000 Y=Sine(x) 0x8000 routines_sec3_6a.eps Pseudo code: for(i=0;i<5;i++) lY *= lX + lK[n] IP= 1 (1 madd) User Guide LD/ST= 1 (1 ld sK) 37 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: lea lea ld.q 1Q15 ld.w a2,coef_sin a3,0x04 d4,[a4+]2 ; (1) ; (2) ; (3) d2,LX1 mul.q mul.q ld.w ld.w lloop: d4,d4,d2,#1 d1,d4,d4,#1 d2,[a2+]4 d8,[a2+]4 ; ; ; ; ; ; loop mul.q shas LX1: madds.q sh ld.w a3,lloop d6,d2,d4,#1 d6,d6,#1 ;load the address of the first coeff ;initialize the counter ;load number we would like the sine (4) ;load the factor of norm (PI/2) 2Q30 (5,6,7) ;x = x*a 2q30 (8,9) ;z = x*x, 3q29 || ;load k5 1q31 (10) ;load k4 1q31 d2,d8,d2,d1,#1 d2,d2,#2 d8,[a2+]4 ; || ; (11,12) ; (13) ; (1,2,3) ;give the result in 3q29 ; (4) ;1q31 ; || ;1q31 ;(((k5*z+k4)*z+k3)z+k2)z+k1 ;((((k5*z+k4)*z+k3)z+k2)z+k1)x 2q30 ;1q31 .word 0x6487ED51;PI/2 in Q30 coef_sin:.word 0xffffffca,0x000005c7,0xfffe5fe6,0x00444444,0xfaaaaaab,0x20000000;1Q31 and 3Q29 User Guide 38 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3.7 Sine (y =sin(x)), range [-Pi, Pi) Equations: Sin(x)=k1*x+k2*x3+k3*x5+k4*x7+k5*x9+… Input: [-1,1) in 1Q15 Output: [-1,1) in 1Q31 By using the previous sine, we can get the result for the range [-Pi, Pi] with: • Sin(Pi-x) = Sin(x) 0x4000 0x3fff Change of variable: x If 0x4000<x<0xbfff f Then x’=(0x7fff-x)<<1 0x7fff 0x0000 If 0xc000<x<0x3ff Then x’=x<<1 0x8000 Y=Sine(x’) 0xbfff 0xc000 routine_sec3_7a.eps Pseudo code: for(i=0;i<5;i++) lY *= lX + lK[n] IP= 1 (1 madd) User Guide LD/ST= 1 (1 ld sK) 39 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: lea lea ld.q a2,coef_sin0 a3,0x04 d4,[a4+]2 ;change of variable movh d9,#0x8000 xor.t d2,d4:31,d4:30 jz d2,lab add d4,d9,d4 rsub d4,d4,#0 lab: sh d4,#1 ld.w d2,LX1 mul.q mul.q ld.w ld.w llloop: d4,d4,d2,#1 d1,d4,d4,#1 d2,[a2+]4 d8,[a2+]4 madds.q ; ; ; ; load address of first coeff initialize the counter load number we would like the sine 1Q15 ; ; ; ; ; ; (4) (5) (6,7) (8) (9) (10) ; ; ; ; ; ; ; -1 0x4000<x<0xbfff ? if not, go to lab else x = x-1 x = -(x-1) = 1-x x’=x<<1 end change of variable ; ; ; ; ; ; (11) load the factor of norm (PI/2) (12,13,14) ; x = x*a 2q30 (15,16) ; z = x*x, 3q29 || ; load k51q31 (17) ; load k41q31 d2,d8,d2,d1,#1 loop mul.q sh d2,d2,#2 ld.w d8,[a2+]4 a3,llloop d6,d2,d4,#1 shas d6,d6,#1 LX1: ; (1) ; (2) ; (3) ; ; ; ; ; ; ; ; ; ; ; 2Q30 (1,2,3) give the result in 3q29 1q31 1q31 (((k5*z+k4)*z+k3)z+k2)z+k1 (4) || || (18,19) ((((k5*z+k4)*z+k3)z+k2)z+k1)x 2q30 (20) ; 1q31 .word 0x6487ED51;PI/2 in Q30 coef_sin0:.word 0xffffffca,0x000005c7,0xfffe5fe6,0x00444444,0xfaaaaaab,0x20000000;1Q31 and 3Q29 User Guide 40 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Cycles Code Optimization Techniques Size Packed Operation Load/Store Scheduling Data Memory Interleaving Packed Load/Store Saturation Rounding 16-bit signed Multiplication 3,4 16 - - - - - - - 9 32-bit signed Multiplication 32-bit result 5 16 - - - - - - - - 32-bit signed Multiplication 64-bit result 5 16 - - - - - - - - “C” integer multiplication 5 16 - - - - - - - - 16-bit update 5 20 - - - - - - 9 9 32-bit update 6,5 20 - - - - - - 9 - 2nd order diff. 4 equation (16-bit) 24 - - 9 - - 9 - - 2nd order diff. 4 equation (32-bit) 24 - - 9 - - 9 - - Complex multiplication 5 20 - - - - - - - - Complex multiplication (packed) 5 14 - - 9 - - 9 9 9 Complex update 7 24 - - - - - - - - Complex update 6 (packed) 18 - - 9 - - 9 9 9 Bytes Name Loop Unrolling Scalars Software Pipelining 4 User Guide 41 Arithmetic Methods v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.1 16-bit signed Multiplication Pseudo code: sZ = sX*sK; IP= 1 (1 mul) LD/ST= 3 (read lX, read lK, write lZ) Assembly code: ;result = 16-bit ld.q sX, Xaddr ld.q sK, Kaddr mul.q lZ,sX u,sK u,#1 st.q Zaddr,sZ ; ; ; ; (1) (2) (3) || ; ; ; left justified ; store 16-bit upper ;same with result = 32-bit ld.q sX, Xaddr ld.q sK, Kaddr mul.q lZ,sX u,sK u,#1 st.w Zaddr,lZ ; ; ; ; (1) (2) (3) || ; ; ; left justified ; store 32-bit (1) (2) (3,4) || ; ; ; left justified ; store 16-bit upper ;same with result = rounded 16-bit ld.q sX, Xaddr ; ld.q sK, Kaddr ; mulr.q sZ,sX u,sK u,#1 ; st.q Zaddr,sZ ; User Guide 42 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.2 32-bit signed Multiplication Pseudo code: lZ = lX*lK; IP= 1 (1 mul) LD/ST= 3 (read lX, read lK, write lZ) Assembly code: ; lX,lK,lZ 32-bit signed ld.w lX, Xaddr ld.w lK, Kaddr mul.q lZ,lX,lK,#1 st.w Zaddr,lZ 4.3 ; ; ; ; (1) (2) (3,4,5) || ; ; ; ; 32-bit signed Multiplication (Result on 64-bit) Pseudo code: llZ = lX*lK; all signed IP= 1 (1 mul) LD/ST= 3 (read lX, read lK, write llZ) Assembly code: ; lX,lK 32-bit signed, llZ 64-bit signed ld.w lX, Xaddr ; (1) ld.w lK, Kaddr ; (2) mul.q llZ,lX,lK,#1 ; (3,4,5) st.d Zaddr,llZ ; || User Guide 43 ; ; ; ; same number of cycles v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.4 ‘C’ Integer Multiplication Pseudo code: This multiplication takes the lower part of the result (MUL), whereas previous multiplications take the upper part (MUL.Q). lZ = lX*lK; IP= 1 (1 mul) LD/ST= 3 (read lX, read lK, write lZ) Assembly code: ; lX,lK,lZ 32-bit signed ld.w lX, Xaddr ld.w lK, Kaddr mul lZ,lX,lK st.w Zaddr,lZ ; ; ; ; (1) (2) (3,4,5) || ; ; ; ; ; lX,lK,lZ 32-bit unsigned ld.w lX, Xaddr ; (1) ; ld.w lK, Kaddr ; (2) ; mul lZ,lX,lK ; (3,4,5) ; ; unsigned multiplication gives the same result ; as a signed multiplication. That is why there is no MUL.U instruction. st.w Zaddr,lZ ; || ; ; lX,lK,lZ 32-bit signed (saturated result) ld.w lX, Xaddr ; (1) ; ld.w lK, Kaddr ; (2) ; muls lZ,lX,lK ; (3,4,5) ; st.w Zaddr,lZ ; || ; ; lX,lK,lZ 32-bit unsigned (saturated result) ld.w lX, Xaddr ; (1) ; ld.w lK, Kaddr ; (2) ; muls.u lZ,lX,lK ; (3,4,5) ; st.w Zaddr,lZ ; || ; User Guide 44 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.5 16-bit Update Pseudo code: sZ = sY + sX*sK; IP = 1 (1 madd) LD/ST= 4 (read sX, read sK, read sY, write sZ) Assembly code: ;result = 32-bit ld.q sX,Xaddr ld.q sK,Kaddr ld.q sY,Yaddr madds.q lZ,sY,sX u,sK u,#1 st.q Zaddr,sZ ; ; ; ; ; (1) (2) (3) (4,5) || ; ; ; ; ; ;same with result = rounded 16-bit ld.q sX,Xaddr ; ld.q sK,Kaddr ; ld.q sY,Yaddr ; maddrs.q sZ,sY,sX u,sK u,#1 ; st.q Zaddr,sZ ; (1) (2) (3) (4,5) || ; ; ; ; ; The pseudo code sZ = sY + sX*sK would have been written sZ = sY + (long) (sX*sK), since the result of a 16*16-bit multiplication is a 32-bit result in hardware. sZ = sY + (long) (sX*sK) is actually equivalent to sZ = sY + (short) (sX*sK). The proof can be seen in the following figure: C 16 0000 C 16 P X R 16 32 P not necessary + R routines_sec4_5a.eps Since the lower part of C is only zeros, it will not change the lower part of P. We can therefore say that the result of the 16*16-bit multiplication is a short value. The pseudo code is: sZ = sY + sX*sK User Guide 45 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.6 32-bit Update Pseudo code: lY += lX*lK; IP = 1 (1 madd) LD/ST= 4 (read lX, read lK, read lY, write lY) Assembly code: ;32*32-bit multiplication, result = 32-bit ld.w lX,Xaddr ; (1) ld.w lK,Kaddr ; (2) ld.w lY,Yaddr ; (3) madds.q lY,lY,lX,lK,#1 ; (4,5,6) st.w Yaddr,lY ; || ; ; ; ; ; A 32-bit update can not only be executed with a 32*32-bit multiplication, but also with a 16*32-bit multiplication and a 16*16-bit multiplication. 16 16 16 32 32 32 32 X 32 + 32 routines_sec4_6a.eps ;16*32-bit multiplication, result = 32-bit ld.q sX,Xaddr ; (1) ld.w lK,Kaddr ; (2) ld.w lY,Yaddr ; (3) madds.q lY,lY,lK,sX u,#1 ; (4,5) st.w Yaddr,lY ; || ; ; ; ; ; ;16*16-bit multiplication, result ld.q sX,Xaddr ld.w lK,Kaddr ld.w lY,Yaddr madds.q lY,lY,sK u,sX u,#1 st.w Yaddr,lY ; ; ; ; ; User Guide = ; ; ; ; ; 32-bit (1) (2) (3) (4,5) || 46 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.7 2nd Order Difference Equation (16-bit) Pseudo code: sY = sX - 2*sX1 + sX2; IP= 3 (1 add, 1 sub, 1mul) LD/ST= 4 (read sX, read sX1, read sX2, write sY) Optimization note: sXX is 32-bit register holding the 2 16-bit variables sX, sX1 Assembly code: ld.w sh sub ld.h add st.h sXX, [Xptr] sX1,sXX,#-15 sY,sXX,sX1 sX2, [Xptr]+4 sY,sY,sX2 Yaddr,sY ; ; ; ; ; ; (1) (2) (3) || (4) || ; ; ; ; ; ; X1 || X 0000 || X1*2 Y = X - 2*X1 Y = X - 2*X1 + X2 Memory organization: [Xptr] --> User Guide Xaddr sX Xaddr + 2 sX1 Xaddr + 4 sX2 47 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.8 2nd Order Difference Equation (32-bit) Pseudo code: lY = lX - 2*lX1 + lX2; IP= 3 (1 add, 1 sub, 1mul) LD/ST= 4 (read lX, read lX1, read lX2, write lY) Assembly code: ld.d sh sub ld.w add st.w lX1/lX, [Xptr] lX1,lX1,#1 lY,lXX,lX1 lX2, [Xptr]+8 lY,lY,lX2 Yaddr,lY ; ; ; ; ; ; (1) (2) (3) || (4) || ; ; ; ; ; ; lX1 || lX lX1*2 lX - 2*lX1 lX - 2*lX1 + lX2 Memory organization: [Xptr] --> User Guide Xaddr lX Xaddr + 4 lX1 Xaddr + 8 lX2 48 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.9 Complex Multiplication Equations: Yr = Xr*Kr - Xi*Ki Yi = Xr*Ki + Xi*Kr Pseudo code: sYr = sXr*sKr – sXi*sKi; sYi = sXr*sKi + sXi*sKr; IP= 4 (2 mul, 1 madd, 1msub) LD/ST= 6 (read sXr, sXi, sKr, sKi, write sYr, sYi) is equivalent to (packed format) = 3 (read sXr||sXi, sKr||sKi, write sYr||sYi) Assembly code: mov ld.w mov ld.w d6,#0 ssK,[Kptr+]4 d7,#0 ssX,[Xptr+]4 ; ; ; ; (1) || (2) || ; ; Ki Kr ; ; Xi Xr msubadm.h mulm.h st.h st.h e0,e6,ssK,ssX ul,#1 e2,ssK,ssX lu,#1 [Yptr+]2,sYr [Yptr+]2,sYi ; ; ; ; (3) (4) || (5) ; ; ; ; Yr = Xr*Kr - Xi*Ki Yi = Xr*Ki + Xi*Kr store Yr store Yi Note: TriCore MULM.H does not have a direct subtraction, only addition. A MADDSUM.H instruction is used to get the subtraction, with the 3rd source register set to 0 (e6). The 2 results are not packed in one register, and so two stores are required. User Guide 49 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1/ d0 d3/ d2 d5 d4 d7/ d6 Load / Store 0 kikr ld krki 0 xixr msubadm.h yr=xr*kr-xi*ki e0,e6,ssK,ssX ul,#1 mulm.h e2,ssK,ssX lu,#1 User Guide ld xrxi st yr yi=xr*ki+xi*kr 50 st yi v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.10 Complex Multiplication (Packed) Equations: Yr = Xr*Kr - Xi*Ki Yi = Xr*Ki + Xi*Kr Pseudo code: sYr = sXr*sKr – sXi*sKi; sYi = sXr*sKi – sXi*sKr; IP= 4 (2 mul, 1 madd, 1 msub) LD/ST= 6 (read sXr, sXi, sKr, sKi, write sYr, sYi) is equivalent to (packed format) = 3 (read sXr||sXi, sKr||sKi, write sYr||sYi) Pseudo code implementation: sYi = sXr*sKi; sYr = sXr*sKr; sYi += sXi*sKr; sYr -= sXi*sKi; Assembly code: ld.w ld.w mulr.h maddsurs.h st.w ssK,[Kptr+]4 ; ssX,[Xptr+]4 ; ssY,ssK,ssX ll,#1 ; ssY,ssY,ssK,ssX uu, #1 ; [Yptr+],ssY ; (1) (2) (3) (4,5) || ; ; ; ; ; Ki Kr Xi Xr Yi =Xr*Ki || Yr = Xr*Kr Yi +=Xi*Kr || Yr -= Xi*Ki store Yi Yr Note: In this example we save one store because the computed results are packed in one register. Rounding is used to pack them. Register diagram: Instruction d0 d5 d4 ki kr ld krki xi xr mulr.h ssY,ssK,ssX ll,#1 yi = xr*ki ||yr = xr*kr maddsurs.h ssY,ssY,ssK,ssX uu, #1 yi += xi*kr ||yr -= xi*ki User Guide 51 Load / Store ld xrxi st yiyr v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.11 Complex Update Equations: Zr = Yr + Xr*Kr - Xi*Ki Zi = Yi + Xr*Ki + Xi*Kr Pseudo code: sZr = sYr + sXr*sKr - sXi*sKi; sZi = sYi + sXr*sKi + sXi*sKr; IP= 4 (3 madd, 1 msub) LD/ST= 8 (read sXr, sXi, sKr, sKi, sYr, sYi, write sZr, sZi) equivalent to (packed format)= 4 (read sXr|| sXi, sKr || sKi, sYr || sYi, write sZr || sZi) Assembly code: movh ld.w ld.w ld.h msubadm.h ld.h st.h maddm.h st.h d2,#0 ssK,[Kptr+]4 ssX,[Xptr+]4 sYr,[Yptr+]2 e0,e2,ssK,ssX ul, #1 sYi,[Yptr+]2 [Zptr+]2,sZr e0,e2,ssK,ssX lu,#1 [Zptr+]2,sZi ; ; ; ; ; ; ; ; ; (1) || (2) (3) (4,5) || || (6,7) || ; ; ; ; ; ; ; ; ; Ki Kr Xi Xr Yr Zr = Yr + Xr*Kr - Xi*Ki Yi store Zr Zi = Yi + Xr*Ki + Xi*Kr store Zi The MSUBADM.H instruction is used because the memory organization is imaginary || real. The equation is zr = yr – (xi*ki - xr*kr), which is equivalent to zr = yr + xr*kr - xi*ki. User Guide 52 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1/ d0 d2 d3 0 d5 d4 kikr ld kikr xixr msubadm.h e0,e2,ssK,ssX ul,#1 zr = yr + xr*kr - xi*ki Load / Store ld xixr yr ld yr yi ld yi st zr maddm.h e0,e2,ssK,ssX lu,#1 zi = yi + xr*ki + xi*kr st zi User Guide 53 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 4.12 Complex Update (Packed) Equations: Zr = Yr + Xr*Kr - Xi*Ki Zi = Yi + Xr*Ki + Xi*Kr Pseudo code: sZr = sYr + sXr*sKr - sXi*sKi; sZi = sYi + sXr*sKi + sXi*sKr; IP= 4 (3 madd, 1 msub) LD/ST= 8 (read sXr, sXi, sKr, sKi, sYr, sYi, write sZr, sZi) equivalent to (packed format)= 4 (read sXr|| sXi, sKr || sKi, sYr || sYi, write sZr || sZi) Pseudo code implementation: sZi = sYi + sXr * sKi; sZr = sYr + sXr * sKr; sZi += sXi * sKr; sZr -= sXi * sKi; Assembly code: ld.w ld.w ld.w maddrs.h maddsurs.h st.w ssK,[Kptr+]4 ; (1) ssX,[Xptr+]4 ; (2) ssY,[Yptr+]4 ; (3) ssZ,ssY,ssK,ssX ll,#1 ; (4) ssZ,ssZ,ssK,ssX uu,#1 ; (5,6) [Zptr+]4,ssZ ; || ; ; ; ; ; ; Ki Kr Xi Xr Yi Yr Zi = Yi+Xr*ki || Zr = Yr+Xr*Kr Zi +=Xi*kr || Zr -= Xi*Ki store Zi Zr Note: Rounding is used to pack the 2 results in 1 register.They can be stored in one instruction. User Guide 54 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d2 d4 xixr maddrs.h ssZ,ssY,ssK,ssX ll,#1 zi=yi+xr*ki ||zr=yr+xr*kr maddsurs.h ssZ,ssZ,ssK,ssX uu,#1 zi=zi+xi*kr || zr=zr-xi*ki zizr User Guide d5 Load / Store kikr ld kikr ld xixr yiyr ld yiyr yiyr ld yiyr st zizr 55 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Data memory interleaving Packed Load/store Saturation Rounding Code Size 1) Optimization Techniques Vector sum (3*N/4 +2)+ 3 32 9 9 9 9 - 9 9 - Vector multiplication (3*N/4 +2) +3 32 9 9 9 9 - 9 - - Vector pre-emphasis (3*N/4 +2) +3 40 9 9 9 9 - 9 9 9 Vector square difference (3*N/2 +2) +2 24 9 - 9 9 9 9 9 9 Vector complex multiplication (5*N +2) +3 28 - - 9 - - 9 - - Vector complex multiplication (packed) (4*N +2) +3 24 - - 9 - - 9 9 9 Vector complex (2*N +2) +4 multiplication (unroll) 42 9 9 9 - 9 9 9 9 Color space conversion 11 64 9 9 9 - 9 9 - - Vector scaling (2*N/2 +2) +2 28, 20 - - - - - 9 9 9 - - 9 - - 9 - - Vector normalization (2*N/2 +2) + 54 (2*N/2 +2) +7 1) Loop Unrolling Cycles Software Pipelining Name Load/store Scheduling Vectors Packed Operation 5 Arithmetic Methods Code Size is in Bytes User Guide 56 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.1 Vector Sum Equation: Zn = Vn + Wn n = 0..N-1 Pseudo code: for (n=0; n<N; n++) sZ[n] = sV[n] + sW[n]; IP= 1 (1 add) LD/ST= 3 (read sV, read sW, write sZ) Assembly code: lea ld.w ld.d vadloop: loop LC,(N/4–1) ssV0,[Vptr+]4 ssssW,[Wptr+]8 adds.h ld.d adds.h ld.d st.d LC,vadloop Example ssZ0,ssV0,ssW0 ssssV,[Vptr+]8 ssZ1,ssV1,ssW1 ssssW,[Wptr+]8 [Zptr+]8,ssssZ ; (1) ; (2) ; (3) ; get loop number ; V0 V1 ; W0 W1 W2 W3 ; ; ; ; ; ; ; ; ; ; (1) || (2) || (3) V1+W1 V2 V3 V3+W3 W4 W5 store || V4 || W6 Z0 V0+W0 V5 V2+W2 W7 Z1 Z2 Z3 N = 64 Î 53 cycles Register diagram: Instruction d1 / d0 d5 / d4 v1 v0 _ _ d7/ d6 Load / Store ld v0v1 w3 w2 w1 w0 ld w0w1w2w3 adds.h ssZ1,ssV2,ssW1 d1= ------------v5 v4 v3 v2 d0= v1+w1 || v0+w0 adds.h ssZ2,ssV1,ssW2 d1= v3+w3 || v2+w2 w7 w6 w5 w4 ld w4w5w6w7 d1= z3 z2 d0= z1 z0 User Guide ld v2v3v4v5 st z0z1z2z3 57 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.2 Vector Multiplication Equation: Zn = Vn * Wn n = 0..N-1 Pseudo code: for (n=0; n<N; n++) sZ[n] = sV[n] * sW[n]; IP= 1 (1 mul) LD/ST= 3 (read sV, read sW, write sZ) Assembly code: lea ld.w ld.d LC,(N/4 - 1) ssV0,[Vptr+]4 ssssW,[Wptr+]8 ; (1) ; (2) ; (3) ; get loop number ; V0 V1 ; W0 W1 W2 W3 vect2loop: loop mulr.h ssZ0,ssV0,ssW0 ul,#1 ld.d ssssV,[Vptr+]8 mulr.h ssZ1,ssV1,ssW1 ul,#1 ld.d ssssW,[Wptr+]8 st.d [Zptr+]8,ssssZ LC,vect2loop Example ; ; ; ; ; (1) || (2,3) || || ; ; ; ; ; V1*W1 V2 V3 V3*W3 W4 W5 store || V4 || W6 Z0 V0*sW0 V5 V2*W2 W7 Z1 Z2 Z3 N = 64 Î 53 cycles Register diagram: Instruction d1/ d0 d5 / d4 v1 v0 _ _ d7/ d6 Load / Store ld v0v1 w3 w2 w1 w0 ld w0w1w2w3 mulr.h _ _ z1 z0 ssZ0,ssV0,ssW0 ul,#1 v5 v4 v3 v2 mulr.h z3 z2 z1 z0 ssZ1,ssV1,ssW1 ul,#1 w7 w6 w5 w4 ld w4w5w6w7 z3 z2 z1 z0 User Guide ld v2v3v4v5 st z0z1z2z3 58 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.3 Vector Pre-emphasis Equation: Zn = Vn + Xn*K n = 0..N-1 K = -28180 Pseudo code: for (n=0; n<N; n++) sZ[n] = sV[n] + sX[n]*sK; IP= 1 (1 madd) LD/ST= 3 (read sX, read sV, write sZ) Assembly code: lea mov.u ld.w addih ld.d LC,(N/4 –1) d6,#0x91ec ssX0,[Xptr+]4 d6,d6,#0x91ec ssssV,[Vptr+]8 preloop: maddrs.h ld.d maddrs.h loop ld.d st.d LC,preloop Example ; ; ; ; ; (1) (2) || (3) || ;get loop number ;K=-28180 ;X0 X1 ;K || K ;V0 V1 V2 V3 ssZ0,ssV0,ssX0,d6 ul,#1 ;(1); Z1=V1+X1*K ||Z0=V0+X0*K ssssX,[Xptr+]8 ; || ;X2 X3 X4 X5 ssZ1,ssV1,ssX1,d6 ul,#1 ;(2,3) ;Z3=V3+X3*K ||Z2=V2+X2*K ssssV,[Vptr+]8 ; || ;V4 V5 V6 V7 [Zptr+]8,ssssZ ; || ;store Z0 Z1 Z2 Z3 N = 160 Î 125 cycles Register diagram: Instruction d1/ d0 d3 / d2 d5 / d4 x1x0 _ _ d6 Load / Store k ld x0x1 v3 v2 v1 v0 k||k maddrs.h ssZ0,ssV0,ssX0,d6 ul,#1 _ _ z1 z0 x5 x4 x3 x2 maddrs.h ssZ1,ssV1,ssX1,d6 ul,#1 z3 z2 z1 z0 ld v0v1v2v3 ld x2x3x4x5 v7 v6 v5 v4 ld v4v5v6v7 st z0z1z2z3 User Guide 59 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.4 Vector Square Difference Equation: Zn = (Xn-Yn)2 n = 0..N-1 Pseudo code: for (n=0; n<N; n++) { sTmp = sX[n]-sY[n]; sZ[n] = sTmp*sTmp; } IP= 2 (1 sub, 1 mul) LD/ST= 3 (read sX, read sY, write sZ) Assembly code: lea ld.d sqdloop: LC, (N/2 - 1) ssssXY, [Xptr+]8 subs.h ld.d mulr.h ; (1) ; get loop number ; (2) ; X0 X1 Y0 Y1 ssTmp,ssX,ssY ; (1) ; X1 – Y1 || X0 – Y0 ssssXY, [Xptr+]8 ; || ; X2 X3 Y2 Y3 ssZ, ssTmp,ssTmp ul,#1 ; (2,3) ; (X1–Y1)^2 || (X0–Y0)^2 [Zptr+]4,ssZ ; || ; store Z0 Z1 st.w loop LC, sqdloop Memory organization: Xaddr sX0 Xaddr + 2 sX1 Xaddr + 4 sY0 Xaddr + 6 sY1 Xaddr + 8 sX2 Xaddr + 10 etc… Example N = 160 Î 244 cycles User Guide 60 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 subs.h ssTmp,ssX,ssY mulr.h ssZ, ssTmp,ssTmp ul,#1 d1 d3 / d2 Load / Store y1 y0 x1 x0 ld x0x1y0y1 x1–y1 || x0–y0 y3 y2 x3 x2 ld x2x3y2y3 (x1–y1)^2 || (x0–y0)^2 st z0z1 User Guide 61 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.5 Vector Complex Multiplication Equations: Yrn = Xrn*Krn - Xin*Kin n = 0..N-1 Yin = Xrn*Kin + Xin*Krn Pseudo code: for (n=0; n<N; n++) { sYr[n] = sXr[n]*sKr[n]-sXi[n]*sKi[n]; sYi[n] = sXr[n]*sKi[n]+sXi[n]*sKr[n]; } IP= 4 (2 mul, 1 msub, 1 madd) LD/ST= 3 in packed format (read sXr || sXi, read sKr|| sKi, write sYr || sYi) Assembly code: lea mov mov nmloop: loop Example User Guide LC, (N – 1) d6,#0 d7,#0 ld.w ld.w msubadm.h mulm.h st.h st.h LC,nmloop ssK,[Kptr+]4 ssX,[Xptr+]4 e0,e6,ssK,ssX ul,#1 e2,ssK,ssX lu,#1 [Yptr+]2,sYr [Yptr+]2,sYi ; (1) ; (2) ; (3) ; get loop number ; clear 3rd source ; clear 3rd source ; ; ; ; ; ; ; ; ; ; ; ; (1) (2) (3) (4) || (5) load K load X Yr= Xr*Kr-Xi*Ki Yi= Xr*Ki+Xi*Kr store Yr store Yi N = 64 Î 325 cycles 62 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d3 / d2 d4 d5 d7/ d6 Load / Store d6=0 d7=0 ki kr xi xr msubadm.h e0,e6,ssK,ssX ul,#1 mulm.h e2,ssK,ssX lu,#1 ld kikr ld xixr yr yi st yr st yi User Guide 63 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.6 Vector Complex Multiplication (Packed) Equations: Yrn = Xrn*Krn - Xin*Kin n = 0..N-1 Yin = Xrn*Kin + Xin*Krn Pseudo code: for (n=0; n<N; n++) { sYr[n] = sXr[n]*sKr[n]-sXi[n]*sKi[n]; sYi[n] = sXr[n]*sKi[n]+sXi[n]*sKr[n]; } IP= 4 (2 mul, 1 msub, 1 madd) LD/ST= 3 in packed format (read sXr || sXi, sKr|| sKi, write sYr || sYi) Pseudo code implementation: for (n=0; n<N; n++) { sYr[n] = sXr[n]*sKr[n]; sYi[n] = sXr[n]*sKi[n]; sYr[n] -= sXi[n]*sKi[n]; sYi[n] += sXi[n]*sKr[n]; } Assembly code: lea ld.w ld.w nloop: LC,(N – 1) ssK,[Kptr+]4 ssX,[Xptr+]4 mulr.h maddsurs.h loop ld.w ld.w st.w LC,nloop Example User Guide ; (1) ; (2) ; (3) ; get loop number ; Ki Kr ; Xi Xr ssY,ssK,ssX ll,#1 ; (1) ; Yi =Xr*Ki || Yr = Xr*Kr ssY,ssY,ssK,ssX uu,#1 ; (2,3) ; Yi+=Xi*Kr || Yr -= Xi*Ki ssK,[Kptr+]4 ; || ; Ki1 Kr1 ssX,[Xptr+]4 ; || ; Xi1 Xr1 [Yptr+]4,ssY ; (4) ; store Yi Yr N = 64 Î 261 cycles 64 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d5 d4 kikr ld kikr xixr mulr.h ssY,ssK,ssX ll,#1 ld xixr yi=xr*ki || yr=xr*kr maddsurs.h yi=yi+xi*kr||yr=yr-xi*ki ssY,ssY,ssK,ssX uu,#1 kikr ld kikr xixr yiyr User Guide d7/ d6 Load / Store ld xixr st yiyr 65 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.7 Vector Complex Multiplication (Unrolled) Equations: Yrn = Xrn*Krn - Xin*Kin n = 0..N-1 Yin = Xrn*Kin + Xin*Krn Pseudo code: for (n=0; n<N; n++) { sYr[n] = sXr[n]*sKr[n]-sXi[n]*sKi[n]; sYi[n] = sXr[n]*sKi[n]+sXi[n]*sKr[n]; } IP= 4 (2 mul, 1 msub, 1 madd) LD/ST= 3 in packed format (read sXr || sXi, sKr|| sKi, write sYr || sYi) Pseudo Code implementation: for (n=0; n<N/2-1; n++) { sYr[2*n] = sXr[2*n]*sKr[2*n]; sYi[2*n] = sXr[2*n]*sKi[2*n]; sYr[2*n] -= sXi[2*n]*sKi[2*n]; sYi[2*n] += sXi[2*n]*sKr[2*n]; sYr[2*n+1] = sXr[2*n+1]*sKr[2*n+1];sYi[2*n+1] = sXr[2*n+1]*sKi[2*n+1]; sYr[2*n+1] -= sXi[2*n+1]*sKi[2*n+1];sYi[2*n+1] += sXi[2*n+1]*sKr[2*n+1]; } Assembly code: lea ld.w ld.d cxloop: LC,(N/2–1) ssK0,[Kptr+]4 ssssX,[Xptr+]8 mulr.h st.w maddsurs.h ld.d mulr.h st.w maddsurs.h loop st.w User Guide ;(1) ;(2) ;(3) ;get loop number ;Ki0 Kr0 ;Xi1 Xr1 Xi0 Xr0 ssY0,ssK0,ssX0 ll,#1;(1) ;Yi0 =Xr*Ki||Yr0=Xr*Kr [Yptr+]4,ssY1 ;|| ;store former Yi1 Yr1 ssY0,ssY0,ssK0,ssX0 uu,#1 ;(2) ;Yi0 +=Xi*sKr ||Yr0 -= Xi*Ki ssssK,[Kptr+]8 ;|| ;Ki2 Kr2 Ki1 Kr1 ssY1,ssK1,ssX1 ll,#1;(3) ;Yi1 =Xr*Ki ||Yr1=Xr*Kr [Yptr+]4,ssY0 ;|| ;store Yi0 Yr0 ssY1,ssY1,ssK1,ssX1 uu,#1 ;(4) ;Yi1 +=Xi*Kr ||Yr1 -= Xi*Ki ssssX,[Xptr+]8 ;|| ;Xi3 Xr3 Xi2 Xr2 ld.d LC,cxloop [Yptr],ssY1 ; (4) ;store last Yi1 Yr1 66 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d1 d5 d4 d7 d6 Load / Store ki0 kr0 ld k0 x x xi1 xr1 xi0 xr0 mulr.h ssY0,ssK0,ssX0 ll,#1 yi0=xr0*ki0 || yr0=xr0*kr0 ld x0 x1 st y1 maddsurs.h yi0 +=xi0*kr0 || ssY0,ssY0,ssK0,ssX0uu,#1 yr0 -= xi0*ki0 ki2 kr2 ld k1 k2 ki1 ki1 mulr.h ssY1,ssK1,ssX1 ll,#1 yi1=xr1*ki1 || yr1=xr1*kr1 st y0 maddsurs.h ssY1,ssY1,ssK1,ssX1uu,#1 yi1+=xi1*kr1 || xi3 xr3 yr1 -= xi1*ki1 xi2 xr2 ld x2 x3 st y1 User Guide 67 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.8 Color Space Conversion Equation: X A Y Cr = Cb 0.257 0.504 0.098 0.439 -0.368 -0.071 -0.148 -0.291 0.439 * R’ K R 0* G B + 0.5 0.5 * 0 for UPF format, 16 for CCIR 601. Xi = ∑ Aij * R’j + Kj for i, j = 0..n Pseudo code: sY = 0.257*sR + 0.504*sG + 0.098*sB; sCr = 0.439*sR - 0.368*sG - 0.071*sB + 0.5; sCb = -0.148*sR - 0.291*sG + 0.439*sB + 0.5; RGB belongs to [0; +1[ ([0; 256[) so YCrCb will be in [0; +1[ ([0; 256[). (Assembly code and Register diagram follow) User Guide 68 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: .define .define .define .define .define RGBptr "a0" OFFSET128 "0x2000" llY "e8" llCr "e10" llCb "e12" lea lea lea RGBptr,rgbvalue Kptr,kmatvalue a4,stvalue mov ld.d mov.u ld.d d0,#0 e2,[RGBptr+]6 d1,#OFFSET128 e4,[Kptr+]6 ; ; ; ; mulm.h madd.q llY,d2,d4ul,#1 llY,llY,d3l,d5l,#1 ; (3) ; llY = sR*sK00 + sG*sK01 ; (4,5) ; llY = llY + sB*sK02 ld.d maddm.h madd.q e4,[Kptr+]6 llCr,e0,d2,d4ul,#1 llCr,llCr,d3l,d5l,#1 ; || ; sK10,sK11,sK12 ; (6) ; llCr=sR*sK10+sG*sK11+128 ; (7,8) ; llCr=llCr+sB*sK12 ld.d maddm.h st.h madd.q st.h st.h e4,[Kptr+]6 llCb,e0,d2,d4ul,#1 [a4+]2,d9 llCb,llCb,d3l,d5l,#1 [a4+]2,d11 [a4+]2,d13 ; ; ; ; ; ; kmatvalue:.half .half .half rgbvalue: .half User Guide 0x20e6,0x4803,0x0c83 0x3831,0xd0e5,0xf6e9 0xed0e,0xdcc1,0x3831 0x4000,0x3000,0x2000 69 (1) || (2) || || (9) || (10) || (11) ; ; sRsGsB ; in Q14 with 16-bits of sign ; sK00,sK01,sK02 ; ; ; ; ; ; sK20,sK21,sK22 llCb=sR*sK20+sG*sK21+128 store sY llCb=llCb+sB*sK22 store sCr store sCb ; ; ; ; abcin Q15 defin Q15 ghiin Q15 RGB are [0;+1[ in Q14 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1/d0 d3/d2 d5/ d4 0 R’G’B’ offset128 Load/ Store ld k00, k01, k02 Y=R*k00 +G*k01 k10, Y = Y + k11, B*k02 k12 ld k10, k11, k12 Cr=R*k1 0+ G*k11+1 28 k20, k21, k22 maddm.h llCb,e0,d2,d4ul,#1 Cr=Cr+B* k12 Y madd.q llCb,llCb,d3l,d5l,#1 User Guide d13/ d12 k00, k01, k02 maddm.h llCr,e0,d2,d4ul,#1 madd.q llCr,llCr,d3l,d5l,#1 d11/d10 ld R’G’B’ mulm.h llY,d2,d4ul,#1 madd.q llY,llY,d3l,d5l,#1 d9/d8 Cb= st Y R*k20+ G*k21+ 128 Cr 70 ld k20, k21, k22 Cb= Cb+B* k22 st Cr Cb st Cb v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.9 Vector Scaling Equation: Zn = (Xn >> 3) << 2 or Zn = (Xn >> shift1) << shift2 n = 0..N-1 Pseudo code: for (n=0; n<N; n++)sZ[n] = (sX[n] >>3) << 2; IP= 2 (2 shifts) LD/ST= 2 (read sX, write sZ) Assembly code: mov lea mov ld.w isloop: loop Shift1,#-3 LC,(N/2 - 1) Shift2,#2 ssX,[Xptr+]4 ; ; ; ; (1) || (2) || ; ; ; ; load 1st shift value get loop number load 2nd shift value X0 X1 sha.h ld.w sha.h t.w LC,isloop ; ; ; ; (1) || (2) || ; ; ; ; X0>>3, X1>>3 X2 X3 X0<<2, X1<<2 store Z0,Z1 d1,ssX,Shift1 ssX,[Xptr+]4 ssZ,d1,Shift2 [Zptr+]4,ssZ Alternatively, the 2 shift values can be directly used lea LC,(N/2 - 1) ld.w ssX,[Xptr+]4 isloop: sha.h d1,ssX,#-3 ld.w ssX,[Xptr+]4 sha.h ssZ,d1,#2 st.w [Zptr+]4,ssZ loop LC,isloop Example User Guide ; (1) ; get loop number ; (2) ; X0 X1 ; ; ; ; (1) || (2) || ; ; ; ; X0>>3, X1>>3 X2 X3 X0<<2, X1<<2 store Z0,Z1 N = 160 Î 164 cycles 71 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1/ d0 d2 d3 d4 Shift1 d7/ d6 Load / Store ld shift1 Shift2 ld shift2 x1x0 ld x0x1 sha.h d1,ssX,Shift1 x1>>3 || x0>>3 x3x2 ld x2x3 sha.h ssZ,d1,Shift2 x1<<2 || x0<<2 st z0z1 User Guide 72 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 5.10 Vector Normalization Equations: (1) minex = minimum (minex,exponent (Xn)) (2) Xn = Xn << minex n = 0..N-1 n = 0..N-1 Pseudo code Equation (1): sMin = 32; for (n = 0; n<N; n++) { sZ = exponent(sX[n]); if (sZ < sMin) sMin = sZ; else sMin = sMin; } IP= 2 (1 min, 1 count leading sign) LD/ST= 1 (read sX) Pseudo code Equation (2): for (n = 0; n<N; n++) sX[n] = sX[n] << sMin; IP= 1 (1 shift) LD/ST= 2 (read sX, write sX) Assembly code: movh lea addi ld.w bexloop: d4,#16 LC,(N/2 - 1) d4,d4,#16 ssX,[Xptr+]4 cls.h min.h ssZ,ssX d4,d4,ssZ ; ; ; ssX,[Xptr+]4 ; ld.w loop LC,bexloop sh d1,d4,#-16 extr.u d3,d4,#0,#16 min.h d3,d1,d3 ld.w ssX,[X1ptr] lea LC, (N/2 - 1) normloop: sh.h ssX,ssX,d3 st.w [X1ptr+]4,ssX ld.w ssX,[X1ptr] loop LC,normloop User Guide ; ; ; ; ; ; ; ; ; (1) || (2) || ; ; ; ; d4 upper = max get loop number d4 lower = max X0 X1 (1) ; Z1 =exp.(X1)||Z0 =exp.(X0) (2) Min1=min(Z1,Min1)|| Min0=min(Z0,Min0) || ; X2 X3 (3) (4) (5) (6) (7) ; (1) ; || ; (2) 73 ; ; ; ; ; d1 = Min1 d0 = Min0 Min = min(Min1,Min0) X0 X1 get loop number ; X0 << Min || X1 << Min ; store normalized X0, X1 ; load unormalized X2, X3 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Finding minimum exponent loop: Example N = 160 Î 169 cycles Normalization loop: Example N = 160 Î 162 cycles Total: Example User Guide N = 160 Î 331 cycles 74 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6 Filters If all the possible data types are considered, the list of Filter routines can be very large: • • • • 16-bit, 32-bit, mixed 16-bit/32-bit Complex 16-bit, complex 32-bit, complex mixed 16-bit/32-bit Result accumulated on 32bit or > 32bit Saturated data type Most common cases are covered, with the aim of providing a maximum of diversity. The first 3 instances are not filter routines, but Vector to Scalar operations. These have a lot in common with Filters since the result is accumulated. The only difference between a FIR (Finite Impulse Response) filter and a Dot Product for example, is that a FIR is likely to use Circular Addressing. It therefore makes sense to place these together, as they will be optimized in the same way. The next 5 are FIR filter routines, beginning with the more trivial (non-looped) cases and ending with the complex FIR routine. This routine can be extremely complicated to optimize, especially since TriCore can perform it in 2 cycles per complex tap. Between these 2 extreme are the standard n-tap FIR (0.5 cycle per tap) and the Block FIR. Auto-correlation appears next to the Block FIR, because from an implementation point of view, these are nearly identical. They belong to a famous class of algorithms, known as BLOCK algorithms. Those 2 examples should be sufficient to develop other routines. The IIR filter routines logically come after the FIR. These routines all contain feedback terms, which always makes the implementation more difficult than for FIR. The first 3 IIR routines are non-looped cases (1, 2 and 4 coefficients). This is followed by the most common N-stage biquad (in 2 types, of 4 or 5 coefficients) and ends with the Lattice IIR. The Lattice IIR routine is always extremely complicated to optimize with pipeline MAC, which is the case with TriCore. The last group are LMS routines, or more precisely, adaptive FIR filters using the Least Mean Square (LMS) method to update the coefficients. There are dozens of adaptive methods and the delayed LMS is used because it is fast. Leaky LMS and the 3 standard cases (16-bit, 32-bit & complex) have also been addressed. In many ways the LMS can be seen as the standard algorithm by which to judge the power of an architecture, and TriCore gives: • 1 cycle per tap (16-bit coefficients) • 1 cycle per tap (32-bit coefficients) • 2 cycles per complex tap (16-bit coefficients) These results are extremely good, and few DSPs can reach these figures. User Guide 75 v1.6.4, 2003-01 Name Cycles Code Optimization Techniques Size 1) Software Pipelining Loop Unrolling Packed Operation Load/Store Scheduling Data Memory Interleaving Packed Load/Store Saturation Rounding TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Dot product (2*N/4)+6 34 9 - 9 - - 9 9 - Magnitude square (1*N/2)+5 16 9 - 9 - - 9 - - Vector quantization (3*N/2)+6 38 - - 9 - - 9 - - First order FIR 4 14 - - - - - 9 - - Second order FIR 5 22 - - - - - 9 - - FIR (2*N/4 +2) +4 34 9 9 9 - - 9 - - Block FIR (4*(N/2) +2) +9 66 9 - 9 9 - 9 - - Autocorrelation ((11+3*(N/2- 1) + 70 2) *M/2+2)+5 9 - 9 9 - 9 - - Complex FIR (2*N +2) +4 32 9 - 9 9 - 9 9 9 First order IIR 5 24 - - 9 - - 9 9 9 2nd order IIR 7 34 9 - 9 - - 9 9 - Biquad 4 coefficients 5 26 9 - 9 - - 9 - - 48 9 9 9 9 - 9 - - N-stage Biquad (3*(N-1) +2) +6 4 coefficients User Guide 76 Arithmetic Methods v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 74 9 9 9 9 - 9 9 - Lattice filter (4*(N-2) +2) +10 54 9 9 - 9 - - 9 9 Leaky LMS (update only) (4*(N/4-1) +2) +9 70 9 9 9 9 - 9 9 9 Delayed LMS (4*N/4 +2) +5 54 9 - 9 9 - 9 9 9 Delayed LMS – (4*(N/2 –1) +2) + 64 32-bit 8 coefficients 9 - 9 9 - 9 9 - Delayed LMS – (4*(N-1) +2) +9 complex 9 9 9 9 - 9 9 9 N-stage Biquad (5*(N-1) +2) +9 5 coefficients 1) 60 Code Size is in Bytes User Guide 77 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.1 Dot Product Equation: Z = ∑ (Vn * Wn) n = 0..N-1 Pseudo code: sZ = 0; for (n = 0; n < N; n++) sZ += sV[n] * sW[n]; IP= 1 (1 madd) LD/ST= 2 (read sV, read sW) Assembly code: lea LC, (N/4 -1) mov d0,#0 ld.d ssssV,[Vptr+]8 mov d1,#0 ld.d ssssW,[Wptr+]8 dotloop: maddms.h llZ,llZ,ssW0,ssV0 ld.d ssssV,[Vptr+]8 maddms.h llZ,llZ,ssW1,ssV1 ld.d ssssW,[Wptr+]8 loop LC,dotloop st.h [Zptr],sZ Example ; ; ; ; ; (1) (2) || (3) || ; ; ; ; ; get loop number Z =0 (lower) dummy dummy V0 V1 Z =0 (upper) W0 W1 W2 W3 ul,#1 ; || ul,#1 ; || ; ; ; ; (1) ; Z V2 V3 V4 (2) ; Z W4 W5 W6 +=V0*W0+V1*W1 V5 +=V2*W2+V3*W3 W7 ; (4) ; store Z N = 64 Î 38 cycles Register diagram: Instruction d1 / d0 d3 / d2 z(=0) z(=0) maddms.h llZ,llZ,lW0,lV0ul,#1 z+v0w0+v1w1 maddms.h llZ,llZ,lW1,lV1ul,#1 z+v2w2+v3w3 User Guide d5 / d4 Load / Store v1v0 _ _ ld v0v1 w3w2w1w0 ld w0w1w2w3 v5v4v3v2 w7w6w5w4 78 ld v2v3v4v5 ld w4w5w6w7 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.2 Magnitude Square Equation: Z = Σ (Xrn2 + Xin2) n = 0..N-1 Pseudo code: sZ = 0; for (n=0; n<N; n++) sZ += (sXr[n]*sXr[n] + sXi[n]*sXi[n]); IP= 2 (2 madd) LD/ST= 2 (read sXr, read sXi) Assembly code: lea LC, (N/2 - 1) ; (1) ld.w ssX, [Xptr+]4 ; (2) msqloop: maddm.h llZ,llZ,ssX,ssX ul,#1 ld.w ssX, [Xptr+]4 ; || loop LC, msqloop ; st.h [Zaddr],sZ ; (3) ; get loop number ; load Xr, Xi ; (1); Z +=(Xr^2 + Xi^2) ; load next Xr,Xi ; store Z Memory organization: Xaddr Xr0 Xaddr + 2 Xi0 Xaddr + 4 Xr1 Xaddr + 6 Xi1 Xaddr + 8 Xr2 Xaddr +10 etc.. Example N = 64 Î 37 cycles Register diagram: Instruction d1/ d0 d2 0 xi0 xr0 ld xr0 xi0 maddm.h Z llZ,llZ,ssX,ssX,#1 xi1 xr1 ld xr1 xi1 User Guide d5 / d4 79 d7/ d6 Load / Store v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.3 Vector Quantization Note: VALIDATED ON TC1 V1.3 SILICON. Equation: Z = Σ (Kn - Xn)2 n = 0..N-1 Pseudo code: sZ = 0; for (n=0; n<N; n++) sZ += (sK[n]-sX[n])^2; IP= 2 (1 sub, 1 madd) LD/ST= 2 (read sX, read sK) Assembly code: lea mov ld.w mov ld.w LC,(N/2 - 1) d0,#0 ssX,[Xptr+]4 d1,#0 ssK,[Kptr+]4 ;(1) ;(2) ;|| ;(3) ;|| quantloop: subs.h ld.w maddm.h loop st.h ; ; ; ; ; ssTp,ssK,ssX ;(1) ssX,[Xptr+]4 ;|| llZ,llZ,ssTp,ssTp ul,#1 ;Z + = ld.w ssK,[Kptr+]4 ;|| LC,quantloop [Zaddr],d1 ;(4) Example get loop number Z = 0(lower) sX0sX1 Z = 0 (upper) K0 K1 ; K1-X1|| K0-X0 ; X2 X3 ; (2,3) ((K0-X0)^2+(K1-X1)^2) ; K2 K3 ; store sZ N = 64 Î 102 cycles Register diagram: Instruction d1 / d0 d3 / d2 z = 0(lower) d5 / d4 x1 x0 z = 0 (upper) subs.h ssTp,ssK,ssX maddms.h llZ,llZ,ssTp, ssTp ul, #1 User Guide d7 / d6 load x0x1 k1 k0 k1-x1 || k0-x0 z += (k0-x0)^2 + (k1-x1)^2) x3 x2 load k0k1 load x2x3 k3 k2 80 Load / Store load k2k3 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.4 First Order FIR Note: Validated on TriCore Rider D board. Equation: Yt = K0*Xt + K1*Xt-1 Pseudo code: sY = sK0*sX0 + sK1*sX1; sX1 = sX0; IP= 2 (1 mul, 1 madd) LD/ST= 6 (read sX0, sX1, sK0, sK1, write sY, write sX0 sX1) Æ Assembly code: ld.w ld.w mulm.h st.h st.h ssK,[Kptr] ssX,[Xptr] llY,ssK,ssX ul,#1 [Xptr]+2,ssX [Yptr],d1 ; ; ; ; ; (1) (2) (3,4) || || ; ; ; ; ; K0 K1 X0 X1 Y = K0*X0 + K1*X1 X0 --> X1 store Y Memory organization: Entering Leaving Xaddr X0 X0 Xaddr + 2 X1 X0 Register diagram: Instruction d1 / d0 d4 d6 k1 k0 ld k1k0 x1 x0 mulm.h llY,ssK,ssX ul,#1 Load / Store ld x1x0 y = k0*x + k1*x1 st y User Guide 81 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.5 Second Order FIR Note: Validated on TriCore Rider D board. Equation: Yt = K0*Xt+ K1*Xt-1 + K2*Xt-2 Pseudo code: sY = sK0*sX0 + sK1*sX1 + sK2*sX2 ; sX2 = sX1; sX1 = sX0; IP= 2 (1 mul, 2 madd) LD/ST= 9 (read sX0, sX1, sX2, sK0, sK1, sK2, write sY, write sX0 sX1, write sX1 sX2) Æ Æ Assembly code: ld.d ld.d mulm.h st.w madd.q st.h sssK,[Kptr] sssX,[Xptr] llY,d4,d6 ul,#1 [Xptr]+2,d6 llY,llY,d5l,d7l,#1 [Yptr],d1 ; ; ; ; ; ; (1) (2) (3) || (4,5) || ; ; ; ; ; ; K0 K1 K2 X X1 X2 Y = K0*X + K1*X1 X0 --> X1 X1 --> X2 Y = Y + K2*X2 store Y Memory organization: Entering Leaving Xaddr X0 X0 Xaddr + 2 X1 X0 Xaddr + 4 X2 X1 User Guide 82 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d5 / d4 d7 / d6 _ k2 k1 k0 ld k0 k1 k2 _ x2 x1 x0 mulm.h llY,d4,d6ul,#1 y = k0*x0 + k1*x1 madd.q llY,llY,d5l,d7l,#1 y += k2*x2 User Guide Load / Store ld x0 x1 x2 st x0 x1 st y 83 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.6 FIR Equation: Yt = ∑ (Xt-n * Kn) n = 0..N-1 Pseudo code: sY = 0; for (n=0; n<N; n++) sY += circular(sX[n])*sK[n]; IP= 1 (1 madd) LD/ST= 2 (read sX, read sK) Assembly code: mov.aa CONST.A a2,Xptr a3,(2*N)<<16 ; circular buffer initialization lea LC,(N/4 -1) ; (1) ; get loop number mov ld.w mov ld.d d0,#0 ssK0,[Kptr+]4 d1,#0 ssssX,[a2/a3+c]8 ; ; ; ; ; ; ; ; sfirloop: maddm.h ld.d maddm.h ld.d loop LC,sfirloop st.h [Yptr],d1 Example User Guide (2) || (3) || Y = 0 (lower) K0 K1 Y = 0 (upper) X0 X1 X2 X3 llY,llY,ssX0,ssK0 ul,#1 ; (1) ; ssssK,[Kptr+]8 ; || ; llY,llY,ssX1,ssK1 ul,#1 ; (2) ; ssssX,[a2/a3+c]8 ; || ; ; (4) Y +=X0*K0+X1*K1 K2 K3 K4 K5 Y +=X2*K2+X3*K3 X4 X5 X6 X7 ; store Y N = 20 Î 16 cycles 84 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d3 / d2 d5 / d4 Load / Store k1k0 _ _ ld k0k1 x3x2 x1x0 maddm.h y +=x0*k0+x1*k1 llY,llY,ssX0,ssK0 ul,#1 maddm.h y +=x2*k2+x3*k3 x7x6 x5x4 llY,llY,ssX1,ssK1 ul,#1 User Guide 85 ld x0x1x2x3 k5k4 k3k2 ld k2k3k4k5 ld x4x5x6x7 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.7 Block FIR Note: Validated on TriCore Rider D board. Equation: Ym = ∑ Xm-n * Kn n = 0..N-1, m = 0..M-1 Pseudo code: for (m = 0; m <M/4; m+4) { sY[m] = 0; for (n = 0; n<N; n++) sY[m] += sX[m+n]*sK[n]; sY[m+1] += sX[m+n+1]*sK[n]; sY[m+2] += sX[m+n+2]*sK[n]; sY[m+3] += sX[m+n+3]*sK[n]; } IP= 4 (4 madd) LD/ST=5 (read sXm, sXm+1, sXm+2, sXm+3, sK) Assembly code: mov.aa CONST.A mov.aa CONST.A a2,Xptr a3,(2*N)<<16 a14,Xptr a15,(2*N)<<16 ;circular buffer address initialization lea LC,(N/2 - 1) ; (1) ; get loop number mul mul ld.q mul ld.w mul ld.d llY0,d13,#0 llY2,d13,#0 d13,[a14/a15+c]2 llY1,d13,#0 ssK,[Kptr+]4 llY3,d13,#0 ssssX,[a2/a3+c]4 ; ; ; ; ; ; ; ; ; ; ; ; ; ; y0 = 0 y2 = 0 dummy load y1 = 0 k1k0 y3 = 0 x3x2x1x0 ; ; ; ; ; ; ; (1) (2) || (3) || (4) || ; ; ; ; store store store store ;circular buffer address initialization (2) (3) || (4) || (5) || blkfir: loop maddm.h maddm.h ld.d maddm.h ld.d maddm.h ld.w LC,blkfir st.h st.h st.h st.h [Yptr+]2,d1 [Yptr+]2,d3 [Yptr+]2,d5 [Yptr+]2,d7 User Guide llY0,llY0,d8,ssK ul,#1 llY2,llY2,d9,ssK ul,#1 ssssX1,[a14/a15+c]4 llY1,llY1,d10,ssK ul,#1 ssssX,[a2/a3+c]4 llY3,llY3,d11,ssK ul,#1 ssK,[Kptr+]4 ; ; ; ; (6) (7) (8) (9) 86 ; ; ; ; ; ; ; y0+=x0*k0+x1*k1 y2+=x2*k0+x3*k1 x4x3x2x1 y1+=x1*k0+x2*k1 x5x4x3x2 y3+=x3*k0+x4*k1 k3k2 y0 y2 y1 y3 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Example N = 20 Î 51 cycles Register diagram: Instruction d1/d0 d3/d2 d5/d4 d7/d6 d9/d8 d11/d10 d12 Load / Store y0 = 0 y2 = 0 y1 = 0 k1k0 ld k1k0 y3 = 0 x3x2x 1x0 maddm.h llY0,llY0,d8,ssK ul,#1 y0+= x0*k0+ x1*k1 maddm.h llY2,llY2,d9,ssK ul,#1 y2+= x2*k0 +x3*k1 maddm.h llY1,llY1,d10,ssK ul,#1 x4x3x2x1 y1+= x1*k0 +x2*k1 maddm.h llY3,llY3,d11,ssK ul,#1 x5x4x 3x2 y3+= x3*k0 +x4*k1 y0 ld x4x3x2x1 ld x5x4x3x2 k3k2 ld k3k2 st y0 y2 st y2 y1 st y1 y3 User Guide ld x3x2x1x0 87 st y3 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.8 Auto-Correlation Note: Validated on TriCore Rider D board. Equation: Zm = ∑ Xn*Xn+m n = 0..N-1 ; m=0..M-1; Pseudo code: for (m=0; m<M; m++) { lZ[m] = 0; for (n=0; n<N; n++) lZ += sX[n]*sX[n+m]; } IP= 1 (1 madd) LD/ST= 3 (read sX, s Xn+m, write lZ) Assembly code: mov.aa a8,Xaddress ; mov.aa a9,Xaddress ; add.a a8,#2 ; lea LCe,(M/2 - 1) ; mov.aa Xoptr,a8 ; extautoloop: mov d0,#0 ; lea Xptr,Xaddress ; mov d1,#0 ; lea LCi,(N/2 - 2) ; mov d2,#0 ; ld.w ssX1,[Xptr+]4 ; mov d3,#0 ; ld.w ssX3,[Xeptr+]4 ; (1) (2) (3) (4) (5) ; ; ; ; ; odd values even values adjust pointer position get external loop number odd pointer init (1) || (2) || (3) || (4) || ; ; ; ; ; ; ; ; z_even = 0 even values z_even = 0 get internal loop number z_odd = 0 x1 x0 (even values) z_odd = 0 x1 x0 maddm.h llZe,llZe,ssX1,ssX3 ul,#1 ; (5) ; z_even=z_even+x0*x0+x1*x1 ld.w ssX2,[Xoptr+]4 ; || ; x2 x1 intautoloop: maddm.h ld.w ld.w maddm.h ld.w loop LCi,intautoloop User Guide llZo,llZo,ssX1,ssX2 ; (1,2) ; ssX1,[Xptr+]4 ; ssX3,[Xeptr+]4 ; llZe,llZe,ssX1,ssX3 ; (3) ; ssX2,[Xoptr+]4 ; 88 ul,#1 z_odd +=x0*x1+x1*x2 || ; x3 x2 || ; x3 x2 ul,#1 z_even +=x2*x2+x3*x3 || ; x4 x3 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines loop maddm.h llZo,llZo,ssX1,ssX2 ul,#1 ; (6) ; st.w [Zptr+]4,d1 ; || ; add.a a9,#4 ; (7) ; st.w [Zptr+]4,d3 ; (8) ; mov.aa Xeptr,a9 ; (9) ; add.a a8,#4 ; (10) ; mov.aa Xoptr,a8 ; (11) ; LCe,extautoloop Example store z_even adjust pointer position store z_odd even pointer init adjust pointer position odd pointer init N=160, M = 10 Î 1247 cycles Register diagram: Instruction d1 / d0 d3 / d2 d5 d4 d6 Load / Store z_even = 0 (lower) z_even = 0 (upper) z_odd = 0 (lower) x1x0 z_odd = 0 (upper) maddm.h llZe,llZe,lX1,lX3 ul,#1 x1x0 z_even += x0*x0+x1*x1 maddm.h llZo,llZo,lX1,lX2 ul,#1 ld x1x0 x2x1 z_odd += x0*x1+x1*x2 ld x2x1 x3x2 ld x3x2 x3x2 maddm.h llZe,llZe,lX1,lX3 ul,#1 maddm.h llZo,llZo,lX1,lX2 ul,#1 z_even += x2*x2+x3*x3 x4x3 z_odd += x0*x1+x1*x2 ld x1x0 ld x3x2 ld x4x3 st z_even st z_odd User Guide 89 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.9 Complex FIR Note: Validated on TriCore Rider D board. Equations: Yrt = ∑ (Xrt-n*Krn – Xit-n*Kin) n = 0..N-1 Yit = ∑ (Xit-n*Krn + Xrt-n*Kin) n = 0..N-1 Pseudo code: sYr sYi for { sYr sYi } = 0; = 0; (n = 0; n<N; n++) += circular(sXr)*sKr - circular(sXi)*sKi; += circular(sXr)*sKi + circular(sXi)*sKr; IP= 4 (3 madd, 1 msub) LD/ST (in packed format) = 2 (read sXi || sXr, sKr || sKi) Pseudo code implementation: sYr sYi for { sYi sYi } = 0; = 0; (n = 0; n< N; n++) += sXr*sKi; sYr -= sXi*sKi; += sXi*sKr; sYr += sXr*sKr; Assembly code: lea mov ld.w ld.q cxloop: LC,(N - 1) ssY,#0 ssX,[Xptr+]4 ssK,[Kptr+]2 maddsurs.h ld.w maddrs.h loop st.w ld.w LC,cxloop [Yptr],ssY ; ; ; ; (1) (2) || (3) ; ; ; ; get loop number y = 0 xi0 xr0 ki0 ssY,ssY,ssX,ssK uu,#1 ; (1) ; yi += xr0*ki0 || ssK,[Kptr+]4 ;|| ; ki1 kr0 ssY,ssY,ssX,ssK ll,#1 ; (2) ; yi += xi0*kr0 || ssX,[a2/a3+c]4 ;|| ; xi1 xr1 ; (4) yr -= xi0*ki0 yr += xr0*kr0 ; store yiyr Note: Warning! This algorithm only works when the coefficients are organised in the reverse order {imag,real} compared to data {real,imag} . User Guide 90 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Memory organization: Xaddr sXr0 Kaddr sKi0 Xaddr + 2 sXi0 Kaddr + 2 sKr0 Xaddr + 4 sXr1 Kaddr + 4 sKi1 Xaddr + 6 sXi1 Kaddr + 6 sKr1 Xaddr + 8 sXr2 Kaddr + 8 sKi2 Xaddr + 10 etc.. Kaddr + 10 etc.. Example N = 20 Î 46 cycles Register diagram: Instruction d0 d4 y=0 xi0 xr0 maddsurs.h ssY,ssY,ssX,ssK uu,#1 yi+=xr*ki || yr-=xi*ki maddrs.h ssY,ssY,ssX,ssK ll,#1 yi+=xi*kr || yr+=xr*kr User Guide xi1 xr1 91 d5 Load / Store ld xi0xr0 ki0 ld ki0 ki1 kr0 ld ki1kr0 ld xi1xr1 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.10 First Order IIR Note: Validated on TriCore Rider D board. Equation: Yt = B*Yt-1 + K*Xt where: B= (1- K) Pseudo code: sY0 = sY1 – sY1*sK + sX0*sK; sY1 = sY0; IP= 2 (1 madd, 1 msub) LD/ST= 4 (read sX, sY1, sK, write sY0 Æ sY1) Assembly code: ;sX and sY are in different registers ld.q sK,[Kptr] ; ld.q sX,[Xptr+]2 ; ld.q sY,[Yptr] ; msubs.q sY,sY,sK u,sY u,#1 ; madds.q sY,sY,sK u,sX u,#1 ; st.q [Yptr],sY ; (1) (2) (3) (4) (5,6) || ; ; ; ; ; ; ;alternatively if sX0 and sY1 are in the same ld.q sK,[Kptr] ; (1) ; ld.w ssXY,[Xptr+]2 ; (2) ; msubrs.h sY,ssXY,ssXY,sK ul,#1 ; (3) ; maddrs.h sY,sY,ssXY,sK uu,#1 ; (4,5) ; st.q [Yptr],sY ; || ; K X0 Y1 Y0 = Y1-Y1*K Y0 += X0*K store Y0 register K X0 || Y1 Y0 = Y1- Y1*K Y0 += X0*K store Y0 Note: Operates with dual MAC instruction but only 1 result is used. User Guide 92 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 D4 d5 Load / Store k0 0 ld k x0 0 ld x0 y0 0 ld y1 msubs.q sY,sY,sKu , sY u,#1 y1 = y0*(1-k) madds.q sY,sY,sK u,sX u,#1 y1 += x0*k Instruction d0 st y0 D3 x0||y1 msubrs.h sY,ssXY,ssXY,sK uu,#1 Load / Store k0 0 ld k ld x0 y1 y0 = y1-y1*k maddrs.h sY,sY,ssXY,sK uu,#1 y0 += x0*k User Guide d5 93 st y v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.11 Second Order IIR Note: Validated on TriCore Rider D board. Equation: Yt = K0*Xt + K1*Xt-1 + K2*Xt-2 + B1*Yt-1 + B2*Yt-2 where: B1, B2 are negative Pseudo code: sY0 = sK0*sX0 + sK1*sX1 + sK2*sX2 + sB1*sY1 + sB2*sY2; sY2 = sY1; sY1 = sY0; IP=5 (1 mul, 4 madd) LD/ST= 12 (read sX0, sX1, sX2, sK0, sK1, sK2, sB1, sB2, sY1, sY2, write sY0 --> sY1, write sY1 --> sY2) Assembly code: ld.d ld.w ld.d ssssK,[Kptr] ssB,[Bptr] ssssX,[Xptr+]2 ; (1) ; (2) ; (3) ; K2 K1 K0 ; B1 B0 ; X2 X1 X0 mulm.h ld.w madds.q ld.d maddm.h st.h e0,d2,d6 ul,#1 ssY,[a2/a3+c]2 e0,e0,ssK l,ssX l,#1 ssssX,[Xptr+]2 e0,e0,ssB,ssY ul,#1 [a2/a3+c]0,d1 ; ; ; ; ; ; ; ; ; ; ; ; User Guide (4) || (5) || (6,7) || 94 Y0 = K1*X1 + K0*X0 Y2 Y1 Y0 += K2*X2 X3 X2 X1 Y0 += B1*Y1 + B2*Y2 store Y0 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d3 / d2 d5 / d4 d7 / d6 d8 Load / Store k2k1k0 ld k0k1k2 b1b0 ld b0b1 x2x1x0 mulm.h e0,d2,d6ul,#1 y2 = k1*x1 + k0*x0 y2y1 madds.q y2 = y + x*k2 e0,e0,ssK l,ssX l,#1 x3x2x1 maddm.h y2 = y2 + e0,e0,ssB,ssYul,#1 b1*y1 + b2*y2 User Guide ld x0x1x2 ld y1y2 ld x1x2x3 st y0 95 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.12 BIQUAD 4 Coefficients Note: Validated on TriCore Rider D board. Equations: Wt = Xt + B1*Wt-1 + B2*Wt-2 Yt = Wt + K1*Wt-1 + K2*Wt-2; Pseudo code: sW0 sY0 sW2 sW1 = = = = sX0 + sB1*sW1 + sB2*sW2; sW0 + sK1*sW1 + sK2*sW2; sW1; sW0; IP=4 (4 madd) LD/ST= 10 (read sX0, read sW1, sW2, read sK1, sK2, read sB1, sB2, write sY0, write sW0 --> sW1, write sW1 --> sW2) Assembly code: ld.h mov ld.w ld.w maddm.h ld.w maddm.h st.h st.h User Guide d0,[Xptr] d1,#0 ssB,[BKptr+]4 ssW,[a14/a15+c]2 llW,llY,ssW,ssB ul,#1 ssK,[BKptr+]4 llY,llW,ssW,ssK ul,#1 [a14/a15+c]0,d7 [Yptr],d1 ; ; ; ; ; ; ; ; ; (1) || || (3) || (4,5) || || 96 ;x0-->d0 ; y0 = 0 ; b2 b1 ; w2 w1 ; w0=x0+w1*b1+w2*b2 ; k2 k1 ; y0=w0+w1*k1+w2*k2 ; w0--> w1 ; store y0 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d2 x = 0 (upper) x = 0 (lower) d5 d7 / d6 b2b1 ld w2w1 k2k1 maddm.h y0 = w0+w1*k1 llY,llW,ssW,ssK ul,#1 +w2*k2 w0 = ld k2k1 x+w1*b1+ w2*b2 w1w0 y0 Load / Store ld b2b1 w2w1 maddm.h llW,llY,ssW,ssB ul,#1 User Guide d4 st w0 st y0 97 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.13 N-stage BIQUAD 4 Coefficients Note: Validated on TriCore Rider D board. Equations: W0,n = Y0,n-1 + B1,n*W1,n + B2,n*W2,n Y0,n = W0,n + K1,n*W1,n + K2,n*W2,n n = 0..N-1 Pseudo code: for (n = 0; n< N; n++) { sW0 = sY0 + sB1[n]*sW1[n] + sB2[n]*sW2[n]; sY0 = sW0 + sK1[n]*sW1[n] + sK2[n]*sW2[n]; sW2[n] = sW1[n]; sW1[n] = sW0; } IP=4 (4 madd) LD/ST= 8 (read sW1, sW2, sK1, sK2, sB1, sB2, write sW0 --> sW1, write sW1 --> sW2) Note: sY0 can be kept in a register and does not need to be written back and re-read from memory between stages. Assembly code: ld.h mov ld.w ;mov ld.w maddm.h ld.w lea bq4loop: loop maddm.h st.h st.h Example User Guide d0,[Xptr] d1,#0 ssB,[BKptr+]4 d0,#0 ssW,[a14/a15+c]2 llW,llY,ssW,ssB ul,#1 ssK,[BKptr+]4 LC,(N - 2) ; ; ; ; ; ; ; ; (2) || (3) || (4) || (1) maddm.h llY,llW,ssW,ssK ul,#1 st.h [a14/a15+c]0,d7 ; || ld.w ssW,[a14/a15+c]2 ; || maddm.h llW,llY,ssW,ssB ul,#1 LC,bq4loop llY,llW,ssW,ssK ul,#1 ; (5,6) [a14/a15+c]0,d7 ; || [Yptr+]2,d1 ; || ; ; ; ; ; ; ; ; x0-->d0 y0 = 0 b2 b1 y0 = 0 w2 w1 w0=y0+w1*b1+w2*b2 k2 k1 get loop number ; ; ; ; (1,2) ; y0=w0+w1*k1+w2*k2 store w0 w1 w0 (3) ; w0=y0+w1*b1+w2*b2 ; y0=w0+w1*k1+w2*k2 ; store w0 ; store y0 N = 13 Î 44 cycles 98 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d2 d4 x = 0 (upper) d5 d7 / d6 Load / Store b2b1 ld b2b1 x = 0 (lower) w2w1 maddm.h llW,llY,ssW,ssB ul,#1 ld w2w1 k2k1 maddm.h y0= llY,llW,ssW,ssK ul,#1 w0+w1*k1+ w2*k2 w0= ld k2k1 x0+w1*b1+ w2*b2 w0 st w0 w1w0 ld w1w0 maddm.h llW,llY,ssW,ssB ul,#1 w0= y0+w1*b1+ w2*b2 maddm.h y0= llY,llW,ssW,ssK ul,#1 w0+w1*k1+ w2*k2 w0 y0 User Guide st w0 st y0 99 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.14 N-stage BIQUAD 5 Coefficients Note: Validated on TriCore Rider D board. Equations: W0,n = Y0,n-1 + B1,n*W1,n + B2,n*W2,n Y0,n = K0,n*W0,n + K1,n*W1,n + K2,n*W2n Pseudo code: for (n = 0; n<N; n++) { sW0 = sY0 + sB1[n]*sW1[n] + sB2[n]*sW2[n]; sY0 = sK0[n]*sW0 + sK1[n]*sW1[n] + sK2[n]*sW2[n]; sW2[n] = sW1[n]; sW1[n] = sW0; } IP= 5 (5 madd) LD/ST= 9 (read sW1, sW2, read sK0, sK1, sK2, read sB1, write sW0 --> sW1, write sW1 --> sW2) Note: sY0 can be kept in a register and does not need to be written back and re-read from memory between stages. Assembly code: ld.h mov ld.w mov ld.w maddm.h ld.d dextr lea bq5loop: loop mulm.h madds.q st.h st.h User Guide d0,[Xptr] d1,#0 ssB,[BKptr+]4 d0,#0 ssW,[a14/a15+c]2 llW,llY,ssW,ssB ul,#1 ssssK,[BKptr] d8,d7,d6,#16 LC,(N - 2) ; ; ; ; ; ; ; ; ; mulm.h llY,ssW,ssK2 ul,#1 ; st.q [a14/a15+c]0,d8 ; madds.q llY,llY,d8u,ssK u,#1 ; ld.w d2,[a14/a15+c]2 ; maddm.h llW,llY,ssW,ssB ul,#1 ; dextr d8,d7,d6,#16 ; LC,bq5loop llY,ssW,ssK2 ul,#1 ; llY,llY,d8u,ssK u,#1 ; [a14/a15+c]0,d8 ; [Yptr+]2,d1 ; 100 (2) || (3) || (4,5) || (6) (1) ; ; ; ; ; ; ; ; ; x0-->d0 y0 = 0 b2 b1 y0 = 0 w2 w1 w0=y0+w1*b1+w2*b2 k2 k1 k0 0 extract w0 get loop number (1) || (2) || (3,4) (5) ; ; ; ; ; ; y0=w1*k1+w2*k2 store w0 y0=y0+k0*w0 w1 w0 w0=y0+w1*b1+w2*b2 extract w0 (7) (8,9) || || ; ; ; ; y0=w1*k1+w2*k2 y0=y0+k0*w0 store w0 store y0 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Example N = 13 Î 71 cycles Register diagram: Instruction d1 / d0 D2 y=0 (lower) y=0 (upper) d3 d5 / d4 d7 / d6 b2b1 ld b2b1 W2w1 ld w2w1 maddm.h llW,llY,ssW,ssB ul,#1 k2k1k w0=y0+w1* 0 b1+w2*b2 dextr d8,d7,d6,#16 mulm.h llY,ssW,ssK2 ul,#1 d8 Load / Store ld k2k1k0 w0 y0=w1*k1+ w2*k2 madds.q y0+=k0*w0 llY,llY,d8u,ssK u,#1 w0 st w0 W1w0 ld w1w0 maddm.h llW,llY,ssW,ssB ul,#1 w0=y0+w1* b1+w2*b2 dextr d8,d7,d6,#16 mulm.h llY,ssW,ssK2 ul,#1 w0 y0=w1*k1+ w2*k2 madds.q y0+=k0*w0 llY,llY,d8u,ssK u,#1 w0 st w0 y0 User Guide st y0 101 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.15 Lattice Filter Equations: Zn = Zn-1 – Xn*Kn-1 n = 1..N-1 Tn-1 = Xn + Zn*Kn-1 Pseudo code: for (n = 1; n<N; n++) { sZ[n] = sZ[n-1] – sX[n]*sK[n-1]; sT[n-1] = sX[n] + sZ[n]*sK[n-1]; } IP = 2 (1 msub, 1 madd) LD/ST= 5 (read sX, read sK, read sZ, write sZ, write sT) Assembly code: lea ld.q ld.q ld.q msubrs.h ld.q ld.q msubrs.h latloop: loop maddrs.h st.q st.q Example User Guide LC,(N - 3) sX,[Xptr+]2 sK,[Kptr+]2 sY,[Yaddr] sZ,sY,sX,sK ul,#1 sX,[Xptr+]2 sK,[Kptr+]2 sZ,sZ,sX,sK ul,#1 ; ; ; ; ; ; ; ; (1) (2) (3) (4) (5,6) || || (7,8) ; ; ; ; ; ; ; ; get loop number x9 k10 y (input) z9 = y - x9*k10 x8 k9 z8 = z9 - x8*k9 maddrs.h ld.q ld.q msubrs.h st.q LC,latloop sT,sX,sZ,sK [Xptr]-4,sT [Xptr]-2,sZ sT,sX,sZ,sK ul,#1 sX,[Xptr+]2 sK,[Kptr+]2 sZ,sZ,sX,sK ul,#1 [Xptr]-6,sT ; ; ; ; ; (1,2) || || (3,4) || ; ; ; ; ; t9 = x8 + z8*k9 x7 k8 z7 = z8 - x7*k8 store t9 ul,#1 ; (9,10) ; t1 = x0 + z0*k1 ; || ; store t1 ; || ; store z0 (result) N = 10 Î 44 cycles 102 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d1 d4 d5 d6 x9 Load / Store ld x9 k10 ld k10 Y msubrs.h z9 = y - x9*k10 sZ,sY,sX,sK ul,#1 ld y ld x8 ld x9 msubrs.h z8 = z9 - x8*k9 sZ,sZ,sX,sK ul,#1 maddrs.h sT,sX,sZ,sK ul,#1 t9 = x8 + z8*k9 x8 ld x8 k9 msubrs.h z7 = z8 – x7*k8 sZ,sZ,sX,sK ul,#1 maddrs.h sT,sX,sZ,sK ul,#1 User Guide ld k9 st t10 t1 = x0 + z0*k1 st t0 t1 st t1 103 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.16 Leaky LMS (Update Only) Note: Validated on TriCore Rider D board. Equation: Kt,n = Kt-1,n * B + Xt-n*u*Errt-1 n = 0..N-1 Pseudo code: for (n = 0; n<N; n++) sK[n] = sK[n]*sB + circular(sX[n])*sErr; IP = 2 (1 mul, 1 madd) LD/ST= 3 (read sX, read sK, write sK) Assembly code: lea ld.h ld.h ld.w mulr.h ld.d llmsloop: LC,(N/4 - 2) sB,[Baddr] sErr,[Eaddr] ssK0,[Kptr+]4 ssKK0,ssK0,sB ll,#1 ssssX,[a14/a15+c]8 ; ; ; ; ; ; maddrs.h loop ssKK0,ssKK0,ssX0,sErr ; ; ld.d ssssK,[Kptr+]8 ; mulr.h ssKK1,ssK1,sB ll,#1 ; ; st.w [Kptr]-12,ssKK0 ; maddrs.h ssKK1,ssKK1,ssX1,sErr ; ld.d ssssX,[a14/a15+c]8 ; mulr.h ssKK1,ssK1,sB ll,#1 ; ; st.w [Kptr]-8,ssK1 ; LC,llmsloop maddrs.h ssKK0,ssKK0,ssX0,sErr ll,#1 ld.w mulr.h ssK1,[Kptr+]4 ssKK1,ssK1,sB ll,#1 maddrs.h ssKK1,ssKK1,ssX1,sErr ll,#1 st.d [Kptr]-8,e0 User Guide 104 ; ; ; ; ; ; ; ; ; (1) (2) (3) (4) (5) || ; ; ; ; ; ; get loop number beta error K1 K0 K1=K1*B|| K0 =K0*B X3 X2 X1 X0 ll,#1 (1) ; || ; || ; (2) ; || ; || ; ll,#1 ; || ; || ; (4) ; || ; || ; K1+=X1*Err K0+=X0*Err K5 K4 K3 K2 K3= K3*B K2= K2*B store K0 K1 (3) ; K3+=X3*Err K2+=X2*Err X7 X6 X5 X4 K5=K5*B K4=K4*B store K2 K3 epilog (6) ; || ; || ; (7) ; || ; (8,9) ; || ; || ; K(n-2)+=X(n-2)*Err K(n-3)+=X(n-3)*Err Kn K(n-1) Kn=Kn*B K(n-1)= K(n-1)*B Kn +=Xn*Err K(n-1)+=X(n-1)*Err Kn K(n-1) K(n-2) K(n-3) v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Example N = 20 Î 27 cycles Register representation: X input sK1 Constant input Result of the instruction sK0 sB sK1*sB sK0*sB Register diagram: Instruction d0 d1 d3 d2 k1k0 __ __ mulr.h ssKK0,ssK0,sB ll,#1 k1= k1*B || k0= k0*B k5 k4 k3 k2 mulr.h ssKK1,ssK1,sB ll,#1 k3= k3*B || k2= k2*B maddrs.h ssKK1,ssKK1,ssX1, sErr ll,#1 k3+=x3*Err|| k2+=x2*Err k5= k5*B || k4= k4*B Load / Store ld k0k1 x3 x2 ld x1 x0 x0x1x2x3 maddrs.h k1+=x1*Err||k0 ssKK0,ssKK0,ssX0, sErr ll,#1 +=x0*Err mulr.h ssKK0,ssK0,sB ll,#1 d5 d4 ld k2k3k4k5 st k0k1 x7 x6 ld x5 x4 x4x5x6x7 st k3k2 maddrs.h ssKK0,ssKK0,ssX0, sErr ll,#1 mulr.h ssKK1,ssK1,sB ll,#1 maddrs.h ssKK1,ssKK1,ssX1, sErr ll,#1 User Guide 105 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.17 Delayed LMS Note: Validated on TriCore Rider D board. Equations: Yt = ∑ Xt-n*Kt-1,n n = 0..N-1 Kt,n = Kt-1,n + Xt-n*u*errt-1 n = 0..N-1 Pseudo code: sY = 0; for (n = 0; n<N; n++) { sY += circular(sX[n])*sK[n]; sK[n] += circular(sX[n])*sErr; } IP = 2 (2 madd) LD/ST = 3 (read sX, read sK, write sK) Assembly code: lea LC,(N/4 - 1) ld.h sErr,[Eaddr] mov d0,#0 ld.w ssK1,[Kptr+]4 mov d1,#0 ld.d ssssX,[a14/a15+c]8 dlmsloop: maddm.h llY,llY,ssX0,ssK0 ul,#1 st.d [Kptr]-16,e6 maddrs.h d6,ssK0,ssX0,sErr ll,#1 ld.d maddm.h ld.d maddrs.h loop st.d st.h ssssK,[Kptr+]8 llY,llY,ssX1,ssK1 ul,#1 ssssX,[a14/a15+c]8 d7,ssK1,ssX1,sErr ll,#1 LC,dlmsloop [Kptr]-16,e6; || [Yptr+]2,d1; (5) ; ; ; ; ; ; (1) (2) (3) || (4) || ; ; ; ; ; ; get loop number Err Y = 0 (lower) K0 K1 Y = 0 (upper) X0 X1 X2 X3 ; ; ; ; ; ; ; ; ; ; (1) ; Y += X0*K0+X1*K1 || ; store (next loop) K0 K1 K2 K3 (2) K0 +=X*Err|| K1 +=X1*Err || ; K2 K3 K4 K5 (3) ; Y += sX2*K2+X3*K3 || ; X4 X5 X6 X7 (4) K2 +=X2*Err||K3 +=X3*Err ; store last 4 K ; store Y Note: Warning! Dummy store on first iteration of the loop. Example User Guide N = 20 Î 27 cycles 106 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register representation: X input sX1 Error input Result of the instruction sX0 sErr sX1*sErr sX0*sErr Register diagram: Instruction d1 / d0 d3 d2 y = 0 (d1) d5 d4 d7 d6 Load / Store k1k0 ---- y = 0 (d0) ld k0k1 x3x2 x1x0 ld x0x1x2x3 maddm.h y += x0*k0 + x1*k1 llY,llY,ssX0,ssK0 ul,#1 k3k2 k1k0 maddrs.h d6,ssK0,ssX0,d8ul,#1 k5k4 k3k2 dummy store ----------k1 += x1*err || k0 += x0*err maddm.h y + = x2*k2 + x3*k3 x7x6 llY,llY,ssX1,ssK1 ul,#1 x5x4 maddrs.h d7,ssK1,ssX1,d8ul,#1 ld x4x5x6x7 k3 += x3*err || k2 += x2*err k1 += x1*err || k0 += x0*err next iteration User Guide ld k2k3k4k5 st k0k1k2k3 107 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.18 Delayed LMS – 32-bit Coefficients Note: Validated on TriCore Rider D board. Equations: Yt = ∑ Xt-n*Kt-1,n n = 0..N-1 Kt,n = Kt-1,n + Xt-n*u*errt-1 n = 0..N-1 Pseudo code: sY = 0; for (n = 0; n< N; n++) { sY += circular(sX[n])* up(lK[n]); lK[n] += circular(sX[n])* sErr; } IP = 2 (2 madd) LD/ST= 3 (read sX, read lK, write lK) Assembly code: lea LC,(N/2 - 2) ;(1) ; get loop number ld.h sErr,[Eaddr] ;(2) ; err mov d0,#0 ;(3) ; y = 0 (lower) ld.d llK,[Kptr+]8 ;|| ; k0k1 mov d1,#0 ;(4) ; y = 0 (upper) ld.w ssX,[a14/a15+c]4 ;|| ; x0x1 madds.h llY,llY,ssX,lK0 ul,#1 ;(5) ; y +=x0*k0 dlms32loop: madds.h llY,llY,ssX,lK1 uu,#1 ;(1) ; y += x1*k1 madds.h e6,llK,ssX,sErr ll,#1 ;(2,3) ;k0=k0+x0*err||k1=k1+x1*err ld.d llK,[Kptr+]8 ;|| ; k3k2 ld.w ssX,[a14/a15+c]4 ;|| ; x2x3 madds.h llY,llY,ssX,lK0 ul,#1 ;(4) ; y=y+x2*k2 st.d [Kptr]-16,llK ;|| ; store k0k1 loop LC,dlms32loop madds.h llY,llY,ssX,lK1 uu,#1 ;(6) ; y=y+xn*kn madds.h e6,llK,ssX,sErr ll,#1 ;(7,8); kn=kn+xn*err ;|| ; k(n-1)=k(n-1)+x(n-1)*err st.h [Yptr]2,d1 ;|| ; store Y st.d [Kptr]-8,e6 ;|| ; store k(n-1)kn Example User Guide N = 20 Î 46 cycles 108 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d1 / d0 d3 / d2 y=0 (lower) y=0 (upper) madds.h llY,llY,ssX,lK0 ul,#1 y += x0*k0 madds.h llY,llY,ssX,lK1 uu,#1 y += x1*k1 d5 / d4 d7/ d6 k1k0 ld k1k0 x1x0 ld x1x0 madds.h e6,llK,ssX,sErr ll,#1 k3k2 k1 += x1*err || k0 += x0*err x3x2 madds.h llY,llY,ssX,lK0 ul,#1 y += x2*k2 madds.h llY,llY,ssX,lK1 uu,#1 y=y+xn*kn ld k3k2 ld x3x2 k5k4 madds.h e6,llK,ssX,sErr ll,#1 User Guide Load / Store st k1k0 kn=kn+xn*err || k(n-1)=k(n-1) + x(n-1)*err 109 st y st k(n-1)kn v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 6.19 Delayed LMS – Complex Note: Validated on TriCore Rider D board. Equations: Yrt = ∑ Xrt-n*Krt-1,n- Xit-n*Kit-1,n Yit = ∑ Xrt-n*Kit-1,n + Xit-n*Krt-1,n Krt,n = Krt-1,n + Xrt-n*u*Er t-1 - Xit-n*u*Ei t-1 Kit,n = Kit-1,n + Xrt-n*u*Ei t-1 + Xit-n*u*Er t-1 n = 0..N-1 Pseudo code: sYr = 0; sYi = 0; for (n = 0; n<N; n++) { sYr += circular(sXr[n]*sKr[n] – circular(sXi[n])*sKi[n]; sYi += circular(sXi[n]*sKr[n] + circular(sXr[n])*sKi[n]; sKr[n] += circular(sXr[n])*sEr - circular(sXi[n])*sEi; sKi[n] += circular(sXr[n])*sEi + circular(sXi[n])*sEr; } IP=8 (6 madd, 2 msub) User Guide LD/ST=6 (read sXr, read sXi, read sKr, read sKi, write sKr, write sKi) 110 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: lea mov ld.w ld.w ld.w LC,(N - 2) ssY,#0 ssE,[Eaddr] ssX,[Xptr+]4 ssK,[Kptr+]4 ; ; ; ; ; (1) (2) || (3) (4) maddsurs.h ssY,ssY,ssX,ssK uu,#1 ; (5) maddsurs.h ssK0,ssK,ssX,ssE uu,#1 ; (6) dlmscpxloop: maddrs.h ssY,ssY,ssX,ssK ll,#1 ; (1) ; || ld.w ssK,[Kptr+]4 ; || maddrs.h ssK0,ssK0,ssX,ssE ll,#1 ; (2) ; || ld.w ssX,[a14/a15+c]4 ; || maddsurs.h ssY,ssY,ssX,ssK uu,#1 ; (3) st.w [Kptr]-8,ssK0 ; || maddsurs.h ssK0,ssK,ssX,ssE uu,#1 ; (4) loop LC,dlmscpxloop maddrs.h maddrs.h ssY,ssY,ssX,ssK ll,#1 ssK0,ssK0,ssX,ssE ll,#1 st.w st.w [Yptr+]4,ssY [Kptr]-4,ssK0 Example User Guide ; ; ; ; ; ;get loop number ;yi = 0 || yr = 0 ;Ei || Er ;xixr0 ;ki0kr0 ;yi+=xr0*ki0 || yr-=xi0*ki0 ;ki0+=xr0*ei||kr0-=xi0*ei ;yi+=xi0*kr0 ;yr+=xr0*kr0 ;ki1kr1 ;ki0+=xi0*er ;kr0+=xr0*er ;xixr1 ;yi+=xr1*ki1 || yr-=xi1*ki1 ;store ki0kr0 ;ki1+=xr1*ei || kr1-=xi1*ei (7) ;yi+=xin*krn || yr+=xrn*krn (8,9) kin+=xin*er || krn+=xrn*er || ;store yryi || ;store kin krn N = 20 Î 87 cycles 111 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d1 d4 d5 yi = 0 || yr = 0 xixr0 d6 Load / Store EiEr ld EiEr ld xixr0 ki0kr0 ld ki0kr0 ki1kr1 ld ki1kr1 maddsurs.h yi+=xr0*ki0 || ssY,ssY,ssX,ssK uu,#1 yr-=xi0*ki0 maddsurs.h ssK1,ssK,ssX,ssE uu,#1 maddrs.h ssY,ssY,ssX,ssK ll,#1 ki0+=xr0*ei ||kr0-=xi0*ei yi+=xi0*kr0 || yr+=xr0*kr0 maddrs.h ssK1,ssK,ssX,ssE ll,#1 ki0+=xi0*er xixr1 ||kr0+=xr0*er maddsurs.h yi+=xr1*ki1 || ssY,ssY,ssX,ssK uu,#1 yr-=xi1*ki1 maddsurs.h ssK1,ssK,ssX,ssE uu,#1 maddrs.h ssY,ssY,ssX,ssK ll,#1 maddrs.h ssK1,ssK,ssX,ssE ll,#1 ki0kr0 st ki0kr0 yryi st yryi ki1+=xr1*ei ||kr1-=xi1*ei yi+=xin*krn || yr+=xrn*krn kin+=xin*er || krn+=xrn*er kinkrn User Guide ld xixr1 112 st kinkrn v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 7 Transforms The DSP functions classed as transforms are much more complete than simple kernel functions. They cover complete applications and can be sub-divided into a series of kernel routines. Taking the example of FFT, this is sub-divided into: • • • • • • • Initialisation Bit reverse Pass loop (most external loop) Group loop (medium loop) Butterfly loop (inner loop) Squaring of results Normalisation, etc. Note: Each loop has a different importance. For a 256 pt FFT the pass loop will be carried out 8 times, whereas the butterfly loop in radix 2 will be done 8*128 times. The butterfly can be divided into various types: • • • • • • Radix 2 or radix 4 or radix 8 or radix N Real or complex data Real or complex coefficients 16-bit or 32-bit data 16-bit or 32-bit coefficients Decimation In Time (DIT) or Decimation In Frequency (DIF) butterfly The butterfly can be implemented in several ways: • With or without shift (shift is generally necessary above 32 points and application dependent) • Use of block floating point • Use of packed data (2 points at a time) • Data incremented in power of 2, coefficients constants • Data incremented, coefficients incremented • Data incremented, coefficients incremented in power of 2 • Degenerated (case of first 3 passes in radix 8) As demonstrated, the number of variables is large. The most common cases of inner loops of radix 2 butterflies for FFTs are: • • • • • Real butterfly – DIT – radix 2 Real butterfly – DIF – radix 2 Complex butterfly – DIT – radix 2 Complex butterfly – DIT – radix 2 – with shift Complex butterfly – DIF – radix 2 User Guide 113 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Real butterfly – DIT – radix 2 A simple implementation, 2 points at a time. The coefficients and data are incremented by 2 points. Conditions: • Works for pass 1 to pass N-1; does not work for pass 0 (due to packing of data) • Number of coefficients must be doubled (duplicated table) Real butterfly – DIF – radix 2 As above, a simple implementation, 2 points at a time. The coefficients and data are incremented by 2 points. Conditions: • Works for pass 0 to pass N-2; does not work for pass N-1 (due to packing of data) • Number of coefficients must be doubled (duplicated table). Complex butterfly – DIT – radix 2 A straightforward implementation, 2 points at a time. The coefficients are incremented by an index value. Conditions: • Works for pass 1 to pass N-1; does not work for pass 0 (due to packing of data) Complex butterfly – DIT – radix 2 – with shift Same implementation as above, with shift. Conditions: • Works for pass 1 to pass N-1; does not work for pass 0 (due to packing of data) Complex butterfly – DIF – radix 2 A straightforward implementation, 2 points at a time. The coefficients are incremented by an index value. Conditions: • Works for pass 0 to pass N-2; does not work for pass N-1 (due to packing of data) User Guide 114 v1.6.4, 2003-01 Name Cycles Code Optimization Size 1) Techniques Software Pipelining Loop Unrolling Packed Operation Load/Store Scheduling Data Memory Interleaving Packed Load/Store Saturation Rounding TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Real butterfly – DIT2) radix 2 (5*N/2 +2) +3 32 9 - 9 9 - 9 - 9 Real butterfly – DIF3) radix 2 (5*N/2 +2) +3 36 - - - - - 9 9 9 Complex butterfly – DIT radix 2 (9*N/2 +2) +3 74 9 - 9 9 - 9 - - Complex butterfly (11*N/2 +2) +3 – DIT radix 2 – with shift 82 9 - 9 9 - 9 - - Complex butterfly – DIF radix 2 76 9 - 9 9 - 9 - - (9*N/2 +2) +3 1) Code size is in Bytes 2) DIT = Decimation In Time 3) DIF = Decimation In Frequency User Guide 115 Arithmetic Methods v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 7.1 Real Butterfly – DIT – Radix 2 Equations: Xn’ = Xn + Yn*Kn Yn’ = Xn – Yn*Kn n=0..N-1 Pseudo code: for (n = 0; n<N; n++) { sXX[n] = sX[n] + sY[n]*sK[n]; sYY[n] = sX[n] - sY[n]*sK[n]; } IP = 2 (1 madd, 1 msub) LD/ST= 5 (read sX, read sK, read sY, write sXX, write sYY) Pseudo code implementation: ssP1 = ssY1 * ssK1; ssP0 = ssY0 * ssK0; ssXX1 = ssX1 + ssP1; ssXX0 = ssX0 + ssP0; ssYY1 = ssX1 – ssP1; ssYY0 = ssX0 – ssP0; Assembly code: lea LC,(N/2 – 1) ld.w ssY,[Yptr+]4 ld.w ssK,[Kptr+]4 rditloop: mulr.h ssP,ssY,ssK ul,#1 ld.w ssX,[Xptr+]4 ld.w ssY,[Yptr+]4 add.h ssXX,ssX,ssP st.w [Xptr]-4,ssXX sub.h ssYY,ssX,ssP st.w [Yptr]-8,ssYY ld.w ssK,[Kptr+]4 loop LC,rditloop User Guide 116 ; (1) ; (2) ; (3) ;get loop number ;y1y0 ;k1k0 ;(1,2) ; || ; || ; (3) ; || ; (4) ; || ; (5) ;p1=y1*k1 || p0=y0*k0 ;x1 x0 ;y3 y2 ;xx1=x1+p1 || xx0=x0+p0 ;store xx1 xx0 ;yy1=x1-p1 || yy0=x0-p0 ;store yy1 yy0 ;k3 k2 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d2 d3 d4 d6 d8 Load / Store y1 y0 ld y1y0 k1 ld k1k0 k0 mulr.h ssP,ssY,ssKul,#1 p1=y1*k1 || x1 p0=y0*k0 x0 ld x1x0 y3 y2 add.h ssXX,ssX,ssP x’1=x1+p1 || x’0=x0+p0 sub.h ssYY,ssX,ssP ld y3y2 st x’1x’0 y’1=x1-p1 || y’0=x0-p0 st y’1y’0 k3 ld k3k2 k2 User Guide 117 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 7.2 Real Butterfly – DIF – Radix 2 Equations: Xn’ = Xn + Yn Yn’ = Xn*Kn – Yn*Kn n=0..N-1 Pseudo code: for (n = 0; n<N; n++) { sXX[n] = sX[n] + sY[n]; sYY[n] = sX[n]*sK[n] - sY[n]*sK[n]; } IP = 3 (1 add, 1 mul, 1 msub) LD/ST= 5 (read sX, sK, sY, write sXX, sYY) Pseudo code implementation (loop only): ssXX1 = ssX1 + ssY1; ssXX0 = ssX0 + ssY0; ssD1 = ssX1 – ssY1; ssD0 = ssX0 – ssY0; ssYY1 = ssD1*ssK1; ssYY0 = ssD0*ssK0; Assembly code: lea LC,(N/2 - 1) ld.w ssY,[Yptr+]4 ld.w ssX,[Xptr+]4 rdifloop: add.h ssXX,ssX,ssY st.w [Xptr]-4,ssXX sub.h ssD,ssX,ssY ld.w ssY,[Yptr+]4 ld.w ssK,[Kptr+]4 mulr.h ssYY,ssD,ssK ul,#1 ld.w ssX,[Xptr+]4 st.w [Yptr]-8,ssYY loop LC,rdifloop User Guide ; (1) ; (2) ; (3) ; get loop number ; y1y0 ; x1x0 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; 118 (1) || (2) || (3) (4,5) || || xx1=x1+y1 || xx0=x0+y0 store xx1 xx0 d1=x1-y1 || d0=x0-y0 y3 y2 k1 k0 yy1=d1*k1 || yy0=d0*k0 x3 x2 store yy1 yy0 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d2 d3 / d2 d4 d6 d8 Load / Store y1 y0 ld y1y0 x1 x0 add.h ssXX,ssX, ssY ld x1x0 xx1=x1+y1|| xx0=x0+y0 st xx1 xx0 sub.h ssD,ssX,ssY d1=x1-y1 || d0=x0-y0 y3 y2 ld y3y2 k1 ld k1k0 k0 mulr.h ssYY,ssD,ssK ul,#1 yy1=d1*k1|| yy0=d0*k0 yy1 yy0 User Guide 119 x3 x2 ld x3x2 st yy1 yy0 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 7.3 Complex Butterfly – DIT – Radix 2 Equations: X’r = Xr + (Yr*Kr – Yi*Ki) X’i = Xi + (Yi*Kr + Yr*Ki) Y’r = Xr – (Yr*Kr - Yi*Ki) Y’i = Xi – (Yi*Kr + Yr*Ki) Pseudo code: for (n = 0; n<N; n++) { sXXr[n] = sXr[n] sXXi[n] = sXi[n] sYYr[n] = sXr[n] sYYi[n] = sXi[n] } + + – – sYr[n]*sKr[n] sYi[n]*sKr[n] sYr[n]*sKr[n] sYi[n]*sKr[n] + + - sYi[n]*sKi[n]; sYr[n]*sKi[n]; sYi[n]*sKi[n]; sYr[n]*sKi[n]; LD/ST= 10 (read sXr, sXi, sYr, sYi, read sKr, sKi, write sXXr, sXXi, sYYr, sYYi) IP=8 (4 madd, 4 msub) Pseudo code implementation (loop only): sPi0 sPi0 sPi1 sPi1 sXXi0 sYYi0 sXXi1 sYYi1 = = = = = = = = sYr0 sPi0 sYr1 sPi1 sXi0 sXi0 sXi1 sXi1 User Guide * + * + + – + – sKi0 sYi0 * sKr0 sKi1 sYi1 * sKr1 sPi0 sPi0 sPi1 sPi1 ; ; ; ; ; ; ; ; sPr0 sPr0 sPr1 sPr1 sXXr0 sYYr0 sXXr1 sYYr1 = = = = = = = = 120 sYr0 sPr0 sYr1 sPr1 sXr0 sXr0 sXr1 sXr1 * * – + – + – sKr0; sYi0 * sKi0; sKr1; sYi1 * sKi1; sPr0; sPr0; sPr1; sPr1; v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: lea LC,(N/2 - 1) ; (1) ;get loop number lea kindex,4 ; (2) ;index is pass dependent ld.d e6,[Yptr+]8 ; (3) ;y1y0 ld.w ssK1,[Kptr] ; (4) ;kikr0 cditloop: mulr.h ssP1,ssK1,ssY1 ll,#1 ; (1) ;pi0=yr0*ki0 || pr0=yr0*kr0 add.a Kptr,Kptr,kindex ; || ; maddsur.h ssP1,ssP1,ssK1,ssY1 uu,#1 ; (2) ; pi0+=kr0*yi0 || pr0-=yi0*ki0 ld.w ssK2,[Kptr] ; || ;kikr1 mulr.h ssP2,ssK2,ssY2 ll,#1 ; (3) ; pi1=yr1*ki1 || pr1=yr1*kr1 maddsur.h ssP2,ssP2,ssK2,ssY2 uu,#1 ; (4,5) ; pi1+=kr1*yi1 ||pr1-=yi1*ki1 ld.d e4,[Xptr+]8 ; || ; xixr1 xixr0 ld.d e6,[Yptr+]8 ; || ; yiyr3 yiyr2 add.h ssXX1,ssX1,ssP1 ; (6) ; xi’0=xi0+pi0 || xr’0=xr0+pr0 add.a Kptr,Kptr,kindex ; || ; sub.h ssYY1,ssX1,ssP1 ; (7) ; yi’0=xi0-pi0 || yr’0=xr0-pr0 ld.w ssK1,[Kptr] ; || ; kikr2 add.h ssXX2,ssX2,ssP2 ; (8) ; xi’1=xi1+pi1 || xr’1=xr1+pr1 st.d [Xptr]-8,e0 ; || ; store x’1x’0 sub.h ssYY2,ssX2,ssP2 ; (9) ; yi’1=xi1-pi1 || yr’1=xr1-pr1 st.d [Yptr]-16,e2 ; || ; store yy1yy0 loop LC,cditloop User Guide 121 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d1 d2 d3 d5 / d4 d7/ d6 d8 d9 d10 d11 yiyr1 yiyr0 kikr0 mulr.h ssP1,ssK1,ssY1 ll,#1 pi0 || pr0 maddsur.h ssP1,ssP1,ssK1,s sY1 uu,#1 kikr1 pi0 || pr0 mulr.h ssP2,ssK2,ssY2 ll,#1 pi1 || pr1 maddsur.h ssP2,ssP2,ssK2,s sY2 uu,#1 xixr1 xixr0 pi1 || pr1 yiyr3 yiyr2 add.h xi’0 || ssXX1,ssX1,ssP1 xr’0 sub.h ssYY1,ssX1,ssP1 yi’0 || yr’0 kikr2 add.h x’1x’0 xi’1 || ssXX2,ssX2,ssP2 xr’1 sub.h ssYY2,ssX2,ssP2 User Guide yy1 yy0 yi’1 || yr’1 122 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 7.4 Complex Butterfly – DIT – Radix 2 – with shift Equations: X’r = Xr + (Yr*Kr – Yi*Ki) X’i = Xi + (Yi*Kr + Yr*Ki) Y’r = Xr – (Yr*Kr - Yi*Ki) Y’i = Xi – (Yi*Kr + Yr*Ki) Pseudo code: for (n = 0; n<N; n++) { sXXr[n] = sXr[n] sXXi[n] = sXi[n] sYYr[n] = sXr[n] sYYi[n] = sXi[n] } + + – – sYr[n]*sKr[n] sYi[n]*sKr[n] sYr[n]*sKr[n] sYi[n]*sKr[n] + + - sYi[n]*sKi[n]; sYr[n]*sKi[n]; sYi[n]*sKi[n]; sYr[n]*sKi[n]; LD/ST= 10 (read sXr, sXi, sYr, sYi, read sKr, sKi, write sXXr, sXXi, sYYr, sYYi) IP=8 (4 madd, 4 msub) Pseudo code implementation (loop only): sPi0 = sYr0 * sKi0 sPi0 = sPi0 + sYi0 * sKr0 sPi1 = sYr1 * sKi1 sPi1 = sPi1 + sYi1 * sKr1 sXi0 >> 1 sXXi0 = sXi0 + sPi0 sYYi0 = sXi0 – sPi0 sXi1 >> 1 sXXi1 = sXi1 + sPi1 sYYi1 = sXi1 – sPi1 User Guide ; ; ; ; ; ; ; ; ; ; sPr0 = sYr0 * sKr0; sPr0 = sPr0 - sYi0 * sKi0; sPr1 = sYr1 * sKr1; sPr1 = sPr1 – sYi1 * sKi1; sXr0 >> 1; sXXr0 = sXr0 + sPr0; sYYr0 = sXr0 – sPr0; sXr1 >> 1; sXXr1 = sXr1 + sPr1; sYYr1 = sXr1 – sPr1; 123 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: lea lea ld.d ld.w LC,(N/2 - 1) kindex,4 e6,[Yptr+]8 ssK1,[Kptr] ; ; ; ; (1) (2) (3) (4) ; ; ; ; get loop number index is pass dependent y1y0 kikr0 cditsloop: mulr.h ssP1,ssK1,ssY1 ll,#0 ; (1) ; pi0=yr0*ki0 || pr0=yr0*kr0 add.a Kptr,Kptr,kindex ; || ; maddsur.h ssP1,ssP1,ssK1,ssY1 uu,#0 ; (2) ; pi0+=kr0*yi0 || pr0-=yi0*ki0 ld.w ssK2,[Kptr] ; || ; kikr1 mulr.h ssP2,ssK2,ssY2 ll,#0 ; (3) ; pi1=yr1*ki1 || pr1=yr1*kr1 maddsur.h ssP2,ssP2,ssK2,ssY2 uu,#0 ; (4,5) ; pi1+=kr1*yi1 || pr1-=yi1*ki1 ld.d e4,[Xptr+]8 ; || ; xixr1 xixr0 sha.h ssX1,ssX1,#-1 ; (6) ; x0>>1 add.a Kptr,Kptr,kindex ; || ; add.h ssXX1,ssX1,ssP1 ; (7) ; xi’0=xi0+pi0 || xr’0=xr0+pr0 ld.w ssK1,[Kptr] ; || ; kikr2 sub.h ssYY1,ssX1,ssP1 ; (8) ; yi’0=xi0-pi0 || yr’0=xr0-pr0 ld.d e6,[Yptr+]8 ; || ; yiyr3 yiyr2 sha.h ssX2,ssX2,#-1 ; (9) ; x1>>1 add.h ssXX2,ssX2,ssP2 ; (10) ; xi’1=xi1+pi1 || xr’1=xr1+pr1 st.d [Xptr]-8,e0 ; || ; store x’1x’0 sub.h ssYY2,ssX2,ssP2 ; (11) ; yi’1=xi1-pi1 || yr’1=xr1-pr1 st.d [Yptr]-16,e2 ; || ; store yy1yy0 loop LC,cditsloop User Guide 124 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d1 d2 d3 d5 / d4 d7 / d6 d8 d9 d10 d11 yiyr1 yiyr0 kikr0 mulr.h ssP1,ssK1,ssY1 ll,#1 pi0|| pr0 maddsur.h ssP1,ssP1,ssK1,s sY1 uu,#1 kikr1 mulr.h ssP2,ssK2,ssY2 ll,#1 pi1|| pr1 maddsur.h ssP2,ssP2,ssK2,s sY2 uu,#1 xixr1 xixr0 sha.h ssX1,ssX1,#-1 add.h ssXX1,ssX1,ssP1 xi’0 || xr’0 User Guide kikr2 yi’0|| yr’0 sha.h ssX2,ssX2,#-1 sub.h ssYY2,ssX2,ssP2 pi1|| pr1 x0>>1 sub.h ssYY1,ssX1,ssP1 add.h ssXX2,ssX2,ssP2 pi0|| pr0 yiyr3 yiyr2 x1>>1 x’1x’0 xi’1|| xr’1 yy1y yi’1||yr’1 y0 125 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 7.5 Complex Butterfly – DIF – Radix 2 Equations: X’r = Xr + Yr X’i = Xi + Yi Y’r = (Xr – Yr)*Kr – (Xi – Yi)*Ki Y’i = (Xi – Yi)*Kr + (Xr – Yr)*Ki Pseudo code: for (n = 0; n<N; n++) { sXXr[n] = sXr[n] + sYr[n]; sXXi[n] = sXi[n] + sYi[n]; sYYr[n] = (sXr[n] – sYr[n])*sKr[n] – (sXi[n] - sYi[n])*sKi[n]; sYYi[n] = (sXi[n] – sYi[n])*sKr[n] + (sXr[n] - sYr[n])*sKi[n]; } IP=8 (2 add, 2 sub, 2 mul, 1 madd, 1 msub) LD/ST= 10 (read sXr, sXi, sYr, sYi, read sKr, sKi, write sXXr, sXXi, sYYr, sYYi) Pseudo code implementation: sXXi0 sXXi1 sDi0 sDi1 sYYi0 sYYi0 sYYi1 sYYi1 = = = = = = = = sXi0 + sYi0 sXi1 + sYi1 sXi0 - sYi0 sXi1 - sYi1 sDi0 * sKr0 sYYi0 + sDr0 * sKi0 sDi1 * sKr1 sYYi1 + sDr1 * sKi1 User Guide ; ; ; ; ; ; ; ; sXXr0 sXXr1 sDr0 sDr1 sYYr0 sYYr0 sYYr1 sYYr1 = = = = = = = = 126 sXr0 + sYr0; sXr1 + sYr1; sXr0 - sYr0; sXr1 – sYr1; sDr0 * sKr0; sYYr0 – sDi0 * sKi0; sDr1 * sKr1; sYYr1 – sDi1 * sKi1; v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Assembly code: lea LC,(N/2 - 1) ; lea kindex,4 ; ld.d e6,[Yptr+]8 ; ld.d e4,[Xptr+]8 ; cdifloop: add.h ssXX1,ssX1,ssY1 ; add.a Kptr,Kptr,kindex ; add.h ssXX2,ssX2,ssY2 ; ld.w ssK1,[Kptr] ; sub.h ssD1,ssX1,ssY1 ; add.a Kptr,Kptr,kindex ; sub.h ssD2,ssX2,ssY2 ; ld.w ssK2,[Kptr] ; mulr.h ssYY1,ssD1,ssK1 ll,#1 127 v1.6.4, 2003-01 ld.d e4,[Xptr+]8 ; maddsur.h ssYY2,ssYY2,ssD2,ssK2 User Guide get loop number index is pass dependent y1y0 x1x0 ; ; ssYY2,ssD2,ssK2 ll,#1 ld.d e6,[Yptr+]8 st.d [Yptr]-16,e2 loop LC,cdifloop ; ; ; ; ; xi’0=xi0+yi0 || xr’0=xr0+yr0 ; ; xi’1=xi1+yi1 || xr’1=xr1+yr1 ; kikr0 ; di0=xi0-yi0 || dr0=xr0-yr0 ; ; di1=xi1-yi1 || dr1=xr1-yr1 ; kikr1 ; (5) ; yi’0=di0*kr0 || yr’0=dr0*kr0 || ; store x’1x’0 uu,#1; (6) ; yi’0+=dr0*ki0||yr’0-=di0*ki0 ; (7) ; yi’1=di1*kr1 || yr’1=dr1*kr1 || ; xixr3 xixr2 uu,#1 ;(8,9) ; yi’1+=dr1*ki1||yr’1-=di1*ki1 || ; yiyr3 yiyr2 || ; store yy1yy0 st.d [Xptr]-8,e0 ; maddsur.h ssYY1,ssYY1,ssD1,ssK1 mulr.h (1) (2) (2) (3) (1) || (2) || (3) || (4) || TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines Register diagram: Instruction d0 d1 d2 d3 d5 / d4 d7/ d6 d8 d9 d10 d11 yiyr1 yiyr0 xixr1 xixr0 add.h ssXX1,ssX1,ssP1 xi’0||xr’0 add.h ssXX2,ssX2,ssP2 sub.h ssYY1,ssX1,ssP1 xi’1||xr’1 kikr0 di0|| dr0 sub.h ssYY2,ssX2,ssP2 di1| dr1 kikr1 mulr.h x’1x’0 ssP1,ssK1,ssY1 ll,#1 yi0|| yr0 maddsur.h ssP1,ssP1,ssK1,ssY 1 uu,#1 yi0|| yr0 xixr3 xixr2 mulr.h ssP2,ssK2,ssY2 ll,#1 maddsur.h ssP2,ssP2,ssK2,ssY 2 uu,#1 yi1|| yr1 yiyr3 yiyr2 yi1|| yr1 yy1 yy0 User Guide 128 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 8 Appendices 8.1 Tools Several tools have been used to test the DSP routines shown in this Optimization guide: • EDE Tasking 1.4 r1: Used to build the projects. One project was created for each DSP algorithm, and one project for the cycle count. • CrossView debugger: Used to run the routine. The output is the screen (stdout) or a file. • A spreadsheet such as Excel: Used to compare TriCore fixed-point precision (or approximation of the fixed-point DSP algorithms) and the floating point results. 8.2 TriBoard Project Cycles Count The project will send the number of cycles of the routine under test by the serial port. The configuration used is indicated below: Hardware Software TriBoard Rider-D w/MMU M3301. Power supply 12V/2A One Parallel cable female/female One serial cable male/male Tasking EDE 1.4 r1 Crossview debugger 1.1 Windows Hyperterminal Excel 8.2.1 Steps to Run the Project 1. Connect the TriBoard with the computer 2. Launch the Hyperterminal The settings should be: Port used : COM1 (This is the com port where the serial cable connects the board to the computer) Bits/Second : 9600 Data bits :8 Parity : None Stop bits :1 Flow : Hardware User Guide 129 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 3. Connect the power supply. The board will respond with: Hello World! I’am the TriBoard with Rider-D developed at INFINEON Technologies AG in Munich Department AI MC AE St.-Martin-Str. 76 D-81541 Munich Tel.:+49-89-234-0 Fax.:+49-89-234-81785 If you have questions to this board or to TriCore CPU, see the manuals on the TriBoard CD. Have fun working with me! The CPU running at 160000000 Hz The EBU running at 80000000 Hz Checking On Board SDRAM: ok running since: 00h:00min:02sec 4. Launch Tasking EDE and select the project space “Testing DSP”. Open the project “cycles count”, add the routine (see 1.3.2) and compile it. 5. Open Crossview debugger 1.1 and load the file cycle count.abs in the board. 6. Run the code. A message is sent by the board: f CPU = 160000000 Hz f EBU = 80000000 Hz K = 1 N = 10 -------------------------------number of cycle : 102018 -------------------------------- User Guide 130 v1.6.4, 2003-01 TriCore™ 32-bit Unified Processor DSP Optimization Guide Part 2: Routines 9 Glossary Reference Definition asm Assembly code CPU Central Processing Unit DIF Decimation In Frequency DIT Decimation In Time DSP Digital Signal Processing FFT Fast Fourier Transformations FIR Finite Impulse Response FPI Flexible Peripheral Interface IIR Infinite Impulse Response IP Integer Processing LDMST Load Modify Store instruction LMS Least Mean Square LS Load Store LSB Least Significant Bit LT Less Than - compare instruction MAC Multiply and Accumulate MSB Most Significant Bit nop No Operation PMU Program Memory Unit TC Abbreviation for TriCore (TC1 or TriCore v2.0, for example) User Guide 131 v1.6.4, 2003-01 ((132)) Infineon goes for Business Excellence “Business excellence means intelligent approaches and clearly defined processes, which are both constantly under review and ultimately lead to good operating results. Better operating results and business excellence mean less idleness and wastefulness for all of us, more professional success, more accurate information, a better overview and, thereby, less frustration and more satisfaction.” Dr. Ulrich Schumacher http://www.infineon.com Published by Infineon Technologies AG