TriLib DSP Library User's Manual

U s e r ’s M a n u a l , V 1 . 2 , J a n . 2 0 0 1
TriLib
A D S P L i b r a r y f o r T r i C o r e TM
IP Cores
N e v e r
s t o p
t h i n k i n g .
Edition 2000-01
Published by Infineon Technologies AG,
St.-Martin-Strasse 53,
D-81541 München, Germany
© Infineon Technologies AG 2002.
All Rights Reserved.
Attention please!
The information herein is given to describe certain components and shall not be considered as warranted
characteristics.
Terms of delivery and rights to technical change reserved.
We hereby disclaim any and all warranties, including but not limited to warranties of non-infringement, regarding
circuits, descriptions and charts stated herein.
Infineon Technologies is an approved CECC manufacturer.
Information
For further information on technology, delivery terms and conditions and prices please contact your nearest
Infineon Technologies Office in Germany or our Infineon Technologies Representatives worldwide (see address
list).
Warnings
Due to technical requirements components may contain dangerous substances. For information on the types in
question please contact your nearest Infineon Technologies Office.
Infineon Technologies Components may only be used in life-support devices or systems with the express written
approval of Infineon Technologies, if a failure of such components can reasonably be expected to cause the failure
of that life-support device or system, or to affect the safety or effectiveness of that device or system. Life support
devices or systems are intended to be implanted in the human body, or to support and/or maintain and sustain
and/or protect human life. If they fail, it is reasonable to assume that the health of the user or other persons may
be endangered.
U s e r ’s M a n u a l , V 1 . 1 , S e p t . 2 0 0 0
TriLib
A D S P L i b r a r y f o r T r i C o r e TM
N e v e r
s t o p
t h i n k i n g .
TriLib
Revision History:
2000-01
Previous Version:
-
Page
V 1.2
V 1.1
Subjects (major changes since last revision)
New functions (Mathematical, Statistical, FFT)
Current Version
-
V 1.2
All the functions are ported to GNU Compiler
New functions
(Random number, Mixed Adaptive, Mixed FFT, Multirate FIR)
Page 407
Applications
GUI on the host side to provide the visual control for two embedded target
applications
Page 425
FAQs
Page 435
Appendix
Page 459
Glossary
We Listen to Your Comments
Any information within this document that you feel is wrong, unclear or missing at all?
Your feedback will help us to continuously improve the quality of this document.
Please send your proposal (including a reference to this document) to:
[email protected]
"Microcontrollers" Template
for Technical Documentation
1
1.1
1.2
1.3
1.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction to TriLib, a DSP Library for TriCore . . . . . . . . . . . . . . . . . . . .
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Future of the TriLib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Support Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
15
15
16
16
2
2.1
2.2
2.3
2.4
Installation and Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TriLib Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installing TriLib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Building TriLib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Source Files List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
18
18
19
3
3.1
3.2
3.3
3.4
3.5
DSP Library Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TriLib Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Calling a DSP Library Function from C Code . . . . . . . . . . . . . . . . . . . . . .
Calling a DSP Library Function from Assembly Code . . . . . . . . . . . . . . . .
TriLib Example Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TriLib Implementation - A Technical Note . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
23
23
23
24
4
4.1
4.2
Function Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Complex Arithmetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Magnitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Vector Arithmetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
IIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Adaptive Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Fast Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
TriCore Implementation Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
First Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Butterfly Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Method adapted in the TriLib FFT implementation . . . . . . . . . . . . . 254
Group Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Stage Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Important Note: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Discrete Cosine Transform (DCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Inverse Discrete Cosine Transform (IDCT) . . . . . . . . . . . . . . . . . . . . . . . 314
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
User’s Manual
5
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
4.11
4.12
4.13
4.14
Multidimensional DCT (General Information) . . . . . . . . . . . . . . . . . . . . .
Mathematical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
315
329
363
379
5
5.1
5.2
5.3
5.4
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Spectrum Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A simple example showing functioning of Spectrum Analyzer. . . . .
Sweep Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Equalizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Setup for Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
401
401
401
404
406
408
6
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7
7.1
7.2
7.3
Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FIR Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Linear Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Numeric Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IIR Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
419
419
420
421
422
424
425
8
8.1
8.2
8.3
8.4
8.5
8.6
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
File Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coding Rules and Conventions for ’C’ and ’C++’ . . . . . . . . . . . . . . . . . . .
Coding Rules and Conventions for Assembly Language . . . . . . . . . . . .
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
429
429
430
433
436
444
445
9
Glossary
User’s Manual
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
6
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
Table 2-1
Table 2-2
Table 3-1
Table 3-2
Table 3-3
Table 3-4
Table 3-5
Table 3-6
Table 3-7
Table 4-1
Table 4-2
Table 4-3
Table 8-1
Table 8-2
Table 8-3
Table 8-4
Table 8-5
User’s Manual
Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Source files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
TriLib Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
FIR Filter Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Compiler Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Tasking Special Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
GHS Special Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
DSPEXT CCD Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Argument Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Register Naming Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Complex Data Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
Equal Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Directives with the same functionality but different syntax. . . . . . . . . 446
Datatypes with DSPEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Datatypes without DSPEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
7
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
User’s Manual
8
V 1.1, 2000-01
Preface
This is the User Manual for TriLib-a DSP library for TriCore. TriCore is the first singlecore 32-bit microcontroller-DSP architecture optimized for real-time embedded systems.
The DSP core of TriCore is a fixed point one.
This manual describes the implementation of essential algorithms for general digital
signal processing applications on the TriCore DSP. Characteristics of TriLib and the
Installation and Build procedure are also described.
The source codes are C as well as C++ -callable and thus this library can be used as a
library of basic functions for developing bigger applications on TriCore. The library
serves as a user guide for TriCore programmers. It demonstrates how the processor’s
architecture can be exploited for achieving high performance. There are number of ways
to implement an algorithm. The algorithms have been implemented with the primary aim
of optimizing execution speed, i.e., minimize number of execution cycles.
The various functions and algorithms implemented and described about in the user
manual are:
•
•
•
–
–
–
•
–
–
•
•
•
Complex Arithmetic Functions
Vector Arithmetic Functions
Filters
FIR
IIR
Adaptive FIR
Transforms
FFT
DCT
Mathematical Functions
Matrix Operations
Statistical Functions
The user manual describes each function in detail under the following heads:
Signature:
This gives the function interface.
Inputs:
Lists the inputs to the function.
User’s Manual
-9
V 1.2, 2000-01
Outputs:
Lists the output of the function.
Return:
Gives the return value of the function if any.
Description:
Gives a brief note on the implementation, the size of the inputs and the outputs,
alignment requirements etc.
Pseudocode:
The implementation is expressed as a pseudocode using C conventions.
Techniques:
The techniques employed for optimization are listed here.
Assumptions:
Lists the assumptions made for an optimal implementation such as constraint on buffer
size. The input output formats are also given here.
Memory Note:
A detailed sketch showing how the arrays are stored in memory, the nature of the buffers
(linear/circular), the alignment requirements of the different buffers, the nature of the
arithmetic performed on them (packed, simple). The diagrams give a great insight into
the actual implementation.
Implementation Note:
Gives a very detailed note on the implementation. The codes in TriLib are optimized for
speed. An optimized code is not very easy to understand. The implementation note is
very helpful in overcoming this hurdle. For example, how techniques such as loop
unrolling (if employed) help in optimization is described in detail.
Further, the path of an Example calling program, the Cycle Count and Code Size are
given for each function.
User’s Manual
-10
V 1.2, 2000-01
Organization
Chapter 1, Introduction, gives a brief introduction of the TriLib and its features.
Chapter 2, Installation and Build, describes the TriLib content, how to install and build
the TriLib.
Chapter 3, DSP Library Notations, describes the DSP Library data types, arguments,
calling a function from the C code and the assembly code, and the implementation notes.
Chapter 4, Function Descriptions, describes the Complex arithmetic functions, Vector
arithmetic functions, FIR filters, IIR filters, Adaptive filters, Fast Fourier Transforms,
Discrete Cosine Transform, Mathematical functions, Matrix operations and Statistical
functions. Each function is described with its signature, inputs, outputs, return, brief
description, pseudocode, techniques used, assumptions made, memory note,
implementation details, example, cycle count and code size.
Chapter 5, Applications, describes the applications such as Spectrum Analyzer, Sweep
Oscillator and Equalizer using implemented TriLib functions.
Chapter 6, References, gives the list of related references.
Chapter 7, FAQs, gives Frequently Asked Questions about FIR, IIR and FFT.
Chapter 8, Appendix, gives the conventions for C and assembly code, file naming
conventions, directory structure and porting for the Tasking, GHS and GNU compilers.
Chapter 9, Glossary, gives a brief explanation of the terminology used in the TriLib user
manual in alphabetical order.
What’s new?
• New functions have been added
• All functions are now supported on GNU compiler also
• Three Applications showing the use of functions from TriLib are added
User’s Manual
-11
V 1.2, 2000-01
• A powerful GUI on the host side is added to provide visual control to the embedded
target application
• FAQs, Appendix and Glossary are added
• The GHS and Tasking compiler now have an extra implementation for C and C++
respectively thereby to give flexibility to the user to use anyone for their convenience
• TriLib Classes for the much bigger TriApp foundation classes called as TFC (TriCore
application foundation classes) to enable developers to scale up their signal
processing applications
Acknowledgements
The technical substance of this manual has been mainly developed by the Infineon’s
TriLib development team. These are designed, developed and tested over the hardware.
We in advance would like to acknowledge users for their feedback and suggestions to
improve this product. The development team would like to thank Dieter Stengl, Director
for CMD TO S/W for all his support and encouragement. Rakesh Verma, Technical
Manager, Wipro, for his support to the Wipro’s development team and co-ordination with
the Infineon team. Thomas Varghese, Arun Naik, Sreenivas, Mahesh for their valuable
contribution in giving the feedback on user manual and active participation in some of
the code reviews and also for their technical support. The team also recognizes the effort
of Savitha for her patience in meticulously preparing, typesetting and reviewing the User
Manual. We also would like to thank our marketing team for their comments and inputs.
Mark Nuchimowicz, Ramachandra, Rashmi, Preethi, Manoj, Ankur and Nagaraj
TriLib Development team - Infineon
Acronyms and Definitions
Acronyms and Definitions
Acronyms
Definitions
DCT
Discrete Cosine Transform
DFT
Discrete Fourier Transform
DIF
Decimation-In-Frequency
DIT
Decimation-In-Time
DLMS
Delayed Least Mean Square
DSP
Digital Signal Processing
User’s Manual
-12
V 1.2, 2000-01
Acronyms and Definitions
Acronyms
Definitions
TriLib
DSP Library functions for TriCore
FFT
Fast Fourier Transform
FIR
Finite Impulse Response
IIR
Infinite Impulse Response
Documentation/Symbol Conventions
The following is the list of documentation/symbol conventions used in this manual.
Documentation/Symbol Conventions
Documentation/
Symbol convention
Courier
Description
Pseudocode
(*)
Denotes Q format multiplication
Times-italic
File name
Pointer
Circular pointer
User’s Manual
-13
V 1.2, 2000-01
User’s Manual
-14
V 1.2, 2000-01
Introduction
1
Introduction
1.1
Introduction to TriLib, a DSP Library for TriCore
The TriLib, a DSP Library for TriCore is C-callable, hand-coded assembly, general
purpose signal processing routines. These routines are extensively used in real-time
applications where speed is critical.
The TriLib includes more than 60 commonly used DSP routines. The throughput of the
system using the TriLib routines is considerably better than those achieved using the
equivalent code written in ANSI C language. The TriLib significantly helps in
understanding the general purpose signal processing routines, its implementation on
TriCore. It also reduces the DSP application development time. The TriLib also provides
the source code. Few applications are also provided as part of TriLib to demonstrate the
usage of functions.
The routines are broadly classified into the following functional categories:
•
•
•
•
•
•
•
•
•
•
Complex Arithmetic
Vector Arithmetic
FIR Filters
IIR Filters
Adaptive Filters
Fast Fourier Transforms
Discrete Cosine Transform
Mathematical functions
Matrix operations
Statistical functions
1.2
•
•
•
•
•
•
•
•
•
Features
Covers the common DSP algorithms with Source codes
Hand-coded and optimized assembly modules
C/C++ callable functions on Tasking, GreenHills and GNU compilers
Multi platform support - Win 95, Win 98, Win NT
Bit-exact reference C codes for easy understanding and verification of the algorithms
Assembly implementation tested for bit exactness against model C codes
Workarounds implemented to take care of known Core errors
Examples to demonstrate the usage of functions
Example input test vectors and the output test data for verification
User’s Manual
1-15
V 1.2, 2000-01
Introduction
•
•
•
•
•
•
Comprehensive Users manual covering many aspects of implementation
Useful Applications built using the TriLib to demonstrate the product
Powerful User friendly GUI interface for applications built using TriLib
TriApp-TriLib application foundation class for extending the TriLib functionality
Supports the Object Oriented application development in C++ and Java
User helpful Demoshield based setup and registration program
1.3
Future of the TriLib
The planned future releases will have the following improvements.
• Expansion of the library by adding more number of functions in the domains such as
image processing, speech processing and the generic core routines of DSP.
• Upgrading the existing 16 bit functions to 32 bit
1.4
Support Information
Any suggestions for improvement, bug report if any, can be sent via e-mail to
[email protected]
Visit www.infineon.com for update on TriLib releases.
User’s Manual
1-16
V 1.2, 2000-01
Installation and Build
2
Installation and Build
2.1
TriLib Content
The following table depicts the TriLib content with its directory structure.
Table 2-1
Directory Structure
Directory
name
Contents
TriLib
Directories which has all the files related None
to the TriLib
source
Directories Tasking, GreenHills and
GNU
Tasking
Individual directories for each functional *.asm
category. Each directory has respective
assembly language implementation files
of the library functions
GreenHills
Individual directories for each functional *.tri
category. Each directory has respective
assembly language implementation files
of the library functions
GNU
Individual directories for each functional *.S
category. Each directory has respective
assembly language implementation files
of the library functions
include
Directories Tasking, GreenHills and
GNU and common include file for ’C’ of
all the three compilers
TriLib.h
Tasking
Include files for assembly routine
*.inc for assembly
GreenHills
Include files for assembly routine
*.h for assembly
GNU
Include files for assembly routine
*.h for assembly
docs
User Manual
Convention Manual
readme.txt
*.fm, *.pdf
*.doc
*.txt
examples
Directories Tasking and GreenHills
None
User’s Manual
Files
2-17
None
V 1.2, 2000-01
Installation and Build
Table 2-1
Directory Structure
Tasking
Individual directories for each functional *.c, *.cpp
category. Each directory has respective
example ‘c’ and ’cpp’ functions to depict
the usage of TriLib
GreenHills
Individual directories for each functional *.cpp, *.c
category. Each directory has respective
example ‘cpp’ and ’c’ functions to depict
the usage of TriLib
GNU
Individual directories for each functional *.c
category. Each directory has respective
example ‘c’ functions to depict the usage
of TriLib
refcode
Individual directories for each functional None
category. Each directory has respective
reference ‘C’ code of the corresponding
assembly implementation in source
directory, which works on Tasking
compiler
build
Build information
testvectors
Test vectors for the different functions in *.dat
their respective directories
2.2
*.pjt, *.bld
Installing TriLib
TriLib is distributed as a self extracting ZIP file. To install the TriLib on the system, unzip
the ZIP file and run setup. This will install all the files in the respective directories.
The directory structure is as given in “TriLib Content” on Page 17
2.3
Building TriLib
Include the TriLib.h into your project and also include the same into the files that need to
call the library function like:
#include “TriLib.h”
Set the include path in the tool that you are using for both the project as well as each of
the files you have included (it is observed that sometimes you get errors if it is not set in
the options of each individual files). Please refer the documentation of the Tasking,
GreenHills and GNU for more details.
User’s Manual
2-18
V 1.2, 2000-01
Installation and Build
In case of Tasking, the #define part for _TASKING selection box should be checked and
in case of GreenHills it should be typed manually as _GHS, otherwise it might give lot of
compiler errors.
In both the compilers the DSPEXT has to be defined in the project options for both the
assembly sources and the c files in the respective project options when the DSP
extension for respective compilers (Tasking and GreenHills) have to be used.
For without DSP extension don’t define DSPEXT for c compiler option. For assembler
option set DSPEXT=0. GNU compiler doesn’t support data types for DSP. So DSPEXT
need not be defined or undefined in case of GNU compiler.
If the .cpp file is to be used, in case of Tasking or GreenHills compiler, the macro
_cplusplus is to be defined under compiler options.
For setting the other CCD, such as H/W workarounds, use the assembler options.
Now include the respective source files for the required functionality into your project.
Refer the functionality table, Table 2-2
Build the system and start using the library.
2.4
Source Files List
Table 2-2
Source files
Tasking
GreenHills
GNU
CplxOp_16.tri
CplxOp_32.tri
CplxOp_16.S
CplxPhMag_16.S
CplxOp_32.S
CplxPhMag_32.S
VectOp_16.tri
VectOp1_16.tri
VectOp_16.S
VectOp1_16.S
Fir_16.tri
FirBlk_16.tri
Fir_4_16.tri
FirBlk_4_16.tri
Fir_16.S
FirBlk_16.S
Fir_4_16.S
FirBlk_4_16.S
Complex Arithmetic functions
CplxOp_16.asm
CplxOp_32.asm
Vector Arithmetic functions
VectOp_16.asm
FIR filters
Fir_16.asm
FirBlk_16.asm
Fir_4_16.asm
FirBlk_4_16.asm
User’s Manual
2-19
V 1.2, 2000-01
Installation and Build
Table 2-2
Source files
FirSym_16.asm
FirSymBlk_16.asm
FirSym_4_16.asm
FirSymBlk_4_16.asm
FirDec_16.asm
FirInter_16.asm
FirSym_16.tri
FirSymBlk_16.tri
FirSym_4_16.tri
FirSymBlk_4_16.tri
FirDec_16.tri
FirInter_16.tri
FirSym_16.S
FirSymBlk_16.S
FirSym_4_16.S
FirSymBlk_4_16.S
FirDec_16.S
FirInter_16.S
IirBiq_4_16.tri
IirBiqBlk_4_16.tri
IirBiq_5_16.tri
IirBiqBlk_5_16.tri
IirBiq_4_16.S
IirBiqBlk_4_16.S
IirBiq_5_16.S
IirBiqBlk_5_16.S
Dlms_4_16.tri
DlmsBlk_4_16.tri
CplxDlms_4_16.tri
CplxDlmsBlk_4_16.tri
Dlms_2_16x32.tri
DlmsBlk_2_16x32.tri
Dlms_4_16.S
DlmsBlk_4_16.S
CplxDlms_4_16.S
CplxDlmsBlk_4_16.S
Dlms_2_16x32.S
DlmsBlk_2_16x32.S
FFT_2_16.tri
FFT_2_32.tri
FFT_2_16X32.tri
FFT_2_16.S
FFT_2_32.S
FFT_2_16X32.S
DCT_2_8.tri
DCT_2_8.S
Sine_32.tri
Cos_32.tri
Arctan_32.tri
Sqrt_32.tri
Ln_32.tri
AntiLn_16.tri
Expn_16.tri
XpowY_32.tri
RandInit_16.tri
Rand_16.tri
Sine_32.S
Cos_32.S
Arctan_32.S
Sqrt_32.S
Ln_32.S
AntiLn_16.S
Expn_16.S
XpowY_32.S
RandInit_16.S
Rand_16.S
IIR filters
IirBiq_4_16.asm
IirBiqBlk_4_16.asm
IirBiq_5_16.asm
IirBiqBlk_5_16.asm
Adaptive filters
Dlms_4_16.asm
DlmsBlk_4_16.asm
CplxDlms_4_16.asm
CplxDlmsBlk_4_16.asm
Dlms_2_16x32.asm
DlmsBlk_2_16x32.asm
FFT
FFT_2_16.asm
FFT_2_32.asm
FFT_2_16X32.asm
DCT
DCT_2_8.asm
Mathematical Functions
Sine_32.asm
Cos_32.asm
Arctan_32.asm
Sqrt_32.asm
Ln_32.asm
AntiLn_16.asm
Expn_16.asm
XpowY_32.asm
RandInit_16.asm
Rand_16.asm
Matrix Functions
User’s Manual
2-20
V 1.2, 2000-01
Installation and Build
Table 2-2
Source files
MatAdd_16.asm
MatSub_16.asm
MatMult_16.asm
MatTrans_16.asm
MatAdd_16.tri
MatSub_16.tri
MatMult_16.tri
MatTrans_16.tri
MatAdd_16.S
MatSub_16.S
MatMult_16.S
MatTrans_16.S
ACorr_16.tri
Conv_16.tri
Avg_16.tri
ACorr_16.S
Conv_16.S
Avg_16.S
Statistical Functions
ACorr_16.asm
Conv_16.asm
Avg_16.asm
User’s Manual
2-21
V 1.2, 2000-01
Installation and Build
User’s Manual
2-22
V 1.2, 2000-01
DSP Library Notations
3
DSP Library Notations
3.1
TriLib Data Types
The TriLib handles the following fractional data types.
Table 3-1
TriLib Data Types
1Q15 (DataS) 1Q15 operand is represented by a short data type (frac16/_sfract) that
is predefined as DataS in TriLib.h header file.
1Q31 (DataL) 1Q31 operand is represented by a long data type (frac32/_fract) that is
predefined as DataL in TriLib.h header file.
CplxS
Complex data type contains the two 1Q15 data arranged in Re-Im
format.
CplxL
Complex data type contains the two 1Q31 data arranged in Re-Im
format.
3.2
Calling a DSP Library Function from C Code
After installing the TriLib, do the following to include a TriLib function in the source code.
1. Include the TriLib.h include file
2. Include the source file that contains required DSP function into the project along with
the other source files
3. Include TriConv.inc (Tasking) or TriConv.h (GreenHills)
4. Set the include paths to point the location of the TriLib.h
5. Set the Compiler Conditional Directives (CCDs) for selection of DSP extension
6. Set the Compiler Conditional Directives (CCDs) to generate code with workarounds
for the H/W bugs
7. Build the system
3.3
Calling a DSP Library Function from Assembly Code
The TriLib functions are written to be used from C. Calling the functions from assembly
language source code is possible as long as the calling function conforms to the TriCore
calling conventions. Refer TriCore Calling Conventions manual for more details.
3.4
TriLib Example Implementation
The examples of how to use the TriLib functions are implemented and are placed in
examples subdirectory. This subdirectory contains a subdirectory for set of functions.
User’s Manual
3-23
V 1.2, 2000-01
DSP Library Notations
3.5
TriLib Implementation - A Technical Note
3.5.1
Memory Issues
The TriLib is implemented with the known alignment constraints for the TriCore memory
addressing architecture. The following information gives the alignment and sizes
conditions in order to work properly.
Halfword alignment for ld.d and st.d is only allowed when the source or destination
address is located in on-chip memory. For external memory accesses over TriCore’s
peripherals bus, doubleword access must be word aligned (TriCore Architecture Manual
p.13).
The size and length of a circular buffer have the following restrictions (TriCore
Architecture Manual p.13).
• The start of the buffer start must be aligned to a 64-bit boundary.
• The length of the buffer must be a multiple of the data size, where the data size is
determined from the instruction being used to access the buffer.
Different alignment requirements for ld.d and st.d for internal and external memories
impose different alignment of data in functions that use those instructions. In some cases
(for example filter delay-buffer defined as circular-buffer) halfword aligned accesses to
the data is required which prohibit the location of such data structures in external
memory.
For example Fir_4_16() function, the delay-buffer of the filter is defined as circular-buffer.
In this case, when located in internal memory the buffer must have doubleword
alignment (circular-buffer). After each call to the function the start position of the delaybuffer is shifted (with circular update) by halfword. The delay-buffer cannot be located in
external memory because the load from the delay-buffer is executed by ld.d instruction
and word alignment is no more satisfied.
3.5.2
Optimization Approach
Extended parallelism of the processor architecture increases the speed of the algorithms
execution, but at the same time imposes some constraints on the size of Input-Buffers.
So for example Fir_4_16() FIR filter executes at maximal possible speed on the TriCore
but the size must be multiple of 4.
In the implementation of the algorithms following optimizations are applied:
• Packed arithmetic
User’s Manual
3-24
V 1.2, 2000-01
DSP Library Notations
• Mixed packed /simple arithmetic
• Simple arithmetic
From the point of view of size of the algorithm (Vector length, Filter length) two cases can
be identified:
• Constraint on the dimension of vector, order of filter
• Arbitrary size
Best performance can be achieved with some constrains on the size in which case fully
packed arithmetic is used in the kernel loop. Arbitrary size (not for all algorithms) can be
achieved by using
• Simple arithmetic in the kernel loop
• Mixed packed/simple arithmetic, partitioning of the algorithm size so that the kernel
loop uses packed arithmetic with conditional post processing to achieve arbitrary size
To achieve maximal performance and flexibility some functions have several
implementations optimized for specific target requirements.
Following implementations can be recognized:
•
•
•
•
On sample, optimized for single sample processing
On block, optimized for block processing
Best performance with restriction on size
Arbitrary size, trade-off between performance and flexibility
For example FIR filter is implemented as
Table 3-2
FIR Filter Implementations
Fir_16()
Sample processing, trade-off on performance, arbitrary size
Fir_4_16()
Sample processing, best performance, size restriction
FirBlk_16()
Block processing, trade-off on performance, arbitrary size
FirBlk_4_16() Block processing, best performance, size restriction
The SIMD instructions are exploited in the FFT by the special arrangement of the Real
and Imaginary parts of the complex number. The Real:Imag format is the conventional
method of storing the complex number x+jy. In this case two complex numbers x0+jy0
and x1+jy1 is arranged as x0x1 and j(y0y1).
User’s Manual
3-25
V 1.2, 2000-01
DSP Library Notations
3.5.3
Options in Library Configurations
Set of Conditional Compile Directives (CCD) on the C language level and assembly level
define the configuration of the TriLib.
3.5.3.1
Compiler
Compiler selection is based on two CCD
Table 3-3
Compiler Selection
_Tasking
CCD on the C level for selecting the Tasking compiler
_GHS
CCD on the Cpp level for selecting the GHS compiler
COR1
Hardware workaround for TriCore ver1.2
COR14
Hardware workaround for TriCore ver1.2
CPU5
Hardware workaround for TriCore ver1.3
In the current implementation of the TriLib this setting is only evaluated in TriLib.h header
file which is common to all the compilers.
All the library functions and examples have dedicated implementations for each compiler
and are not influenced by this setting.
3.5.3.2
DSP Extensions
To improve the DSP functionality on the C language level Tasking compiler supports
three additional special DSP specific intrinsic data types to perform fixed point arithmetic.
Refer to the tools documentation for details.
Table 3-4
Tasking Special Data Types
_sfract
16 bits: 1 sign bit + 15 mantissa bits
_fract
32 bits: 1 sign bit + 31 mantissa bits
_accum
64 bits: 1 sign bit + 17 integral bits + 46 mantissa bits
To efficiently implement a circular buffer in the C language additional qualifier _circ is
defined by Tasking. This can be used in conjunction with the other data types.
User’s Manual
3-26
V 1.2, 2000-01
DSP Library Notations
GHS compiler, extended support of DSP functionality is implemented by defining C++
classes.
Table 3-5
GHS Special Data Types
frac16
16 bits: 1 sign bit + 15 mantissa bits
frac32
32 bits: 1 sign bit + 31 mantissa bits
frac64
64 bits: 1 sign bit + 17 integral bits + 46 mantissa bits
Circular buffer pointer is implemented in GHS C++ compiler as a templatized class.
To make the library portable, TriLib function arguments use following predefined data
types.
Table 3-6
Data Types
DataS
16-bit operands
DataL
32-bit operands
cptrDataS
circular-pointer to DataS circular-buffer
cptrDataL
circular-pointer to DataL circular-buffer
Depending on the compiler used and the setting of _DSPEXT CCD following
assignments are used (implemented in TriLib.h)
Table 3-7
DSPEXT CCD Assignments
DSPEXT=FALSE
DSPEXT=TRUE
Tasking, GHS, GNU
Tasking
GHS
DataS
short
_sfract
frac16
DataL
int
_fract
frac32
CptrDataS
struct (TriLib.h)
_sfract _circ*
circptr<frac16>
DSPEXT CCD has effect on the C/C++ level as well on the assembly implementations
of the TriLib function.
3.5.4
Workarounds of known Behavioral Deviations
The instruction set of TriCore is defined in different syntax for the GreenHills and Tasking
Tool sets. There are different deviations in each of the compilers. Particularly the
GreenHills doesn’t support some instructions in its Multi 2000 ver 2.0 and also there are
behavioral changes in the ver 2.0.2. This can be potential risk in the development for
User’s Manual
3-27
V 1.2, 2000-01
DSP Library Notations
people to migrate from one compiler to other. To give some instances of the known
deviations.
Conditional move instruction (cmov,cmovn) is not supported in GHS ver 2.0 in this case
select (sel,seln) instructions has to be used.
The data memory addressing is bit complicated in GHS, there are special syntax to do
the same for instance syntaxes such as %sdaoff etc., are used. Refer the GHS
documentation for more details.
The jz has problems in GHS ver 2.0 so in order to workaround this, usage of jeq is
encouraged, The instruction jz works on GHS ver 2.0.2. The Sine/Cosine functions use
jz instruction and will run on ver 2.0.2.
3.5.5
Testing Methodology
The TriLib is tested on GHS, Tasking simulator and TriCore TC10GP TriBoard ver2.4.
The Hardware workarounds have to be enabled only if the TriLib is intended to run on
TC10GP (TriCore ver1.2, ver1.3) processor hardware.
User’s Manual
3-28
V 1.2, 2000-01
Function Descriptions
4
Function Descriptions
Each function is described with its signature, inputs, outputs, return, brief description,
pseudocode, techniques used, assumptions made, memory note, how it is implemented,
example, cycle count and code size.
Functions are classified into the following categories.
•
•
•
•
•
•
•
•
•
•
Complex Arithmetic functions
Vector functions
FIR filters
IIR filters
Adaptive filters
Fast Fourier Transforms
Discrete Cosine Transform
Mathematical functions
Matrix operations
Statistical functions
4.1
Conventions
4.1.1
Argument Conventions
The following conventions have been followed while describing the arguments for each
individual function.
Table 4-1
Argument Conventions
Argument
Convention
X,Y
Input data or input data vector
R
Output data
nX, nY, nR
The size of vectors X, Y, and R respectively. In functions
where nX = nY = nR, only nX has been used
H
Filter coefficient vector (filter routines only)
nH
The size of vector H. Usually not defined explicitly
DataS
Data type definition equating a short, a 16-bit value representing a
1Q15 number
DataL
Data type definition equating a long, a 32-bit value representing a 1Q31
number
DataD
Reserved for 64-bit value
User’s Manual
4-29
V 1.2, 2000-01
Function Descriptions
Table 4-1
Argument Conventions
Argument
Convention
cptrDataS
Circular pointer structure
CplxS
Data type definition equating a short, a 16-bit value representing a
1Q15 complex number
CplxL
Data type definition equating a long, a 32-bit value representing a 1Q31
complex number
aR
Pointer to Output-Buffer
4.1.2
Register Naming Conventions
The following register naming conventions have been followed.
Table 4-2
Register Naming Conventions
Argument
Convention
a
Address register or data type prefix
ca
Circular buffer address register pair
User’s Manual
4-30
V 1.2, 2000-01
Function Descriptions
4.2
Complex Arithmetic Functions
4.2.1
Complex Numbers
A complex number z is an ordered pair (x,y) of real numbers x and y, written as
z= (x,y)
where x is called the real part and y the imaginary part of z.
4.2.2
Complex Number Representation
A complex number can be represented in different ways, such as
Rectangular form
: C = R + iI
[4.1]
Trigonometric form
: C = M [ cos ( φ ) + j sin ( φ ) ]
[4.2]
Exponential form
: C = Me
iφ
[4.3]
Magnitude and angle form
: C = M ∠φ
[4.4]
In the complex functions implementation, the rectangular form is considered.
4.2.3
Complex Plane
The geometrical representation of complex numbers as points in the plane is of great
importance. Choose two perpendicular coordinate axis in the Cartesian coordinate
system. The horizontal x-axis is called the real axis, and the vertical y-axis is called the
imaginary axis. Plot a given complex number z=(x,y) = x + iy as the point P with
coordinates (x, y). The xy-plane in which the complex numbers are represented in this
way is called the Complex Plane. This is also called as the Argand diagram after the
French mathematician Jean Robert Argand.
User’s Manual
4-31
V 1.2, 2000-01
Function Descriptions
(Imaginary
Axis)
y
P
z = x + iy
(Real Axis)
O
x
Figure 4-1
4.2.4
The Complex Plane (Argand Diagram)
Complex Arithmetic
Addition
if z1 and z2 are two complex numbers given by z1 =x1+iy1 and z2 = x2 + iy2,
z1+z2 = (x1+iy1) + (x2 + iy2) = (x1+x2) + i(y1+y2)
[4.5]
Subtraction
if z1 and z2 are two complex numbers given by z1 =x1+iy1 and z2 = x2 + iy2,
z1-z2 = (x1-x2) + i(y1-y2)
[4.6]
Multiplication
if z1 and z2 are two complex numbers given by z1 =x1+iy1 and z2 = x2 + iy2,
z1.z2 = (x1+iy1).(x2 + iy2) = x1x2 + ix1y2 + iy1x2 + i2 y1y2
= (x1x2 - y1y2) + i(x1y2 + x2y1)
User’s Manual
[4.7]
4-32
V 1.2, 2000-01
Function Descriptions
Conjugate
The complex conjugate, z of a complex number z = x+iy is given by
z = x - iy
[4.8]
and is obtained by geometrically reflecting the point z in the real axis.
Magnitude
The magnitude of a complex number z=x+iy is given by
2
z =
x +y
2
[4.9]
Geometrically, |z| is the distance of the point z from the origin.
|z1-z2| is the distance between z1 and z2.
Phase
The phase of complex number z=x+iy is given by
phase = tan-1(y/x)
[4.10]
Shift
Shifting of a complex number is indicated by the shift value. Left shifting is done if the
shift value is positive and right shifting is done if shift value is negative.
Z r = x » abs ( shiftval ), if ( shiftval < 0 )
else ( x « shiftval )
Zi = y » abs ( shiftval ), if ( shiftval < 0 )
[4.11]
else ( y « shiftval )
User’s Manual
4-33
V 1.2, 2000-01
Function Descriptions
4.2.5
Complex Number Schematic
31
15
Real
0
Imaginary
Sign
Bit
Figure 4-2
16-bit Complex number representation
63
31
Real
0
Imaginary
Sign Bit
Figure 4-3
User’s Manual
32-bit Complex number representation
4-34
V 1.2, 2000-01
Function Descriptions
4.2.6
Complex Data Structure
Table 4-3
Complex Data Structure
Tasking
GHS
ANSI C/GNU
typedef struct
{
frac16 imag;
frac16 real;
} CplxS;
typedef struct
{
short imag;
short real;
} CplxS;
typedef struct
{
frac32 imag;
frac32 real;
} CplxL;
typedef struct
{
long imag;
long real;
} CplxL;
16 bit
typedef struct
{
_sfract imag;
_sfract real;
} CplxS;
32 bits
typedef struct
{
_fract imag;
_fract real;
} CplxL;
4.2.7
Descriptions
The following complex arithmetic functions for 16 bit and 32 bit are described.
•
•
•
•
•
•
•
Addition (with and without saturation)
Subtraction (with and without saturation)
Multiplication (with and without saturation)
Conjugate
Magnitude
Phase
Shift
User’s Manual
4-35
V 1.2, 2000-01
Function Descriptions
CplxAdd_16
Complex Number Addition for 16 bits
Signature
CplxS CplxAdd_16(CplxS X,
CplxS Y
);
Inputs
X
:
16 bit Complex input value
Y
:
16 bit Complex input value
Output
None
Return
The sum of two complex numbers as a 16 bit complex number
Description
This function computes the sum of two 16 bit complex
numbers. Wraps around the result in case of overflow.
The algorithm is as follows
Rr = xr + yr
[4.12]
Ri = xi + yi
Pseudo code
{
R.real = X.real + Y.real;
//add the real part
R.imag = X.imag + Y.imag;
//add the imaginary part
return R;
}
Techniques
None
Assumptions
• Input and output has a real and an imaginary part packed
as 16 bit data to make a 32 bit complex data
User’s Manual
4-36
V 1.2, 2000-01
Function Descriptions
CplxAdd_16
Complex Number Addition for 16 bits (cont’d)
Memory Note
31
15
Real
0
31
15
Real
Imaginary
0
Imaginary
+
+
31
15
Real
Figure 4-4
0
Imaginary
Complex Number addition for 16 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
1+2
Code Size
6 bytes
User’s Manual
4-37
V 1.2, 2000-01
Function Descriptions
CplxAdds_16
Complex Number Addition for 16 bits with saturation
Signature
CplxS CplxAdds_16(CplxS X,
CplxS Y
);
Inputs
X
:
16 bit Complex input value
Y
:
16 bit Complex input value
Output
None
Return
The sum of two complex numbers as a 16 bit saturated
complex number
Description
This function computes the sum of two 16 bit complex
numbers. In case of overflow, this saturates the result to
0x7FFF for positive values and 0x8000 for negative values.
This is applicable for both real and imaginary part of the
complex number. The algorithm is as follows
Rr = xr + yr
[4.13]
Ri = xi + yi
Pseudo code
{
R.real = (frac16 sat)(X.real
//add the
R.imag = (frac16 sat)(X.imag
//add the
return R;
+ Y.real);
real part
+ Y.imag);
imaginary part
}
Techniques
None
Assumptions
• Input and output has a real and an imaginary part packed
as 16 bit data to make a 32 bit complex data
User’s Manual
4-38
V 1.2, 2000-01
Function Descriptions
CplxAdds_16
Complex Number Addition for 16 bits with saturation
(cont’d)
Memory Note
31
15
Real
0
31
15
Real
Imaginary
0
Imaginary
+
+
Sat
31
Sat
15
Real
Figure 4-5
0
Imaginary
Complex number addition for 16 bits with saturation
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
1+2
Code Size
6 bytes
User’s Manual
4-39
V 1.2, 2000-01
Function Descriptions
CplxSub_16
Complex Number Subtraction for 16 bits
Signature
CplxS CplxSub_16(CplxS X,
CplxS Y
);
Inputs
X
:
16 bit Complex input value
Y
:
16 bit Complex input value
Output
None
Return
The difference of two complex numbers as a 16 bit complex
number
Description
This function finds the difference of two 16 bit complex
numbers. Wraps around the result in case of underflow. The
algorithm is as follows.
Rr = xr – y r
[4.14]
Ri = x i – yi
Pseudo code
{
R.real = X.real - Y.real;
//subtract the real part
R.imag = X.imag - Y.imag;
//subtract the imaginary part
return R;
}
Techniques
None
Assumptions
• Input and output has a real and an imaginary part packed
as 16 bit data to make a 32 bit complex data
User’s Manual
4-40
V 1.2, 2000-01
Function Descriptions
CplxSub_16
Complex Number Subtraction for 16 bits (cont’d)
Memory Note
31
15
Real
0
31
15
Real
Imaginary
0
Imaginary
31
15
Real
Figure 4-6
0
Imaginary
Complex number subtraction for 16 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
1+2
Code Size
6 bytes
User’s Manual
4-41
V 1.2, 2000-01
Function Descriptions
CplxSubs_16
Complex Number Subtraction for 16 bits with
saturation
Signature
CplxS CplxSubs_16(CplxS X,
CplxS Y
);
Inputs
X
:
16 bit Complex input value
Y
:
16 bit Complex input value
Output
None
Return
The difference of two complex numbers as a 16 bit saturated
complex number
Description
This function finds the difference of two 16 bit complex
numbers. In case of overflow, this saturates the result to
0x7FFF for positive values and 0x8000 for negative values.
The algorithm is as follows.
Rr = xr – y r
[4.15]
Ri = x i – yi
Pseudo code
{
R.real = (frac16 sat)(X.real - Y.real);
//subtract the real part
R.imag = (frac16 sat)(X.imag - Y.imag);
//subtract the imaginary part
return R;
}
Techniques
None
Assumptions
• Input and output has a real and an imaginary part packed
as 16 bit data to make a 32 bit complex data
User’s Manual
4-42
V 1.2, 2000-01
Function Descriptions
CplxSubs_16
Complex Number Subtraction for 16 bits with
saturation (cont’d)
Memory Note
31
15
Real
0
31
15
Real
Imaginary
0
Imaginary
-
Sat
31
Sat
15
Real
Figure 4-7
0
Imaginary
Complex number subtraction for 16 bits with saturation
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
1+2
Code Size
6 bytes
User’s Manual
4-43
V 1.2, 2000-01
Function Descriptions
CplxMul_16
Complex Number Multiplication for 16 bits
Signature
void CplxMul_16(CplxS X,
CplxS Y,
CplxL *R
);
Inputs
X
:
16 bit Complex input value
Y
:
16 bit Complex input value
Output
R
:
The pointer to the product of two
complex numbers as a 64 bit
complex number
Return
None
Description
This function computes the product of the two 16 bit complex
numbers. Wraps around the result in case of overflow.
The complex multiplication is computed as follows.
Rr = xr × yr – xi × yi
Ri = x i × yr + xr × yi
Pseudo code
{
R->real = X.real*Y.real - Y.imag*X.imag;
R->imag = X.real*Y.imag + Y.real*X.imag;
}
Techniques
None
Assumptions
• Input is in 1Q15 format
• Input and output has a real and an imaginary part packed
as 16 bit data in 1Q15 format to make a 32 bit complex
data
User’s Manual
4-44
V 1.2, 2000-01
Function Descriptions
CplxMul_16
Complex Number Multiplication for 16 bits (cont’d)
Memory Note
31
15
0
Real
31
15
Real
Imaginary
0
Imaginary
+
+
+
+
+
63
Imaginary
Real
Figure 4-8
0
31
Complex number multiplication for 16 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
6+2
Code Size
30 bytes
User’s Manual
4-45
V 1.2, 2000-01
Function Descriptions
CplxMuls_16
Complex Number Multiplication for 16 bits with
Saturation
Signature
CplxS CplxMuls_16(CplxS X,
CplxS Y
);
Inputs
X
:
16 bit Complex input value
Y
:
16 bit Complex input value
Output
None
Return
The product of two complex numbers as a 32 bit saturated
complex number
Description
This function computes the product of the two 16 bit complex
numbers. In case of overflow, the result is saturated to
0x7FFF for positive overflow and 0x8000 for negative
underflow.
The complex multiplication is computed as follows.
Rr = xr × yr – xi × yi
Ri = x i × yr + xr × yi
Pseudo code
{
R0.real = (frac32 sat)(X.real*Y.real - Y.imag*X.imag);
R0.imag = (frac32 sat)(X.real*Y.imag + Y.real*X.imag);
R0.real = (rnd)R0.real;
//rounding
R0.imag = (rnd)R0.imag;
//rounding
R.real = (frac16 sat)R0.real;
//load lower 16 bits
R.imag = (frac16 sat)R0.imag;
//load lower 16 bits
return R;
}
Techniques
User’s Manual
None
4-46
V 1.2, 2000-01
Function Descriptions
CplxMuls_16
Complex Number Multiplication for 16 bits with
Saturation (cont’d)
Assumptions
• Inputs are in 1Q15 format
• Input and output has a real and an imaginary part packed
as 16 bit data in 1Q15 format to make a 32 bit complex
data
Memory Note
31
15
0
Real
31
15
Real
Imaginary
0
Imaginary
+
+
+
+
+
-
63
Real
Imaginary
Round
Round
Sat
Sat
31
15
Real
Figure 4-9
User’s Manual
0
31
0
Imaginary
Complex number multiplication for 16 bits with saturation
4-47
V 1.2, 2000-01
Function Descriptions
CplxMuls_16
Complex Number Multiplication for 16 bits with
Saturation (cont’d)
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
9+2
Code Size
34 bytes
User’s Manual
4-48
V 1.2, 2000-01
Function Descriptions
CplxConj_16
Complex Number Conjugate for 16 bits
Signature
CplxS CplxConj_16(CplxS X);
Inputs
X
Output
None
Return
The conjugate of the complex number as a 16 bit complex
number
Description
This function finds the conjugate of a 16 bit complex number.
Conjugate of a complex number is given by
:
16 bit Complex input value
R = ( x + iy ) = x – iy
[4.16]
Pseudo code
{
R.real = X.real;
R.imag = 0.0 - X.imag;
//negate the imaginary part
return R;
}
Techniques
None
Assumptions
• Input and output has a real and an imaginary part packed
as 16 bit data to make a 32 bit complex data
Memory Note
31
15
Real
0
Imaginary
Negate
31
15
Real
0
Imaginary
Figure 4-10 Complex number conjugate for 16 bits
User’s Manual
4-49
V 1.2, 2000-01
Function Descriptions
CplxConj_16
Complex Number Conjugate for 16 bits (cont’d)
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
3+2
Code Size
12 bytes
User’s Manual
4-50
V 1.2, 2000-01
Function Descriptions
CplxMag_16
Magnitude of a Complex Number for 16 bits
Signature
DataL CplxMag_16(CplxS X);
Inputs
X
Output
None
Return
Magnitude of the complex number as 32 bit integer or fract
Description
This function finds the magnitude of a complex number. The
algorithm is as follows
R =
:
2
x +y
2
16 bit Complex input value
[4.17]
Pseudo code
{
int indx;
frac32 sat tempX;
frac32 sat tempY;
frac32 sat temp;
frac32 sqrttab[15] = {0.999999999999, 0.7071067811865, 0.5,
0.3535533905933, 0.25, 0.1767766952966,
0.125, 0.08838834764832, 0.0625, 0.04419417382416,
0.03125, 0.02209708691208, 0.015625,
0.01104854345604, 0.0078125};
//Scale down the input by 2
X.real >>= 1;
X.imag >>= 1;
//Power = real^2 + imag^2
tempX = (X.real * X.real);
tempY = (X.imag * X.imag);
tempX += tempY;
User’s Manual
4-51
V 1.2, 2000-01
Function Descriptions
CplxMag_16
Magnitude of a Complex Number for 16 bits (cont’d)
if (tempX == 0)
{
return tempX;
}
//Mag = sqrt(power);
indx = exp1(tempX);//calculate the leading zero
tempX = norm(tempX,indx);
//normalise
tempY = tempX >> 1;//y = x/2
tempY -= 0.5;
//y = x/2 - 0.5
tempX = tempY + 0.9999999999999999;
//sqrt(x) = y + 1
temp = (tempY * tempY);
// y^2
tempX -= temp >> 1;//sqrt(x) = (y + 1) - 0.5*y^2
temp =(temp*tempY);//y^3
tempX += temp >> 1;//sqrt(x) = (y + 1) - 0.5*y^2 + 0.5*y^3
temp = (temp * tempY);
//y^4
tempX -= temp * 0.625;
//sqrt(x) = (y + 1) - 0.5*y^2 + 0.5*y^3 - 0.625*y^4
temp = (temp * tempY);
//y^5
tempX = tempX + (0.875 * temp);
//sqrt(x) = (y + 1) - 0.5*y^2 + 0.5*y^3
//
- 0.625*y^4 +0.875*y^5
temp = tempX << 15;
if (temp >= 0.5)
{
tempX >>= 16;
tempX <<= 16;
tempX += 0.0000305178125;
}
else
{
tempX >>=16;
tempX <<=16;
}
tempX = tempX * sqrttab[indx];
return tempX;
}
User’s Manual
4-52
V 1.2, 2000-01
Function Descriptions
CplxMag_16
Magnitude of a Complex Number for 16 bits (cont’d)
Techniques
None
Assumptions
None
Memory Note
None
Implementation
The real and imaginary parts of a complex number x+iy are
scaled down by two to avoid overflow.
The computation of power(x2+y2) is done by a dual MAC
instruction.
If the power is zero, then the whole computation is not done
to save cycles. Power(x2+y2) is normalized and the exponent
is used as the scale factor in the square root operation. The
square root is computed using the taylor approximation
series.
The taylor series for square root is as follows:
Let Z = x2+y2
R = (Z + 1)/2
2
3
4
sqrt ( Z ) = R + 1 – 0.5R + 0.5R – 0.625R – 0.875R
5
[4.18]
The final result sqrt(Z) is again rescaled using the scale factor
as index of the square root table to give the magnitude.
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplxMag.c
Cycle Count
7+2
7+42+2
Code Size
118 bytes
(Best)
(Worst)
140 bytes (Data)
User’s Manual
4-53
V 1.2, 2000-01
Function Descriptions
CplxPhase_16
Phase of a Complex Number for 16 bits
Signature
DataL CplxPhase_16 (CplxS X);
Inputs
X
Output
None
Return
The phase of the input complex number as a 32 bit integer or
fract
Description
This function computes the phase of a complex number. The
algorithm is as follows.
:
16 bit Complex input value
Phase = tan-1(y/x)
[4.19]
Pseudo code
{
int indx;
int flag;
frac32 sat tempX;
frac32 sat tempY;
frac32 sat temp;
//Scale down the input by 2
X.real >>= 1;
X.imag >>= 1;
//Power = real^2 + imag^2
//Taking absolute value of input complex number
if (X.real < 0)
{
tempX = -X.real;
}
else
{
tempX = X.real;
}
User’s Manual
4-54
V 1.2, 2000-01
Function Descriptions
CplxPhase_16
Phase of a Complex Number for 16 bits (cont’d)
if (X.imag < 0)
{
tempY = -X.imag;
}
else
{
tempY = X.imag;
}
//Phase = arctan(imag/real)
if (tempX <= tempY)
{
flag = 1;
temp = tempX/tempY;
}
else
{
flag = 0;
temp = tempY/tempX;
}
indx = exp1(temp); //calculate the leading zero
temp = norm(temp,indx);
//normalise
//Polynomial calculation
tempX = K5 * temp + K4;
tempX = tempX * temp + K3;
tempX = tempX * temp + K2;
tempX = tempX * temp + K1;
tempX = tempX * temp;
temp = tempX << 15;
User’s Manual
4-55
V 1.2, 2000-01
Function Descriptions
CplxPhase_16
Phase of a Complex Number for 16 bits (cont’d)
//if imag > real
if (flag == 1)
{
tempX = 0.5 - tempX;
}
//third quadrant X = X - 180 deg
if (X.real < 0 && X.imag < 0)
{
tempX = tempX - 0.9999999999999;
}
//second quadrant X = 180 - X deg
else if (X.real < 0 && X.imag >= 0)
{
tempX = 0.9999999999999 - tempX;
}
//fourth quadrant X = - X
else if (X.real >= 0 && X.imag < 0)
{
tempX = -tempX;
}
//Rounding
if (temp >= 0.5)
{
tempX >>= 16;
tempX <<= 16;
tempX += 0.0000305178125;
}
else
{
tempX >>=16;
tempX <<=16;
}
return tempX;
}
Techniques
None
Assumptions
None
Memory Note
None
User’s Manual
4-56
V 1.2, 2000-01
Function Descriptions
CplxPhase_16
Phase of a Complex Number for 16 bits (cont’d)
Implementation
The phase in a complex plane is the arctan(y/x), where y/x=z.
By Taylor series,
phase = tan-1(z) for Z<=1
[4.20]
or 0.5-tan-1(1/z) for z>1
[4.21]
If y ≤ x , the flag is set to indicate that Equation [4.20] to be
computed, otherwise Equation [4.21] is computed.
After calculating y/x, the results are normalized. Then the
arctan is calculated by using the Taylor approximation series
is a polynomial expansion. This is as follows:
2
arc tan ( z ) = 0.318253z + 0.003314z – 0.130908z
4
+ 0.068542z – 0.009159z
5
3
[4.22]
The final part of the processing extracts the sign of real and
imaginary part and branches to appropriate quadrant.
I quadrant :
phase = arctan(y/x) radian
II quadrant :
phase = π -arctan(y/x) radian
III quadrant:
phase = arctan(y/x)- π radian
IV quadrant:
phase = arctan(y/x) radian
The output of the function is given in radians and has to be
scaled. The output is as follows
+ π = 0x7fff or 0.99999999
- π = 0x8000 or -1.0
π /2 is approximately equal to 0.5
- π /2 is approximately equal to -0.5
Example
User’s Manual
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplxPh.c
4-57
V 1.2, 2000-01
Function Descriptions
CplxPhase_16
Phase of a Complex Number for 16 bits (cont’d)
Cycle Count
52+2
62+2
Code Size
180 bytes
(Best)
(Worst)
20 bytes (Data)
User’s Manual
4-58
V 1.2, 2000-01
Function Descriptions
CplxShift_16
Complex Number Shift for 16 bits
Signature
CplxS CplxShift_16(CplxS X,
int shiftVal
);
Inputs
X
:
16 bit Complex input value
shiftVal
:
shift value as a signed integer
Output
None
Return
Output value after the real and imaginary parts are shifted
Description
This function performs shifting of a 16 bit complex number
indicated by the shiftVal. Left shifting is done if the shiftVal is
positive and Right shifting is done if shiftVal is negative.
The algorithm is as follows.
R r = x r » abs ( shiftVal ), if ( shiftVal < 0 )
else ( xr « shiftVal )
Ri = x i » abs ( shiftVal ), if ( shiftVal < 0 )
[4.23]
else ( xi « shiftVal )
Pseudo code
{
real.real = X.real << shiftVal;
real.imag = X.imag << shiftVal;
return real;
}
Techniques
None
Assumptions
None
User’s Manual
4-59
V 1.2, 2000-01
Function Descriptions
CplxShift_16
Complex Number Shift for 16 bits (cont’d)
Memory Note
31
15
Real
0
Imaginary
Right shift if
-16<shift value< 0
Left shift if
0<shift value<16
31
....
Real
15
0..0
....
Imag
31
0
Sign
0..0
....
Real
15
Sign
....
0
Imag
Figure 4-11 Complex number shift for 16 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
1+2
Code Size
6 bytes
User’s Manual
4-60
V 1.2, 2000-01
Function Descriptions
CplxAdd_32
Complex Number Addition for 32 bits
Signature
void CplxAdd_32(CplxL *X,
CplxL *Y,
CplxL *R
);
Inputs
X
:
32 bit Complex input value
Y
:
32 bit Complex input value
Output
R
:
The sum of two complex numbers
as a 32 bit complex number.
Return
None
Description
This function computes the sum of two 32 bit complex
numbers. Wraps around the result in case of overflow.
The algorithm is as follows
Rr = xr + yr
[4.24]
Ri = xi + yi
Pseudo code
{
R->real = X->real + Y->real;
R->imag = X->imag + Y->imag;
}
Techniques
None
Assumptions
• Inputs are in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 64 bit complex
data
• Inputs are doubleword aligned
User’s Manual
4-61
V 1.2, 2000-01
Function Descriptions
CplxAdd_32
Complex Number Addition for 32 bits (cont’d)
Memory Note
63
31
Real
0
63
31
Real
Imaginary
0
Imaginary
+
+
63
31
Real
0
Imaginary
Figure 4-12 Complex number addition for 32 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
4+2
Code Size
22 bytes
User’s Manual
4-62
V 1.2, 2000-01
Function Descriptions
CplxAdds_32
Complex Number Addition for 32 bits with
saturation
Signature
void CplxAdds_32(CplxL
*X,
CplxL
*Y,
CplxL_Sat *R
);
Inputs
X
:
32 bit Complex input value
Y
:
32 bit Complex input value
Output
R
:
The sum of two complex numbers
as a 32 bit saturated complex
number.
Return
None
Description
This function computes the sum of two 32 bit complex
numbers. In case of underflow, this saturates the result to
0x7FFFFFFF for positive values and 0x80000000 for negative
values.
Wraps around the result in case of overflow.
The algorithm is as follows
Rr = xr + yr
[4.25]
Ri = xi + yi
Pseudo code
{
R->real = (frac32 sat)(X->real + Y->real);
R->imag = (frac32 sat)(X->imag + Y->imag);
}
Techniques
None
Assumptions
• Inputs are in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 64 bit complex
data
• Inputs are doubleword aligned
User’s Manual
4-63
V 1.2, 2000-01
Function Descriptions
CplxAdds_32
Complex Number Addition for 32 bits with saturation
(cont’d)
Memory Note
63
31
Real
0
63
31
Real
Imaginary
0
Imaginary
+
+
Sat
63
Sat
31
Real
0
Imaginary
Figure 4-13 Complex number addition for 32 bits with saturation
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
4+2
Code Size
22 bytes
User’s Manual
4-64
V 1.2, 2000-01
Function Descriptions
CplxSub_32
Complex Number Subtraction for 32 bits
Signature
void CplxSub_32(CplxL *X,
CplxL *Y,
CplxL *R
);
Inputs
X
:
32 bit Complex input value
Y
:
32 bit Complex input value
Output
R
:
The difference of two complex
numbers as a 32 bit complex
number
Return
None
Description
This function computes the difference of two 32 bit complex
numbers. Wraps around the result in case of overflow.
The algorithm is as follows.
Rr = xr – y r
[4.26]
Ri = xr – y i
Pseudo code
{
R->real = X->real - Y->real;
R->imag = X->imag - Y->imag;
}
Techniques
None
Assumptions
• Inputs are in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 64 bit complex
data
• Inputs are doubleword aligned
User’s Manual
4-65
V 1.2, 2000-01
Function Descriptions
CplxSub_32
Complex Number Subtraction for 32 bits (cont’d)
Memory Note
63
31
Real
0
63
31
Real
Imaginary
0
Imaginary
63
31
Real
0
Imaginary
Figure 4-14 Complex number subtraction for 32 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
4+2
Code Size
22 bytes
User’s Manual
4-66
V 1.2, 2000-01
Function Descriptions
CplxSubs_32
Complex Number Subtraction for 32 bits with
saturation
Signature
void CplxSubs_32(CplxL
*X,
CplxL
*Y,
CplxL_Sat *R
);
Inputs
X
:
32 bit Complex input value
Y
:
32 bit Complex input value
Output
R
:
The difference of two complex
numbers as a 32 bit saturated
complex number
Return
None
Description
This function computes the difference of two 32 bit complex
numbers. In case of underflow, this saturates the result to
0x7FFFFFFF for positive values and 0x80000000 for negative
values. The algorithm is as follows.
Rr = xr – y r
[4.27]
Ri = xr – y i
Pseudo code
{
R->real = (frac32 sat)(X->real - Y->real);
R->imag = (frac32 sat)(X->imag - Y->imag);
}
Techniques
None
Assumptions
• Inputs are in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 64 bit complex
data
• Inputs are doubleword aligned
User’s Manual
4-67
V 1.2, 2000-01
Function Descriptions
CplxSubs_32
Complex Number Subtraction for 32 bits with
saturation (cont’d)
Memory Note
63
31
Real
0
63
31
Real
Imaginary
0
Imaginary
-
Sat
63
Sat
31
Real
0
Imaginary
Figure 4-15 Complex number subtraction for 32 bits with saturation
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
4+2
Code Size
22 bytes
User’s Manual
4-68
V 1.2, 2000-01
Function Descriptions
CplxMul_32
Complex Number Multiplication for 32 bits
Signature
void CplxMul_32(CplxL *X,
CplxL *Y,
CplxL *R
);
Inputs
X
:
32 bit Complex input value
Y
:
32 bit Complex input value
Output
R
:
The product of two complex
numbers as a 32 bit complex
number
Return
None
Description
This function computes the product of the two 32 bit complex
numbers. Wraps around the result in case of overflow.
The complex multiplication is computed as follows.
Rr = xr × yr – xi × yi
Ri = x i × yr + xr × yi
Pseudo code
{
frac64 real;
frac64 ima;
real = (frac64)((X->real * Y->real) - (X->imag * Y->imag));
//real part
ima = (frac64)((X->real * Y->imag) + (X->imag * Y->real));
//imaginary part
R->real = (frac32)real;
R->imag = (frac32)ima;
}
Techniques
None
Assumptions
• Inputs are in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 64 bit complex
data
• Inputs are doubleword aligned
User’s Manual
4-69
V 1.2, 2000-01
Function Descriptions
CplxMul_32
Complex Number Multiplication for 32 bits (cont’d)
Memory Note
63
31
0
Real
63
31
Real
Imaginary
0
Imaginary
+
+
+
+
+
63
0
31
Imaginary
Real
Figure 4-16 Complex number multiplication for 32 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
13+2
Code Size
38 bytes
User’s Manual
4-70
V 1.2, 2000-01
Function Descriptions
CplxMuls_32
Complex Number Multiplication for 32 bits with
Saturation
Signature
void CplxMuls_32(CplxL
*X,
CplxL
*Y,
CplxL_Sat *R
);
Inputs
X
:
32 bit Complex input value
Y
:
32 bit Complex input value
Output
R
:
The product of two complex
numbers as a 32 bit complex
number
Return
None
Description
This function computes the product of the two 32 bit complex
numbers. In case of overflow, the result is saturated to
0x7FFFFFFF for positive overflow and 0x80000000 for
negative underflow.
The complex multiplication is computed as follows.
Rr = xr × yr – xi × yi
Ri = x i × yr + xr × yi
Pseudo code
{
frac64
frac64
real;
ima;
real = (frac64)((X->real * Y->real) - (X->imag * Y->imag));
//real part
ima = (frac64)((X->real * Y->imag) + (X->imag * Y->real));
//imaginary part
R->real = (frac32 sat)real;
R->imag = (frac32 sat)ima;
}
Techniques
User’s Manual
None
4-71
V 1.2, 2000-01
Function Descriptions
CplxMuls_32
Complex Number Multiplication for 32 bits with
Saturation (cont’d)
Assumptions
• Inputs are in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 64 bit complex
data
• Inputs are doubleword aligned
Memory Note
63
31
0
Real
63
31
Real
Imaginary
0
Imaginary
+
+
+
+
+
-
63
0
31
Imaginary
Real
Sat
Sat
32
16
Real
0
Imaginary
Figure 4-17 Complex number multiplication for 32 bits with saturation
User’s Manual
4-72
V 1.2, 2000-01
Function Descriptions
CplxMuls_32
Complex Number Multiplication for 32 bits with
Saturation (cont’d)
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
13+2
Code Size
38 bytes
User’s Manual
4-73
V 1.2, 2000-01
Function Descriptions
CplxConj_32
Complex Number Conjugate for 32 bits
Signature
void CplxConj_32(CplxL *X,
CplxL *R
);
Inputs
X
:
32 bit Complex input value
Output
R
:
The conjugate of the complex
number
Return
None
Description
This function finds the conjugate of a 32 bit complex number.
Conjugate of a complex number is given by
R = ( x + iy ) = x – iy
[4.28]
Pseudo code
{
R->imag = 0.0 - X->imag;
R->real = X->real;
}
Techniques
None
Assumptions
• Input is in 1Q31 format
• Input and output has a real and an imaginary part packed
as 32 bit data in 1Q31 format to make a 32 bit complex
data
• Inputs are doubleword aligned
User’s Manual
4-74
V 1.2, 2000-01
Function Descriptions
CplxConj_32
Complex Number Conjugate for 32 bits (cont’d)
Memory Note
63
31
Real
0
Imaginary
Negate
63
31
Real
0
Imaginary
Figure 4-18 Complex number conjugate for 32 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
2+2
Code Size
14 bytes
User’s Manual
4-75
V 1.2, 2000-01
Function Descriptions
CplxMag_32
Magnitude of a Complex Number for 32 bits
Signature
DataL CplxMag_32(CplxL X);
Inputs
X
Output
None
Return
The magnitude of the complex number as a 32 bit integer or
fract
Description
This function finds the magnitude of a 32 bit complex number.
:
32 bit Complex input value
The algorithm is as follows
R =
2
x +y
2
[4.29]
Pseudo code
{
int indx;
frac32 sat
frac32 sat
frac32 sat
frac32 sat
tempX;
tempY;
temp;
sqrttab[15] = {0.999999999999, 0.7071067811865, 0.5,
0.3535533905933, 0.25, 0.1767766952966, 0.125,
0.08838834764832, 0.0625, 0.04419417382416,
0.03125, 0.02209708691208, 0.015625,
0.01104854345604, 0.0078125};
//Scale down the input by 2
X->real >>= 1;
X->imag >>= 1;
//Power = real^2 + imag^2
tempX = (X->real * X->real);
tempY = (X->imag * X->imag);
tempX += tempY;
//Mag = sqrt(power);
indx = exp1(tempX);//calculate the leading zero
tempX = norm(tempX,indx);
//normalise
tempY = tempX >> 1;//y = x/2
tempY -= 0.5;
//y = x/2 - 0.5
tempX = tempY + 0.9999999999999999;
//sqrt(x) = y + 1
User’s Manual
4-76
V 1.2, 2000-01
Function Descriptions
CplxMag_32
Magnitude of a Complex Number for 32 bits (cont’d)
temp = (tempY * tempY);
//y^2
tempX -= temp >> 1;//sqrt(x) = (y
temp= (temp*tempY);//y^3
tempX += temp >> 1;//sqrt(x) = (y
temp = (temp * tempY);
//y^4
tempX -= temp * 0.625;
//sqrt(x) = (y +
temp = (temp * tempY);
//y^5
tempX = tempX + (0.875 * temp);
//sqrt(x) = (y
//
tempX = tempX * sqrttab[indx];
return tempX;
+ 1) - 0.5*y^2
+ 1) - 0.5*y^2 + 0.5*y^3
1) - 0.5*y^2 + 0.5*y^3 - 0.625*y^4
+ 1) - 0.5*y^2 + 0.5*y^3
0.625*y^4 +0.875*y^5
}
Techniques
None
Assumptions
• Inputs are doubleword aligned
Memory Note
None
User’s Manual
4-77
V 1.2, 2000-01
Function Descriptions
CplxMag_32
Magnitude of a Complex Number for 32 bits (cont’d)
Implementation
The real and imaginary parts of a complex number x+iy are
scaled down by two to avoid overflow.
MAC is used to square the imaginary part and dual MAC is
used to square the real part. Add these to give the
power(x2+y2).
If the power is zero, then the whole computation is not done
to save cycles. Power(x2+y2) is normalized and the exponent
is used as the scale factor in the square root operation. The
square root is computed using the taylor approximation
series.
The taylor series for square root is as follows:
Let Z = x2+y2
R = (Z + 1)/2
2
3
4
sqrt ( Z ) = R + 1 – 0.5R + 0.5R – 0.625R – 0.875R
5
[4.30]
The final result sqrt(Z) is again rescaled using the scale factor
as index of the square root table to give the magnitude.
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplxMag.c
Cycle Count
52
62
Code Size
126 bytes
(Best)
(Worst)
140 bytes (Data)
User’s Manual
4-78
V 1.2, 2000-01
Function Descriptions
CplxPhase_32
Phase of a Complex Number for 32 bits
Signature
DataL CplxPhase_32(CplxL *X);
Inputs
X
Output
None
Return
The phase of a complex number as a 32 bit integer or fract
Description
This function computes the phase of a complex number. The
algorithm is as follows.
:
Phase = tan-1(y/x)
32 bit Complex input value
[4.31]
Pseudo code
{
int indx;
int flag;
frac32 sat tempX;
frac32 sat tempY;
frac32 sat temp;
//Scale down the input by 2
X->real >>= 1;
X->imag >>= 1;
//Power = real^2 + imag^2
if (X->real < 0)
{
tempX = -X->real;
}
else
{
tempX = X->real;
}
if (X->imag < 0)
{
tempY = -X->imag;
}
else
{
tempY = X->imag;
}
User’s Manual
4-79
V 1.2, 2000-01
Function Descriptions
CplxPhase_32
//Phase =
if (tempX
{
flag =
temp =
}
else
{
flag =
temp =
}
Phase of a Complex Number for 32 bits (cont’d)
arctan(imag/real)
<= tempY)
1;
tempX/tempY;
0;
tempY/tempX;
indx = exp1(temp); //calculate the leading zero
temp = norm(temp,indx);
//normalise
tempX = K5 * temp + K4;
tempX = tempX * temp + K3;
tempX = tempX * temp + K2;
tempX = tempX * temp + K1;
tempX = tempX * temp;
if (flag == 1)
{
tempX = 0.5 - tempX;
}
if (X->real < 0 && X->imag < 0)
{
tempX = tempX - 0.9999999999999;
}
else if (X->real < 0 && X->imag >= 0)
{
tempX = 0.9999999999999 - tempX;
}
else if (X->real >= 0 && X->imag < 0)
{
tempX = -tempX;
}
return tempX;
}
User’s Manual
4-80
V 1.2, 2000-01
Function Descriptions
CplxPhase_32
Phase of a Complex Number for 32 bits (cont’d)
Techniques
None
Assumptions
• Inputs are doubleword aligned
Memory Note
None
Implementation
The phase in a complex plane is the arctan(y/x), where y/x=z.
By Taylor series,
phase = tan-1(z) for Z<=1
[4.32]
or 0.5-tan-1(1/z) for z>1.
[4.33]
If y ≤ x , the flag is set to indicate that Equation [4.32] to be
computed, otherwise Equation [4.33] is computed.
After calculating y/x, the results are normalized. Then the
arctan is calculated by using the Taylor approximation series
is a polynomial expansion. This is as follows:
2
arc tan ( z ) = 0.318253z + 0.003314z – 0.130908z
4
+ 0.068542z – 0.009159z
5
3
[4.34]
The final part of the processing extracts the sign of real and
imaginary part and branches to appropriate quadrant.
I quadrant :
phase = arctan(y/x) radian
II quadrant :
phase = π -arctan(y/x) radian
III quadrant:
phase = arctan(y/x)- π radian
IV quadrant:
phase = arctan(y/x) radian
The output of the function is given in radians and has to be
scaled. The output is as follows
+ π = 0x7fffffff or 0.99999999
- π = 0x80000000 or -1.0
π /2 is approximately equal to 0.5
- π /2 is approximately equal to -0.5
User’s Manual
4-81
V 1.2, 2000-01
Function Descriptions
CplxPhase_32
Phase of a Complex Number for 32 bits (cont’d)
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplxPh.c
Cycle Count
7
7+44
Code Size
180 bytes
(Best)
(Worst)
20 bytes (Data)
User’s Manual
4-82
V 1.2, 2000-01
Function Descriptions
CplxShift_32
Complex Number Shift for 32 bits
Signature
void CplxShift_32(CplxL *X,
CplxL *R,
int
shiftVal
);
Inputs
X
:
32 bit Complex input value
shiftVal
:
shift value as a signed integer
Output
R
:
Output value after the real and
imaginary parts are shifted
Return
None
Description
This function performs shifting of a 32 bit complex number
indicated by the shiftVal. Left shifting is done if the shiftVal is
positive and Right shifting is done if shiftVal is negative.
The algorithm is as follows.
R r = x r » abs ( shiftVal ), if ( shiftVal < 0 )
else ( xr « shiftVal )
Ri = x i » abs ( shiftVal ), if ( shiftVal < 0 )
[4.35]
else ( xi « shiftVal )
Pseudo code
{
if (Y < 0)
{
R->real
R->imag
}
else if (Y
{
R->real
R->imag
}
else
{
R->real
R->imag
}
= X->real >> Y;
= X->imag >> Y;
> 0)
= X->real << Y;
= X->imag << Y;
= X->real;
= X->imag;
}
Techniques
User’s Manual
None
4-83
V 1.2, 2000-01
Function Descriptions
CplxShift_32
Complex Number Shift for 32 bits (cont’d)
Assumptions
• Inputs are doubleword aligned
Memory Note
63
31
Real
0
Imaginary
Right shift if
-32<shift value< 0
Left shift if
0<shift value<32
63
....
Real
31
0..0
....
Imag
63
0
Sign
0..0
....
Real
31
Sign
....
0
Imag
Figure 4-19 Complex number shift for 32 bits
Example
Trilib\Example\Tasking\CplxArith\expCplx.c, expCplx.cpp
Trilib\Example\GreenHills\CplxArith\expCplx.cpp,
expCplx.c
Trilib\Example\GNU\CplxArith\expCplx.c
Cycle Count
3+2
Code Size
18 bytes
User’s Manual
4-84
V 1.2, 2000-01
Function Descriptions
4.3
Vector Arithmetic Functions
A vector is a quantity that has both magnitude and direction. Many physical quantities
are vectors, e.g., force, velocity and momentum. In order to compare vectors and to
operate on them mathematically, it is necessary to have some reference system that
determines scale and direction, such as Cartesian coordinates. A vector is frequently
symbolized by its components with respect to the coordinate axis. The concept of a
vector can be extended to three or more dimensions.
4.3.1
Descriptions
The following vector arithmetic functions are described.
•
•
•
•
•
•
•
Vector addition with saturation
Vector subtraction with saturation
Vector Dot product
Maximum element by index
Minimum element by index
Maximum element by value
Minimum element by value
User’s Manual
4-85
V 1.2, 2000-01
Function Descriptions
VecAdd
Vector Operation - Addition of two vectors
Signature
int VecAdd(DataS
*X,
DataS
* Y,
DataS_Sat *R,
int
nX
);
Inputs
X
:
Pointer to first vector components
Y
:
Pointer to second vector
components
nX
:
Dimension of vector
Output
R
:
Pointer to the sum of two vectors
Return
None
Description
This function finds the sum of two vectors.
If x and y are two vectors given by x = [x0, x1,....xN-1]T and y =
[y0, y1,...,yN-1]T, their sum is given by
Ri = xi + yi (i = 0,1,..., N-1)
[4.36]
Pseudo code
{
int i;
for (i = 0;i < nX;i++)
{
R[i] = X[i] + Y[i];
//Add
}
}
Techniques
None
Assumptions
• The input vectors have the same dimension
User’s Manual
4-86
V 1.2, 2000-01
Function Descriptions
VecAdd
Vector Operation - Addition of two vectors (cont’d)
Memory Note
aX
X[0]
X[1]
X[2]
+
Y[0]
+
+
Y[2]
.
.
.
.
.
.
.
.
+
X[nX]
aR
aY
Y[1]
Y[nX]
R[0]
R[1]
R[2]
.
.
.
.
R[nX]
Figure 4-20 Vector Addition
User’s Manual
4-87
V 1.2, 2000-01
Function Descriptions
VecAdd
Vector Operation - Addition of two vectors (cont’d)
Implementation
The Vector Add function adds with saturation the peer
elements of two arrays and stores the result in the resultant
array. It uses the packed Load/Store instruction to load 4
words of data simultaneously. It adds the 4 elements in one
go and stores it into the result array. This is applicable for all
the arrays with sizes equal to the multiples of 4 words. In case
if the size is of odd or not the multiple of 4 words, it checks the
remaining elements and correspondingly takes respective
paths to execute the addition separately from the remaining
words which is left out.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
Code Size
User’s Manual
nX
7 + 5 × ------- + 4 + 2
4
(Best)
nX
7 + 5 × ------- + 8 + 2
4
(Worst)
84 bytes
4-88
V 1.2, 2000-01
Function Descriptions
VecSub
Vector Operation - Difference of two vectors
Signature
int VecSub(DataS
*X,
DataS
*Y,
DataS_Sat *R,
int
nX
);
Inputs
X
:
Pointer to first vector components
Y
:
Pointer to second vector
components
nX
:
Dimension of vector
Output
R
:
Pointer to difference of two vectors
Return
None
Description
This function finds the difference of two vectors.
If x and y are two vectors given by x = [x0, x1,....xN-1]T and y =
[y0, y1,...,yN-1]T, their sum is given by
Ri = xi - yi (i = 0,1,..., N-1)
[4.37]
Pseudo code
{
int i;
for (i = 0;i < nX;i++)
{
R[i] = X[i] - Y[i];
//Subtract
}
}
Techniques
None
Assumptions
• The input vectors have the same dimension
User’s Manual
4-89
V 1.2, 2000-01
Function Descriptions
VecSub
Vector Operation - Difference of two vectors (cont’d)
Memory Note
aX
X[0]
-
X[1]
X[2]
Y[0]
-
Y[2]
.
.
.
.
.
.
.
.
-
X[nX]
aR
aY
Y[1]
Y[nX]
R[0]
R[1]
R[2]
.
.
.
.
R[nX]
Figure 4-21 Vector Subtraction
User’s Manual
4-90
V 1.2, 2000-01
Function Descriptions
VecSub
Vector Operation - Difference of two vectors (cont’d)
Implementation
The Vector Subtract function subtracts with saturation the X
array data by the corresponding peer element of Y array and
stores the result in the resultant array. It uses the packed
Load/Store instruction to load 4 words of data simultaneously.
It adds the 4 elements in one go and stores it into the result
array. This is applicable for all the arrays with sizes equal to
the multiples of 4 words. In case if the size is of odd or not the
multiple of 4 words, it checks the remaining elements and
correspondingly takes respective paths to execute the
subtraction separately from the remaining words which is left
out.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
Code Size
User’s Manual
nX
7 + 5 × ------- + 4 + 2
4
(Best)
nX
7 + 5 × ------- + 8 + 2
4
(Worst)
84 bytes
4-91
V 1.2, 2000-01
Function Descriptions
VecDotPro
Signature
Vector Operation - Dot Product of two vectors
DataL VecDotPro(DataS *X,
DataS *Y,
int
nX
);
Inputs
X
:
Pointer to first vector components
Y
:
Pointer to second vector
components
nX
:
Dimension of vectors
Output
None
Return
Dot product of the two vectors (48-bit output value converted
to 32-bit with saturation)
Description
If x and y are two vectors of dimension N, their dot product is
given by
N–1
x⋅y =
∑ x i ⋅ yi =
x0 ⋅ y 0 + x 1 ⋅ y1 + … + xN – 1 ⋅ yN – 1
[4.38]
i=0
Pseudo code
{
int i;
frac64 product = 0;
for(i = 0;i < nX;i++)
{
product += (frac64) X[i](*)Y[i];
}
//calculating the dot product
return(frac32 sat)product;
//Format the result to 32-bit saturated value
}
Techniques
• Use of MAC instructions
• Intermediate results stored in a 64 bit register (16 guard
bits)
• Dot product output is converted to 16 bit with saturation
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• The input vectors have the same dimension
User’s Manual
4-92
V 1.2, 2000-01
Function Descriptions
VecDotPro
Vector Operation - Dot Product of two vectors (cont’d)
Memory Note
aX
X[0]
X[1]
Y[1]
.
.
.
.
.
.
.
.
.
.
.
X[Size]
acc
.
=
X[0].Y[0]
aY
Y[0]
.
+
X[1].Y[1]
Y[Size]
X[Size].
Y[Size]
Figure 4-22 Dot product of two vectors
Implementation
The Vector Dot Product function multiplies and accumulates
the X array data by the corresponding peer element of Y
array. It uses the madd.q instruction to do the multiply and
accumulate the input data, the final result which is in 17Q47
format in a 64 bit register is converted to a 32 bit result and is
saturated.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
5 + 2 × [ nX – 1 ] + 5
Code Size
52 bytes
User’s Manual
4-93
V 1.2, 2000-01
Function Descriptions
VecMaxIdx
Vector Operation - Maximum Element by Index of a
vector
Signature
int VecMaxIdx(DataS *X,
int
nX
);
Inputs
X
:
Pointer to the vector components
nX
:
Dimension of vector
Output
None
Return
The maximum element by index of the input vector
Description
This function calculates the maximum element by index of a
vector. The input vector components are 16 bit real values.
Pseudo code
{
frac16 element = -1.0;
int i;
for (i = 0;i <
{
if (element
{
element
}
}
i = 0;
while (element
{
i++;
}
nX;i++)
< X[i])
= X[i];
!= X[i])
return i;
}
Techniques
None
Assumptions
• Inputs are in 1Q15 format
User’s Manual
4-94
V 1.2, 2000-01
Function Descriptions
VecMaxIdx
Vector Operation - Maximum Element by Index of a
vector (cont’d)
Memory Note
aX
X[0]
X[1]
.
Yes
Max< x[0]
Max=X[0], index=i
No
.
.
Max < x[1]
Max=X[1], index=i
Max <
x[size]
Max=X[size], index=i
.
.
X[size]
Return index
Figure 4-23 Maximum element by index
User’s Manual
4-95
V 1.2, 2000-01
Function Descriptions
VecMaxIdx
Vector Operation - Maximum Element by Index of a
vector (cont’d)
Implementation
The Vector Maximum by Index function uses the max.h and
eq.h instructions to optimally find the maximum value in the
array. The max.h instruction checks the two 32 bit registers
and returns the bigger 2 words among them into another
register thereby does two comparison and movement of data
in one go. Similarly the eq.h checks if the value is equal
among the two registers, this is used here to find the greater
value between the two words of a same 32 bit register finally,
which is found to be in the maximum pair register after the
computation of maximum element. Since the max.h does two
comparisons, the loop count is reduced by half.
The final part of the function is to calculate the index of the
maximum element, this is done by initializing a index variable
and is kept on incrementing until the maximum element found
matches with one of the array’s element, odd array size is
separately taken care.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
1
nX
4 + 2 × ------- + 1 + 3 +  2 × --- + 2

2
4
nX
nX
4 + 2 × ------- + 1 + 3 +  2 × ------- + 2

2
4
Code Size
User’s Manual
(Best)
(Worst)
92 bytes
4-96
V 1.2, 2000-01
Function Descriptions
VecMinIdx
Vector Operation - Minimum Element by index of a
vector
Signature
int VecMinIdx(DataS *X,
int
nX
);
Inputs
X
:
Pointer to vector components
nX
:
Dimension of vector
Output
None
Return
The minimum element by index of the input vector
Description
This function calculates the minimum element by index of a
vector. The input vector components are 16 bit real values
and are halfword aligned.
Pseudo code
{
frac16 element = 0.99999999999999;
int i;
for (i = 0;i <
{
if (element
{
element
}
}
i = 0;
while (element
{
i++;
}
nX;i++)
> X[i])
= X[i];
!= X[i])
return i;
}
Techniques
None
Assumptions
None
User’s Manual
4-97
V 1.2, 2000-01
Function Descriptions
VecMinIdx
Vector Operation - Minimum Element by index of a
vector (cont’d)
Memory Note
aX
X[0]
X[1]
.
Yes
Min>x[0]
Min=X[0], index=i
No
.
.
Min>x[1]
Min=X[1], index=i
Min>x[size]
Min=X[size], index=i
.
.
X[size]
Return index
Figure 4-24 Minimum element by index
User’s Manual
4-98
V 1.2, 2000-01
Function Descriptions
VecMinIdx
Vector Operation - Minimum Element by index of a
vector (cont’d)
Implementation
The Vector Minimum by Index function uses the min.h and
eq.h instructions to optimally find the minimum value in the
array. The min.h instruction checks the two 32 bit registers
and returns the smaller 2 words among them into another
register thereby does two comparison and movement of data
in one go. Similarly the eq.h checks if the value is equal
among the two registers, this is used here to find the smaller
value between the two words of a same 32 bit register finally,
which is found to be in the minimum pair register after the
computation of minimum element. Since the min.h does two
comparisons, the loop count is reduced by half.
The final part of the function is to calculate the index of the
minimum element, this is done by initializing a index variable
and is kept on incrementing until the minimum element found
matches with one of the array’s element, odd array size is
separately taken care.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
1
nX
4 + 2 × ------- + 1 + 3 +  2 × --- + 2

2
4
nX
nX
4 + 2 × ------- + 1 + 3 +  2 × ------- + 2

2
4
Code Size
User’s Manual
(Best)
(Worst)
98 bytes
4-99
V 1.2, 2000-01
Function Descriptions
VecMaxVal
Vector Operation - Maximum Element by value of a
vector
Signature
int VecMaxVal(DataS *X,
int
nX
);
Inputs
X
:
Pointer to vector components
nX
:
Dimension of vector
Output
None
Return
The maximum element by value of the input vector
Description
This function calculates the maximum element by value of a
vector. The input vector components are 16 bit real values
and are halfword aligned.
Pseudo code
{
frac16 element = -1.0;
int i;
for (i = 0;i < nX ;i++)
{
if (element < X[i])
{
element = X[i];
}
}
return element;
}
Techniques
None
Assumptions
None
User’s Manual
4-100
V 1.2, 2000-01
Function Descriptions
VecMaxVal
Vector Operation - Maximum Element by value of a
vector (cont’d)
Memory Note
aX
X[0]
X[1]
.
Yes
Max<x[0]
Max=X[0]
No
.
.
Max<x[1]
Max=X[1]
Max<x[size]
Max=X[size]
.
.
X[size]
Return Max
Figure 4-25 Maximum element by value
User’s Manual
4-101
V 1.2, 2000-01
Function Descriptions
VecMaxVal
Vector Operation - Maximum Element by value of a
vector (cont’d)
Implementation
The Vector Maximum by value function uses the max.h and
eq.h instructions to optimally find the maximum value in the
array. The max.h instruction checks the two 32 bit registers
and returns the bigger 2 words among them into another
register thereby does two comparison and movement of data
in one go. Similarly the eq.h checks if the value is equal
among the two registers, this is used here to find the greater
value between the two words of a same 32 bit register finally,
which is found to be in the maximum pair register after the
computation of maximum element. Since the max.h does two
comparisons, the loop count is reduced by half. It returns the
maximum value among the two in the maximum element
register.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
nX
3 + 2 × ------- + 1 + 5
4
nX
3 + 2 × ------- + 1 + 7
4
Code Size
User’s Manual
(Best)
(Worst)
56 bytes
4-102
V 1.2, 2000-01
Function Descriptions
VecMinVal
Vector Operation - Minimum Element by value of a
vector
Signature
int VecMinVal(DataS
int
);
*X,
nX
Inputs
X
:
Pointer to vector components
nX
:
Dimension of vector
Output
None
Return
The minimum element by value of the input vector
Description
This function calculates the minimum element by value of a
vector. The input vector components are 16 bit real values
and are halfword aligned.
Pseudo code
{
frac16 element = 0.999999999;
int i;
for (i = 0;i < nX;i++)
{
if (element > X[i])
{
element = X[i];
}
}
return element;
}
Techniques
None
Assumptions
None
User’s Manual
4-103
V 1.2, 2000-01
Function Descriptions
VecMinVal
Vector Operation - Minimum Element by value of a
vector (cont’d)
Memory Note
aX
X[0]
X[1]
.
Yes
Min>x[0]
Min=X[0]
No
.
.
Min>x[1]
Min=X[1]
Min>x[size]
Min=X[size]
.
.
X[size]
Return Min
Figure 4-26 Minimum element by value
User’s Manual
4-104
V 1.2, 2000-01
Function Descriptions
VecMinVal
Vector Operation - Minimum Element by value of a
vector (cont’d)
Implementation
The Vector Minimum by value function uses the min.h and
eq.h instructions to optimally find the minimum value in the
array. The min.h instruction checks the two 32 bit registers
and returns the smaller 2 words among them into another
register thereby does two comparison and movement of data
in one go. Similarly the eq.h checks if the value is equal
among the two registers, this is used here to find the smaller
value between the two words of a same 32 bit register finally,
which is found to be in the minimum pair register after the
computation of minimum element. Since the min.h does two
comparisons, the loop count is reduced by half. It returns the
minimum value among the two in the minimum element
register.
Example
Trilib\Example\Tasking\Vectors\expVect.c, expVect.cpp
Trilib\Example\GreenHills\Vectors\expVect.cpp, expVect.c
Trilib\Example\GNU\Vectors\expVect.c
Cycle Count
nX
3 + 2 × ------- + 1 + 5
4
nX
3 + 2 × ------- + 1 + 7
4
Code Size
User’s Manual
(Best)
(Worst)
56 bytes
4-105
V 1.2, 2000-01
Function Descriptions
4.4
FIR Filters
4.4.1
Normal FIR
The FIR (Finite Impulse Response) filter, as its name suggests, will always have a finite
duration of non-zero output values for given finite duration of non-zero input values. FIR
filters use only current and past input samples, and none of the filter’s previous output
samples, to obtain a current output sample value.
For causal FIR systems, the system function has only zeros (except for poles at z=0).
The FIR filter can be realized in transversal, cascade and lattice forms. The implemented
structure is of transversal type, which is realized by a tapped delay line. In case of FIR,
delay line stores the past input values. The input x(n) for the current calculation will
become x(n-1) for the next calculation. The output from each tap is summed to generate
the filter output. For a general nH tap FIR filter, the difference equation is
nH – 1
R( n) =
∑
Hi ⋅ X ( n – i )
[4.39]
i=0
where,
X(n)
:
the filter input for nth sample
R(n)
:
output of the filter for nth sample
Hi
:
filter coefficients
nH
:
filter order
The filter coefficients, which decide the scaling of current and past input samples stored
in the delay line, define the filter response.
The transfer function of the filter in Z-transform is
R[z]
H [ z ] = ------------ =
X[z]
nH – 1
∑
Hi ⋅ Z
–i
[4.40]
i=0
User’s Manual
4-106
V 1.2, 2000-01
Function Descriptions
Delay Line
X(n)
(Filter Input)
H0
X(n)
X
Z-1
H1
X(n-1)
Z-1
X
Z-1
X(n-nH+1)
H nH-1
X
+
R(n)
(Filter Output)
Figure 4-27 Block Diagram of the FIR Filter
4.4.1.1
Descriptions
The following Normal FIR filter functions are described.
•
•
•
•
Normal, Arbitrary number of coefficients, Sample processing
Normal, Arbitrary number of coefficients, Block processing
Normal, coefficients - multiple of 4, Sample processing
Normal, coefficients - multiple of 4, Block processing
User’s Manual
4-107
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing
Signature
DataS Fir_16(DataS
DataS
cptrDataS
);
Inputs
X
:
Real input value
H
:
Pointer to Coeff-Buffer of size nH
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
Output
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
Return
R
:
Output value of the filter (48-bit
value converted to 16-bit with
saturation)
Description
The implementation of FIR filter uses transversal structure
(direct form). A single input is processed at a time and output
for every sample is returned. The filter operates on 16-bit real
input, 16-bit coefficients and gives 16-bit real output. The
number of coefficients given by the user is arbitrary. Circular
buffer addressing mode is used for delay line. Coefficient
buffer is halfword aligned. Delay line buffer is doubleword
aligned.
User’s Manual
4-108
X,
*H,
*DLY
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int j,k=0;
frac16circ *aDLY = &DLY;
//ptr to Circ-ptr of Delay-Buffer
*DLY = X;
//Store input value in Delay-Buffer at
//the position of the oldest value
acc = 0.0;
if(nH%2 == 0)
//even coefficients
{
//’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are valid
//for first loop iteration
//For each next loop i,j should be decremented and
//incremented by 2 respectively.
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)) +
(*(H+k+1))* (*(DLY+k+1)));
//acc += X(n)*H(0) + X(n-1)*H(1)
k=k+2;
}
}
else
//odd coefficients
{
//’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are valid
//for first loop iteration.
//For each next loop i,j should be decremented and
//incremented by 1 respectively.
User’s Manual
4-109
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing (cont’d)
for(j=0; j<nH; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)));
//acc += X(n)*H(0)
k++;
}
}
DLY--;
//Set DLY.index to the oldest value
//in Delay-Buffer
aDLY=&DLY;
//store updated delay
R = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
return R;
//Filter output returned
}
Techniques
• Loop unrolling, two taps/loop if coefficients are even, else
one tap/loop
• Use of packed data Load/Store
• Delay line implemented as circular buffer
• Use of dual MAC instruction for even coefficients and MAC
instruction for odd coefficients
• Intermediate results stored in 64 bit register (16 guard bits)
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
User’s Manual
4-110
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing (cont’d)
Memory Note
Delay-Buffer
Coeff-Buffer
.
H0
.
aDLY caDLY
X(n-nH + 1)
H1
X
.
X(n)
.
X(n-1)
.
X(n-2)
.
.
.
aH
MAC (odd
number of
coefficients)
1Q15
.
HnH-1
1Q15
doubleword
aligned
Dual MAC halfword
aligned
(even
number of
coefficients)
Figure 4-28 Fir_16
User’s Manual
4-111
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing (cont’d)
Implementation
The FIR filter implemented structure is of transversal type,
which is realized by a tapped delay line.
The FIR filter routine processes one sample at a time and
returns the output of that sample. The input for which the
output is to be calculated is sent as an argument to the
function.
Implementation is different for even and odd coefficients.
Even number of coefficients:
TriCore’s load word instruction loads the two delay line values
and two coefficients in one cycle. Dual MAC instruction
performs a pair of multiplications and additions according to
the equation
acc = acc + X ( n ) ⋅ H 0 + X ( n – 1 ) ⋅ H 1
[4.41]
By using a dual MAC in the tap loop, the loop count is brought
down by a factor of two. Here two taps are used during a
single pass and loop is unrolled for efficient pointer update of
delay line. Thus loop is executed (nH/2-1) times.
Odd number of coefficients:
TriCore’s load halfword instruction loads one delay line value
and one coefficient in one cycle. MAC instruction performs
one multiplication and one addition according to the equation
acc = acc + X ( n ) ⋅ H 0
[4.42]
By using a MAC in the tap loop, the loop count remains nH.
Only one tap is used during a single pass and loop is unrolled
for efficient pointer update of delay line. Thus loop is executed
(nH-1) times.
User’s Manual
4-112
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing (cont’d)
The filter output R(n) is 16-bit saturated equivalent of acc
when the tap loop is executed fully.
For delay line, circular addressing mode is used which helps
in efficient delay update. The size of the circular Delay-Buffer
is equal to the filter order, i.e., the number of coefficients.
Circular buffer needs doubleword alignment. There is no
restriction on the number of coefficients.
Delay pointer in the memory note shows updated pointer after
tap loop is over. This points to the oldest value in the delaybuffer which is replaced by new input value.
Example
Trilib\Example\Tasking\Filters\FIR\expFir_16.c,
expFir_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFir_16.cpp,
expFir_16.c
Trilib\Example\GNU\Filters\FIR\expFir_16.c
Cycle Count
With DSP
Extensions
For even number of coefficients
Pre-kernel
:
Kernel
:
Post-kernel
:
10
nH
------- – 1 × 2 + 2
2
2+2
For odd number of coefficients
User’s Manual
Pre-kernel
:
8
Kernel
:
[ nH – 1 ] × 2 + 2
Post-kernel
:
2+2
4-113
V 1.2, 2000-01
Function Descriptions
Fir_16
FIR Filter, Normal, Arbitrary number of coefficients,
Sample processing (cont’d)
Without DSP
Extensions
For even number of coefficients
Pre-kernel
:
10
Kernel
:
same as With DSP Extensions
Post-kernel
:
3+2
For odd number of coefficients
Pre-kernel
Code Size
User’s Manual
:
8
Kernel
:
same as With DSP Extensions
Post-kernel
:
3+2
110 bytes
4-114
V 1.2, 2000-01
Function Descriptions
FirBlk_16
FIR Filter, Normal, Arbitrary number of coefficients,
Block processing
Signature
void FirBlk_16(DataS
DataS
cptrDataS
cptrDataS
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Circular pointer of Coeff-Buffer of
size nH
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
nX
:
Size of Input-Buffer
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
R(nX)
:
Output-Buffer
Outputs
*X,
*R,
H,
*DLY,
nX
Return
None
Description
The implementation of FIR filter uses transversal structure
(direct form). The block of inputs are processed at a time and
output for every sample is stored in the output array. The filter
operates on 16-bit real input, 16-bit coefficients and gives 16bit real output. The number of coefficients given by user is
arbitrary. Circular buffer addressing mode is used for
coefficients and delay line. Both coefficient buffer and delay
line buffer are doubleword aligned. The input buffer and the
output buffer are halfword aligned.
User’s Manual
4-115
V 1.2, 2000-01
Function Descriptions
FirBlk_16
FIR Filter, Normal, Arbitrary number of coefficients,
Block processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int j,i,k;
frac16circ *aDLY=&DLY;
//ptr to Circ-ptr of Delay-Buffer
for(i=0; i<nX; i++)
{
*DLY = *X;
//Store input value in Delay-Buffer at
//the position of the oldest value
acc = 0.0;
if(nH%2 == 0)
{
// ’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration.
//For each next loop i,j should be decremented
//and incremented by 2 respectively.
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)) +
(*(H+k+1)) * (*(DLY+k+1)));
//acc += X(n)*H(0) + X(n-1)*H(1)
k=k+2;
}
}
else
{
// ’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration.
//For each next loop i,j should be decremented and
//incremented by 1 respectively.
User’s Manual
4-116
V 1.2, 2000-01
Function Descriptions
FirBlk_16
FIR Filter, Normal, Arbitrary number of coefficients,
Block processing (cont’d)
for(j=0; j<nH; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)));
//acc += X(n)*H(0)
k=k+1;
}
}
DLY--;
//Set DLY.index to the oldest value
//in Delay-Buffer
aDLY=&DLY;
// store updated delay
*R++ = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
}//end of indata loop
}
Techniques
• Loop unrolling, two taps/loop if coefficients are even else
one tap/loop
• Use of packed data Load/Store
• Delay line implemented as circular buffer
• Coefficient buffer implemented as circular buffer
• Use of dual MAC instruction for even number of coefficients
and MAC instructions for odd number of coefficients
• Intermediate results stored in 64 bit register (16 guard bits)
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
User’s Manual
4-117
V 1.2, 2000-01
Function Descriptions
FirBlk_16
FIR Filter, Normal, Arbitrary number of coefficients,
Block processing (cont’d)
Memory Note
Input-Buffer
X(0)
Output-Buffer
R(0)
aX
X(1)
R(1)
.
.
.
.
X(n)
R(n)
X(n+1)
R(n + 1)
Delay-Buffer
.
.
.
.
.
.
1Q15
X(n-nH+1) caDLY
halfword
aligned
aR
1Q15
aDLY
halfword
aligned
X(n)
X(n-1)
X(n-2)
Coeff-Buffer
.
H0
.
caH aH
H1
1Q15
.
doubleword
aligned
.
Dual MAC
(even
number of
coefficients)
MAC (odd
number of
coefficients)
.
HnH-1
1Q15
doubleword
aligned
Figure 4-29 FirBlk_16
User’s Manual
4-118
V 1.2, 2000-01
Function Descriptions
FirBlk_16
FIR Filter, Normal, Arbitrary number of coefficients,
Block processing (cont’d)
Implementation
This FIR filter routine processes a block of input values at a
time. The pointer to the input buffer is sent as an argument to
the function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as Fir_16, except that the
Coeff-Buffer is also circular and needs doubleword alignment.
The size of the Coeff-Buffer is equal to the filter order, i.e., the
number of coefficients. Because of circular addressing used
for Coeff-Buffer, at the end of the tap loop coeff-pointer
always points to H0, i.e., first coefficient which is needed for
next instant. An additional loop is needed to calculate the
output for every sample in the buffer. Hence, this loop is
repeated as many times as the size of the input buffer.
Example
Trilib\Example\Tasking\Filters\FIR\expFirBlk_16.c,
expFirBlk_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFirBlk_16.cpp,
expFirBlk_16.c
Trilib\Example\GNU\Filters\FIR\expFirBlk_16.c
Cycle Count
With DSP
Extensions
For even number of coefficients
Pre-loop
:
Loop
:
9


nH
nX ×  5 +  ------- – 1 × 2 + 1 + 3 
 2



+3
Post-loop
:
1+2
For odd number of coefficients
User’s Manual
Pre-loop
:
6
Loop
:
nX × { 5 + [ ( nH – 1 ) × 2 + 1 ] + 3 }
+3
Post-loop
:
1+2
4-119
V 1.2, 2000-01
Function Descriptions
FirBlk_16
FIR Filter, Normal, Arbitrary number of coefficients,
Block processing (cont’d)
Without DSP
Extensions
For even number of coefficients
Pre-loop
:
11
Loop
:
same as With DSP Extensions
Post-Loop
:
1+2
For odd number of coefficients
Pre-loop
Code Size
User’s Manual
:
8
Loop
:
same as With DSP Extensions
Post-loop
:
1+2
178 bytes
4-120
V 1.2, 2000-01
Function Descriptions
Fir_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Sample processing
Signature
DataS Fir_4_16(DataS
DataS
cptrDataS
);
Inputs
X
:
Real input value
H
:
Pointer to Coeff-Buffer of size nH
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
Output
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
Return
R
:
Output value of the filter (48-bit
value converted to 16-bit with
saturation)
Description
The implementation of FIR filter uses transversal structure
(direct form). The single input is processed at a time and
output for every sample is returned. The filter operates on 16bit real input, 16-bit coefficients and gives 16-bit real output.
The number of coefficients given by the user is multiple of
four. Optimal implementation requires filter order to be
multiple of four. Circular buffer addressing mode is used for
delay line. Delay line buffer is doubleword aligned and it
should be in internal memory. Coefficient-Buffer should be
word aligned if it is in the external memory.
User’s Manual
4-121
X,
*H,
*DLY
V 1.2, 2000-01
Function Descriptions
Fir_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Sample processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int j,k;
frac16circ *aDLY=&DLY;
//ptr to Circ-ptr of Delay-Buffer
*DLY = X;
//Store input value in Delay-Buffer at
//the position of the oldest value
acc = 0.0;
//’n’ in the comments refers to current instant
//The index i,j of X(i),H(j)(in the comments) are valid
//for first loop iteration
//For each next loop i,j should be decremented and
//incremented by 4 respectively.
for(j=0; j<nH/4; j++)
{
acc = acc + (frac64)(*(H+k)*(*(DLY+k)) + (*(H+k+1)) * (*(DLY+k+1)));
//acc += X(n)*H(0) + X(n-1)*H(1)
acc = acc + (frac64)(*(H+k+2) * (*(DLY+k+2))+
(*(H+k+3)) * (*(DLY+k+3)));
//acc += X(n-2)*H(2) + X(n-3)*H(3)
k=k+4;
}
DLY--;
//Set DLY.index to the oldest value
//in Delay-Buffer
//store updated delay
aDLY=&DLY;
R = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
return R;
//Filter output returned
}
Techniques
User’s Manual
•
•
•
•
•
•
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular buffer
Use of dual MAC instructions
Intermediate results stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
4-122
V 1.2, 2000-01
Function Descriptions
Fir_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Sample processing (cont’d)
Assumptions
• Filter size must be multiple of 4 and minimum filter order is
eight
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
• Delay-Buffer is in internal memory
Memory Note
Delay-Buffer
Coeff-Buffer
.
H0
.
aDLY
caDLY
X(n-nH + 1)
aH
H1
X
.
X(n)
.
X(n-1)
.
X(n-2)
.
.
Dual MAC
.
.
HnH-1
1Q15
1Q15
doubleword
aligned
(Must be in IntMem)
Figure 4-30 Fir_4_16
User’s Manual
4-123
V 1.2, 2000-01
Function Descriptions
Fir_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Sample processing (cont’d)
Implementation
The FIR filter implemented structure is of transversal type,
which is realized by a tapped delay line.
The FIR filter routine processes one sample at a time and
returns the output of that sample. The input for which the
output is to be calculated is sent as an argument to the
function.
TriCore’s load doubleword instruction loads four delay line
values and four coefficients in one cycle. Each dual MAC
instruction performs a pair of multiplications and additions
according to the equation
acc = acc + X ( n ) ⋅ H 0 + X ( n – 1 ) ⋅ H 1
[4.43]
Thus by using two dual MACs in the tap loop, the loop count
is brought down by a factor of four. Here four taps are used
during a single pass and loop is unrolled for efficient pointer
update of delay line. Thus loop is executed (nH/4-1) times.
The filter output R(n) is 16-bit saturated equivalent of acc
when the tap loop is fully executed.
To support load doubleword instruction, coeff-buffer should
be word aligned if it is in the external memory and halfword
aligned if it is in the internal memory. For delay line, circular
addressing mode is used which helps in efficient delay
update. The size of the circular Delay buffer is equal to the
filter order, i.e., the number of coefficients. Circular buffer
needs doubleword alignment and to use load doubleword
instruction, size of the buffer should be multiple of eight bytes.
This implies that the coefficients should be multiple of four.
Delay pointer in the memory note shows updated pointer after
tap loop is over. This points to the oldest value in the DelayBuffer which is replaced by new input value.
Note: To Use load doubleword instruction for the delay line
the Delay-Buffer should be in internal memory only.
User’s Manual
4-124
V 1.2, 2000-01
Function Descriptions
Fir_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Sample processing (cont’d)
Example
Trilib\Example\Tasking\Filters\FIR\expFir_4_16.c,
expFir_4_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFir_4_16.cpp,
expFir_4_16.c
Trilib\Example\GNU\Filters\FIR\expFir_4_16.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
Kernel
:
7
nH
------- – 1 × 2 + 2
4
if nH > 8
nH
------- – 1 × 2 + 1
4
if nH = 8
Post-kernel
:
3+2
Without DSP
Extensions
Code Size
User’s Manual
Pre-kernel
:
7
Kernel
:
same as With DSP Extensions
Post-kernel
:
4+2
80 bytes
4-125
V 1.2, 2000-01
Function Descriptions
FirBlk_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Block processing
Signature
void FirBlk_4_16(DataS
DataS
cptrDataS
cptrDataS
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Circular pointer of Coeff-Buffer of
size nH
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
nX
:
Size of Input-Buffer
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
R(nX)
:
Output-Buffer
Output
*X,
*R,
H,
*DLY,
nX
Return
None
Description
The implementation of FIR filter uses transversal structure
(direct form). The block of inputs are processed at a time and
output for every sample is stored in the output array. The filter
operates on 16-bit real input, 16-bit coefficients and gives 16bit real output. The number of coefficients given by user is
multiple of four. Optimal implementation requires filter order to
be multiple of four. Circular buffer addressing mode is used for
coefficients and delay line. Both coefficient buffer and delay
line buffer are doubleword aligned. Input and output buffer are
halfword aligned.
User’s Manual
4-126
V 1.2, 2000-01
Function Descriptions
FirBlk_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Block processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int j,i,k;
frac16circ *aDLY=&DLY;
//Ptr to Circ-ptr of Delay-Buffer
frac16circ *H;
//Circ-ptr of Coeff-Buffer
for(i=0; i<nX; i++)
{
*DLY = *X;
//Store input value in Delay-Buffer at
//the position of the oldest value
acc = 0.0;
//’n’ in the comments refers to current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration
//For each next loop i,j should be decremented
//and incremented by 4 resp.
for(j=0; j<nH/4; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)) +
(*(H+k+1)) * (*(DLY+k+1)));
//acc += X(n)*H(0) + X(n-1)*H(1)
acc = acc + (frac64)(*(H+k+2) * (*(DLY+k+2)) +
(*(H+k+3)) * (*(DLY+k+3)));
//acc += X(n-2)*H(2) + X(n-3)*H(3)
k=k+4;
}
DLY--;
//Set DLY.index to the oldest value in Delay-Buffer
aDLY = &DLY;
//store updated delay
*R++ = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
}
}
User’s Manual
4-127
V 1.2, 2000-01
Function Descriptions
FirBlk_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Block processing (cont’d)
Techniques
•
•
•
•
•
•
•
Assumptions
• Filter order is a multiple of four and minimum filter order is
eight
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
• Delay-Buffer is in internal memory
User’s Manual
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular buffer
Coefficient buffer implemented as circular buffer
Use of dual MAC instructions
Intermediate results stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
4-128
V 1.2, 2000-01
Function Descriptions
FirBlk_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Block processing (cont’d)
Memory Note
Input-Buffer
X(0)
Output-Buffer
R(0)
aX
X(1)
R(1)
.
.
.
.
X(n)
R(n + 1)
.
.
.
R(n)
Delay-Buffer
X(n+1)
.
.
X(n-nH+1)
1Q15
X(n)
halfword
aligned
X(n-1)
aR
caDLY
.
aDLY
1Q15
halfword
aligned
Coeff-Buffer
X(n-2)
.
H0
.
caH
aH
H1
1Q15
.
Dual
MAC
doubleword
aligned
(Must be in IntMem)
.
.
HnH-1
1Q15
doubleword
aligned
Figure 4-31 Fir_Blk_4_16
User’s Manual
4-129
V 1.2, 2000-01
Function Descriptions
FirBlk_4_16
FIR Filter, Normal, Coefficients - multiple of four,
Block processing (cont’d)
Implementation
This FIR filter routine processes a block of input values at a
time. The pointer to the input buffer is sent as an argument to
the function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as Fir_4_16, except that the
Coeff-Buffer is also circular and needs doubleword alignment.
The size of the Coeff-Buffer is equal to the filter order, i.e., the
number of coefficients. Because of circular addressing used
for Coeff-Buffer, at the end of the tap loop coeff-pointer
always points to H0, i.e., first coefficient which is needed for
next instant. An additional loop is needed to calculate the
output for every sample in the buffer. Hence, this loop is
repeated as many times as the size of the input buffer.
Note: To Use load doubleword instruction for the delay line
the Delay-Buffer should be in internal memory only.
Example
Trilib\Example\Tasking\Filters\FIR\expFirBlk_4_16.c,
expFirBlk_4_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFirBlk_4_16.cpp,
expFirBlk_4_16.c
Trilib\Example\GNU\Filters\FIR\expFirBlk_4_16.c
Cycle Count
With DSP
Extensions
Pre-loop
:
Loop
:
5


nH
nX ×  5 + 2 ×  ------- – 1 + 1 + 4 
 4



+3
Post-loop
:
1+2
:
7
Without DSP
Extensions
Pre-loop
User’s Manual
4-130
V 1.2, 2000-01
Function Descriptions
FirBlk_4_16
Code Size
4.4.2
FIR Filter, Normal, Coefficients - multiple of four,
Block processing (cont’d)
Loop
:
same as With DSP Extensions
Post-loop
:
1+2
104 bytes
Symmetric FIR
FIR filters with symmetrical Finite Impulse Response are called Symmetrical FIR filters.
Such filters find use in signal processing applications such as speech processing where
linear phase response is required to avoid phase distortion.
4.4.2.1
Descriptions
The following Symmetric FIR filter functions are described.
•
•
•
•
Symmetric, Arbitrary number of coefficients, Sample processing
Symmetric, Arbitrary number of coefficients, Block processing
Symmetric, coefficients - multiple of 4, Sample processing
Symmetric, coefficients - multiple of 4, Block processing
User’s Manual
4-131
V 1.2, 2000-01
Function Descriptions
FirSym_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Sample processing
Signature
DataS FirSym_16(DataS
DataS
cptrDataS
);
Inputs
X
:
Real input value
H
:
Pointer to Coeff-Buffer of size nH/2
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
Output
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
Return
R
:
Output value of the filter (48-bit
value converted to 16-bit with
saturation)
Description
The implementation of FIR filter uses transversal structure
(direct form). A single input is processed at a time and output
for that sample is returned. The filter operates on 16-bit real
input, 16-bit coefficients and returns 16-bit real output. The
number of coefficients given by the user is arbitrary and half
of the filter order. Circular buffer addressing mode is used for
delay line. Delay line buffer is double word aligned. CoeffBuffer is halfword aligned. The Delay-Buffer is twice the size
of Coeff-Buffer.
User’s Manual
4-132
X,
*H,
*DLY
V 1.2, 2000-01
Function Descriptions
FirSym_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Sample processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int j,k;
frac16circ *aDLY=&DLY1;
//ptr to Circ-ptr of Delay-Buffer
DLY2 = DLY1-1;
aDLY=&DLY2;
*DLY1 = X;
//Ptr to X(n-nH+1)
//store index to the oldest value for next instant
//Store input value in Delay-Buffer at
//the position of the oldest value for current instant
acc = 0.0;
//The index i,j,k of X1(i),X2(j),H(k)(in the comments)
//are valid for first loop iteration.
//For each next loop i,j,k should be decremented, incremented and
//incremented by 1 respectively.
//’n’ in the comments refers to current instant
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY1+k)));
//acc += X1(n) * H(0)
acc = acc + (frac64)(*(H+k) * (*(DLY2-k)));
//acc += X2(n-nH+1) * H(0)
k=k+1;
}
DLY1=*aDLY;
//Set DLY.index to the oldest value
//in Delay-Buffer for next instant
R = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
return R;
//Filter output is returned
}
User’s Manual
4-133
V 1.2, 2000-01
Function Descriptions
FirSym_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Sample processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
Loop unrolling, two taps/loop
Use of packed data Load/Store
Delay line implemented as circular buffer
Use of MAC instructions
Intermediate results stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
Memory Note
Delay-Buffer
.
aDLY1
caDLY1
caDLY2
X(n-nH+2)
aDLY2
X(n-nH+1)
X(n)
MAC
X(n-1)
X
nH/2
.
Coeff-Buffer
.
H0
X(n-nH/2+1)
x(n-nH/2)
MAC
.
aH
H1
.
HnH/2 -1
1Q15
1Q15
doubleword
aligned
halfword
aligned
Figure 4-32 FirSym_16
User’s Manual
4-134
V 1.2, 2000-01
Function Descriptions
FirSym_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Sample processing (cont’d)
Implementation
The FIR filter implemented structure is of transversal type,
which is realized by a tapped delay line.
The FIR filter routine processes one sample at a time and
returns the output of that sample. The input for which the
output is to be calculated is sent as an argument to the
function.
TriCore’s load halfword instruction loads the one delay line
value and one coefficient in one cycle each. For delay line,
circular addressing mode is used. Two pointers are initialized
for circular delay line, one points to X(n), which is incremented
and the other points to X(n-nH+1), which is decremented to
access all the delay line values. Each pointer accesses nH/2
values.
In a symmetric FIR filter, X(n) and X(n-nH+1) get multiplied
with the same coefficient H0. This fact can be made use of to
reduce the number of loads for coefficients. So, for the first
pass in tap loop, one delay line pointer loads X(n) and the
other pointer loads X(n-nH+1) by using load halfword
instruction.
MAC instruction performs multiplication and addition. Two
MACs are used in the tap loop, which for the first pass perform
acc = acc + X ( n ) ⋅ H 0
acc = acc + X ( n – nH + 1 ) ⋅ H 0
[4.44]
Here two taps are used during a single pass and loop is
unrolled to save cycle. Thus loop is executed (nH/2-1) times.
The filter output R(n) is 16-bit saturated equivalent of acc
when the tap loop is fully executed.
As Delay-Buffer is circular, the delay line update is done
efficiently. The size of the circular Delay-Buffer is equal to the
filter order, i.e., twice the number of given coefficients.
Circular buffer needs doubleword alignment and to use load
halfword instruction, size of the buffer should be multiple of
two bytes. There is no restriction on the number of
coefficients.
User’s Manual
4-135
V 1.2, 2000-01
Function Descriptions
FirSym_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Sample processing (cont’d)
Delay pointers in the memory note show updated pointers for
the next iteration. caDLY1 points to the oldest value in the
Delay-Buffer which is replaced by new input value.
Example
Trilib\Example\Tasking\Filters\FIR\expFirSym_16.c,
expFirSym_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFirSym_16.cpp,
expFirSym_16.c
Trilib\Example\GNU\Filters\FIR\expFirSym_16.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
9
Kernel
:
Post-kernel
:
4+2
:
9
nH
------- – 1 × 3 + 2
2
Without DSP
Extensions
Pre-kernel
Code Size
User’s Manual
Kernel
:
same as With DSP Extensions
Post-kernel
:
5+2
88 bytes
4-136
V 1.2, 2000-01
Function Descriptions
FirSymBlk_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Block processing
Signature
void FirSymBlk_16(DataS
DataS
DataS
cptrDataS
int
);
Inputs
X
Outputs
:
*X,
*R,
*H,
*DLY,
nX
Pointer to Input-Buffer of size nX
R
:
Pointer to Output-Buffer of size nX
H
:
Pointer to Coeff-Buffer of size nH/2
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
nX
:
Number of input samples
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
R(nX)
:
Output-Buffer
Return
None
Description
The implementation of FIR filter uses transversal structure
(direct form). A block of inputs are processed at a time and
output for every sample is stored in the output array. The filter
operates on 16-bit real input, 16-bit coefficients and gives 16bit real output. The number of coefficients given by the user is
arbitrary and half of the filter order. Circular buffer addressing
mode is used for delay line. Delay line buffer is doubleword
aligned. Coefficient, Input and output buffer are halfword
aligned. The Delay-Buffer is twice the size of Coeff-Buffer.
User’s Manual
4-137
V 1.2, 2000-01
Function Descriptions
FirSymBlk_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Block processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int i,j,k;
frac16circ *aDLY=&DLY1;
//ptr to Circ-ptr of Delay-Buffer
frac16 *H0;
//Ptr to Coeff-Buffer
H0 = H;
DLY2 = DLY1-1;
aDLY = &DLY2;
*DLY1 = X;
//store coeff-buffer ptr
//Ptr to X(n-nH+1)
//store index to the oldest value of next instant
//Store input value in Delay-Buffer at
//the position of the oldest value of current instant
for(i=0; i<nX; i++)
{
acc = 0.0;
k=0;
//The index i,j,k of X1(i),X2(j),H(k)(in the comments)
//are valid for first loop iteration.
// For each next loop i,j,k should be decremented, incremented and
//incremented by 1 respectively.
//’n’ in the comments refers to current instant
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY1+k)));
//acc += X1(n) * H(0)
acc = acc + (frac64)(*(H+k) * (*(DLY2-k)));
//acc += X2(n-nH+1) * H(0)
k=k+1;
}
DLY1 = *aDLY;
//Set DLY.index to the oldest value in Delay-Buffer
H = H0;
//initialize coeff-ptr
*R++ = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
}
}
User’s Manual
4-138
V 1.2, 2000-01
Function Descriptions
FirSymBlk_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Block processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
User’s Manual
Loop unrolling, two taps/loop
Use of packed data Load/Store
Delay line implemented as circular buffer
Use of MAC instructions
Intermediate results stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
4-139
V 1.2, 2000-01
Function Descriptions
FirSymBlk_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Block processing (cont’d)
Memory Note
Input-Buffer
X(0)
Output-Buffer
aX
R(0)
aR
X(1)
R(1)
.
.
.
.
X(n)
R(n)
X(n+1)
R(n + 1)
.
.
.
.
1Q15
1Q15
halfword
aligned
halfword
aligned
Delay-Buffer
.
aDLY1
caDLY1
X(n-nH+2)
caDLY2
X(n)
MAC
X(n-1)
nH/2
aDLY2
X(n-nH+1)
.
Coeff-Buffer
.
H0
X(n-nH/2+1)
X(n-nH/2)
MAC
.
aH
H1
.
HnH/2 -1
1Q15
1Q15
doubleword
aligned
halfword
aligned
Figure 4-33 FirSymBlk_16
User’s Manual
4-140
V 1.2, 2000-01
Function Descriptions
FirSymBlk_16
FIR Filter, Symmetric, Arbitrary number of
coefficients, Block processing (cont’d)
Implementation
This symmetric FIR filter routine processes a block of input
values at a time. The pointer to the input buffer is sent as an
argument to the function. The output is stored in output buffer,
the starting address of which is also sent as an argument to
the function.
Implementation details are same as FirSym_16, except that
the Coeff-Buffer pointer is stored for next iteration and an
additional loop is needed to calculate the output for every
sample in the buffer. Hence, this loop is repeated as many
times as the size of the input buffer.
Example
Trilib\Example\Tasking\Filters\FIR\expFirSymBlk_16.c,
expFirSymBlk_16.cpp
Trilib\Example\GreenHills\Filters\FIR
\expFirSymBlk_16.cpp, expFirSymBlk_16.c
Trilib\Example\GNU\Filters\FIR\expFirSymBlk_16.c
Cycle Count
Pre-loop
:
Loop
:
4


nH
nX ×  8 + 3 ×  ------- – 1 + 1 + 5 
 2



+3
Post-loop
Code Size
User’s Manual
:
0+2
112 bytes
4-141
V 1.2, 2000-01
Function Descriptions
FirSym_4_16
FIR Filter, Symmetric, Coefficients - multiple of four,
Sample processing
Signature
DataS FirSym_4_16(DataS
DataS
cptrDataS
);
Inputs
X
:
Real input value
H
:
Pointer to Coeff-Buffer of size nH/2
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
Output
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
Return
R
:
Output value of the filter (48-bit
value converted to 16-bit with
saturation)
Description
The implementation of FIR filter uses transversal structure
(direct form). A single input is processed at a time and output
for that sample is returned. The filter operates on 16-bit real
input, 16-bit coefficients and returns 16-bit real output. The
filter order should be a multiple of four. Therefore number of
coefficients given by the user should be even and half of the
filter order. Optimal implementation requires filter order to be
multiple of four. Circular buffer addressing mode is used for
delay line. Delay line buffer is double word aligned. Coefficient
buffer is halfword aligned. The Delay-Buffer is twice the size
of Coeff-Buffer.
User’s Manual
4-142
X,
*H,
*DLY
V 1.2, 2000-01
Function Descriptions
FirSym_4_16
FIR Filter, Symmetric, Coefficients - multiple of four,
Sample processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int j,k;
frac16circ *aDLY=&DLY1;
//ptr to Circ-ptr of Delay-Buffer
DLY2 = DLY1-1;
aDLY=&DLY2;
//store index to the oldest value for next instant
DLY2 = DLY2-1;
//Ptr to X(n-nH+2)
*DLY1 = X;
//Store input value in Delay-Buffer at
//the position of the oldest value
acc = 0.0;
//The index i,j,k of X1(i),X2(j),H(k)(in the comments)
//are valid for first loop iteration.
//For each next loop i,j,k should be decremented,incremented and
//incremented by 2 resp.
//’n’ in the comments refers to current instant
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY1+k)) +
(*(H+k+1)) * (*(DLY1+k+1)));
//acc += X1(n) * H(0) + X1(n-1) * H(1)
acc = acc + (frac64)(*(H+k) * (*(DLY2-k)) + (*(H+k+1)) *
(*(DLY2-k-1)));
//acc += X2(n-nH+1) * H(0) + X2(n-nH+2) * H(1) ||
k=k+2;
}
DLY1=*aDLY;
//Set DLY.index to the oldest value
//in Delay-Buffer for next instant
R = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
return R;
//Filter output is returned
}
User’s Manual
4-143
V 1.2, 2000-01
Function Descriptions
FirSym_4_16
FIR Filter, Symmetric, Coefficients - multiple of four,
Sample processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Filter order is a multiple of four
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular buffer
Use of dual MAC instructions
Intermediate results stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
Memory Note
Delay-Buffer
aDLY2
caDLY2
aDLY1
caDLY1
.
X(n-nH+2)
X(n-nH+1)
X(n)
MAC
X(n-1)
X
nH
.
Coeff-Buffer
.
H0
X(n-nH/2+1)
x(n-nH/2)
MAC
.
aH
H1
.
HnH/2 -1
1Q15
1Q15
doubleword
aligned
halfword
aligned
Figure 4-34 FirSym_4_16
User’s Manual
4-144
V 1.2, 2000-01
Function Descriptions
FirSym_4_16
FIR Filter, Symmetric, Coefficients - multiple of four,
Sample processing (cont’d)
Implementation
The FIR filter implemented structure is of transversal type,
which is realized as a tapped delay line.
The FIR filter routine processes one sample at a time and
returns the output of that sample. The input for which the
output is to be calculated is sent as an argument to the
function.
TriCore’s load word instruction loads the two delay line values
and two coefficients in one cycle. For delay line, circular
addressing mode is used. Two pointers are initialized for
circular delay line, one points to X(n), which is incremented
and the other points to X(n-nH+2), which is decremented to
access all the delay line values. Each pointer accesses nH/2
values.
In a symmetric FIR filter, X(n) and X(n-nH+1) get multiplied
with the same coefficient H0. This fact can be made use of to
reduce the number of loads for coefficients. So, for the first
pass in tap loop, one delay line pointer loads X(n), X(n-1) and
the other pointer loads X(n-nH+1), X(n-nH+2) by using load
word instruction.
Dual MAC instruction performs a pair of multiplication and
additions. Two dual MACs are used in the tap loop, which for
the first pass perform
acc = acc + X ( n ) ⋅ H 0 + X ( n – 1 ) ⋅ H 1
acc = acc + X ( n – nH + 1 ) ⋅ H 0 + X ( n – nH + 2 ) ⋅ H 1
[4.45]
Here four taps are used during a single pass and loop is
unrolled to save cycle. Thus loop is executed (nH/4-1) times.
The filter output R(n) is 16-bit saturated equivalent of acc
when the tap loop is executed fully.
User’s Manual
4-145
V 1.2, 2000-01
Function Descriptions
FirSym_4_16
FIR Filter, Symmetric, Coefficients - multiple of four,
Sample processing (cont’d)
As Delay-Buffer is circular, the delay line update is done
efficiently. The size of the circular Delay-Buffer is equal to the
filter order, i.e., twice the number of given coefficients.
Circular buffer needs doubleword alignment and to use load
word instruction, size of the buffer should be multiple of four
bytes. The number of coefficients given should be even,
which means the filter order is a multiple of four.
Delay pointers in the memory note show updated pointers for
the next iteration. caDLY1 points to the oldest value in the
Delay-Buffer which is replaced by new input value.
Example
Trilib\Example\Tasking\Filters\FIR\expFirSym_4_16.c,
expFirSym_4_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFirSym_4_16.cpp
, expFirSym_4_16.c
Trilib\Example\GNU\Filters\FIR\expFirSym_4_16.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
Kernel
:
10
nH
------- – 1 × 3 + 2
4
if nH > 8
nH
------- – 1 × 3 + 1
4
if nH = 8
Post-Kernel
:
4+2
Pre-kernel
:
10
Kernel
:
same as With DSP Extensions
Without DSP
Extensions
User’s Manual
4-146
V 1.2, 2000-01
Function Descriptions
FirSym_4_16
FIR Filter, Symmetric, Coefficients - multiple of four,
Sample processing (cont’d)
Post-kernel
Code Size
User’s Manual
:
5+2
92 bytes
4-147
V 1.2, 2000-01
Function Descriptions
FirSymBlk_4_16
FIR Filter, Symmetric, Coefficients - multiple of 4,
Block processing
Signature
void FirSymBlk_4_16(DataS
DataS
DataS
cptrDataS
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Pointer to Coeff-Buffer of size nH/2
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
nX
:
Size of Input-Buffer
DLY
:
Updated circular buffer with index
set to the oldest value of the filter
Delay-Buffer
R
:
Output-Buffer
Output
*X,
*R,
*H,
*DLY,
nX
Return
None
Description
The implementation of FIR filter uses transversal structure
(direct form). A block of inputs are processed at a time and
output for every sample is stored in the output array. The filter
operates on 16-bit real input, 16-bit coefficients and gives 16bit real output. The filter order should be a multiple of four.
Therefore the number of coefficients given by the user should
be even and half of the filter order. Optimal implementation
requires filter order to be multiple of four. Circular buffer
addressing mode is used for delay line. Delay line buffer is
doubleword aligned. Input, output and coefficient buffer are
halfword aligned. The Delay-Buffer is twice the size of CoeffBuffer.
User’s Manual
4-148
V 1.2, 2000-01
Function Descriptions
FirSymBlk_4_16
FIR Filter, Symmetric, Coefficients - multiple of 4,
Block processing (cont’d)
Pseudo code
{
frac64 acc;
//Filter Result
int i,j,k;
frac16circ *aDLY=&DLY1;
//ptr to Circ-ptr of Delay-Buffer
frac16 *H0;
//Ptr to Coeff-Buffer
H0 = H;
DLY2 = DLY1-1;
aDLY = &DLY2;
//store index to the oldest value for next instant
DLY2 = DLY2-1;
//Ptr to X(n-nH+2)
*DLY1 = X;
//Store input value in Delay-Buffer at
//the position of the oldest value
for(i=0; i<nX; i++)
{
acc = 0.0;
k=0;
//The index i,j,k of X1(i),X2(j),H(k)(in the comments)
//are valid for first loop iteration.
//For each next loop i,j,k should be decremented, incremented and
//incremented by 2 respectively.
//’n’ in the comments refers to current instant
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY1+k)) +
(*(H+k+1)) * (*(DLY1+k+1)));
//acc += X1(n) * H(0) + X1(n-1) * H(1)
acc = acc + (frac64)(*(H+k) * (*(DLY2-k)) +
(*(H+k+1)) * (*(DLY2-k-1)));
//acc += X2(n-nH+1) * H(0) + X2(n-nH+2) * H(1) ||
k=k+2;
}
DLY1 = *aDLY;
//Set DLY.index to the oldest value in Delay-Buffer
H = H0;
*R++ = (frac16 sat)acc;
//Format the filter output from 48-bit
//to 16-bit saturated value
}
}
User’s Manual
4-149
V 1.2, 2000-01
Function Descriptions
FirSymBlk_4_16
FIR Filter, Symmetric, Coefficients - multiple of 4,
Block processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
User’s Manual
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular buffer
Use of dual MAC instructions
Intermediate results stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
4-150
V 1.2, 2000-01
Function Descriptions
FirSymBlk_4_16
FIR Filter, Symmetric, Coefficients - multiple of 4,
Block processing (cont’d)
Memory Note
Input-Buffer
X(0)
Output-Buffer
aX
aR
R(0)
X(1)
R(1)
.
.
.
.
X(n)
R(n)
X(n+1)
R(n + 1)
.
.
.
.
1Q15
1Q15
halfword
aligned
aDLY2
aDLY1
halfword
aligned
Delay-Buffer
caDLY2
caDLY1
.
X(n-nH+2)
Dual
MAC
X(n-nH+1)
X(n)
X(n-1)
nH/2
.
Coeff-Buffer
.
H0
X(n-nH/2+1)
X(n-nH/2)
.
1Q15
doubleword
aligned
aH
H1
.
Dual
MAC
HnH/2 -1
1Q15
halfword
aligned
Figure 4-35 FirSymBlk_4_16
User’s Manual
4-151
V 1.2, 2000-01
Function Descriptions
FirSymBlk_4_16
FIR Filter, Symmetric, Coefficients - multiple of 4,
Block processing (cont’d)
Implementation
This symmetric FIR filter routine processes a block of input
values at a time. The pointer to the input buffer is sent as an
argument to the function. The output is stored in output buffer,
the starting address of which is also sent as an argument to
the function.
Implementation details are same as FirSym_4_16, except
that the Coeff-Buffer pointer is stored for next iteration and an
additional loop is needed to calculate the output for every
sample in the buffer. Hence, this loop is repeated as many
times as the size of the input buffer.
Example
Trilib\Example\Tasking\Filters\FIR\expFirSymBlk_4_16.c,
expFirSymBlk_4_16.cpp
Trilib\Example\GreenHills\Filters\FIR\
expFirSymBlk_4_16.cpp, expFirSymBlk_4_16.c
Trilib\Example\GNU\Filters\FIR\expFirSymBlk_4_16.c
Cycle Count
Pre-kernel
:
4
Kernel
:


nH
nX ×  9 + 3 ×  ------- – 1 + 1 + 5 
 4



+ 1+2
Post-kernel
Code Size
4.4.3
:
0+2
116 bytes
Multirate Filters
Discrete time systems with unequal sampling rates at various parts of the system are
called Multirate Systems. For sampling rate alterations, the basic sampling rate
alteration devices are invariably employed together with lowpass digital filters. Filters
having different sampling rates at input and output of filter are called Multirate Filters.
The two types of multirate filtering processes are Decimation filtering and Interpolation
filtering.
User’s Manual
4-152
V 1.2, 2000-01
Function Descriptions
4.4.3.1
Decimating Filters
Decimation is equivalent to down sampling a discrete-time signal. It is used to eliminate
redundant data, allowing more information to be stored, processed or transmitted in the
same amount of data.
Decimator or down sampler reduces the sampling rate by a factor of integer M.
X[n]=Xa(nT)
M
y[n]=Xa(nMT)
F’T=FT/M=1/T’
FT=1/T
Figure 4-36 Decimation/down Sampling Illustration
The sampling rate of a critically sampled discrete time signal with a spectrum occupying
the full Nyquist range cannot be reduced any further since such a reduction will introduce
aliasing. Hence the bandwidth of a critically sampled signal must first be reduced by
lowpass filtering before its sampling rate is reduced by a down sampler. The decimation
algorithm can be implemented using FIR or IIR filter structure. But generally, FIR is used.
The overall system comprising of a lowpass filter followed by a down sampler ahead of
a lowpass FIR filter is called decimator or decimating FIR. Such a filter would give an
output for every Mth input.
The decimating FIR filter is given by
N–1
y(m) =
∑
h ( K )x ( Mm – K )
[4.46]
K=0
V[n]
X[n]
H(Z)
M
y[n]
Figure 4-37 Decimation Filter Block Diagram
4.4.3.2
Interpolating FIR Filters
Interpolation increases the sample rate of a signal inserting zeros between the samples
of input data. In practice, the zero-valued samples inserted by the up sampler are
replaced with appropriate non-zero values using some type of interpolation process in
User’s Manual
4-153
V 1.2, 2000-01
Function Descriptions
order that the new higher rate sequence be useful. This interpolation can be done by
digital lowpass filtering.
X[n]=Xa(nT)
y[n]=Xa(n/LT)
L
F’T=FT.L=1/T’
FT=1/T
Figure 4-38 Interpolation/Down Sampling Illustration
The system comprising of up sampler followed by FIR lowpass filter which is used to
remove the unwanted images in the spectra of up sampled signal is called Interpolating
FIR filter.
Xin[n]
X[n]
L
H(Z)
y[n]
Figure 4-39 Interpolation Filter Block Diagram
The rate expander inserts If-1 zero valued samples after each input sample. The
resulting samples Xin[n] are lowpass filtered to produce output y(n), a smooth and anti
imaged version of Xin[n]. The transfer function of interpolator H(k) incorporates a gain of
1/If because the If-1 zeros inserted by the rate expander cause the energy of each input
to be spread over If output samples. The lowpass filter of interpolator uses a direct form
FIR filter structure for computational efficiency. Output of an FIR filter is given by
N–1
y[n] =
∑ h ( k )Xin [ n – k ]
[4.47]
k=0
where,
N-1
:
the number of filter coefficients (taps)
Xin[n-k]
:
the rate expanded version of the input X[n]
User’s Manual
4-154
V 1.2, 2000-01
Function Descriptions
X[n] is related to Xin[n-k] by
 X ( ( n – k ) ⁄ If )
X in [ n – k ] = 
for (n-k)=0, ± If ,±2If…
0

Otherwise
4.4.3.3
Description
The following Multirate FIR filters are described.
• Decimation FIR
• Interpolation FIR
User’s Manual
4-155
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
Signature
void FirDec_16(DataS
DataS
cptrDataS
cptrDataS
int
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Circular pointer of Coeff-Buffer of
size nH
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH
Without DSP Extension - Pointer to
Circ-Struct
(nH)
:
Transferred as a part of Circular
Pointer data type in a DLY
parameter
nX
:
Size of Input-Buffer
Df
:
Decimation length
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
R(nX)
:
Output-Buffer
Outputs
*X,
*R,
H,
*DLY,
nX,
Df
Return
None
Description
The implementation of Decimation FIR filter uses transversal
structure (direct form). A block of inputs are processed at a
time. The filter operates on 16-bit real input, 16-bit coefficients
and gives 16-bit real output. Number of coefficients is
arbitrary. If nX/Df is not an integer, the trailing samples are
lost. Circular buffer addressing mode is used for coefficients
and delay line. Both coefficient buffer and Delay-Buffer are
doubleword aligned. Input and output buffers are halfword
aligned.
User’s Manual
4-156
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
Pseudo code
{
frac64 acc;
//Filter result
int j,i,k;
frac16circ *adly=&DLY;
//Ptr to Circ-ptr of Delay-Buffer
//macro
macro FirDec EV_Coef, EV_Coef_Odd_Df
{
if EV_Coef==TRUE
{
//FIR filtering
for(i=0; i<nX; i++)
{
*DLY = *X++;
//Store input value in Delay-Buffer at
//the position of the oldest value
acc = 0.0;
// ’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration.
//For each next loop i,j should be decremented
//and incremented by 2 respectively.
for(j=0; j<nH/2; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)) +
(*(H+k+1)) * (*(DLY+k+1)));
//acc += X(n)*H(0) + X(n-1)*H(1)
k=k+2;
}
DLY--;
//(Df-1) values loaded into delay buffer before next output
//calculation
if (EV_Coef_Odd_Df==TRUE)
{
for(i=0;i<(Df-1)/2;i++)
{
*DLY-- = *X++;
*DLY-- = *X++;
}
}
else
{
User’s Manual
4-157
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
for(i=0;i<Df-1;i++)
{
*DLY-- = *X++;
}
else
{
// ’n’ in the comments refers to current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration.
//For each next loop i,j should be decremented and
//incremented by 1 respectively.
for(j=0; j<nH; j++)
{
acc = acc + (frac64)(*(H+k) * (*(DLY+k)));
//acc += X(n)*H(0)
k=k+1;
}
DLY--;
//(Df-1) values loaded into delay buffer before next output
//calculation
for(i=0;i<Df-1;i++)
{
*DLY-- = *X++;
}
}
}//End of Macro
FirDec_16:
{
nR = nX/Df;
if (nH%2 == 0)
{
if (Df%2 != 0)
{
FirDec TRUE, TRUE;
}
FirDec TRUE, FALSE;
}
else
{
FirDec FALSE, FALSE;
}
}
}
User’s Manual
4-158
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
Techniques
• Loop unrolling, two taps/loop if coefficients are even else
one tap/loop
• Use of packed data Load/Store
• Delay line implemented as circular buffer
• Coefficient buffer implemented as circular buffer
• Intermediate results stored in 64-bit register
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
User’s Manual
4-159
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
Memory Note
Input-Buffer
X(0)
Output-Buffer
R(0)
R(1)
.
.
.
.
.
R(nX/Df - 1)
aX
X(1)
.
.
X(n)
X(n+1)
.
X(nX)
Delay-Buffer
.
.
X(n-nH+1)
1Q15
X(n)
halfword
aligned
X(n-1)
caDLY
aR
1Q15
aDLY
halfword
aligned
X(n-2)
Coeff-Buffer
.
H0
.
caH
aH
H1
1Q15
.
.
doubleword
aligned
HIn-1
HIn
.
HnH-1
1Q15
doubleword
aligned
Figure 4-40 FirDec_16
User’s Manual
4-160
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
Implementation
Decimation FIR filter is implemented with Transversal
structure which is realized by a tapped delay line. This
Decimation FIR filter routine processes a block of input values
at a time. The pointer to the input buffer is sent as an
argument to the function. The output is stored in output buffer,
the starting address of which is also sent as an argument to
the function.
Both Coeff-Buffer and data buffer are circular and need
doubleword alignment. The size of Coeff-Buffer and DelayBuffer are equal to filter order, i.e., the number of coefficients.
The size of output buffer is nX/Df as there will be an output
only for every Dfth input. A macro is used for performing the
decimating FIR filtering. The macro is called with two
arguments, EV_Coef, EV_Coef_Odd_Df.
If the number of coefficients is even (EV_Coef = TRUE)
TriCore’s load word instruction loads the two delay line values
and two coefficients in one cycle. Dual MAC instruction
performs a pair of multiplications and additions according to
the equation
acc = acc + X ( n ) ⋅ H 0 + X ( n – 1 ) ⋅ H 1
[4.48]
By using a dual MAC in the tap loop, the loop count is brought
down by a factor of two. Here two taps are used during a
single pass and loop is unrolled for efficient pointer update of
delay line. Thus loop is executed (nH/2-1) times.
In case of odd number of coefficients TriCore’s load halfword
instruction loads one delay line value and one coefficient in
one cycle. MAC instruction performs one multiplication and
one addition according to the equation
acc = acc + X ( n ) ⋅ H 0
[4.49]
By using a MAC in the tap loop, the loop count remains nH.
Only one tap is used during a single pass and loop is unrolled
for efficient pointer update of delay line. Thus loop is executed
(nH-1) times.
For decimation, after each FIR output calculation the delay
line has to be updated by (Df-1) inputs for which output will not
be calculated.
User’s Manual
4-161
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
If the number of coefficients is even and Df is odd,
(EV_Coef_Odd_Df = TRUE) then the updation of delay line
can be done using TriCore’s load word instructions thereby
reducing the loop count for the decimation loop by a factor of
two else the load halfword instruction is used and the loop is
executed (Df-1) times.
Thus the implementation is most optimal for the case of even
coefficient and odd Df.
Example
Trilib\Example\Tasking\Filters\FIR\expFirDec_16.c,
expFirDec_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFirDec_16.cpp,
expFirDec_16.c
Trilib\Example\GNU\Filters\FIR\expFirDec_16.c
Cycle Count
For Macro FirDec
Mcall (TRUE,TRUE)
Pre-loop
:
Loop
:
3
nX
nH
------- × 5 +  ------- – 1 2 + 5
 2

Df
+ ( ( Df – 1 ) ⁄ 2 )3 + 3 ] + 2
Post-loop
:
2
Pre-loop
:
Loop
:
3
nX
nH
------- × 5 +  ------- – 1 2 + 5 + Df ( 2 )
 2

Df
Post-loop
:
2
:
2
Mcall
(TRUE,FALSE)
+3 ] + 2
Mcall
(TRUE,FALSE)
Pre-loop
User’s Manual
4-162
V 1.2, 2000-01
Function Descriptions
FirDec_16
Decimation FIR Filter
Loop
:
nX
------- × [ 5 + ( nH – 1 )2 + 5 + Df ( 2 )
Df
+3 ] + 2
Post-loop
:
2
where integer part of nX/Df is considered. The number of
cycles taken by the Loop should be reduced by nX/Df if either
the tap loop or the decimation loop gets executed only once.
If both get executed only once then the total reduction in
number of cycles taken by the loop is 2(nX/Df) for all the
cases.
For FirDec_16
With DSP
Extensions
Even nH and odd Df
31 + Mcall ( TRUE, TRUE ) + 2 + 2
Even nH and even Df
27 + Mcall ( TRUE, FALSE ) + 2 + 2
Odd nH
28 + Mcall ( FASLE, FALSE ) + 2 + 2
where Mcall (X,Y) is the number of cycles taken by the macro
when the arguments passed to it are X and Y.
User’s Manual
4-163
V 1.2, 2000-01
Function Descriptions
Without DSP
Extensions
Even nH and odd Df
33 + Mcall ( TRUE, TRUE ) + 2 + 2
Even nH and even Df
29 + Mcall ( TRUE, FALSE ) + 2 + 2
Odd nH
30 + Mcall ( FALSE, FALSE ) + 2 + 2
where Mcall (X,Y) is the number of cycles taken by the macro
when the arguments passed to it are X and Y.
Code Size
User’s Manual
308 bytes
4-164
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter
Signature
void FirInter_16(DataS
DataS
cptrDataS
cptrDataS
int
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Circular pointer of Coeff-Buffer of
size nH
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH
Without DSP Extension - Pointer to
Circ-Struct
(nH)
:
Transferred as a part of Circular
Pointer data type in a DLY
parameter
nX
:
Size of Input-Buffer
Outputs
*X,
*R,
H,
*DLY,
nX,
If
If
:
Interpolation length
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
R(nX)
:
Output-Buffer
Return
None
Description
The implementation of Interpolation FIR filter uses transversal
structure (direct form). The block of inputs are processed at a
time and output for every sample is stored in the output array.
The filter operates on 16-bit real input, 16-bit coefficients and
gives 16-bit real output. The number of coefficients given by
user are arbitrary, but nX/If must be an integer. Circular buffer
addressing mode is used for coefficients and delay line. Both
coefficient buffer and delay line buffer are doubleword
aligned. Input and output buffer are halfword aligned.
User’s Manual
4-165
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
Pseudo code
{
frac64 acc;
//Filter result
int i,j,k,l;
frac16 circ*aDLY=DLY
//Ptr to Circ-Ptr of Delay-Buffer
if ((nH/If)%2 == 0)
{
for (i=0;i<nX;i++)
{
*DLY=*X
//store input value in Delay-Buffer at the
//position of the oldest value
acc = 0.0;
l = 0;
for (j=0;j<If;j++)
{
// ’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration.
//For each next loop i,j should be decremented and
//incremented by 1 respectively.
for (k=0;k<nH/2If;k++)
{
m = 0;
acc = acc + (frac64)(*(H+l+m)*(*DLY+k)) + (*(H+l+m+1)*
(*(DLY+k+1)));
//acc = X(n)*H(0)+X(n-1)*H(If)
m = m + If;
k = k + 2;
}//(nH/2If) loop
l++;
*R++ = (frac16 sat)acc;
//format the filter output from 48-bit to 16-bit
//saturated value
}//(If) loop
DLY--;
}//nX loop
}//If
else
{
User’s Manual
4-166
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
for (i=0;i<nX;i++)
{
*DLY=*X
//store input value in Delay-Buffer at the
//position of the oldest value
acc = 0.0;
l = 0;
for (j=0;j<If;j++)
{
// ’n’ in the comments refers current instant
//The index i,j of X(i),H(j)(in the comments) are
//valid for first loop iteration.
//For each next loop i,j should be decremented and
//incremented by 1 respectively.
for (k=0;k<nH/If;k++)
{
m = 0;
acc = acc + (frac64)(*(H+l+m)*(*DLY+k))
//acc = X(n)*H(0)+X(n-1)*H(If)
m = m + If;
k = k + 1;
}//(nH/If) loop
l++;
*R++ = (frac16 sat)acc;
//format the filter output from 48-bit to 16-bit
//saturated value
}//(If) loop
DLY--;
}//nX loop
aDLY = DLY;
}//else loop
//store updated delay
}
Techniques
User’s Manual
• Loop unrolling, one tap/loop if (nH/If) is odd and two
taps/loop if even
• Use of packed data Load/Store
• Delay line implemented as circular buffer
• Coefficient buffer implemented as circular buffer
• Intermediate results stored in 64-bit register
• Instruction ordering for zero overhead Load/Store
4-167
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
Assumptions
• Inputs, outputs, coefficients and delay line are in 1Q15
format
• Filter order nH is not explicitly sent as an argument, instead
it is sent through the argument DLY as a size of circ-DelayBuffer
• The size of circ-Delay-Buffer is nH/If and it should be
integer
User’s Manual
4-168
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
Memory Note
Input-Buffer
X(0)
Output-Buffer
R(0)
aX
X(1)
R(1)
.
.
.
.
X(n)
Rf-1
Delay-Buffer
X(n+1)
Rf
.
.
.
.
.
X(n-nH+1)
aR
caDLY
.
aDLY
1Q15
X(n)
1Q15
halfword
aligned
X(n-1)
halfword
aligned
X(n-2)
Coeff-Buffer
.
H0
.
caH
aH
H1
1Q15
.
doubleword
aligned
.
Hf-1
Hf
.
HnH-1
1Q15
doubleword
aligned
Figure 4-41 FirInter_16
User’s Manual
4-169
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
Implementation
Interpolation FIR filter implemented structure is transversal
type which is realized by a tapped delay line. This
interpolation FIR filter routine processes a block of input
values at a time. The pointer to the input buffer is sent as an
argument to the function. The output is stored in output buffer,
the starting address of which is also sent as an argument to
the function.
In Interpolation FIR both Coeff-Buffer and data-buffer are
circular and needs doubleword alignment. The size of CoeffBuffer is equal to filter order, i.e., the number of coefficients.
Implementation is different for even and odd coefficients.
Even number of coefficients:
TriCore’s load word instruction loads the two delay line values
and two coefficients in one cycle. Dual MAC instruction
performs a pair of multiplications and additions according to
the equation
acc = acc + X ( n ) ⋅ H 0 + X ( n – 1 ) ⋅ H If
[4.50]
By using a dual MAC in the tap loop, the loop count is brought
down by a factor of two. This tap loop which is innermost loop,
is executed (nX/2If-1) times. Delay pointer is incremented
once every cycle, so that successive data are multiplied.
Coefficient pointer after each product and accumulation is
incremented by If. This is done to make the routine efficient on
the multiplication by zero in data samples are avoided by
incrementing the coefficients pointer by If.
Odd number of coefficients:
TriCore’s load halfword instruction loads one delay line value
and one coefficients in one cycle. MAC instruction performs
one multiplication and one addition according to the equation
acc = acc + X ( n ) ⋅ H 0
User’s Manual
4-170
[4.51]
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
This tap loop which is innermost loop turns (nX/If-1) times.
Delay pointer is incremented once every cycle, so that
successive data are multiplied. Coefficient pointer after each
product and accumulation is incremented by If. This is done to
make the routine efficient, as the multiplication by zeros in
data samples are avoided by incrementing the coefficients
pointer by If.
In data loop runs nX times. Delay pointer points to the oldest
data and coefficient pointer to beginning of Coeff-Buffer.
Interpolation loop runs If times. Delay pointer points to the
new data which is loaded and coefficient pointer points to one
more than what it has pointed during last iteration.
Example
Trilib\Example\Tasking\Filters\FIR\expFirInter_16.c,
expFirInter_16.cpp
Trilib\Example\GreenHills\Filters\FIR\expFirInter_16.cpp,
expFirInter_16.c
Trilib\Example\GNU\Filters\FIR\expFirInter_16.c
Cycle Count
With DSP
Extensions
For even number of coefficients


nH
12 + nX × 3 + If ×  11 +   ------------- – 1 × ( 5 ) + 1  + 2 + 2
2
×
If


+1+2+1+2
For odd number of coefficients


nH
7 + nX × 3 + If ×  9 +  ------- – 1 × ( 3 ) + 1  + 2 + 2
 If



+1+2+1+2
Without DSP
Extensions
For even number of coefficients


nH
14 + nX × 3 + If ×  11 +   ------------- – 1 × ( 5 ) + 1  + 2 + 2
2
×
If


+1+2+1+2
User’s Manual
4-171
V 1.2, 2000-01
Function Descriptions
FirInter_16
Interpolation FIR Filter (cont’d)
For odd number of coefficients


nH
9 + nX × 3 + If ×  9 +  ------- – 1 × ( 3 ) + 1  + 2 + 2
If


+1+2+1+2
Code Size
User’s Manual
142 bytes
4-172
V 1.2, 2000-01
Function Descriptions
4.5
IIR Filters
Infinite Impulse Response (IIR) filters have infinite duration of non-zero output values for
a given finite duration of non-zero impulse input. Infinite duration of output is due to the
feedback used in IIR filters.
Recursive structures of IIR filters make them computationally efficient but because of
feedback not all IIR structures are realizable (stable). The transfer function for the direct
form of the biquad (second order) IIR filter is given by
–1
–2
H0 + H1 ⋅ z + H2 ⋅ z
R[z]
H [ z ] = ------------ = -------------------------------------------------------------–1
–2
X[z]
1 – ( H3 ⋅ z ) – ( H 4 ⋅ z )
[4.52]
where H3, H4 correspond to the poles and H0, H1, H2 correspond to the zeroes of the
filter.
The equivalent difference equation is
R ( n ) = H0 ⋅ X ( n ) + H1 ⋅ X ( n – 1 ) + H2 ⋅ X ( n – 2 )
+ H3 ⋅ R ( n – 1 ) + H4 ⋅ R ( n – 2 )
[4.53]
where, X(n) is the nth input and R(n) is the corresponding output.
The direct form is not commonly used in IIR filter design. In the case of a linear shiftinvariant system, the overall input-output relationship of a cascade is independent of the
order in which systems are cascaded. This property suggests a second direct form
realization. Therefore, another form called Canonical form (also called direct form II)
which uses half the number of delay stages and thereby less memory, is used for the
implementation. All the IIR filters in this DSP Library have been implemented in this form.
User’s Manual
4-173
V 1.2, 2000-01
Function Descriptions
The block diagram for a biquad (second order) filter in canonical form is as follows.
X[n]
H0
W 1[n]
+
+
R[n]
Z-1
+
H3
H1
+
W 1[n-1]
Z-1
H4
H2
W 1[n-2]
Figure 4-42 Canonical Form (Direct Form II) Second-order Section
Equation [4.52] can be broken into two parts in terms of zeroes and poles of transfer
function as
W ( n ) = X ( n ) + H3 ⋅ W ( n – 1 ) + H4 ⋅ W ( n – 2 )
R ( n ) = H0 ⋅ W ( n ) + H1 ⋅ W ( n – 1 ) + H 2 ⋅ W ( n – 2 )
[4.54]
From the figure, it is clear that the first part of this equation corresponds to poles and the
second corresponds to zeros. All the implementations of IIR filters use this equation.
The term W(n), called as the delay line, refers to the intermediate values. Any higher
order IIR filter can be constructed by cascading several biquad stages together. A
cascaded realization of a fourth order system using direct form II realization of each
biquad subsystem would be as shown in the following diagram.
User’s Manual
4-174
V 1.2, 2000-01
Function Descriptions
X(n)
W 1(n) H0
+
+
W 2(n) H5
+
Z-1
+
H3
R(n)
Z-1
H1
+
+
H8
W 1(n-1)
H6
+
W 2(n-1)
Z-1
H4
+
Z-1
H9
H2
W 1(n-2)
H7
W 2(n-2)
Figure 4-43 Cascaded Biquad IIR Filter
A Comparison between FIR and IIR filters:
• IIR filters are computationally efficient than FIR filters i.e., IIR filters require less
memory and fewer instruction when compared to FIR to implement a specific transfer
function.
• The number of necessary multiplications are least in IIR while it is most in FIR.
• IIR filters are made up of poles and zeroes. The poles give IIR filter an ability to realize
transfer functions that FIR filters cannot do.
• IIR filters are not necessarily stable, because of their recursive nature it is designer’s
task to ensure stability, while FIR filters are guaranteed to be stable.
• IIR filters can simulate prototype analog filter while FIR filters cannot.
• Probability of overflow errors is quite high in IIR filters in comparison to FIR filters.
• FIR filters are linear phase as long as H(z) = H(z-1) but all stable, realizable IIR filters
are not linear phase except for the special cases where all poles of the transfer
function lie on the unit circle.
4.5.1
Descriptions
The following IIR filter functions are described.
•
•
•
•
Coefficients Coefficients Coefficients Coefficients -
User’s Manual
multiple of four, Sample processing
multiple of four, Block processing
multiple of five, Sample processing
multiple of five, Block processing
4-175
V 1.2, 2000-01
Function Descriptions
IirBiq_4_16
IIR Filter, Coefficients - multiple of four, Sample
processing
Signature
DataS IirBiq_4_16(DataS
DataS
DataS
int
);
Inputs
X
:
Real input value
H
:
Pointer to Coeff-Buffer
X,
*H,
*DLY,
nBiq
DLY
:
Pointer to Delay-Buffer
nBiq
:
Number of Biquads
Output
DLY[2*nBiq]
:
Updated delay line is an implicit
output - Wi(n) and Wi(n-1) are
stored as Wi(n-1) and Wi(n-2) for
next sample computation
Return
R
:
Output value of the filter (48-bit
output value converted to 16-bit
with saturation).
Description
The IIR filter is implemented as a cascade of direct form II
Biquads. If number of biquads is ’n’, the filter order is 2*n. A
single sample is processed at a time and output for that
sample is returned. The filter operates on 16-bit real input, 16bit real coefficients and returns 16-bit real output. The number
of inputs is arbitrary, while the number of coefficients is
4*(number of Biquads). Length of delay line is 2*(number of
Biquads). In internal memory Coeff-Buffer can be halfword/
word aligned but in external memory it has to be halfword and
not word aligned. This ensures that after the scale value is
read and the pointer incremented, the starting address of the
coefficients is word aligned. Delay-Buffer can be halfword
aligned in both internal and external memory.
User’s Manual
4-176
V 1.2, 2000-01
Function Descriptions
IirBiq_4_16
IIR Filter, Coefficients - multiple of four, Sample
processing (cont’d)
Pseudo code
{
frac16 *W;
frac64 W64;
frac64 acc;
int i,j;
InScale = *H;
//Ptr to Delay-Buffer
//Filter result
//InScale value is read
W =DLY;
H++;
//Ptr to Coefficients
acc =(frac64) (X * InScale);
//Input scaled by InScale and stored in 19Q45 format
//Biquad loop
//’n’ (in the comments) refers to the current instant
//Indices i and j of H(i) and W_j in the comments are valid only for
//the first iteration
//For subsequent iterations they have to be incremented by 4
//and 1 respectively
for(i=0;i<nBiq;i++)
{
//W64 in 19Q45
W64 = acc + ( *(H+2) * (*W) + *(H+3) * (*(W+1)) );
//W_1(n) = X(n) + H(3) * W_1(n-1) + H(4) * W_1(n-2)
//acc in 19Q45
acc = W64 +(frac64) ( (*H) * (*W)
+ (*(H+1)) * (*(W+1)) );
//acc = acc + H(1) * W_1(n-1) + H(2) * W_1(n-2)
*(W+1) = *W;
//Update the Delay line
*W =((_frac16 _sat)W64);
//Format the delay line value to 16-bit(1Q15)
//saturated and store the updated value in memory
W = W + 2;
H = H + 4;
//Ptr to W_2(n-1)
//Ptr to H(5)
}
R = (frac16 sat)acc;
//Format the Filter output to 16-bit (1Q15)
//saturated value
return R;
//Filter Output returned
}
User’s Manual
4-177
V 1.2, 2000-01
Function Descriptions
IirBiq_4_16
IIR Filter, Coefficients - multiple of four, Sample
processing (cont’d)
Techniques
• Use of packed data Load/Store
• Use of dual MAC instructions
• Intermediate results stored in a 64-bit register (16 guard
bits)
• Filter output converted to 16-bit with saturation
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Input and output are in 1Q15 format
• Coefficients are in 2Q14 format
Memory Note
Coeff-Buffer
aH
Delay-Buffer
aW
W 1(n-1)
W 1(n-2)
.
W k(n-1)
W k(n-2)
.
W nBiq (n-1)
W nBiq (n-2)
Dual
MAC-2
Dual
MAC-1
Inscale
H(1)
H(2)
H(3)
H(4)
.
.
H(4*nBiq)
1Q15
2Q14
1Q15
Figure 4-44 IirBiq_4_16
User’s Manual
4-178
V 1.2, 2000-01
Function Descriptions
IirBiq_4_16
IIR Filter, Coefficients - multiple of four, Sample
processing (cont’d)
Implementation
The IIR filter implemented as a cascade of biquads has two
delay elements per biquad and five coefficients per biquad. In
this implementation, the fifth coefficient which scales the
current delay line value of the biquad (H0) is taken to be one.
The input is scaled by a constant value, Inscale. Hence, only
four coefficients per biquad are considered. The kth biquad
uses the coefficients H(4k-3), H(4k-2), H(4k-1) and H(4k), k =
1,2,...nBiq.
This IIR filter routine processes one sample at a time and
returns the output for that sample. The input for which the
output is to be calculated is sent as an argument to the
function.
TriCore’s load doubleword instruction loads the four
coefficients used in a biquad in one cycle. Load word
instruction loads the corresponding two delay line values
(Wk(n-1),Wk(n-2)). A dual MAC instruction performs a pair of
multiplications and additions to generate the new delay line
value for that biquad in one cycle according to the equation
W k ( n ) = Rk – 1 ( n ) + H ( 4k – 1 ) × W k ( n – 1 )
+ H ( 4K ) × W k ( n – 2 )
[4.55]
where, R0(n) = X(n).
A second Dual MAC instruction uses this delay line value and
performs another pair of multiplication and additions to
generate the output for that biquad in one cycle according to
the equation
R k [ n ] = Wk ( n ) + H ( 4k – 3 ) × Wk ( n – 1 )
+ H ( 4K – 2 ) × W k ( n – 2 )
[4.56]
where, RnBiq(n) = R(n).
Wk(n) and Wk(n-1) of the current sample become Wk(n-1) and
Wk(n-2) for the next sample computation. The Delay line is
updated accordingly in memory.
User’s Manual
4-179
V 1.2, 2000-01
Function Descriptions
IirBiq_4_16
IIR Filter, Coefficients - multiple of four, Sample
processing (cont’d)
Hence a loop executed as many times as there are biquad
stages will generate the filter output, with each pass through
it yielding the output for that biquad stage.
Load doubleword instruction of TriCore requires word
alignment in external memory. If external memory is used,
since first value in the Coeff-Buffer is Inscale, followed by the
coefficients used in each biquad stage, the address of the
Coeff-Buffer should be halfword and not word aligned. That is,
it should be a multiple of two bytes but not a multiple of four
bytes. This ensures that once Inscale (16 bit value) is read
and pointer is incremented, the address at which the
coefficients begin would be a multiple of four bytes as required
by the load double word instruction.
Example
Trilib\Example\Tasking\Filters\IIR\expIirBiq_4_16.c,
expIirBiq_4_16.cpp
Trilib\Example\GreenHills\Filters\IIR\expIirBiq_4_16.cpp,
expIirBiq_4_16.c
Trilib\Example\GNU\Filters\IIR\expIirBiq_4_16.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
5
Kernel
:
[ nBiq × 4 ] + 2 if nBiq > 1
Post-kernel
:
2+2
Pre-kernel
:
5
Kernel
:
same as With DSP Extensions
[ nBiq × 4 ] + 1 if nBiq = 1
Without DSP
Extensions
User’s Manual
4-180
V 1.2, 2000-01
Function Descriptions
IirBiq_4_16
IIR Filter, Coefficients - multiple of four, Sample
processing (cont’d)
Post-kernel
Code Size
User’s Manual
:
3+2
78 bytes
4-181
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_4_16
IIR Filter, Coefficients - multiple of four, Block
processing
Signature
void IirBiqBlk_4_16(DataS
DataS
DataS
DataS
int
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Pointer to Coeff-Buffer
Output
*X,
*R,
*H,
*DLY,
nBiq,
nX
DLY
:
Pointer to Delay-Buffer
nBiq
:
Number of Biquads
nX
:
Size of Input-Buffer
DLY[nW]
:
Updated Delay-Buffer values
R[nX]
:
Output-Buffer
Return
None
Description
The IIR filter is implemented as a cascade of direct form II
Biquads. If number of biquads is ’n’, the filter order is 2*n. A
block of input is processed at a time and output for every
sample is stored in the output buffer. The filter operates on 16bit real input, 16-bit real coefficients and returns 16-bit real
output. The number of inputs is arbitrary, while the number of
coefficients is 4*(number of Biquads). Length of delay line is
2*(number of Biquads). Coeff-Buffer can be halfword/word
aligned in internal memory, but in external memory it should
be only halfword and not word aligned. This ensures that after
Inscale value is read, the coefficient array is word aligned.
Delay-Buffer can be halfword aligned in both internal and
external memory.
User’s Manual
4-182
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_4_16
IIR Filter, Coefficients - multiple of four, Block
processing (cont’d)
Pseudo code
{
frac16 *W;
frac16 *H0;
frac16 *H;
frac64 W64;
frac64 acc;
int i,j;
InScale = *H0;
H0++;
//Ptr to Delay-Buffer
//Ptr to InScale
//H0+1 - Ptr to Coefficients
//Filter result
//InScale value is read
//Ptr to coefficients
// Loop for Input-Buffer
for(j=0;j<nX;j++)
{
W =DLY;
H=H0
acc =(frac64) (*(X+j) * InScale);
//X(n)scaled by InScale and stored in 19Q45 format
//Biquad loop
//’n’ refers to the current instant
//Indices i and j of H(i) and W_j in the comments are
//valid only for the first iteration. For subsequent iterations
//they have to be incremented by 4 and 1 respectively
for(i=0;i<nBiq;i++)
{
//W64 in 19Q45
W64 = acc + ( *(H+2) * (*W) + *(H+3) * (*(W+1)) );
//W_1(n) = X(n) + H(3) * W_1(n-1) + H(4) * W_1(n-2)
//acc in 19Q45
acc = W64 +(frac64) ( (*H) * (*W)
+ (*(H+1)) * (*(W+1)) );
//acc = W64 + H(1) * W_1(n-1) + H(2) * W_1(n-2)
*(W+1) = *W; //Update the Delay line
*W =((_frac16 _sat)W64);
//Format the delay line value to 16-bit(1Q15)
//saturated and store the updated value in memory
W = W + 2;
//Ptr to W_2(n-1)
H = H + 4;
//Ptr to H(5)
}
User’s Manual
4-183
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_4_16
IIR Filter, Coefficients - multiple of four, Block
processing (cont’d)
(R+j) =((_frac16 _sat)acc);
//Format the Filter output to 16-bit (1Q15)
//saturated value and store in output buffer
}
}
Techniques
• Use of packed data Load/Store
• Use of dual MAC instructions
• Intermediate results stored in a 64-bit register (16 guard
bits)
• Filter output converted to 16-bit with saturation
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Input and output are in 1Q15 format
• Coefficients are in 2Q14 format
User’s Manual
4-184
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_4_16
IIR Filter, Coefficients - multiple of four, Block
processing (cont’d)
Memory Note
aX
Input-Buffer
X(0)
X(1)
.
X(n)
X(n+1)
.
.
.
aR
Coeff-Buffer
1Q15
aH
Delay-Buffer
aW
Output-Buffer
R(0)
R(1)
.
R(n)
R(n+1)
.
.
.
W 1(n-1)
W 1(n-2)
.
W k(n-1)
W k(n-2)
.
W nBiq(n-1)
W nBiq(n-2)
Dual
MAC-2
Dual
MAC-1
1Q15
Inscale 1Q15
H(1)
H(2)
H(3)
2Q14
H(4)
.
.
H(4*nBiq)
1Q15
Figure 4-45 IirBiqBlk_4_16
User’s Manual
4-185
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_4_16
IIR Filter, Coefficients - multiple of four, Block
processing (cont’d)
Implementation
This IIR filter routine processes a block of input values at a
time. The pointer to the input buffer is sent as an argument to
the function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as that of IirBiq_4_16. The
difference is than an additional loop is needed to calculate the
output for every sample in the buffer. Hence, this loop is
repeated as many times as the size of the input buffer.
Example
Trilib\Example\Tasking\Filters\IIR\expIirBiqBlk_4_16.c,
expIirBiqBlk_4_16.cpp
Trilib\Example\GreenHills\Filters\IIR
\expIirBiqBlk_4_16.cpp, expIirBiqBlk_4_16.c
Trilib\Example\GNU\Filters\IIR\expIirBiqBlk_4_16.c
Cycle Count
Pre-loop
:
Loop
:
Post-loop
:
Code Size
User’s Manual
1
nX × { 7 + [ nBiq × 4 ] + 4 } + 1 + 2
0+2
98 bytes
4-186
V 1.2, 2000-01
Function Descriptions
IirBiq_5_16
IIR Filter, Coefficients - multiple of five, Sample
processing
Signature
DataS IirBiq_5_16(DataS
DataS
DataS
int
);
Inputs
X
:
Real input value
H
:
Pointer to Coeff-Buffer
X,
*H,
*DLY,
nBiq
DLY
:
Pointer to Delay-Buffer
nBiq
:
Number of Biquads
Output
DLY[nW]
:
Updated delay line is an implicit
output - Wi(n) and Wi(n-1) are
stored as Wi(n-1) and Wi(n-2) for
next sample computation
Return
R
:
Output value of the filter(48-bit
output value converted to 16-bit
with saturation).
Description
The IIR filter is implemented as a cascade of direct form II
Biquads. If number of biquads is ’n’, the filter order is 2*n. A
single sample is processed at a time and output for that
sample is returned. The filter operates on 16-bit real input, 16bit real coefficients and returns 16-bit real output. The number
of inputs is arbitrary, while the number of coefficients is
5*(number of Biquads). Length of delay line is 2*(number of
Biquads). Coeff-Buffer and Delay-Buffer are halfword aligned
in both internal and external memory.
User’s Manual
4-187
V 1.2, 2000-01
Function Descriptions
IirBiq_5_16
IIR Filter, Coefficients - multiple of five, Sample
processing (cont’d)
Pseudo code
{
frac16 *W;
frac16 W16;
frac64 W64;
frac64 HW64;
frac64 acc;
int i,j;
//Ptr to Delay-Buffer
//Filter result
acc =(frac64) (X); //Input stored in 19Q45 format
//Biquad loop.
//’n’ refers to the current instant
//Indices i and j of H(i) and W_j in the comments are valid only
//for the first iteration. For subsequent iterations they
// have to be incremented by 5 and 1 respectively
//
for(i=0;i<nBiq;i++)
{
//W64 in 19Q45
W64 = acc + ( *(H+3) * (*W) + *(H+4) * (*(W+1)) );
//W_1(n) = acc + H(3) * W_1(n-1) + H(4) * W_1(n-2)
W16 = (frac16 sat)W64;
//Format the delay line value W_1(n) to 16 bit
//value with saturation
//HW64 in 19Q45
HW64 = (frac64)(W16 * (*H));
//HW64 = H(0) * W_1(n)
//acc in 19Q45
acc = HW64 +(frac64) (*(H+1) * (*W)
+ (*(H+2)) * (*(W+1)));
//acc = H(0) * W_1(n)+ H(1) * W_1(n-1) + H(2) * W_1(n-2)
*(W+1) = *W;
//update the delay line
*W = W16;
//update the delay line
W = W + 2;
//Ptr to W_2(n-1)
H = H + 4;
//Ptr to H(5)
}
R =(frac16 sat)acc);
//Format the Filter output to 16-bit (1Q15)
//saturated value
}
User’s Manual
4-188
V 1.2, 2000-01
Function Descriptions
IirBiq_5_16
IIR Filter, Coefficients - multiple of five, Sample
processing (cont’d)
Techniques
• Use of packed data Load/Store
• Use of dual MAC instructions
• Intermediate results stored in a 64-bit register (16 guard
bits)
• Filter output converted to 16-bit with saturation
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Inputs and outputs are in 1Q15 format
• Coefficients are in 2Q14 format
Memory Note
Coeff-Buffer
Delay-Buffer
aW
W 1(n-1)
W 1(n-2)
.
W k(n-1)
W k(n-2)
.
W nBiq (n-1)
W nBiq (n-2)
aH
Dual
MAC-2
Dual
MAC-1
H(0)
H(1)
H(2)
H(3)
H(4)
.
.
H(5*nBiq-1)
2Q14
1Q15
Figure 4-46 IirBiq_5_16
User’s Manual
4-189
V 1.2, 2000-01
Function Descriptions
IirBiq_5_16
IIR Filter, Coefficients - multiple of five, Sample
processing (cont’d)
Implementation
In this implementation, there are five coefficients per biquad.
The kth biquad uses the coefficients H(5k-5), H(5k-4), H(5k-3),
H(5k-2) and H(5k-1), k=1,2,.....nBiq.
To perform two multiplication in one cycle using dual MAC, the
values should be packed in one register. Hence, H(5k-4),
H(5k-3) and H(5k-2), H(5k-1) are loaded in one cycle each
using load word instruction. H(5k-5) is loaded separately
using load halfword instruction.
The first dual MAC instruction performs a pair of
multiplications and additions to generate the new delay line
value for that biquad in one cycle according to the equation
W k ( n ) = Rk – 1 ( n ) + H ( 5k – 2 ) × W k ( n – 1 )
[4.57]
+ H ( 5K – 1 ) × Wk ( n – 2 )
where, R0(n) = X(n).
This delay line value is multiplied by H(5k-5).
The second dual MAC uses the above result and performs
another pair of multiplication and additions to generate the
output for that biquad according to the equation
R k [ n ] = H ( 5k – 5 ) × W k ( n ) + H ( 5k – 4 ) × W k ( n – 1 )
+ H ( 5K – 3 ) × W k ( n – 2 )
[4.58]
where, RnBiq(n) = R(n).
Wk(n) and Wk(n-1) of the current sample become Wk(n-1) and
Wk(n-2) for the next sample computation. The Delay line is
updated accordingly in memory.
Hence a loop executed as many times as there are biquad
stages will generate the filter output, with each pass through
it yielding the output for that biquad stage.
User’s Manual
4-190
V 1.2, 2000-01
Function Descriptions
IirBiq_5_16
IIR Filter, Coefficients - multiple of five, Sample
processing (cont’d)
Example
Trilib\Example\Tasking\Filters\IIR\expIirBiq_5_16.c,
expIirBiq_5_16.cpp
Trilib\Example\GreenHills\Filters\IIR\expIirBiq_5_16.cpp,
expIirBiq_5_16.c
Trilib\Example\GNU\Filters\IIR\expIirBiq_5_16.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
4
Kernel
:
[ nBiq × 7 ] + 2 if nBiq > 1
Post-kernel
:
2+2
Pre-kernel
:
4
Kernel
:
same as With DSP Extensions
Post-kernel
:
3+2
[ nBiq × 7 ] + 1 if nBiq = 1
Without DSP
Extensions
Code Size
User’s Manual
92 bytes
4-191
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_5_16
IIR Filter, Coefficients - multiple of five, Block
processing
Signature
void IirBiqBlk_5_16(DataS
DataS
DataS
DataS
int
int
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
Pointer to Coeff-Buffer
Output
*X,
*R,
*H,
*DLY,
nBiq,
nX
DLY
:
Pointer to Delay-Buffer
nBiq
:
Number of Biquads
nX
:
Size of Input-Buffer
DLY[nW]
:
Updated Delay-Buffer values
R[nX]
:
Output-Buffer
Return
None
Description
The IIR filter is implemented as a cascade of direct form II
Biquads. A block of input is processed at a time and output for
every sample is stored in the output buffer. The filter operates
on 16-bit real input, 16-bit real coefficients and returns 16-bit
real output. The number of inputs is arbitrary, while the
number of coefficients is 5*(number of Biquads). Length of
delay line is 2*(number of biquads). Both Coeff-Buffer and
Delay-Buffer are halfword aligned.
User’s Manual
4-192
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_5_16
IIR Filter, Coefficients - multiple of five, Block
processing (cont’d)
Pseudo code
{
frac16 *W;
frac16 *H0;
frac16 W16;
frac64 W64;
frac64 HW64;
frac64 acc;
int i,j;
//Ptr to Delay-Buffer
//Ptr to Coeff-Buffer
//Filter result
//Loop for Input-Buffer
for(j=0;j<nX;j++)
{
W =DLY;
H=H0;
//Ptr to coefficients initialized
acc =(frac64) *(X+j);
//X(n) stored in 19Q45 format
//Biquad loop
//’n’ refers to the current instant
//Indices i and j of H(i) and W_j in the comments are valid
//only for the first iteration. For subsequent iterations
//they have to be incremented by 5 and 1 respectively
for(i=0;i<nBiq;i++)
{
//W64 in 19Q45
W64 = acc + ( *(H+3) * (*W) + (*(H+4)) * (*(W+1)) );
//W_1(n) = acc + H(3) * W_1(n-1) + H(4) * W_1(n-2)
W16 = (frac16 sat)W64;
//Format the delay line value W_1(n) to 16 bit
//value with saturation
//HW64 in 19Q45
HW64 = (frac64)(W16 * (*H));
// HW64 = H(0) * W_1(n)
//acc in 19Q45
acc = HW64 +(frac64) ( (*(H+1) * (*W)
+ (*(H+2)) * (*(W+1)) );
//acc = H(0) * W_1(n)+ H(1) * W_1(n-1) + H(2) * W_1(n-2)
*(W+1) = *W; //update the delay line
*W = W16;
//update the delay line
W = W + 2;
//Ptr to W_2(n-1)
H = H + 4;
//Ptr to H(5)
}
User’s Manual
4-193
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_5_16
IIR Filter, Coefficients - multiple of five, Block
processing (cont’d)
*(R+j) =((_frac16 _sat)acc);
//Format the Filter output to 16-bit (1Q15)
//saturated value and store in output buffer
}
}
Techniques
• Use of packed data Load/Store.
• Use of dual MAC instructions.
• Intermediate results stored in a 64-bit register(16 guard
bits)
• Filter output converted to 16-bit with saturation
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Input and output are in 1Q15 format
• Coefficients are in 2Q14 format
User’s Manual
4-194
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_5_16
IIR Filter, Coefficients - multiple of five, Block
processing (cont’d)
Memory Note
aX
Input-Buffer
X(0)
X(1)
.
X(n)
X(n+1)
.
.
.
Output-Buffer
aR
R(0)
R(1)
.
R(n)
R(n+1)
.
.
.
1Q15
Coeff-Buffer
Delay-Buffer
aW
W 1(n-1)
W 1(n-2)
.
W k(n-1)
W k(n-2)
.
W nBiq (n-1)
W nBiq (n-2)
aH
Dual Dual
MAC-2 MAC-1
1Q15
H(0)
H(1)
H(2)
H(3)
H(4)
.
.
H(5*nBiq-1)
2Q14
1Q15
Figure 4-47 IirBiqBlk_5_16
User’s Manual
4-195
V 1.2, 2000-01
Function Descriptions
IirBiqBlk_5_16
IIR Filter, Coefficients - multiple of five, Block
processing (cont’d)
Implementation
This IIR filter routine processes a block of input values at a
time. The pointer to the input buffer is sent as an argument to
the function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as that of IirBiq_5_16. The
difference is that an additional loop is needed to calculate the
output for every sample in the buffer. Hence, this loop is
repeated as many times as the size of the input buffer.
Example
Trilib\Example\Tasking\Filters\IIR\expIirBiqBlk_5_16.c,
expIirBiqBlk_5_16.cpp
Trilib\Example\GreenHills\Filters\IIR
\expIirBiqBlk_5_16.cpp, expIirBiqBlk_5_16.c
Trilib\Example\GNU\Filters\IIR\expIirBiqBlk_5_16.c
Cycle Count
Pre-loop
:
Loop
:
Post-loop
:
Code Size
User’s Manual
1
nX × { 6 + [ nBiq × 7 ] + 4 } + 1 + 2
0+2
112 bytes
4-196
V 1.2, 2000-01
Function Descriptions
4.6
Adaptive Digital Filters
An adaptive filter adapts to changes in its input signals automatically.
Conventional linear filters are those with fixed coefficients.These can extract signals
where the signal and noise occupy fixed and separate frequency bands. Adaptive filters
are useful when there is a spectral overlap between the signal and noise or if the band
occupied by the noise is unknown or varies with time. In an adaptive filter, the filter
characteristics are variable and they adapt to changes in signal characteristics. The
coefficients of these filters vary and cannot be specified in advance.
The self-adjusting nature of adaptive filters is largely used in applications like telephone
echo cancelling, radar signal processing, equalization of communication channels etc.
Adaptive filters with the LMS (Least Mean Square) algorithm are the most popular kind.
The basic concept of an LMS adaptive filter is as follows.
X(n)
FIR
(H0, H1, ... H nH-1)
R(n)
+
D(n)
LMS Algorithm
Figure 4-48 Adaptive filter with LMS algorithm
The filter part is an N-tap filter with coefficients H0, H1,..., HnH-1, whose input signal is
X(n) and output is R(n). The difference between the actual output R(n) and a desired
output D(n), gives an error signal
Err ( n ) = D ( n ) – R ( n )
User’s Manual
[4.59]
4-197
V 1.2, 2000-01
Function Descriptions
The algorithm uses the input signal X(n) and the error signal Err(n) to adjust the filter
coefficients H0, H1,..., HnH-1, such that the difference, Err(n) is minimized on a criterion.
The LMS algorithm uses the minimum mean square error criterion
min H0, H1,..., HnH-1 E(Err2(n))
[4.60]
Where E denotes statistical expectation.The algorithm of a delayed LMS adaptive filter
is mathematically expressed as follows.
R ( n ) = Hn – 1 ( 0 ) × X ( n ) + Hn – 1 ( 1 ) × X ( n – 1) + Hn – 2 ( 2 ) × X ( n – 2 ) + …
[4.61]
+ H n – 1 ( nH – 1 ) × X ( n – nH + 1 )
H n ( k ) = H n – 1 ( k ) + X ( n – k ) × µ × Err n – 1
[4.62]
Err n = D ( n ) – R ( n )
[4.63]
where µ >0 is a constant called step-size. Note that the filter coefficients are time
varying. Hn(i) denotes the value of the i-th coefficient at time n. The algorithm has three
stages.
1. The filter output R(n) is produced.
2. The error value from previous iteration is read and coefficients are updated.
3. The expected value is read, error is calculated and stored in memory.
Step-size µ controls the convergence of the filter coefficients to the optimal (or
stationary) state. The larger the µ value, faster the convergence of the adaptation. On
the other hand, a large value of µ also leads to a large variation of Hn(i) (a bad accuracy)
and thus a large variation of the output error (a large residual error). Therefore, the
choice of µ is always a trade-off between fast convergence and high accuracy. µ must
not be larger than a certain threshold. Otherwise, the LMS algorithm diverges.
4.6.1
Delayed LMS algorithm for an adaptive real FIR
Delayed LMS algorithm for an adaptive real FIR filter can be represented by the following
mathematical equation.
nH – 1
R(n) =
∑
Hn – 1 ( k ) × X ( n – k)
[4.64]
K=0
H n ( k ) = H n – 1 ( k ) + X ( n – k ) × U × Err n – 1
[4.65]
Err n = D ( n ) – R ( n )
[4.66]
User’s Manual
4-198
V 1.2, 2000-01
Function Descriptions
where,
R(n)
:
output sample of the filter at index n
X(n)
:
input sample of the filter at index n
D(n)
:
expected output sample of the filter at index n
Hn(0),Hn(1),..
:
filter coefficients at index n
nH
:
filter order (number of coefficients)
Errn
:
error value at index n which will be used to
update coefficients at index n+1
4.6.2
Delayed LMS algorithm for an adaptive Complex FIR
Delayed LMS algorithm for an adaptive Complex FIR filter can be represented by the
following mathematical equations.
nH – 1
Rr ( n ) =
∑
[ Hr n – 1 ( k ) × Xr ( n – k ) – Hin – 1 ( k ) × Xi ( n – k ) ]
[4.67]
[ Hr n – 1 ( k ) × Xi ( n – k ) + Hin – 1 ( k ) × Xr ( n – k ) ]
[4.68]
K=0
nH – 1
Ri ( n ) =
∑
K=0
Hr n ( k ) = Hr n – 1 ( k )
+ U × ( Xr ( n – k ) × Errr n – 1 – Xi ( n – k ) × Erri n – 1 )
Hin ( k ) = Hi n – 1 ( k )
+ U × ( Xr ( n – k ) × Erri n – 1 + Xi ( n – k ) × Errrn – 1 )
Errr n = Dr ( n ) – Rr ( n )
User’s Manual
[4.69]
[4.70]
[4.71]
4-199
V 1.2, 2000-01
Function Descriptions
Erri n = Di ( n ) – Ri ( n )
[4.72]
where,
Rr(n)
:
Real output sample of the filter at index n
Ri(n)
:
Imag output sample of the filter at index n
Xr(n)
:
Real input sample of the filter at index n
Xi(n)
:
Imag input sample of the filter at index n
Dr(n)
:
Real desired output sample of the filter at index n
Di(n)
:
Imag desired output sample of the filter at index n
Hrn(0),Hrn(1),..
:
filter coefficients (real) at index n
Hin(0),Hin(1),..
:
filter coefficients (imag) at index n
nH
:
filter order (number of coefficients)
Errn
:
error value at index n which will be used to
update coefficients at index n+1
4.6.3
Descriptions
The following are adaptive FIR filter functions with 16 bit input and 16 bit coefficients.
•
•
•
•
Real, Coefficients - multiple of four, Sample processing
Real, Coefficients - multiple of four, Block processing
Complex, Coefficients - multiple of four, Sample processing
Complex, Coefficients - multiple of four, Block processing
The following are mixed adaptive FIR filter functions with 16 bit input and 32 bit
coefficients.
• Real, Coefficients - multiple of two, Sample Processing
• Real, Coefficients - multiple of two, Block Processing
User’s Manual
4-200
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing
Signature
DataS Dlms_4_16(DataS
DataS
cptrDataS
DataS
DataS
DataS
);
Inputs
X
:
Real Input Value
H
:
Pointer to Coeff-Buffer
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
D
:
Real expected value
Err
:
Pointer to Error value
U
:
Step size
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
Output
X,
*H,
*DLY,
D,
*Err,
U
H(nH)
:
Modified Coeff-Buffer
Return
R
:
Output value of the filter (48-bit
output value converted to 16-bit
with saturation)
Description
Delayed LMS algorithm implemented for adaptive FIR filter,
FIR filter transversal structure (direct form), Single sample
processing, 16-bit fractional input, coefficients and output
data format, Optimal implementation, requires filter order to
be multiple of four.
User’s Manual
4-201
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing (cont’d)
Pseudo code
{
frac64 acc;
//filter result
frac16 circ *aDLY = &DLY;
//ptr to Circ-ptr of Delay-Buffer
int j;
//Error value multiplied by step size
uerr = (frac16 rnd)(*Err * U);
//store input value in Delay-Buffer at the position
//of the oldest value
*DLY = X;
acc = 0;
k = 0;
//tap loop
//The index i and j of H_n-1(i) and X(j) in the comments are valid only
//for the first iteration.For each next iteration it has to be
//incremented and decremented by 4 respectively.
for (j=0; j<nH/4; j++)
{
acc = acc + (frac64)[(*(H+k) * (*(DLY + k))
+(*(H+k+1)) * (*(DLY+k+1))];
//acc = acc + X(n)* H_n-1(0) + X(n-1) * H_n-1(1)
acc = acc + (frac64)[(*(H+k+2) * (*(DLY+k+2))+
(*(H+k+3)) * (*(DLY+k+3));
//acc = X(n-2) * (H_n-1(2) + X(n-3) * H_n-1(3)
//coefficient update
*(H+k) = (frac16 sat rnd)((*(H+k)) + uerr * (*(DLY+k)));
*(H+k+1) = (frac16 sat rnd)((*(H+k+1)) + uerr * (*(DLY+k+1)));
*(H+k+2) = (frac16 sat rnd))(*(H+k+2) + uerr * (*(DLY+k+2)));
*(H+k+3) = (frac16 sat rnd)((*(H+k+3)) + uerr * (*(DLY+k+3)));
k = k + 4;
}
//Set DLY.index to the oldest value in Delay-Buffer
DLY--;
aDLY = *DLY;
//format the filter output from 48-bit to 16-bit saturated value
R = (frac16 sat)acc;
//calculate error for the current output
*Err = D - R;
return R;
}
User’s Manual
4-202
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Filter size must be multiple of four
• Inputs, outputs, coefficients are in 1Q15 format
• Delay-Buffer is in Internal Memory
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular-buffer
Use of dual MAC instructions
Intermediate result stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
Memory Note
Delay-Buffer
.
.
X
x(n-nH+1)
caDLY
aDLY
Coeff-Buffer
x(n)
Hn-1(0)
x(n-1)
Hn-1(1)
x(n-2)
.
.
.
Dual
MAC
aH
.
.
.
1Q15
.
doubleword
aligned
Hn-1(nH-1)
(Must be in IntMem)
1Q15
Figure 4-49 Dlms_4_16
User’s Manual
4-203
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing (cont’d)
Coefficient
Update
Delay-Buffer
.
caDLY
Hn(0)
aH
Hn(1)
.
x(n-nH+1)
Updated
Coefficient
aDLY
.
x(n)
.
x(n-1)
.
x(n-2)
.
.
.
Hn(nH-1)
.
Error Value
1Q15
doubleword
aligned
Errn-1
1Q15
Dual Mac
Errn = D - R
Figure 4-50 Dlms_4_16 Coefficient update
User’s Manual
4-204
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing (cont’d)
Implementation
LMS algorithm has been used to realize an adaptive FIR filter.
The implemented filter is a Delayed LMS adaptive filter. That
is, the updation of coefficients in the current instant is done
using the error in the previous output.
The FIR filter is implemented using transversal structure and
is realized as a tapped delay line.
This routine processes one sample at a time and returns
output of that sample. The input for which the output is to be
calculated is sent as an argument to the function.
TriCore’s load doubleword instruction loads four delay line
values and four coefficients in one cycle. Dual MAC
instruction performs a pair of multiplications and additions
according to the equation
acc = acc + X ( n – k ) ⋅ H n – 1 ( k ) + X ( n – ( k – 1 ) )
⋅ Hn – 1 ( k + 1 )
[4.73]
where, k=0,1,...., nH-1.
The coefficient is updated using error from the previous
output, i.e., errn-1. As Hn-1(0) and Hn-1(1) are packed in one
register, one dual MAC instruction can be used to update both
the coefficients in one cycle. TriCore provides a dual MAC
instruction which performs packed multiplication and addition
with rounding and saturation. Hence the two coefficients are
updated at a time and packed in one register according to the
equation
H n ( k ) = H n – 1 ( k ) + X ( n – k ) ⋅ Err n – 1
H n ( k + 1 ) = H n – 1 ( k + 1 ) + X ( n – ( k – 1 ) ) ⋅ Errn – 1
[4.74]
where, k=0,1,...,nH-1.
User’s Manual
4-205
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing (cont’d)
Thus by using four dual MAC operations, four coefficients are
used and updated on a single pass through the loop. This
brings down the loop count by a factor of four. For the sake of
optimization one set of four dual MACs are performed outside
the loop. Hence loop is unrolled. This implies it is executed
(nH/4-1) times. For delay line, circular addressing mode is
used which helps in efficient delay update. The size of the
circular delay buffer is equal to the filter order, i.e., the number
of coefficients. Circular buffer needs doubleword alignment
and to use load doubleword instruction, size of the buffer
should be multiple of eight bytes. This implies that the
coefficients should be multiple of four.
Note: To use load doubleword instruction for delay line, the
delay-buffer should be in internal memory only.
Example
Trilib\Example\Tasking\Filters\Adaptive\expDlms_4_16.c,
expDlms_4_16.cpp
Trilib\Example\GreenHills\Filters\Adaptive
\expDlms_4_16.cpp, expDlms_4_16.c
Trilib\Example\GNU\Filters\Adaptive\expDlms_4_16.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
Kernel
:
12
nH
------- – 1 × 4 + 2
4
if TapLoopCount > 1
nH
------- – 1 × 4 + 1
4
if TapLoopCount = 1
Post-kernel
User’s Manual
:
4-206
4+2
V 1.2, 2000-01
Function Descriptions
Dlms_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Sample Processing (cont’d)
Without DSP
Extensions
Pre-kernel
Code Size
User’s Manual
:
12
Kernel
:
same as With DSP Extensions
Post-kernel
:
5+2
130 bytes
4-207
V 1.2, 2000-01
Function Descriptions
DlmsBlk_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Block Processing
Signature
void DlmsBlk_4_16(DataS
DataS
cptrDataS
cptrDataS
int
DataS
DataS
DataS
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
With DSP Extension - circular
pointer of Coeff-Buffer of size nH
Without DSP Extension - circStruct. Whose members are base
address, size and index
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
D
:
Pointer to Desired-Output-Buffer
Err
:
Pointer to Error value
U
:
Step size
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
Output
*X,
*R,
H,
*DLY,
nX,
*D,
*Err,
U
H(nH)
:
Modified Coeff-Buffer
R(nX)
:
Output-Buffer
Return
None
Description
Delayed LMS algorithm implemented for adaptive FIR filter,
FIR filter transversal structure (direct form), Block processing,
16-bit fractional input, coefficients and output data format,
Optimal implementation, requires filter order to be multiple of
four.
User’s Manual
4-208
V 1.2, 2000-01
Function Descriptions
DlmsBlk_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Block Processing (cont’d)
Pseudo code
{
frac64 acc;
//filter result
frac16 circ *aDLY = &DLY;
//ptr to Circ-ptr of Delay-Buffer
int i, j;
//loop for input buffer
for (i=0; i<nX; i++)
{
//Error value multiplied by step size
uerr = (frac16 rnd)(*Err * U);
//store input value in Delay-Buffer at the position
//of the oldest value
*DLY = *X++;
acc = 0;
k = 0;
//tap loop
for (j=0; j<nH/4; j++)
{
acc = acc + (frac64)[(*(H+k) * (*(DLY + k))
+(*(H+k+1)) * (*(DLY+k+1))];
//acc = acc + X(n)* H_n-1(0) + X(n-1) * H_n-1(1)
acc = acc + (frac64)[(*(H+k+2) * (*(DLY+k+2))+
(*(H+k+3)) * (*(DLY+k+3));
//acc = X(n-2) * (H_n-1(2) + X(n-3) * H_n-1(3)
//coefficient update
*(H+k) = (frac16 sat rnd)((*(H+k)) + uerr * (*(DLY+k)));
*(H+k+1) = (frac16 sat rnd)((*(H+k+1)) + uerr * (*(DLY+k+1)));
*(H+k+2) = (frac16 sat rnd))(*(H+k+2) + uerr * (*(DLY+k+2)));
*(H+k+3) = (frac16 sat rnd)((*(H+k+3)) + uerr * (*(DLY+k+3)));
k = k + 4;
}
//Set DLY.index to the oldest value in Delay-Buffer
DLY--;
aDLY = *DLY;
//format the filter output from 48-bit to 16-bit saturated value
//and store to Output-Buffer
*R = (frac16 sat)acc;
//calculate error for the current output
*Err = *D++ - *R++;
}
}
User’s Manual
4-209
V 1.2, 2000-01
Function Descriptions
DlmsBlk_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Block Processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Filter size is a multiple of four
• Inputs, outputs, coefficients are in 1Q15 format
• Delay-Buffer is in internal memory
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular-buffer
Use of dual MAC instructions
Intermediate result stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
Memory Note
Input-Buffer
X(0)
Delay-Buffer
.
aX
X(1)
.
.
x(n-nH+1)
.
x(n)
Hn-1(0)
.
x(n-1)
Hn-1(1)
X(n)
x(n-2)
.
.
.
.
.
1Q15
1Q15
halfword
aligned
caDLY
aDLY
Dual
MAC
Coeff-Buffer
caH aH
.
.
.
.
doubleword
aligned
Hn-1(nH-1)
(Must be in IntMem)
1Q15
doubleword
aligned
Figure 4-51 DlmsBlk_4_16
User’s Manual
4-210
V 1.2, 2000-01
Function Descriptions
DlmsBlk_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Block Processing (cont’d)
Coefficient
Update
Delay-Buffer
.
.
x(n-nH+1)
caDLY
Updated
Coefficient
H n(0)
aH
H n(1)
aDLY
.
x(n)
.
x(n-1)
.
x(n-2)
.
.
.
.
H n(nH-1)
Error Value
1Q15
Err n-1
doubleword
aligned
doubleword
aligned
Dual Mac
Desired Output
Buffer
D(0)
1Q15
Output-Buffer
R(0)
aD
D(1)
R(1)
.
.
.
.
.
.
.
.
D(n)
R(n)
.
.
1Q15
1Q15
aR
Errn = D(n) - R(n)
Figure 4-52 DlmsBlk_4_16 Coefficient update
User’s Manual
4-211
V 1.2, 2000-01
Function Descriptions
DlmsBlk_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Block Processing (cont’d)
Implementation
This DLMS routine processes a block of input values at a time.
The pointer to the input buffer is sent as an argument to the
function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as Dlms_4_16, except that
the Coeff-Buffer is also circular and needs doubleword
alignment. The advantage of using circular buffer for
coefficients is efficient pointer update. In this implementation
while exiting the tap loop, the first two coefficients are already
loaded for the next input value. This helps in saving one cycle
in the next sample processing.
Example
Trilib\Example\Tasking\Filters\Adaptive
\expDlmsBlk_4_16.c, expDlmsBlk_4_16.cpp
Trilib\Example\GreenHills\Filters\Adaptive
\expDlmsBlk_4_16.cpp, expDlmsBlk_4_16.c
Trilib\Example\GNU\Filters\Adaptive
\expDlmsBlk_4_16.c
Cycle Count
With DSP
Extensions
Pre-loop
:
Loop
:
7


nH
nX ×  8 +  ------- – 1 × 4 + 6 
 4



+1+2
Post-loop
:
1+2
Pre-loop
:
8
Loop
:
same as With DSP Extensions
Without DSP
Extensions
User’s Manual
4-212
V 1.2, 2000-01
Function Descriptions
DlmsBlk_4_16
Adaptive FIR Filter, Coefficients - multiple of four,
Block Processing (cont’d)
Post-loop
Code Size
User’s Manual
:
1+2
166 bytes
4-213
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing
Signature
DataL CplxDlms_4_16(CplxS
X,
DataS
* H,
cptrDataS
*DLYr,
cptrDataS
*DLYi,
CplxS
D,
CplxS
*Err,
DataS
U
);
Inputs
Output
Return
User’s Manual
X
:
Complex input value
H
:
Pointer to Cplx-Coeff-Buffer
DLYr
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer
(Real)
Without DSP Extension - Pointer to
Circ-Struct
DLYi
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer
(Imag)
Without DSP Extension - Pointer to
Circ-Struct
D
:
Desired complex value
Err
:
Pointer to complex Error value
U
:
Step size
DLYr
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer (Real)
DLYi
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer (Imag)
H(nH*2)
:
Modified Coeff-Buffer (Real and
Imag)
R
:
Output value of the filter (48-bit
output value converted to 16-bit
with saturation)
4-214
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
Description
Delayed LMS algorithm implemented for adaptive Complex
FIR filter, FIR filter transversal structure (direct form), Single
sample processing, 16-bit fractional input, coefficients and
output data format, Optimal implementation, requires filter
order to be multiple of four.
Pseudo code
{
frac64 accr,acci; //Filter result
int i,j,k;
frac16circ *aDLYr=&DLYr, *aDLYi=&DLYi;
//Ptr to circ-ptr of real and imaginary Delay-Buffer
//Error value multiplied by step size
uerrr = (frac16 rnd)(*Errr * U);
uerri = (frac16 rnd)(*Erri * U);
//Store input value in Delay-Buffer at the position of the
//oldest value
*DLYi = Xi
//Imag part of Input is stored in delay line(imag)
*DLYr = Xr
//Real part of Input is stored in delay line(real)
accr = 0.0;
acci = 0.0;
k=0;
//tap loop
for(j=0; j<nH/2; j++)
{
//Filter result
//Imag
acci += (frac64)(*(H+k) * (*(DLYi+k)) + (*(H+k+1) * (*(DLYi+k+1)));
//acci += Xi(n) * Hr_n-1(0) + Xi(n-1) * Hr_n-1(1)
acci -= (frac64)(*(H+k+2) * (*(DLYr+k)) + (*(H+k+3) * (*(DLYr+k+1)));
//acci += Xr(n) * Hi_n-1(0) + Xr(n-1) * Hi_n-1(1)
//Real
accr += (frac64)(*(H+k) * (*(DLYr+k)) + (*(H+k+1) * (*(DLYr+k+1)));
//accr += Xr(n) * Hr_n-1(0) + Xr(n-1) * Hr_n-1(1)
accr -= (frac64)(*(H+k+2) *(*(DLYi+k)) + (*(H+k+3) * (*(DLYi+k+1)));
//accr -= Xi(n) * Hi_n-1(0) + Xi(n-1) * Hi_n-1(1)
User’s Manual
4-215
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
//Coefficient update
//Real_i
*(H+k) = (frac16 sat rnd)(*(H+k) + (uerrr * (*(DLYr+k)));
//Hr_n(0) = Hr_n-1(0) + Xr(n) * Errr_n-1
*(H+k) = (frac16 sat rnd)(*(H+k) - (uerri * (*(DLYi+k)));
//Hr_n(0) -= Xi(n) * Erri_n-1
//Real_i+1
*(H+k+1) = (frac16 sat rnd)(*(H+k+1) + (uerrr * (*DLYr+k+1)));
//Hr_n(1) = Hr_n-1(1) + Xr(n-1) * Errr_n-1
*(H+k+1) = (frac16 sat rnd)(*(H+k+1) - (uerri * (*(DLYi+k+1)));
//Hr_n(1) -= Xi(n-1) * Erri_n-1
//Imag_i
*(H+k+2) = (frac16 sat rnd)(*(H+k+2) + (uerri * (*(DLYr+k)));
//Hi_n(0) = Hi_n-1(0) + Xr(n) * Erri_n-1
*(H+k+2) = (frac16 sat rnd)(*(H+k+2) + (uerrr * (*(DLYi+k)));
//Hi_n(0) += Xi(n) * Errr_n-1
//Imag_i+1
*(H+k+3) = (frac16 sat rnd)(*(H+k+3) + (uerri * (*(DLYr+k+1)));
//Hi_n(1) = Hi_n-1(1) + Xr(n-1) * Erri_n-1
*(H+k+3) = (frac16 sat rnd)(*(H+k+3) + (uerrr * (*(DLYi+k+1)));
//Hi_n(1) += Xi(n-1) * Errr_n-1
k=k+4;
}
//Set DLYr.index and DLYi.index to the oldest value in Delay-Buffer
*DLYr--;
*DLYi--;
aDLYr = &DLYr;
aDLYi = &DLYi;
//Format the real and imaginary parts of the filter output from
//48-bit to 16-bit saturated values and pack them in the return
//register (Rr : Ri)
RLo = (frac16 sat)acci;
RHi = (frac16 sat)accr;
//Calculate error in current output
*Err = D - R;
}
}
User’s Manual
4-216
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
Techniques
•
•
•
•
•
•
Assumptions
• Filter size is a multiple of four
• Inputs, outputs, coefficients are in 1Q15 format
User’s Manual
Loop unrolling, four taps/loop
Use of packed data Load/Store
Delay line implemented as circular-buffer
Use of dual MAC instructions
Intermediate result stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
4-217
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
Memory Note
Delay-Buffer
(Real)
.
Delay-Buffer
(Imag)
.
Xi
Xr
.
.
Xr(n-nH+1)
caDLYr aDLYr
aDLYi
caDLYi
Xi(n-nH+1)
Xr(n)
Xi(n)
Xr(n-1)
Xi(n-1)
Xr(n-2)
Xi(n-2)
.
.
.
.
1Q15
1Q15
doubleword
aligned
doubleword
aligned
Hrn-1(0)
Dual
MAC
Imag 2
Hrn-1(1)
Dual
MAC
Real 1
Hin-1(0)
Dual
MAC
Real 2
Hin-1(1)
.
.
Dual
MAC
Imag 1
Hin-1(H-2)
Hin-1(H-1)
1Q15
Figure 4-53 CplxDlms_4_16
User’s Manual
4-218
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
Delay-Buffer
(Real)
.
Delay-Buffer
(Imag)
.
Coefficient
Update
.
Xr(n-nH+1)
caDLYr
.
aDLYr
aDLYi
caDLYi
Xr(n)
Xi(n-nH+1)
Xi(n)
Dual Mac
Real
Xr(n-1)
Xr(n-2)
.
Xi(n-1)
Xi(n-2)
Dual Mac
Imag
.
.
Complex Error
Value
1Q15
Errin-1
doubleword
aligned
.
1Q15
doubleword
aligned
Errrn-1
Dual Mac
Imag
Dual Mac
Real
Updated CoeffBuffer
Errrn = Dr - Rr
Errin = Di - Ri
Hr n(0)
aH
Hr n(1)
Hi n(0)
Hi n(1)
.
.
Hi n(nH-2)
Hi n(nH-1)
1Q15
halfword
aligned
Figure 4-54 CplxDlms_4_16
User’s Manual
4-219
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
Implementation
Delayed LMS has been implemented for realizing an adaptive
complex FIR filter. Circular addressing mode is used for
Delay-Buffer. As the filter is complex, two delay buffers are
initialized, one for real part of input and the other for imaginary
part of the input. The real and imaginary part of the input are
separated and they replace the oldest value in the
corresponding delay buffers.
To make use of the dual MAC feature of TriCore, coefficients
are arranged in a special way as shown in the memory note.
Real parts of a pair of coefficients are packed in a register
using load word instruction. The corresponding imaginary
parts are packed into another register.
A pair of real part of input and a pair of imaginary part of input
are also packed in two registers in one cycle each by using the
load word instruction.
The complex multiplication requires four multiplications (real real, imaginary - imaginary, real - imaginary and imaginaryreal). Four dual MACs are used which perform each of the
above multiplications for a pair of inputs at a time and
accumulate the result separately for real and imaginary parts.
Hence the loop is executed nH/2 times. Similarly coefficient
updation requires four more dual MACs with rounding and
saturation. Loop unrolling is done for efficient update of delay
line. Thus tap loop is executed (nH/2-1) times. The
accumulated real and imaginary parts of the result are
formatted to 16-bit saturated value and packed into the return
register.
Example
User’s Manual
Trilib\Example\Tasking\Filters\Adaptive
\expCplxDlms_4_16.c, expCplxDlms_4_16.cpp
Trilib\Example\GreenHills\Filters\Adaptive
\expCplxDlms_4_16.cpp, expCplxDlms_4_16.c
Trilib\Example\GNU\Filters\Adaptive
\expCplxDlms_4_16.c
4-220
V 1.2, 2000-01
Function Descriptions
CplxDlms_4_16
Adaptive Complex Filter, Coefficients - multiple of
four, Sample Processing (cont’d)
Cycle Count
With DSP
Extensions
Pre-kernel
:
14
Kernel
:
Post-kernel
:
13+2
Pre-kernel
:
3
Kernel
:
same as With DSP Extensions
Post-kernel
:
13+2

 nH- – 1 + 1 + 1 
 8 ×  -----

2


Without DSP
Extensions
Code Size
User’s Manual
206 bytes
4-221
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing
Signature
void CplxDlmsBlk_4_16(CplxS
CplxS
DataS
cptrDataS
cptrDataS
int
CplxS
CplxS
DataS
);
Inputs
X
Output
User’s Manual
:
*X,
*R,
*H,
*DLYr,
*DLYi,
nX,
*D,
*Err,
U
Pointer to complex Input-Buffer
R
:
Pointer to complex Output-Buffer
H
:
Pointer to Cplx-Coeff-Buffer
DLYr
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer
(Real)
Without DSP Extension - Pointer to
Circ-Struct
DLYi
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer
(Imag)
Without DSP Extension - Pointer to
Circ-Struct
nX
:
Size of complex Input-Buffer
D
:
Pointer to complex DesiredOutput-Buffer
Err
:
Pointer to complex Error value
U
:
Step size
DLYr
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer (Real)
DLYi
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer (Imag)
4-222
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing (cont’d)
Return
H(nH*2)
:
Modified Coeff-Buffer (Real and
Imag)
R(nX)
:
Complex Output-Buffer
None
Description
Delayed LMS algorithm implemented for adaptive Complex
FIR filter, FIR filter transversal structure (direct form), Block
processing, 16-bit fractional input, coefficients and output data
format, Optimal implementation, requires filter order to be
multiple of four.
Pseudo code
{
frac64 accr,acci; //Filter result
int i,j,k;
frac16circ *aDLYr=&DLYr, *aDLYi=&DLYi;
//Ptr to circ-ptr of real and imaginary Delay-Buffer
for(i=0; i<nX; i++)
{
//Error value multiplied by step size
uerrr = (frac16 rnd)(*Errr * U);
uerri = (frac16 rnd)(*Erri * U);
//Store input value in Delay-Buffer at the position of the
//oldest value
*DLYi = *X++;//Imag part of Input
*DLYr = *X++;//Real part of Input
accr = 0.0;
acci = 0.0;
k=0;
//tap loop
User’s Manual
4-223
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing (cont’d)
for(j=0; j<nH/2; j++)
{
//Filter result
//Imag
acci += (frac64)(*(H+k) * (*(DLYi+k))
+ (*(H+k+1) * (*(DLYi+k+1)));
//acci += Xi(n) * Hr_n-1(0) + Xi(n-1) * Hr_n-1(1)
acci -= (frac64)(*(H+k+2) * (*(DLYr+k)) +
(*(H+k+3) * (*(DLYr+k+1)));
//acci += Xr(n) * Hi_n-1(0) + Xr(n-1) * Hi_n-1(1)
//Real
accr += (frac64)(*(H+k) * (*(DLYr+k))
+ (*(H+k+1) * (*(DLYr+k+1)));
//accr += Xr(n) * Hr_n-1(0) + Xr(n-1) * Hr_n-1(1)
accr -= (frac64)(*(H+k+2) * (*(DLYi+k))
+ (*(H+k+3) * (*(DLYi+k+1)));
//accr -= Xi(n) * Hi_n-1(0) + Xi(n-1) * Hi_n-1(1)
//Coefficient update
//Real_i
*(H+k) = (frac16 sat rnd)(*(H+k) + (uerrr * (*(DLYr+k)));
//Hr_n(0) = Hr_n-1(0) + Xr(n) * Errr_n-1
*(H+k) = (frac16 sat rnd)(*(H+k) - (uerri * (*(DLYi+k)));
//Hr_n(0) -= Xi(n) * Erri_n-1
//Real_i+1
*(H+k+1) = (frac16 sat rnd)(*(H+k+1) + (uerrr * (*DLYr+k+1)));
//Hr_n(1) = Hr_n-1(1) + Xr(n-1) * Errr_n-1
*(H+k+1) = (frac16 sat rnd)(*(H+k+1) - (uerri * (*(DLYi+k+1)));
//Hr_n(1) -= Xi(n-1) * Erri_n-1
//Imag_i
*(H+k+2) = (frac16 sat rnd)(*(H+k+2) + (uerri * (*(DLYr+k)));
//Hi_n(0) = Hi_n-1(0) + Xr(n) * Erri_n-1
*(H+k+2) = (frac16 sat rnd)(*(H+k+2) + (uerrr * (*(DLYi+k)));
//Hi_n(0) += Xi(n) * Errr_n-1
//Imag_i+1
*(H+k+3) = (frac16 sat rnd)(*(H+k+3) + (uerri * (*(DLYr+k+1)));
//Hi_n(1) = Hi_n-1(1) + Xr(n-1) * Erri_n-1
*(H+k+3) = (frac16 sat rnd)(*(H+k+3) + (uerrr * (*(DLYi+k+1)));
//Hi_n(1) += Xi(n-1) * Errr_n-1
k=k+4;
}
User’s Manual
4-224
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing (cont’d)
//Set DLYr.index and DLYi.index to the oldest value in Delay-Buffer
*DLYr--;
*DLYi--;
aDLYr = &DLYr;
aDLYi = &DLYi;
//Format the real and imaginary parts of the filter output
//from 48 bit to 16-bit saturated values and store the
//result to Output-Buffer
*RLo = (frac16 sat)acci;
*RHi = (frac16 sat)accr;
R++;
//Calculate error in current output
*Err = *D++ - *R++;
}//end of indata loop
}//end of main
Techniques
•
•
•
•
•
•
Assumptions
• Filter size is a multiple of four
• Inputs, outputs, coefficients are in 1Q15 format
User’s Manual
Loop unrolling, two taps/loop
Use of packed data Load/Store
Delay line implemented as circular-buffer
Use of dual MAC instructions
Intermediate result stored in 64-bit register (16 guard bits)
Instruction ordering for zero overhead Load/Store
4-225
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing (cont’d)
Memory Note
Delay-Buffer (Imag)
Delay-Buffer (Real)
.
.
.
.
caDLYr
Xr(n-nH+1)
aDLYr
aDLYi
caDLYi
Xi(n-nH+1)
Xr(n)
Xi(n)
Xr(n-1)
Xi(n-1)
Xr(n-2)
Xi(n-2)
.
.
.
.
1Q15
1Q15
doubleword
aligned
Hr n-1(0)
Dual MAC
Real 1
Input-Buffer
Xi(0)
doubleword
aligned
aX
Dual
MAC
Imag 2
Hr n-1(1)
Hi n-1(0)
Dual
MAC
Real 2
Hi n-1(1)
.
.
Xr(0)
Hi n-1(H-2)
Xi(1)
Hi n-1(H-1)
Xr(1)
.
Xi(n)
Dual
MAC
Imag 1
1Q15
halfword aligned
Xr(n)
.
1Q15
halfword aligned
Figure 4-55 CplxDlmsBlk_4_16
User’s Manual
4-226
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing (cont’d)
Delay-Buffer (Real)
.
Coefficient
Update
Delay-Buffer (Imag)
.
.
Xr(n-nH+1)
.
caDLYr aDLYr
aDLYi
caDLYi
Xi(n-nH+1)
Xr(n)
Xr(n-1)
Xr(n-2)
Xi(n-1)
Xi(n-2)
Dual Mac
Imag
.
.
.
.
Complex Error
Value
1Q15
1Q15
Errin-1
doubleword
aligned
doubleword
aligned
Errrn-1
Dual Mac Real
Desired
Output Buffer
Di(0)
Xi(n)
Dual Mac
Real
aR
Dual Mac Imag
Updated
Coeff- Buffer
Output-Buffer
Ri(0)
aR
aH
Hrn(0)
Dr(0)
Rr(0)
Hrn(1)
Di(1)
Ri(1)
Hin(0)
Dr(1)
Rr(1)
Hin(1)
.
.
.
Di(n)
Ri(n)
.
Dr(n)
Rr(n)
Hin(nH-2)
.
.
Hin(nH-1)
1Q15
1Q15
halfword
aligned
halfword
aligned
1Q15
halfword
aligned
Errrn = Dr(n) - Rr(n)
Errin = Di(n) - Ri(n)
Figure 4-56 CplxDlmsBlk_4_16 Coefficient update
User’s Manual
4-227
V 1.2, 2000-01
Function Descriptions
CplxDlmsBlk_4_16 Adaptive Complex Filter, Coefficients - multiple of
four, Block Processing (cont’d)
Implementation
This DLMS routine processes a block of input values at a time.
The pointer to the input buffer is sent as an argument to the
function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as CplxDlms_4_16. An
additional loop is needed to calculate the output for every
sample in the buffer. Hence, this loop is repeated as many
times as the size of the input buffer.
Example
Trilib\Example\Tasking\Filters\Adaptive
\expCplxDlmsBlk_4_16.c, expCplxDlmsBlk_4_16.cpp
Trilib\Example\GreenHills\Filters\Adaptive
\expCplxDlmsBlk_4_16.cpp, expCplxDlmsBlk_4_16.c
Trilib\Example\GNU\Filters\Adaptive
\expCplxDlmsBlk_4_16.c
Cycle Count
With DSP
Extensions
Pre-loop
:
9
Loop
:


nH
nX ×  8 +  ------- – 1 × 8 + 16 
 2



+1+2
Post-loop
:
3+2
:
9
Without DSP
Extensions
Pre-loop
Code Size
User’s Manual
Loop
:
same as With DSP Extensions
Post-loop
:
3+2
252 bytes
4-228
V 1.2, 2000-01
Function Descriptions
Dlms_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Sample Processing
Signature
DataL Dlms_2_16x32(DataS
DataL
cptrDataS
DataL
DataL
DataL
);
Inputs
X
:
Real Input Value
H
:
Pointer to Coeff-Buffer
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
(nH)
:
Implicit filter order stored in Circ-Ptr
DLY
D
:
Real expected value
X,
*H,
*DLY,
D,
*Err,
U
Err
:
Pointer to Error value
U
:
Step size
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
H(nH)
:
Modified Coeff-Buffer
Return
R
:
Output value of the filter (32-bit
output)
Description
Delayed LMS algorithm implemented for mixed adaptive FIR
filter, FIR filter transversal structure (direct form), Single
sample processing, 16-bit fractional input, 32-bit coefficients
and output data format, Optimal implementation, requires
filter order to be multiple of two.
Outputs
User’s Manual
4-229
V 1.2, 2000-01
Function Descriptions
Dlms_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Sample Processing (cont’d)
Pseudo code
{
frac32 acc;
//filter result
frac16 circ *aDLY = &DLY;
//ptr to Circ-ptr of Delay-Buffer
int j;
//Error value multiplied by step size
uerr = (frac32)(*Err * U);
//store input value in Delay-Buffer at the position
//of the oldest value
*DLY = X;
acc = 0;
k = 0;
//tap loop
//The index i and j of H_n-1(i) and X(j) in the comments are valid only
//for the first iteration.For each next iteration it has to be
//incremented and decremented by 2 respectively.
for (j=0; j<nH/2; j++)
{
acc = acc + (frac32 sat)(*(H+k) * (*(DLY + k)));
//acc = acc + X(n)* H_n-1(0)
acc = acc + (frac32 sat)(*(H+k+1) * (*(DLY+k+1)));
//acc = X(n-1) * (H_n-1(1)
//coefficient update
*(H+k) = (frac32 sat)((*(H+k)) + uerr * (*(DLY+k)));
*(H+k+1) = (frac32 sat)((*(H+k+1)) + uerr * (*(DLY+k+1)));
k = k + 2;
}
//Set DLY.index to the oldest value in Delay-Buffer
DLY--;
aDLY = *DLY;
//filter output stored to output buffer
R = acc;
//calculate error for the current output
*Err = D - R;
return R;
}
Techniques
User’s Manual
•
•
•
•
Loop unrolling, two taps/loop
Use of packed data Load/Store
Delay line implemented as circular-buffer
Instruction ordering for zero overhead Load/Store
4-230
V 1.2, 2000-01
Function Descriptions
Dlms_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Sample Processing (cont’d)
Assumptions
• Filter order is a multiple of two
• Inputs in 1Q15 format, all other parameters in 1Q31 format
Memory Note
Delay-Buffer
.
.
X
x(n-nH+1)
caDLY
aDLY
Coeff-Buffer
x(n)
Hn-1(0)
x(n-1)
Hn-1(1)
x(n-2)
.
.
MAC
.
aH
.
.
.
1Q15
.
doubleword
aligned
Hn-1(nH-1)
1Q31
Figure 4-57 Dlms_2_16x32
User’s Manual
4-231
V 1.2, 2000-01
Function Descriptions
Dlms_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Sample Processing (cont’d)
Coefficient
Update
Delay-Buffer
.
caDLY
Hn(0)
aH
Hn(1)
.
x(n-nH+1)
Updated
Coefficient
aDLY
.
x(n)
.
x(n-1)
.
x(n-2)
.
.
.
Hn(nH-1)
.
Error Value
1Q15
doubleword
aligned
Errn-1
1Q31
MAC
Errn = D - R
Figure 4-58 Dlms_2_16x32 Coefficient update
User’s Manual
4-232
V 1.2, 2000-01
Function Descriptions
Dlms_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Sample Processing (cont’d)
Implementation
LMS algorithm has been used to realize an adaptive FIR filter.
The implemented filter is a Delayed LMS adaptive filter i.e.,
the updation of coefficients in the current instant is done using
the error in the previous output.
The FIR filter is implemented using transversal structure and
is realized as a tapped delay line.
This routine processes one sample at a time and returns
output of that sample. The input for which the output is to be
calculated is sent as an argument to the function.
TriCore’s load word instruction loads two delay line values
and two coefficients in one cycle each. MAC instruction
performs a multiplication and an addition according to the
equation
acc = acc + X ( n – k ) ⋅ H n – 1 ( k )
[4.75]
where, k=0,1,...., nH-1.
The coefficient is updated using error from the previous
output, i.e., errn-1. A MAC instruction updates a coefficient in
one cycle according to the equation
H n ( k ) = H n – 1 ( k ) + X ( n – k ) ⋅ Err n – 1
[4.76]
where, k=0,1,...,nH-1.
By using four MACs two coefficients are used and updated in
one pass through the loop. The loop is unrolled for efficient
pointer update. Hence tap loop is executed (nH/2 - 1) times.
For delay line, circular addressing mode is used. The size of
the circular delay buffer is equal to the filter order, i.e., the
number of coefficients. Circular buffer needs doubleword
alignment and to use load word instruction, size of the buffer
should be multiple of four bytes. This implies that the
coefficients should be multiple of two.
User’s Manual
4-233
V 1.2, 2000-01
Function Descriptions
Dlms_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Sample Processing (cont’d)
Example
Trilib\Example\Tasking\Filters\Adaptive
\expDlms_2_16x32.c, expDlms_2_16x32.cpp
Trilib\Example\GreenHills\Filters\Adaptive
\expDlms_2_16x32.cpp, expDlms_2_16x32.c
Trilib\Example\GNU\Filters\Adaptive
\expDlms_2_16x32.c
Cycle Count
With DSP
Extensions
Pre-kernel
:
Kernel
:
12
nH
------- – 1 × 4 + 2
2
if LoopCount > 1
nH
------- – 1 × 4 + 1
2
if LoopCount = 1
Post-kernel
:
4+2
Without DSP
Extensions
Code Size
User’s Manual
Pre-kernel
:
12
Kernel
:
same as With DSP Extensions
Post-kernel
:
4+2
108 bytes
4-234
V 1.2, 2000-01
Function Descriptions
DlmsBlk_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Block Processing
Signature
void DlmsBlk_2_16x32(DataS
DataL
cptrDataL
cptrDataS
int
DataL
DataL
DataL
);
Inputs
X
:
Pointer to Input-Buffer
R
:
Pointer to Output-Buffer
H
:
With DSP Extension - circular
pointer of Coeff-Buffer of size nH
Without DSP Extension - circStruct. Whose members are base
address, size and index
DLY
:
With DSP Extension - Pointer to
circular pointer of Delay-Buffer of
size nH, where nH is the filter order
Without DSP Extension - Pointer to
Circ-Struct
(nH)
:
Implicit filter order stored in CircPointer DLY
D
:
Pointer to Desired-Output-Buffer
Output
Return
User’s Manual
*X,
*R,
H,
*DLY,
nX,
*D,
*Err,
U
Err
:
Pointer to Error value
U
:
Step size
DLY
:
Updated circular pointer with index
set to the oldest value of the filter
Delay-Buffer
H(nH)
:
Modified Coeff-Buffer
R(nX)
:
Output-Buffer
None
4-235
V 1.2, 2000-01
Function Descriptions
DlmsBlk_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Block Processing (cont’d)
Description
Delayed LMS algorithm implemented for mixed adaptive FIR
filter, FIR filter transversal structure (direct form), Block
processing, 16-bit fractional input, 32-bit coefficients and
output data format, Optimal implementation, requires filter
order to be multiple of two.
User’s Manual
4-236
V 1.2, 2000-01
Function Descriptions
DlmsBlk_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Block Processing (cont’d)
Pseudo code
{
frac32 acc;
//filter result
frac16 circ *aDLY = &DLY;
//ptr to Circ-ptr of Delay-Buffer
int i, j;
//loop for input buffer
for (i=0; i<nX; i++)
{
//Error value multiplied by step size
uerr = (frac32 rnd)(*Err * U);
//store input value in Delay-Buffer at the position
//of the oldest value
*DLY = *X++;
acc = 0;
k = 0;
//tap loop
for (j=0; j<nH/4; j++)
{
acc = acc + (frac32 sat)(*(H+k) * (*(DLY + k)));
//acc = acc + X(n)* H_n-1(0)
acc = acc + (frac32 sat)(*(H+k+1) * (*(DLY+k+1)));
//acc = X(n-1) * (H_n-1(1)
//coefficient update
*(H+k) = (frac32 sat)((*(H+k)) + uerr * (*(DLY+k)));
*(H+k+1) = (frac32 sat)((*(H+k+1)) + uerr * (*(DLY+k+1)));
k = k + 2;
}
//Set DLY.index to the oldest value in Delay-Buffer
DLY--;
aDLY = *DLY;
//filter output stored to output buffer
*R = acc;
//calculate error for the current output
*Err = *D++ - *R++;
}
}
User’s Manual
4-237
V 1.2, 2000-01
Function Descriptions
DlmsBlk_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Block Processing (cont’d)
Techniques
• Loop unrolling, two taps/loop
• Use of packed data Load/Store
• Delay line and coefficient array implemented as circularbuffer
• Instruction ordering for zero overhead Load/Store
Assumptions
• Filter size is a multiple of two
• Inputs in 1Q15, all other parameters in 1Q31 format
Memory Note
Input-Buffer
X(0)
Delay-Buffer
.
aX
X(1)
.
.
X(n-nH+1)
.
X(n)
Hn-1(0)
caDLY
aDLY
Coeff-Buffer
.
X(n-1)
Hn-1(1)
X(n)
X(n-2)
.
.
.
.
.
1Q15
1Q15
halfword
aligned
MAC
caH aH
.
.
.
.
doubleword
aligned
Hn-1(nH-1)
1Q31
doubleword
aligned
Figure 4-59 DlmsBlk_2_16x32
User’s Manual
4-238
V 1.2, 2000-01
Function Descriptions
DlmsBlk_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Block Processing (cont’d)
Coefficient
Update
Delay-Buffer
.
caDLY
Hn(0)
aH
Hn(1)
.
x(n-nH+1)
Updated
Coefficient
aDLY
.
x(n)
.
x(n-1)
.
x(n-2)
.
.
.
Hn(nH-1)
.
Error Value
1Q15
Errn-1
doubleword
aligned
doubleword
aligned
MAC
Desired
Output Buffer
D(0)
1Q31
Output-Buffer
R(0)
aD
D(1)
R(1)
.
.
.
.
.
.
.
.
D(n)
R(n)
.
.
1Q31
1Q31
aR
Errn = D(n) - R(n)
Figure 4-60 DlmsBlk_2_16x32 Coefficient update
User’s Manual
4-239
V 1.2, 2000-01
Function Descriptions
DlmsBlk_2_16x32
Mixed Adaptive FIR Filter, Coefficients - multiple of
two, Block Processing (cont’d)
Implementation
This DLMS routine processes a block of input values at a time.
The pointer to the input buffer is sent as an argument to the
function. The output is stored in output buffer, the starting
address of which is also sent as an argument to the function.
Implementation details are same as Dlms_4_16, except that
the Coeff-Buffer is also circular and needs doubleword
alignment. The advantage of using circular buffer for
coefficients is efficient pointer update. In this implementation
while exiting the tap loop, the first two coefficients are already
loaded for the next input value. This helps in saving one cycle
in the next sample processing.
Example
Trilib\Example\Tasking\Filters\Adaptive
\expDlmsBlk_2_16x32.c, expDlmsBlk_2_16x32.cpp
Trilib\Example\GreenHills\Filters\Adaptive
\expDlmsBlk_2_16x32.cpp, expDlmsBlk_2_16x32.c
Trilib\Example\GNU\Filters\Adaptive
\expDlmsBlk_2_16x32.c
Cycle Count
With DSP
Extensions
Pre-loop
:
Loop (for input data)
:
7
nH
nX × 9 +  ------- – 1 × 4 + 6
 2

+1+2
Post-loop
:
1+2
:
8
Without DSP
Extensions
Pre-loop
Code Size
User’s Manual
Loop
:
same as With DSP Extensions
Post-loop
:
1+2
136 bytes
4-240
V 1.2, 2000-01
Function Descriptions
4.7
Fast Fourier Transforms
Spectrum (Spectral) analysis is a very important methodology in Digital Signal
Processing. Many applications have a requirement of spectrum analysis. The spectrum
analysis is a process of determining the correct frequency domain representation of the
sequence. The analysis gives rise to the frequency content of the sampled waveform
such as bandwidth and centre frequency.
One of the method of doing the spectrum analysis in Digital Signal Processing is by
employing the Discrete Fourier Transform (DFT).
The DFT is used to analyze, manipulate and synthesize signals in ways not possible with
continuous (analog) signal processing. It is a mathematical procedure that helps in
determining the harmonic, frequency content of a discrete signal sequence. DFTs origin
is from a continuous fourier transform which is given by
∞
X(f) =
∫ x ( t )e
– j2πft
[4.77]
dt
–∞
where x(t) is continuous time varying signal and X(f) is the fourier transform of the same.
The DFT is given by
N–1
X(k) =
∑ x ( n )WN
nk
[exponential form]
[4.78]
n=0
where the DFT coefficients used in the DFT Kernel, W, is
WN = e
– j2π ⁄ N
[4.79]
N–1
X(k) =
∑ x ( n ) [ cos ( 2πnk ⁄ N ) – j sin ( 2πnk ⁄ N ) ]
[4.80]
n=0
X(k) is the kth DFT output component for k=0,1,2,....,N-1
x(n) is the sequence of discrete sample for n=0,1,2,...,N-1
j is imaginary unit
–1
N is the number of samples of the input sequence (and number of frequency points of
DFT output).
User’s Manual
4-241
V 1.2, 2000-01
Function Descriptions
While the DFT is used to convert the signal from time domain to frequency domain. The
complementary function for DFT is the IDFT, which is used to convert a signal from
frequency to time domain. The IDFT is given by
N–1
1
x ( n ) = ---N
∑ X ( k )e
j2πnk ⁄ N
[exponential form]
[4.81]
k=0
N–1
1
x ( n ) = ---N
∑ X ( k ) [ cos ( 2πnk ⁄ N ) + j sin ( 2πnk ⁄ N ) ]
[4.82]
k=0
Notice the difference between DFT in Equation [4.78] and Equation [4.80], the IDFT
Kernel is the complex conjugate of the DFT and the output is scaled by N.
WNnk, the Kernel of the DFT and IDFT is called the Twiddle-Factor and is given by,
In exponential form,
e
e
– j 2πnk ⁄ N
j2πnk ⁄ N
for DFT
for IDFT
In rectangular form,
cos ( 2πnk ⁄ N ) – j sin ( 2πnk ⁄ N ) for DFT
cos ( 2πnk ⁄ N ) + j sin ( 2πnk ⁄ N ) for IDFT
While calculating DFT, a complex summation of N complex multiplications is required for
each of N output samples. N2 complex multiplications and N(N-1) complex additions
compute an N-point DFT. The processing time required by large number of calculation
limits the usefulness of DFT. This drawback of DFT is overcome by a more efficient and
fast algorithm called Fast Fourier Transform (FFT). The radix-2 FFT computes the DFT
in N*log2(N) complex operations instead of N2 complex operations for that of the DFT.
(where N is the transform length.)
The FFT has the following preconditions to operate at a faster rate.
• The radix-2 FFT works only on the sequences with lengths that are power of two.
• The FFT has a certain amount of overhead that is unavoidable, called bit reversed
ordering. The output is scrambled for the ordered input or the input has to be arranged
in a predefined order to get output properly arranged. This makes the straight DFT
better suited for short length computation than FFT. The graph shows the algorithm
complexity of both on a typical processor like pentium.
User’s Manual
4-242
V 1.2, 2000-01
Function Descriptions
Execution time (seconds)
1000
correlation DFT
100
10
1
0.1
FFT
0.01
0.001
8
16
32
64
128 256 512
Number points in DFT
1024
2048 4096
Figure 4-61 Complexity Graph
The Fourier transform plays an important role in a variety of signal processing
applications. Anytime, if it is more comfortable to work with a signal in the frequency
domain than in the original time or space domain, we need to compute Fourier transform.
Given N input samples of a signal x(n) = 0,1,..., (N-1), its Fourier transform is defined by
N–1
X(f) =
∑ x ( n )e
– j2πfn
[4.83]
n=0
Since n is an integer, X(f) is periodic with the period 1. Therefore, we only consider X(f)
in the basic interval 0 ≤ f ≤ 1 . In digital computation, X(f) is often evaluated at N uniformly
spaced points f = k/N (k=0,1,.....,N-1). This leads to the Discrete Fourier Transform (DFT)
N–1
X(k) =
∑ x ( n )WN
nk
(k=0,1,.....,N-1)
[4.84]
n=0
– j2π ⁄ N
with W N = e
. Direct computation of this length N, DFT takes N2 complex
multiplications and N(N-1) complex additions. FFT is an incredibly efficient algorithm for
computing DFT. The main idea of FFT is to exploit the periodic and symmetric properties
User’s Manual
4-243
V 1.2, 2000-01
Function Descriptions
nk
of the DFT Kernel WN . The resulting algorithm depends strongly on the transform
length N. The basic Cooley-Tukey algorithm assumes that N is a power of two. Hence it
is called radix-2 algorithm. Depending on how the input samples x(n) and the output data
X(k) are grouped, either a decimation-in-time (DIT) or a decimation-in-frequency (DIF)
algorithm is obtained. The technique used by Cooley and Tukey can also be applied to
DFTs, where N is a power of r. The resulting algorithms are referred to as radix-r FFT. It
turns out that radix-4, radix-8, and radix-16 are especially interesting. In cases where N
cannot be represented in terms of powers of single number, mixed-radix algorithms must
be used. For example for 28 point input, since 28 cannot be represented in terms of
powers of 2 and 4 we use radix-7 and radix-4 FFT to get the frequency spectrum. The
basic radix-2 decimation-in-frequency FFT algorithm is implemented.
4.7.1
Radix-2 Decimation-In-Time FFT Algorithm
The decimation-in-time (DIT) FFT divides the input (time) sequence into two groups, one
of even samples and the other of odd samples. N/2-point DFTs are performed on these
sub-sequences and their outputs are combined to form the N-point DFT.
First, x(n) the input sequence in the Equation [4.84] is divided into even and odd subsequences.
---- – 1
2
---- – 1
2
∑ x ( 2n )WN
2nk
X(k) =
n=0
n=0
N
---- – 1
2
2nk
But, W N
n 0
2nk
for k=0 to N-1
N
---- – 1
2
∑ x ( 2n )WN
=
( 2n + 1 )k
∑ x ( 2n + 1 )WN
+
= (e
+ WN
( – j2π ) ⁄ N 2nk
)
[4.85]
∑ x ( 2n + 1 )WN
k
2nk
n
= (e
0
( – j2π ) ⁄ ( N ⁄ 2 ) nk
) = WN ⁄ 2
nk
By substituting the following in Equation [4.85]
x1(n)=x(2n)
x2(n)=x(2n+1)
Equation [4.85] becomes
N⁄2–1
X( k ) =
∑
N⁄2–1
nk
x 1 ( n )W N ⁄ 2
+ WN
k
∑
nk
x 2 ( n )WN ⁄ 2
n=0
n=0
for k=0 to N-1
[4.86]
k
= Y ( k ) + WN Z ( k )
User’s Manual
4-244
V 1.2, 2000-01
Function Descriptions
Equation [4.86] is the radix-2 DIT FFT equation. It consists of two N/2-point DFTs (Y(k)
and Z(k)) performed on the subsequences of even and odd samples respectively of the
input sequence, x(n). Multiples of WN, the Twiddle-Factors are the coefficients in the FFT
calculation.
Further,
WN
k+N⁄2
= (e
– j 2π ⁄ N k
) × (e
– j 2π ⁄ N N ⁄ 2
)
= –WN
k
[4.87]
Equation [4.86] can be expressed in two equations
k
X ( k ) = Y( k ) + WN Z ( k )
[4.88]
k
X ( k + N ⁄ 2 ) = Y ( k ) – WN Z ( k )
[4.89]
for k=0 to N/2-1
The complete 8-point DIT FFT is illustrated in figure.
x0
W
0
x4
X0
+
-
+
W
x2
W0
x6
x1
W0
x5
+
-
0
W2
x7
W0
+
-
+
W0
+
+
W1
+
+
W2
+
W0
x3
X1
+
+
-
W2
-
-
X2
X3
X4
X5
X6
W3
X7
Figure 4-62 8-point DIT FFT
User’s Manual
4-245
V 1.2, 2000-01
Function Descriptions
The complete 8-point DIF FFT is illustrated in figure.
x0
x4
x2
x6
x1
x5
+
-
X0
W0
+
+
-
W2
X1
+
W0
+
-
W0
+
+
+
-
x3
+
x7
-
W1
+
+
+
-
W2
X3
X4
X5
X6
-
W3
X2
W2
X7
Figure 4-63 8-point DIF FFT
In the diagram, each pair of arrows represents a Butterfly. The whole of FFT is computed
by different patterns of Butterflies. These are called groups and stages.
For 8-point FFT the first stage consists of four groups of one Butterfly each, second
consists of two groups of two butterflies and third stage has one group of four Butterflies.
Each Butterfly is represented as in diagram.
Primary
node
x0+jy 0
Dual
node
x1+jy 1
x0’+jy0’
Dual node
spacing
W=C+j(-S)
x1’+jy1’
Figure 4-64 Radix-2 DIT Butterfly
User’s Manual
4-246
V 1.2, 2000-01
Function Descriptions
The output is derived as follows
x 0’ = x 0 + [ ( C )x 1 – ( – S )y 1 ]
[4.90]
y 0’ = y 0 + [ ( C )y 1 + ( – S )x 1 ]
[4.91]
x 1’ = x 0 – [ ( C )x 1 – ( – S )y 1 ]
[4.92]
y 1’ = y 0 – [ ( C )y 1 + ( – S )x 1 ]
[4.93]
User’s Manual
4-247
V 1.2, 2000-01
Function Descriptions
4.8
TriCore Implementation Note
4.8.1
Organization of FFT functions
The FFT radix-2 DIT function set consists of the following functions.
•
•
•
•
Forward FFT
Inverse FFT
Forward Real FFT
Inverse Real FFT
The above set of functions makes use of macros for efficient computation. The basic bit
reversal module, Butterflies and the Spectrum split operations are implemented in form
of macros.
The TriLib FFT implementation is one of the most optimal implementation which makes
use of several optimization techniques. Further, it makes use of different optimization
methods at instruction level. Secondly, it is organized as macros to save time during
function calls and also overcome the conditional checks such as shift etc., which perhaps
is done during assembling time itself as it is implemented as macros. Thirdly, the
algorithmic optimization, where the first pass or the first stage Butterflies are computed
outside the loop separately. This saves time as the first stage Butterflies need not be
multiplied by Twiddle-Factors.
4.8.2
16 Bit Implementation Modules
The classical FFT takes the input and Twiddle-Factor in the form of 16 bit complex
number representation as in Figure 4-2. For computational efficiency and to make use
of the parallel architecture of TriCore, a more efficient form of complex representation is
devised for internal operations of the FFT. The REAL:IMAG, REAL:IMAG pairs are
converted to REAL:REAL, IMAG:IMAG representation before processing.
Twiddle-Factors for the computation of 16 bit FFT is done by a utility function called
FFT_TF_16().
The main modules of FFTs are:
FFT_2_16()
Forward FFT for 16 bit Complex input, radix-2 decimation-intime implementation
IFFT_2_16()
Inverse FFT for 16 bit Complex input, radix-2 decimation-intime implementation
User’s Manual
4-248
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16()
Forward FFT for 16 bit Real sequence input, radix-2
decimation-in-time implementation
IFFTReal_2_16()
Inverse Real FFT for 16 bit Complex sequence input, radix-2
decimation-in-time implementation to generate the two real
output sequences
4.8.3
16 bit Implementation for Mixed FFT
The mixed 16 bit FFT is the combination of features of 32 bit and 16 bit FFT, while 16 bit
is more efficient and 32 bit is more precise. The mixed FFT is a combination of both. It
has better precision than 16 bit and better speed than 32 bit implementation.
Internally the mixed FFT uses 32 bit representation and the final stage output is
converted to 16 bit precision using ConvertBuf macro.
Twiddle-Factors for the computation of mixed FFT is done by a utility function called
FFT_TF_16x32().
The main modules of Mixed FFTs are:
FFT_2_16x32()
Forward FFT for 16 bit Complex input, radix-2 decimation-intime implementation. Internal processing will be 32 bits,
output will be rounded to 16 bits
IFFT_2_16x32()
Inverse FFT for 16 bit Complex input, radix-2 decimation-intime implementation. Internal processing will be 32 bits,
output will be rounded to 16 bits
FFTReal_2_16x32()
Forward FFT for 16 bit Real sequence input, radix-2
decimation-in-time implementation. Internal processing will be
32 bits, output will be rounded to 16 bits
IFFTReal_2_16x32()
Inverse Real FFT for 16 bit Complex sequence input, radix-2
decimation-in-time implementation to generate the two real
output sequences. Internal processing will be 32 bits, output
will be rounded to 16 bits
4.8.4
32 Bit Implementation
The 32 bit implementation follows the straight forward approach in implementation. The
first pass (stage) is done outside the stage loop for the optimization purpose like it is
done in the 16 bit implementation. This is done by the Firstpass macro.
User’s Manual
4-249
V 1.2, 2000-01
Function Descriptions
Subsequent passes (stages) uses the Butterfly2 macro for the forward FFT and the
IButterfly2 macro for the inverse FFT. This is same as the 16 bit implementation, except
that this doesn’t need the special arrangement of the data.
Twiddle-Factors for FFT and IFFT are complex conjugate of each other, the TwiddleFactors calculated for FFT are used for IFFT. The Butterfly calculation for IFFT is
changed accordingly.
The Real FFT uses the Complex FFT functionality for computation and the final output
is split to separate the real part from the complex result and is arranged as a real half in
and imaginary half like Re[0], Re[1],...,Re[N/2-1], Im[0], Im[1],...,Im[N/2-1] in a
continuous order.
Twiddle-Factors for the computation of FFT is done by a utility function called
FFT_TF_32() as shown in the example.
The input for the 32 bit FFT, IFFT, RFFT, RIFFT are all in 1Q31 packed into a 64 bit data
as shown in the Figure 4-3 the input and the output is in normal order.
The main modules of FFTs are:
FFT_2_32()
Forward FFT for 32 bit Complex input, radix-2 decimation-intime implementation
IFFT_2_32()
Inverse FFT for 32 bit Complex input, radix-2 decimation-intime implementation
FFTReal_2_32()
Forward FFT for 32 bit Real sequence input, radix-2
decimation-in-time implementation
IFFTReal_2_32()
Inverse Real FFT for 32 bit Complex sequence input, radix-2
decimation-in-time implementation to generate the two real
output sequences
4.8.5
Functional Implementation
The main functions tested in Section 4.8.2 has a generic structure. It uses three nested
loops. It computes the first pass outside the nested loops.
First Stage
The First stage is executed outside the nested loops. The advantage of having this has
been already discussed in the Section 4.8.1. The First stage makes use of the
User’s Manual
4-250
V 1.2, 2000-01
Function Descriptions
FirstPass macro. The idea to separate the first stage Butterfly outside the loop can be
depicted as follows
x 0’ = x 0 + [ ( C )x 1 – ( – S )y 1 ]
[4.94]
y 0’ = y 0 + [ ( C )y 1 + ( – S )x 1 ]
[4.95]
x 1’ = x 0 – [ ( C )x 1 – ( – S )y 1 ]
[4.96]
y 1’ = y 0 – [ ( C )y 1 + ( – S )x 1 ]
[4.97]
In the first stage, there are N/2 groups, each containing a single Butterfly. Each Butterfly
uses a Twiddle-Factor W0, where
0
W = e
j0
= cos ( 0 ) + j sin ( 0 ) = 1 + j0
[4.98]
All of the multiplications in the first stage are by a value of either 0 or 1 and therefore can
be removed. The first stage Butterflies do not need multiplications. The Butterfly
equations reduce to the following.
x 0’ = x 0 + x 1
[4.99]
y 0’ = y 0 + y 1
[4.100]
x 1’ = x 0 – x 1
[4.101]
y 1’ = y 0 – y 1
[4.102]
Because there is only one Butterfly per group in the first stage, the Butterfly loop (which
would execute only once per group) and the group loop can be combined.
The FirstPass macro does the following operations.
• It copies the Input-Buffer elements in the bit reversal order to output array which is
used for in-place processing.
• It calculates the first Butterfly.
• It converts the conventional complex notation REAL:IMAG, REAL:IMAG format to
REAL:REAL, IMAG:IMAG format for efficient computation.
The following sections describe each of the loops.
Butterfly Loop
The inner most loop is the Butterfly loop in the FFT.
User’s Manual
4-251
V 1.2, 2000-01
Function Descriptions
The Butterfly macro is used to perform the basic Butterfly operation with or without
shifting. The Butterfly operation is as given below.
The Butterfly macro exploits the parallel architecture of the TriCore to achieve two
parallel operations in one single operation. Therefore it can compute two Butterfly
outputs in parallel.
x 0’ = x 0 + [ ( C )x 1 – ( – S )y 1 ]
[4.103]
y 0’ = y 0 + [ ( C )y 1 + ( – S )x 1 ]
[4.104]
x 1’ = x 0 – [ ( C )x 1 – ( – S )y 1 ]
[4.105]
y 1’ = y 0 – [ ( C )y 1 + ( – S )x 1 ]
[4.106]
The Butterfly macro involves two packed multiplications and two packed additional
subtraction. The MAC operation can cause the output of Butterfly to grow by two bits
from input to output. So the Butterfly also has a version with shift to take care of the
conditions to avoid errors caused by bits growth.
The Inverse Butterfly (IButterfly) macro is used by the Inverse FFT functions to
compute the Butterfly operation. In classical method the Twiddle-Factor is the complex
conjugate of the forward FFT. For efficient computation, the Twiddle-Factor is computed
by the same method as that of the forward FFT. But the computational mechanism is
changed in case of Inverse Butterfly, so as to achieve the same output as that by using
the complex conjugate. In contrast to the Forward Butterfly, inverse will compute using
the following equations.
x 0’ = x 0 + [ ( C )x 1 + ( – S )y 1 ]
[4.107]
y 0’ = y 0 + [ ( C )y 1 – ( – S ) x 1 ]
[4.108]
x 1’ = x 0 – [ ( C )x 1 + ( – S )y 1 ]
[4.109]
y 1’ = y 0 – [ ( C )y 1 – ( – S ) x1 ]
[4.110]
An example of bit growth and overflow is shown below.
Bit Growth:
Input to the Butterfly
H#0C00
User’s Manual
= 0000 1100 0000 0000
4-252
V 1.2, 2000-01
Function Descriptions
Output from Butterfly
H#1800
= 0001 1000 0000 0000
Overflow:
Input to the Butterfly
H#3000
= 0011 0000 0000 0000
Output from Butterfly
H#C000
= 1100 0000 0000 0000
In overflow, the positive number H#3000 is multiplied by a positive number, resulting in
H#C000, which is too large to represent as a positive, signed 16 bit number. H#C000 is
erroneously interpreted as a negative number.
To avoid overflow errors there are methods for compensating the growth of bits.
Following are the standard methods of compensation for the bit growth error.
a) Scaling of Input data to the Butterfly
b) Scaling of the output data unconditionally using the block floating point fundamental
method
c) Scaling of the output data conditionally using the block floating point fundamental
method
d) Extra sign bits to protect the output data
The method depicted in (d) is the fastest and the most efficient method but unfortunately
this has limited accuracy and is not suited for large FFTs.
Method (a) Input data scaling requires the extra shifting or scaling for all the input before
passing to FFT for processing, this becomes overhead in using the FFT and the purpose
is not served since it involves extra processing and also programming effort.
Method (b) is another way of compensating the bit growth, it unconditionally scales down
the input to Butterfly by a factor of two so that the output never overflows. This adds extra
time as the overhead and also the precision is lost in every iteration. The method
adapted here is to shift the whole block of data one bit to the right and updating the block
exponent.
User’s Manual
4-253
V 1.2, 2000-01
Function Descriptions
Method adapted in the TriLib FFT implementation
The most optimal method (c), the conditional block floating point scales the input data
only if the bit growth occurs. This shifting is done for the entire block with the updating of
the block exponent if one or more output grows. The condition is checked before every
stage of the loop begins and then it is branched to execute the nested loops with or
without pre-shift depending upon the status of the Sticky Advance Overflow (SAV) flag
of the Program Status Word (PSW).
Group Loop
The main objective of the group loop is to control the group of Butterfly. It sets the
address pointers for each of the Butterflies for their respective Twiddle-Factor-Buffers
and the input data buffers.
Stage Loop
The Stage Loop is the outer most loop of the FFTs nested loop. It controls the group
count, the number of Butterflies for each of the group and most importantly it performs
the conditional block floating point scaling on the stage calculation before it enters the
Group Loop.
Post Processing
The Post processing is involved in case of 16 bits, Mixed 16 bits and all the Real FFT
implementations.
In case of 16 bit implementation, ToComplexSfm is used to convert the REAL:REAL,
IMAG:IMAG internal representation to REAL:IMAG format.
In case of mixed 16 bit implementation, the output buffer after the FFT has 32 bit
precision it uses the ConvertBuf macro to make it 16 bit.
In Real Forward FFT implementation of all the types, the Split macro is used to separate
the output of the two real sequences given as the input to the Real FFT.
4.8.6
Implementation of FFT to Process the Real Sequences of Data
Many applications have the real valued data to be processed. Though the data is real
valued, one trivial approach is to use the Complex FFT by making the real portion of the
complex sequence filled by the real values and the imaginary portion equated to zero.
User’s Manual
4-254
V 1.2, 2000-01
Function Descriptions
However, this method is very inefficient. Following steps are followed to efficiently
implement the Real FFT using the Complex FFT algorithm.
1. Input complex sequence x(n) has to be formed from the two N length real valued
sequences x1(n), x2(n).
For n = 0, 1,..., N-1
x(n).real = x1(n)
[4.111]
x(n).imag = x2(n)
[4.112]
2. Compute the N-length Complex FFT on x(n).
X ( k ) = FFT [ x ( n ) ]
[4.113]
3. Perform the Split of the output spectrum. The Splitting of the spectrum is done by
Split macro that implements the following equations.
X1 r ( 0 ) = Xr ( 0 )
X1 i ( 0 ) = 0
[4.114]
X 2 r ( 0 ) = Xi ( 0 )
X2 i ( 0 ) = 0
[4.115]
X1 r ( N ⁄ 2 ) = Xr ( N ⁄ 2 )
X1 i ( N ⁄ 2 ) = 0
[4.116]
X 2 r ( N ⁄ 2 ) = Xi ( N ⁄ 2 )
X2 i ( N ⁄ 2 ) = 0
[4.117]
For k = 1,..., N/2-1
X 1 r ( k ) = 0.5 × [ Xr ( k ) + Xr ( N – k ) ]
X 1 i ( k ) = 0.5 × [ Xi ( k ) + Xi ( N – k ) ]
[4.118]
X 2 r ( k ) = 0.5 × [ Xi ( k ) + Xi ( N – k ) ]
X 2 i ( k ) = – 0.5 × [ Xr ( k ) + Xr ( N – k ) ]
[4.119]
X1 r ( N – k ) = X1 r ( k )
X1 i ( N – k ) = –X1 i ( k )
[4.120]
X2 r ( N – k ) = X2 r ( k )
X2 i ( N – k ) = –X2 i ( k )
[4.121]
Implementation of the Inverse Real FFT is done by forming the single complex sequence
X(k) from two sequences X1(k) and X2(k). The Unify macro is used to perform this
operation. The following equations are implemented in the Unify macro.
User’s Manual
4-255
V 1.2, 2000-01
Function Descriptions
For k = 0,...,N-1
Xr ( k ) = X 1 r ( k ) + X 2 i ( k )
[4.122]
Xi ( k ) = X 1 i ( k ) + X2 r ( k )
[4.123]
The unified complex sequence X(k) is used as the single sequence as input to the
Inverse FFT.
x ( n ) = IDFT [ X ( k ) ]
4.8.7
[4.124]
Design of Test Cases for the FFT functions
The test cases are designed using the math lab references. The characteristics of the
FFT is used to simplify the design of test cases. The Complex FFT contains the real and
imaginary components in the input data. By careful examination of the FFT equation it
can be found that when the real component is a cosine term with or without the
harmonics and the imaginary component is the sine term with same frequency and
harmonics as that of the cosine term, the output of the FFT will have a peak in second
position of the output array
Say, the input is given by the following equation
N
∑
cos ( 2πnk ) + i sin ( 2πnk )
[4.125]
n=0
where k=0,...., ∞
User’s Manual
4-256
V 1.2, 2000-01
Function Descriptions
The corresponding output will have only one peak as shown in the graphics below.
Figure 4-65 The plot of Equation [4.125] for a typical value of k given as input
Figure 4-66 The output plot from the FFT contains only one peak
User’s Manual
4-257
V 1.2, 2000-01
Function Descriptions
Figure 4-67 The Real cosine component for the Real FFT input
Figure 4-68 The output of the FFT contains two peaks for the input Figure 4-67
The presence of only cosine component and the sine component if equated to zero, the
output should have two peaks in second and Nth position in the real part of the output
array. This is the test used for the real FFT
The DC test is optional which gives rise to one peak in the first position of the output
array.This can be used to verify the scaling factor of the FFT.
User’s Manual
4-258
V 1.2, 2000-01
Function Descriptions
4.8.8
Using FFT functions
TriLib has three versions of FFT implementation 16 bit precision, 32 bit precision and 16
bit mixed precision.
16 bit implementation is most efficient.
32 bit implementation is most accurate.
16 bit mixed implementation is a compromise between speed of 16 bit and accuracy of
32 bit. It should be noted that mixed FFT is not efficient at all for FFTs at low points say,
8, 16.
FFTs are demonstrated by respective example main files such as
expCplx FFT_2_16() - demonstrates 16 bit FFT
expCplx FFT_2_32() - demonstrates 32 bit FFT
expCplx FFT_2_16X32() - demonstrates 16 bit mixed FFT and so the Real too.
The test data can be included into the above main functions such as
FFT_X.h - where X is points of FFT. e.g.,
FFT_8.h - 8 point Complex 16 bit data
FFT_16_32.h - 16 point Complex 32 bit data
RFFT_16.h - 16 point Real 16 bit data and so on.
Important Note:
• The 16 bit, 32 bit Real FFT and 16 bit Real, Complex FFT requires an output buffer to
be 2N size
• The Real FFT functions of 16, 32 and 16 mixed versions modifies the contents of input
buffer
4.8.9
Description
The following FFT functions for 16 bit, 32 bit and mixed are described.
•
•
•
•
Complex Forward Radix-2 DIT FFT
Complex Inverse Radix-2 DIT FFT
Real Forward Radix-2 DIT FFT
Real Inverse Radix-2 DIT FFT
User’s Manual
4-259
V 1.2, 2000-01
Function Descriptions
Important Note on Cycle Count:
The actual cycle count depends upon the dynamic path followed while execution which
depends on the input given. The actual cycle count should lie within the range given by
higher and lower limit of cycle count.
I
User’s Manual
4-260
V 1.2, 2000-01
Function Descriptions
FFT_2_16
Complex Forward Radix-2 DIT FFT for 16 bits
Signature
short FFT_2_16(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle- Factor-Buffer of
16 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Complex Forward Radix-2
decimation-in-time Fast fourier transform on the given input
complex array. The detailed implementation is given in the
Section 4.8.
4-261
V 1.2, 2000-01
Function Descriptions
FFT_2_16
Complex Forward Radix-2 DIT FFT for 16 bits (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * y->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
+
+
k->imag
k->real
k->imag
k->imag
*
*
*
*
y->imag);
y->imag);
y->imag);
y->real);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
User’s Manual
4-262
V 1.2, 2000-01
Function Descriptions
FFT_2_16
Complex Forward Radix-2 DIT FFT for 16 bits (cont’d)
Memory Note
Input-Buffer
Output-Spectrum
aX
x(0)
R(0)
x(1)
Bit
reversed
data fetch
x(2)
x(3)
R(2)
R(3)
x(4)
R(4)
.
.
.
Hi
Memory
aR
R(1)
FFT
.
x(N-1)
R(N-1)
32 bit
(16 bit Cplx)
32 bit
(16 bit Cplx)
Real and
Imaginary parts in
1Q15
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-2
Hi
Memory
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
TF(0)
TF(1)
TF(2)
Buffers will have both
Real and Imaginary parts
.
.
.
.
TF(N/2-1)
32 bit
(16 bit Cplx)
Figure 4-69 FFT_2_16
Implementation
User’s Manual
Refer Section 4.8.2
4-263
V 1.2, 2000-01
Function Descriptions
FFT_2_16
Complex Forward Radix-2 DIT FFT for 16 bits (cont’d)
Example
Trilib\Example\Tasking\Transforms\FFT\expCplxFFT_2_16
.c, expCplxFFT_2_16.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expCplxFFT_2_16.cpp, expCplxFFT_2_16.c
Trilib\Example\GNU\Transforms\FFT\expCplxFFT_2_16.c
Cycle Count
Initialization
:
7
First Pass Loop
:
Kernel
:
7+7×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+8 × ( N ⁄ 2 – 1 ) + 2
+ ( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
8 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
Post Processing
:
6+4×N⁄2+4
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
167
172
164
256
8350
8350
7453
344 bytes
4-264
V 1.2, 2000-01
Function Descriptions
IFFT_2_16
Complex Inverse Radix-2 DIT IFFT for 16 bits
Signature
short IFFT_2_16(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle- Factor-Buffer of
16 bit complex number value in
predefined format
Size of Input-Buffer (power of 2)
This function computes the Complex Inverse Radix-2
decimation-in-time Fast fourier transform on the given input
complex array. The detailed implementation is given in the
Section 4.8.
4-265
V 1.2, 2000-01
Function Descriptions
IFFT_2_16
Complex Inverse Radix-2 DIT IFFT for 16 bits (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * y->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
-
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
User’s Manual
4-266
V 1.2, 2000-01
Function Descriptions
IFFT_2_16
Complex Inverse Radix-2 DIT IFFT for 16 bits (cont’d)
Memory Note
Input-Buffer
Output-Spectrum
aX
X(0)
R(0)
X(1)
Bit
reversed
data fetch
X(2)
X(3)
R(2)
R(3)
X(4)
R(4)
.
.
.
Hi
Memory
aR
R(1)
IFFT
X(N-1)
.
R(N-1)
32 bit
(16 bit Cplx)
Hi
Memory
32 bit
(16 bit Cplx)
Real and
Imaginary parts in
1Q15
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-2
TF(0)
TF(1)
TF(2)
.
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
.
.
.
TF(N/2-1)
32 bit
(16 bit Cplx)
Figure 4-70 IFFT_2_16
Implementation
User’s Manual
Refer Section 4.8.2
4-267
V 1.2, 2000-01
Function Descriptions
IFFT_2_16
Complex Inverse Radix-2 DIT IFFT for 16 bits (cont’d)
Example
Trilib\Example\Tasking\Transforms\FFT
\expCplxFFT_2_16.c, expCplxFFT_2_16.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expCplxFFT_2_16.cpp, expCplxFFT_2_16.c
Trilib\Example\GNU\Transforms\FFT\expCplxFFT_2_16.c
Cycle Count
Initialization
:
7
First Pass Loop
:
Kernel
:
7+7×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+8 × ( N ⁄ 2 – 1 ) + 2
+ ( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
8 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
Post Processing
:
6+4×N⁄2+4
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
162
172
164
256
7581
8350
7453
345 bytes
4-268
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16
Real Forward Radix-2 DIT FFT for 16 bits
Signature
short FFTReal_2_16(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle- Factor-Buffer of
16 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Real Forward Radix-2 decimationin-time Fast Fourier Transform on the given input complex
array. The detailed implementation is given in the Section 4.8.
The Real FFT is implemented by using the complex FFT and
the output spectrum is split to separate the Real FFT results.
4-269
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16
Real Forward Radix-2 DIT FFT for 16 bits (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real x’->imag = x->imag + (k->imag * y->real +
y’->real = x->real - (k->real * y->real y’->imag = x->imag - (k->real * y->imag +
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
Split Spectrum
// separate the real from the
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
complex output
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
• Input contains two real sequences, x1 and x2, each of
length N. x1 is in real part and x2 is in imaginary part of
input complex data
• The output spectra has two complex blocks, each of length
N, wherein the first block is for x1 and subsequent block for
x2
User’s Manual
4-270
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16
Real Forward Radix-2 DIT FFT for 16 bits (cont’d)
Memory Note
Input-Buffer
aX
x(0)
Output-Spectrum
aR
R(0)
x(1)
R(1)
Bit
reversed
data fetch
x(2)
x(3)
R(2)
R(3)
x(4)
R(4)
.
Hi
Memory
.
RFFT
.
Hi
Memory
.
x(N-1)
R(N-1)
32 bit*
(16 bit Cplx)
32 bit*
(16 bit Cplx)
aR
*
Real and
Imaginary parts in
1Q15
R(0) Real
R(1) Real
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-2
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
*
Real and
Imaginary parts in
1Q15
Split
Spectrum
.
TF(0)
.
TF(1)
.
TF(2)
.
.
.
.
R(N-1) Real
.
R(N) Imag
.
R(N+1) Imag
TF(N/2-1)
.
32 bit*
(16 bit Cplx)
.
.
.
.
Complex
results of
first Real
sequence
stored in
real part of
the InputBuffer
Complex
results of
second Real
sequence
stored in
imaginary
part of the
Input-Buffer
R(2N-1) Imag
32 bit*
(16 bit Cplx)
Figure 4-71 FFTReal_2_16
User’s Manual
4-271
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16
Real Forward Radix-2 DIT FFT for 16 bits (cont’d)
Implementation
Refer Section 4.8.2
Example
Trilib\Example\Tasking\Transforms\FFT
\expRealFFT_2_16.c, expRealFFT_2_16.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expRealFFT_2_16.cpp, expRealFFT_2_16.c
Trilib\Example\GNU\Transforms\FFT
\expRealFFT_2_16.c
Cycle Count
Initialization
:
7
First Pass Loop
:
Kernel
:
7+7×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+8 × ( N ⁄ 2 – 1 ) + 2
+ ( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
8 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
Post Processing
:
6+4×N⁄2+4
Split Spectrum
:
14 + 11 × ( N ⁄ 2 – 1 ) + 5
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
219
224
216
256
9766
9766
8869
678 bytes
4-272
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16
Real Inverse Radix-2 DIT IFFT for 16 bits
Signature
short IFFTReal_2_16(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX,
int
SFlg
);
Inputs
X
:
TF
:
nX
SFlg
:
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle-Factor-Buffer of
16 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
Indicates scale down the input by 2
if this flag is TRUE
This function computes the Real Inverse Radix-2 decimationin-time Fast fourier transform on the given input complex array.
The detailed implementation is given in the Section 4.8.The
Real IFFT is implemented by using the complex IFFT and
before processing the input is arranged to form a single valued
complex sequence from two complex sequences.
4-273
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16
Real Inverse Radix-2 DIT IFFT for 16 bits (cont’d)
Pseudo code
{
unify spectrum
//Forms a single valued complex sequence from two
sequences
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real - k->imag * y->imag);
x’->imag = x->imag + (k->imag * k->real - k->imag * y->real);
y’->real = x->real - (k->real * y->real - y->imag * k->imag);
y’->imag = x->imag - (k->real * y->imag - y->real * k->imag);
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
• Input contains two complex blocks each of length N,
wherein the first block is for x1 and subsequent block is for
x2
• The output spectra contains two real sequences x1 and x2,
each of length N. x1 is in real part and x2 is in imaginary
part of output complex data
Caution
• The input array gets modified after processing
User’s Manual
4-274
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16
Real Inverse Radix-2 DIT IFFT for 16 bits (cont’d)
Memory Note
Output-Spectrum
R(0)
aR
Input-Buffer
aX
X(0)
R(1)
X(1)
Bit
reversed
data fetch
X(2)
Unify
Spectrum
X(3)
R(4)
.
RIFFT
.
Hi
Memory
aX
.
R(N-1)
X(N-1)
32 bit*
(16 bit Cplx)
X(0) Real
X(1) Real
.
.
.
.
Real and
Imaginary parts
in 1Q15
Twiddle-Factor
TF(0)
aTF
TF(1)
TF(2)
X(N-1) Real
.
.
X(N+1) Imag
.
.
.
.
TF(N/2-1)
.
.
X(2N-1) Imag
32 bit*
(16 bit Cplx)
*
Contains X1, the
first real
sequence in
Real part and
X2, the second
Real sequence
in imaginary part
The data is arranged
as in Figure 4-2
.
.
Hi
Memory
32 bit*
(16 bit Cplx)
*
X(N) Imag
Complex
input
sequence
to generate
X2, the
second
Real output
sequence
R(3)
X(4)
.
Complex
input
sequence
to
generate
X1, the
first Real
output
sequence
R(2)
*
32 bit*
(16 bit Cplx)
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary
parts
Real and
Imaginary parts in
1Q15
Figure 4-72 IFFTReal_2_16
User’s Manual
4-275
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16
Real Inverse Radix-2 DIT IFFT for 16 bits (cont’d)
Implementation
Refer Section 4.8.2
Example
Trilib\Example\Tasking\Transforms\FFT
\expRealFFT_2_16.c, expRealFFT_2_16.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expRealFFT_2_16.cpp, expRealFFT_2_16.c
Trilib\Example\GNU\Transforms\FFT
\expRealFFT_2_16.c
Cycle Count
Initialization
:
6
Unify
:
5 + ( 10 × N ⁄ 2 ) + 2
First Pass Loop
:
Kernel
:
7+7×N⁄2
10 × ( Log 2 N – 1 ) + 2
+8 × ( N ⁄ 2 – 1 ) + 2
+ ( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
8 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 13or11 ) ( Log 2 N – 1 ) × N ⁄ 4 + 2
Post Processing
:
6+4×N⁄2+4
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
209
219
211
256
8868
9637
8740
680 bytes
4-276
V 1.2, 2000-01
Function Descriptions
FFT_2_32
Complex Forward Radix-2 DIT FFT for 32 bits
Signature
short FFT_2_32(CplxL
*R,
CplxL
*X,
CplxL
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 32 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 32 bit
complex value
Pointer to Twiddle- Factor-Buffer of
32 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Complex Forward Radix-2
decimation-in-time Fast fourier transform on the given input
complex array. The detailed implementation is given in the
Section 4.8.4.
4-277
V 1.2, 2000-01
Function Descriptions
FFT_2_32
Complex Forward Radix-2 DIT FFT for 32 bits (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * k->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
+
+
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q31 format
• Input and Output has real and imaginary part packed as 32
bit data to form 64 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
User’s Manual
4-278
V 1.2, 2000-01
Function Descriptions
FFT_2_32
Complex Forward Radix-2 DIT FFT for 32 bits (cont’d)
Memory Note
Input-Buffer
Output-Spectrum
aX
x(0)
R(0)
x(1)
Bit
reversed
data fetch
x(2)
x(3)
R(2)
R(3)
x(4)
R(4)
.
.
.
Hi
Memory
aR
R(1)
FFT
.
x(N-1)
R(N-1)
64 bit
(32 bit Cplx)
64 bit
(32 bit Cplx)
Real and
Imaginary parts in
1Q31
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-3
Hi
Memory
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
TF(0)
TF(1)
TF(2)
Buffers will have both
Real and Imaginary parts
.
.
.
.
TF(N/2-1)
64 bit
(32 bit Cplx)
Figure 4-73 FFT_2_32
Implementation
User’s Manual
Refer Section 4.8.4
4-279
V 1.2, 2000-01
Function Descriptions
FFT_2_32
Complex Forward Radix-2 DIT FFT for 32 bits (cont’d)
Example
Trilib\Example\Tasking\Transforms\FFT\expCplxFFT_2_32
.c, expCplxFFT_2_32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expCplxFFT_2_32.cpp, expCplxFFT_2_32.c
Trilib\Example\GNU\Transforms\FFT\expCplxFFT_2_32.c
Cycle Count
Initialization
:
8
First Pass Loop
:
Kernel
:
7+9×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
4
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
260
264
244
256
19803
20058
18267
350 bytes
4-280
V 1.2, 2000-01
Function Descriptions
IFFT_2_32
Complex Inverse Radix-2 DIT IFFT for 32 bits
Signature
short IFFT_2_32(CplxL
*R,
CplxL
*X,
CplxL
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 32 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 32 bit
complex value
Pointer to Twiddle- Factor-Buffer of
32 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Complex Inverse Radix-2
decimation-in-time Fast fourier transform on the given input
complex array. The detailed implementation is given in the
Section 4.8.4.
4-281
V 1.2, 2000-01
Function Descriptions
IFFT_2_32
Complex Inverse Radix-2 DIT IFFT for 32 bits (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * y->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
-
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q31 format
• Input and Output has real and imaginary part packed as 32
bit data to form 64 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
User’s Manual
4-282
V 1.2, 2000-01
Function Descriptions
IFFT_2_32
Complex Inverse Radix-2 DIT IFFT for 32 bits (cont’d)
Memory Note
Input-Buffer
Output-Spectrum
aX
X(0)
R(0)
X(1)
Bit
reversed
data fetch
X(2)
X(3)
R(2)
R(3)
X(4)
R(4)
.
.
.
Hi
Memory
aR
R(1)
IFFT
X(N-1)
.
R(N-1)
64 bit
(32 bit Cplx)
Hi
Memory
64 bit
(32 bit Cplx)
Real and
Imaginary parts in
1Q31
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-3
TF(0)
TF(1)
TF(2)
.
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
.
.
.
TF(N/2-1)
64 bit
(32 bit Cplx)
Figure 4-74 IFFT_2_32
Implementation
User’s Manual
Refer Section 4.8.4
4-283
V 1.2, 2000-01
Function Descriptions
IFFT_2_32
Complex Inverse Radix-2 DIT IFFT for 32 bits (cont’d)
Example
Trilib\Example\Tasking\Transforms\FFT\expCplxFFT_2_32
.c, expCplxFFT_2_32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expCplxFFT_2_32.cpp, expCplxFFT_2_32.c
Trilib\Example\GNU\Transforms\FFT\expCplxFFT_2_32.c
Cycle Count
Initialization
:
8
First Pass Loop
:
Kernel
:
7+9×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
4
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
244
264
244
256
18523
20058
18267
352 bytes
4-284
V 1.2, 2000-01
Function Descriptions
FFTReal_2_32
Real Forward Radix-2 DIT FFT for 32 bits
Signature
short FFTReal_2_32(CplxL
*R,
CplxL
*X,
CplxL
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 32 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 32 bit
complex value
Pointer to Twiddle- Factor-Buffer of
32 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Real Forward Radix-2 decimationin-time Fast fourier transform on the given input complex array.
The detailed implementation is given in the Section 4.8.4.
4-285
V 1.2, 2000-01
Function Descriptions
FFTReal_2_32
Real Forward Radix-2 DIT FFT for 32 bits (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real x’->imag = x->imag + (k->imag * y->real +
y’->real = x->real - (k->real * y->real y’->imag = x->imag - (k->real * y->imag +
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
Split Spectrum
// separate the real from the
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
complex output
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q31 format
• Input and Output has real and imaginary part packed as 32
bit data to form 64 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
• Input contains two real sequences, x1 and x2, each of
length N. x1 is in real part and x2 is in imaginary part of
input complex data
• The output spectra has two complex blocks, each of length
N, wherein the first block is for x1 and subsequent block for
x2
User’s Manual
4-286
V 1.2, 2000-01
Function Descriptions
FFTReal_2_32
Real Forward Radix-2 DIT FFT for 32 bits (cont’d)
Memory Note
Input-Buffer
aX
x(0)
Output-Spectrum
aR
R(0)
x(1)
R(1)
Bit
reversed
data fetch
x(2)
x(3)
R(2)
R(3)
x(4)
R(4)
.
Hi
Memory
.
RFFT
.
Hi
Memory
.
x(N-1)
R(N-1)
64 bit*
(32 bit Cplx)
64 bit*
(32 bit Cplx)
aR
*
Real and
Imaginary parts in
1Q31
R(0) Real
R(1) Real
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-3
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
*
Real and
Imaginary parts in
1Q31
Split
Spectrum
.
TF(0)
.
TF(1)
.
TF(2)
.
.
.
.
R(N-1) Real
.
R(N) Imag
.
R(N+1) Imag
TF(N/2-1)
.
64 bit*
(32 bit Cplx)
.
.
.
.
Complex
results of
first Real
sequence
stored in
real part of
the InputBuffer
Complex
results of
second Real
sequence
stored in
imaginary
part of the
Input-Buffer
R(2N-1) Imag
64 bit*
(32 bit Cplx)
Figure 4-75 FFTReal_2_32
User’s Manual
4-287
V 1.2, 2000-01
Function Descriptions
FFTReal_2_32
Real Forward Radix-2 DIT FFT for 32 bits (cont’d)
Implementation
Refer Section 4.8.4
Example
Trilib\Example\Tasking\Transforms\FFT
\expRealFFT_2_32.c, expRealFFT_2_32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expRealFFT_2_32.cpp, expRealFFT_2_32.c
Trilib\Example\GNU\Transforms\FFT
\expRealFFT_2_32.c
Cycle Count
Initialization
:
8
First Pass Loop
:
Kernel
:
7+9×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
4
Split Spectrum
:
13 + 8 × ( N ⁄ 2 – 1 ) + 5
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
302
306
286
256
20837
21092
19301
784 bytes
4-288
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_32
Real Inverse Radix-2 DIT IFFT for 32 bits
Signature
short IFFTReal_2_32(CplxL
*R,
CplxL
*X,
CplxL
*TF,
int
nX,
int
SFlg
);
Inputs
X
:
TF
:
nX
SFlg
:
:
Output
R
:
Pointer to Output-Buffer of 32 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 32 bit
complex value
Pointer to Twiddle- Factor-Buffer of
32 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
Indicates scale down the input by 2
if this flag is TRUE
This function computes the Real Inverse Radix-2 decimationin-time Fast fourier transform on the given input complex array.
The detailed implementation is given in the Section 4.8.4. The
Real IFFT is implemented by using the complex IFFT and
before processing the input is arranged to form a single valued
complex sequence from two complex sequences.
4-289
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_32
Real Inverse Radix-2 DIT IFFT for 32 bits (cont’d)
Pseudo code
{
unify spectrum
//Forms a single valued complex sequence from two
sequences
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real - k->imag * y->imag);
x’->imag = x->imag + (k->imag * k->real - k->imag * y->real);
y’->real = x->real - (k->real * y->real - y->imag * k->imag);
y’->imag = x->imag - (k->real * y->imag - y->real * k->imag);
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q31 format
• Input and Output has real and imaginary part packed as 32
bit data to form 64 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
• Input contains two complex blocks each of length N,
wherein the first block is for x1 and subsequent block is for
x2
• The output spectra contains two real sequences x1 and x2,
each of length N. x1 is in real part and x2 is in imaginary
part of output complex data
Caution
• The input array gets modified after processing
User’s Manual
4-290
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_32
Real Inverse Radix-2 DIT IFFT for 32 bits (cont’d)
Memory Note
Output-Spectrum
R(0)
aR
Input-Buffer
aX
X(0)
R(1)
X(1)
Bit
reversed
data fetch
X(2)
Unify
Spectrum
X(3)
R(4)
.
RIFFT
.
Hi
Memory
aX
R(N-1)
X(0) Real
X(1) Real
.
.
.
.
Real and
Imaginary parts
in 1Q15
Twiddle-Factor
TF(0)
aTF
TF(1)
TF(2)
X(N-1) Real
.
.
X(N+1) Imag
.
.
.
.
TF(N/2-1)
.
.
X(2N-1) Imag
32 bit*
(16 bit Cplx)
Figure 4-76 IFFTReal_2_32
*
Contains X1, the
first real
sequence in
Real part and
X2, the second
Real sequence
in imaginary part
The data is arranged
as in Figure 4-2
.
.
Hi
Memory
32 bit*
(16 bit Cplx)
*
X(N) Imag
User’s Manual
.
X(N-1)
32 bit*
(16 bit Cplx)
Complex
input
sequence
to generate
X2, the
second
Real output
sequence
R(3)
X(4)
.
Complex
input
sequence
to
generate
X1, the
first Real
output
sequence
R(2)
*
32 bit*
(16 bit Cplx)
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary
parts
Real and
Imaginary parts in
1Q15
4-291
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_32
Real Inverse Radix-2 DIT IFFT for 32 bits (cont’d)
Implementation
Refer Section 4.8.4
Example
Trilib\Example\Tasking\Transforms\FFT
\expRealFFT_2_32.c, expRealFFT_2_32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expRealFFT_2_32.cpp, expRealFFT_2_32.c
Trilib\Example\GNU\Transforms\FFT
\expRealFFT_2_32.c
Cycle Count
Initialization
:
8
Unify
:
4+4×N+2
First Pass Loop
:
Kernel
:
7+9×N⁄2+2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 20or18 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
4
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
298
302
282
256
20833
21088
19297
816 bytes
4-292
V 1.2, 2000-01
Function Descriptions
FFT_2_16X32
Complex Forward Radix-2 DIT 16 bit mixed
FFT
Signature
short FFT_2_16X32(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle- Factor-Buffer of
16 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Complex Forward Radix-2
decimation-in-time Fast fourier transform on the given input
complex array with better precision where it internally uses 32
bit for computation. The detailed implementation is given in the
Section 4.8.
4-293
V 1.2, 2000-01
Function Descriptions
FFT_2_16X32
Complex Forward Radix-2 DIT 16 bit mixed
FFT (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * y->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
+
+
k->imag
k->real
k->imag
k->imag
*
*
*
*
y->imag);
y->imag);
y->imag);
y->real);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
User’s Manual
4-294
V 1.2, 2000-01
Function Descriptions
FFT_2_16X32
Complex Forward Radix-2 DIT 16 bit mixed
FFT (cont’d)
Memory Note
Input-Buffer
x(0)
Output-Spectrum
aX
R(0)
x(1)
Bit
reversed
data fetch
x(2)
x(3)
R(2)
R(3)
x(4)
R(4)
.
.
.
Hi
Memory
aR
R(1)
FFT
x(N-1)
.
Hi
Memory
R(N-1)
32 bit
(16 bit Cplx)
Real and
Imaginary parts in
1Q15
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-2
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
TF(0)
Extra space
for
intermediate
computation
(2N-1)
TF(1)
TF(2)
(2N-1)
.
32 bit
(16 bit Cplx)
.
.
.
TF(N/2-1)
32 bit
(16 bit Cplx)
Figure 4-77 FFT_2_16X32
Implementation
User’s Manual
Refer Section 4.8.3
4-295
V 1.2, 2000-01
Function Descriptions
FFT_2_16X32
Complex Forward Radix-2 DIT 16 bit mixed
FFT (cont’d)
Example
Trilib\Example\Tasking\Transforms\FFT
\expCplxFFT_2_16X32.c, expCplxFFT_2_16X32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expCplxFFT_2_16X32.cpp, expCplxFFT_2_16X32.c
Trilib\Example\GNU\Transforms\FFT
\expCplxFFT_2_16X32.c
Cycle Count
Initialization
:
8
First Pass Loop
:
Kernel
:
10 + 9 × nX ⁄ 2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
11 + 4 × nX
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
269
272
256
256
17508
17508
15712
374 bytes
4-296
V 1.2, 2000-01
Function Descriptions
IFFT_2_16X32
Complex Inverse Radix-2 DIT 16 bit mixed
IFFT
Signature
short IFFT_2_16X32(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle- Factor-Buffer of
16 bit complex number value in
predefined format
Size of Input-Buffer (power of 2)
This function computes the Complex Inverse Radix-2
decimation-in-time Fast fourier transform on the given input
complex array with better precision where it internally uses 32
bit for computation. The detailed implementation is given in the
Section 4.8.
4-297
V 1.2, 2000-01
Function Descriptions
IFFT_2_16X32
Complex Inverse Radix-2 DIT 16 bit mixed
IFFT (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * y->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
-
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order
User’s Manual
4-298
V 1.2, 2000-01
Function Descriptions
IFFT_2_16X32
Complex Inverse Radix-2 DIT 16 bit mixed
IFFT (cont’d)
Memory Note
Input-Buffer
X(0)
Output-Spectrum
aX
R(0)
X(1)
Bit
reversed
data fetch
X(2)
X(3)
R(2)
R(3)
X(4)
R(4)
.
Hi
Memory
aR
R(1)
.
.
IFFT
X(N-1)
.
Hi
Memory
R(N-1)
32 bit
(16 bit Cplx)
Real and
Imaginary parts in
1Q15
Twiddle-Factor
aTF
The data is arranged as in
Figure 4-2
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
TF(0)
Extra space
for
intermediate
computation
(2N-1)
TF(1)
TF(2)
(2N-1)
.
32 bit
(16 bit Cplx)
.
.
.
TF(N/2-1)
32 bit
(16 bit Cplx)
Figure 4-78 IFFT_2_16X32
Implementation
User’s Manual
Refer Section 4.8.3
4-299
V 1.2, 2000-01
Function Descriptions
IFFT_2_16X32
Complex Inverse Radix-2 DIT 16 bit mixed
IFFT (cont’d)
Example
Trilib\Example\Tasking\Transforms\FFT
\expCplxFFT_2_16X32.c, expCplxFFT_2_16X32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expCplxFFT_2_16X32.cpp, expCplxFFT_2_16X32.c
Trilib\Example\GNU\Transforms\FFT
\expCplxFFT_2_16X32.c
Cycle Count
Initialization
:
8
First Pass Loop
:
Kernel
:
10 + 9 × nX ⁄ 2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
11 + 4 × nX
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
270
272
256
256
17506
17508
15712
376 bytes
4-300
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16X32
Real Forward Radix-2 DIT 16 bit mixed FFT
Signature
short FFTReal_2_16X32(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX
);
Inputs
X
:
TF
:
nX
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
User’s Manual
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle- Factor-Buffer of
16 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
This function computes the Real Forward Radix-2 decimationin-time Fast Fourier Transform on the given input complex
array with better precision where it internally uses 32 bit for
computation. The detailed implementation is given in the
Section 4.8. The Real FFT is implemented by using the
complex FFT and the output spectrum is split to separate the
Real FFT results.
4-301
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16X32
Real Forward Radix-2 DIT 16 bit mixed FFT (cont’d)
Pseudo code
{
Bit reverse input
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real x’->imag = x->imag + (k->imag * y->real +
y’->real = x->real - (k->real * y->real y’->imag = x->imag - (k->real * y->imag +
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
Split Spectrum
// separate the real from the
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
complex output
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order with the real part
separated from the complex part
User’s Manual
4-302
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16X32
Real Forward Radix-2 DIT 16 bit mixed FFT (cont’d)
Memory Note
Input-Buffer
aX
x(0)
Output-Spectrum
aR
R(0)
x(1)
R(1)
Bit
reversed
data fetch
x(2)
x(3)
R(2)
R(3)
x(4)
R(4)
.
Hi
Memory
.
RFFT
.
.
x(N-1)
R(N-1)
32 bit*
(16 bit Cplx)
The data is arranged
as in Figure 4-2
Twiddle-Factor
(2N-1)
TF(0)
aTF
32 bit*
(16 bit Cplx)
TF(1)
TF(2)
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Buffers will have both
Real and Imaginary parts
*
Real and
Imaginary parts in
1Q15
Split
Spectrum
Extra space
for
intermediate
computation
*
Real and
Imaginary parts in
1Q15
Hi
Memory
aR
.
R(0) Real
.
R(1) Real
.
.
.
R(N-1) Real
TF(N/2-1)
32 bit*
(16 bit Cplx)
Complex
results of first
Real sequence
stored in real
part of the
Input-Buffer
R(N) Imag
R(N+1) Imag
.
R(2N-1) Imag
Complex results of
second Real
sequence stored in
imaginary part of the
Input-Buffer
32 bit*
(16 bit Cplx)
Figure 4-79 FFTReal_2_16X32
User’s Manual
4-303
V 1.2, 2000-01
Function Descriptions
FFTReal_2_16X32
Real Forward Radix-2 DIT 16 bit mixed FFT (cont’d)
Implementation
Refer Section 4.8.3
Example
Trilib\Example\Tasking\Transforms\FFT
\expRealFFT_2_16X32.c, expRealFFT_2_16X32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expRealFFT_2_16X32.cpp, expRealFFT_2_16X32.c
Trilib\Example\GNU\Transforms\FFT
\expRealFFT_2_16X32.c
Cycle Count
Initialization
:
8
First Pass Loop
:
Kernel
:
10 + 9 × nX ⁄ 2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
11 + 4 × nX
Split Spectrum
:
14 + 11 × ( N ⁄ 2 – 1 ) + 5
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
320
324
308
256
18004
18924
17128
662 bytes
4-304
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16X32
Real Inverse Radix-2 DIT 16 bit mixed IFFT
Signature
short IFFTReal_2_16X32(CplxS
*R,
CplxS
*X,
CplxS
*TF,
int
nX,
int
SFlg
);
Inputs
X
:
TF
:
nX
SFlg
:
:
Output
R
:
Pointer to Output-Buffer of 16 bit
complex value
Return
NF
:
Scaling factor used for
normalization
Description
Pointer to Input-Buffer of 16 bit
complex value
Pointer to Twiddle-Factor-Buffer of
16 bit complex value in predefined
format
Size of Input-Buffer (power of 2)
Indicates scale down the input by 2
if this flag is TRUE
This function computes the Real Inverse Radix-2 decimationin-time Fast fourier transform on the given input complex array
with better precision where it internally uses 32 bit for
computation. The detailed implementation is given in the
Section 4.8.The Real IFFT is implemented by using the
complex IFFT and before processing the input is arranged to
form a single valued complex sequence from two complex
sequences.
Pseudo code
{
unify spectrum
//Forms a single valued complex sequence from two
sequences
Bit reverse input
User’s Manual
4-305
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16X32
Real Inverse Radix-2 DIT 16 bit mixed IFFT (cont’d)
for(l=1;l<=L;l++) //Loop 1 Stage loop
{
for(i=1;i<=I;i++);
//Loop 2 Group loop
{
for(j=1;j<=J;j++)
//Loop 3 Butterfly loop
{
x’->real = x->real + (k->real * y->real
x’->imag = x->imag + (k->imag * k->real
y’->real = x->real - (k->real * y->real
y’->imag = x->imag - (k->real * y->imag
}
initialize k pointer
initialize x,y pointer
}
I = I/2;
J = J*2;
}
-
k->imag
k->imag
y->imag
y->real
*
*
*
*
y->imag);
y->real);
k->imag);
k->imag);
}
Techniques
• Packed multiplication
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Inputs are in 1Q15 format
• Input and Output has real and imaginary part packed as 16
bit data to form 32 bit complex data
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• Input and Output are in normal order with the real part
separated from the complex part
• Input contains two complex blocks each of length N,
wherein the first block is for x1 and subsequent block is for
x2
• The output spectra contains two real sequences x1 and x2,
each of length N. x1 is in real part and x2 is in imaginary
part of output complex data
Caution
• The input array gets modified after processing
User’s Manual
4-306
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16X32
Real Inverse Radix-2 DIT 16 bit mixed IFFT (cont’d)
Memory Note
Output-Spectrum
R(0)
aR
Input-Buffer
aX
X(0)
R(1)
X(1)
Bit
reversed
data fetch
X(2)
Unify
Spectrum
X(3)
R(4)
.
RIFFT
.
Hi
Memory
aX
.
R(N-1)
X(N-1)
32 bit*
(16 bit Cplx)
X(0) Real
X(1) Real
.
.
.
.
*
Real and
Imaginary parts
in 1Q15
TF(1)
first real sequence
in Real part and
X2, the second
Real sequence in
imaginary part
.
.
X(N+1) Imag
.
.
.
.
TF(N/2-1)
.
32 bit*
(16 bit Cplx)
.
32 bit*
(16 bit Cplx)
* Contains X1, the
TF(2)
X(N-1) Real
X(2N-1) Imag
32 bit*
(16 bit Cplx)
TF(0)
.
.
(2N-1)
Twiddle-Factor
aTF
*
Hi
Memory
Extra space
for
intermediate
computation
The data is arranged
as in Figure 4-2
X(N) Imag
Complex
input
sequence
to generate
X2, the
second
Real output
sequence
R(3)
X(4)
.
Complex
input
sequence
to
generate
X1, the
first Real
output
sequence
R(2)
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
Real and
Imaginary parts in
1Q15
Buffers will have both
Real and Imaginary
parts
Figure 4-80 IFFTReal_2_16X32
User’s Manual
4-307
V 1.2, 2000-01
Function Descriptions
IFFTReal_2_16X32
Real Inverse Radix-2 DIT 16 bit mixed IFFT (cont’d)
Implementation
Refer Section 4.8.3
Example
Trilib\Example\Tasking\Transforms\FFT
\expRealFFT_2_16X32.c, expRealFFT_2_16X32.cpp
Trilib\Example\GreenHills\Transforms\FFT
\expRealFFT_2_16X32.cpp, expRealFFT_2_16X32.c
Trilib\Example\GNU\Transforms\FFT
\expRealFFT_2_16X32.c
Cycle Count
Initialization
:
8
Unify
:
5 + ( 10 × N ⁄ 2 ) + 2
First Pass Loop
:
Kernel
:
10 + 9 × nX ⁄ 2
10 × ( Log 2 N – 1 ) + 2
+7 × ( N ⁄ 2 – 1 ) + 2
+ ( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
• Stage Loop
:
10 × ( Log 2 N – 1 ) + 2
• Group Loop
:
7 × (N ⁄ 2 – 1) + 2
• Butterfly
:
( 16or14 ) ( Log 2 N – 1 ) × N ⁄ 2 + 2
Post Processing
:
11 + 4 × nX
Example
N is the number of points of FFT
Code Size
User’s Manual
N
Actual
Higher limit
Lower limit
8
314
319
303
256
17004
18795
16999
482 bytes
4-308
V 1.2, 2000-01
Function Descriptions
4.9
Discrete Cosine Transform (DCT)
4.9.1
Algorithm
Similar to the Discrete Fourier Transform (DFT) the Discrete Cosine Transform (DCT) is
widely used for transforming a signal or image from the time or spatial domain to the
frequency domain. The DCT, especially the two-dimensional (2D) DCT plays an
important role in applications such as signal or image compression, e.g. in the JPEG and
MPEG standards. In contrast to FFT, DCT is a real valued transform. The onedimensional (1D) DCT of a discrete time sequence u(n) (n = 0, 1,...,N-1) is defined as
N–1
∑ u ( n ) ⋅ αN ( k ) cos
v(k) =
( 2n + 1 )kπ
--------------------------- (k = 0, 1,...,N-1)
2N
[4.126]
n=0
with
 1⁄N
αN ( k ) = 
for k = 0
 2⁄N
for k = 1, 2,...N-1
The DCT Equation [4.126] can be represented in a matrix vector form
v = CNu
[4.127]
where
u =
CN =
u( 0)
u( 1)
u(N – 1)
v =
v( 0)
v( 1)
v(N – 1)
[4.128]
c N ( 0, 0 )
c N ( 0, 1 )
…
c N ( 0, N – 1 )
c N ( 1, 0 )
c N ( 1, 1 )
…
c N ( 1, N – 1 )
[4.129]
c N ( N – 1, 0 ) c N ( N – 1, 1 ) … c N ( N – 1, N – 1 )
with
( 2n + 1 )kπ
c N ( k, n ) = α N ( k ) cos --------------------------2
Notice that CN is an orthogonal matrix, i.e., its inverse is equal to its transpose.
CN-1 = CNT
User’s Manual
[4.130]
4-309
V 1.2, 2000-01
Function Descriptions
or
CNCNT = CNTCN = identity matrix
The 2D DCT separates a two dimensional signal (i.e., an image) u(n1, n2), (n1 = 0,
1,...,N1-1; n2 = 0, 1,...,N2-1) into parts or spectral subbands of differing importance (with
respect to the visual quality of the image). The transformed image v(n1,n2) has the same
size N 1 × N 2 and is defined as
N1 – 1 N2 – 1
∑ ∑
v ( k 1, k 2 ) =
u ( n 1, n 2 ) ⋅ α N1 ( k 1 )α N2 ( k 2 )
[4.131]
n1 = 0 n2 = 0
( 2n 1 + 1 )k 1 π
( 2n 2 + 1 )k 2 π
cos -------------------------------- cos -------------------------------2N 1
2N 2
(k1 = 0, 1,...,N1-1; k2 = 0,1,...,N2-1)
By using the matrix notation
U =
u ( 0, 0 )
u ( 0, 1 )
u ( 0, N 2 – 1 )
u ( 1, 0 )
u ( 1, 1 )
u ( 1, N 2 – 1 )
[4.132]
u ( N 1 – 1, 0 ) u ( N 1 – 1, 1 ) u ( N 1 – 1, N 2 – 1 )
V =
v ( 0, 0 )
v ( 0, 1 )
v ( 0, N 2 – 1 )
v ( 1, 0 )
v ( 1, 1 )
v ( 1, N 2 – 1 )
[4.133]
v ( N 1 – 1, 0 ) v ( N 1 – 1, 1 ) v ( N 1 – 1, N 2 – 1 )
We can write the 2D DCT as a multiplication of three matrices
V = CN1UCN2T
The N 1 × N 2 matrix CN1 and the N 2 × N 2 CN2 are defined as in Equation [4.129].
It is easy to see that the 2D DCT is separable into a sequence of 1D DCTs, N2 times 1D
DCTs of the length N1 applied to the columns of U, followed by another N1 times 1D
DCTs of the length N2 applied to the rows of CN1U. Hence, we can say that the 1D DCT
algorithm is the Kernel of the 2D one.
A direct implementation of the DCT given in Equation [4.126] requires NxN
multiplications and additions/subtractions of the same order. Like the DFT, the DCT can
be implemented more efficiently by using a fast algorithm. In the literature many fast DCT
algorithms have been developed “References” on Page 423. Among them, the sparse
User’s Manual
4-310
V 1.2, 2000-01
Function Descriptions
matrix factorization algorithms decompose the coefficient matrix CN into a product of
several sparse matrices in order to reduce the number of multiplications and additions.
One such algorithm is proposed in “References” on Page 423. It is applicable to any
DCT whose transform length is a power of 2. For a length N 1D DCT, this algorithm
requires (3N/2)(log2N-1)+2 real additions and Nlog2N-(3N/2)+4 real multiplications.
The number of additions and multiplications for this particular case is 26 and 16. Note
that the input samples u(n) are in natural order while the output samples v’(k) are in bit
reversed order. The output samples v’(k) are exactly identical to those defined in
Equation [4.126] except for scaling
v(k) =
2
---- v’(k) (k = 0, 1,...,N-1)
N
= v’(k)/2
[4.134]
for N = 8
DCT is an orthogonal transform. If we decompose the scaling factor 1/2 in
Equation [4.134] in two 1/ 2 and scale all butterflies in Figure 4-81 whose branch
coefficients are 1 and -1, by 1/ 2 , all butterflies become an orthogonal transform.
In the following, we use this algorithm to compute an 8 × 8 DCT. A C code is given below.
It computes actually 2 × 8 , 8 sample 1D DCTs, based on the signal flow graph in
Figure 4-81. The first 8 DCTs (j = 8) are applied to the 8 columns of the original image
and the last 8 DCTs (j = 1) are applied to the 8 rows of the resulting image. The results
we obtain correspond to the transformed image V in Equation [4.133] except for a
scaling ( 2 ⁄ N )2 = 2/N due to Equation [4.134]. The program is for 16 bit fractional data
and works in an in-place manner. The 8 × 8 input image U is stored in the raster scan
(row-by-row) order in a buffer of the length 64. The same buffer is also used to store the
immediate result C8U during the processing, as well as the final output V in the same
order.
User’s Manual
4-311
V 1.2, 2000-01
Function Descriptions
C π/4
x0
X0
C π/4
C π/4
x1
X4
-C π/4
S π/8
x2
-1
x3
X2
C π/8
-S 3π/8
X6
C 3π/8
-1
S π/16
x4
x5
x6
x7
-1
S 5π/16
-C π/4
-1
-1
C π/4
C π/4
X5
C5 π/16
-S 3π/16
X3
-1
-1
C π/4
C 3π/16
C 7π/16
-1
X1
C π/16
-S 7π/16
X7
Figure 4-81 Signal Flow Graph for an 8-sample 1D DCT
User’s Manual
4-312
V 1.2, 2000-01
Function Descriptions
C π/4
X0
C π/4
C π/4
X4
S π/8
C π/8
-S 3π/8
X6
C 3π/8
x2
-1
x3
-1
S π/16
S 5π/16
X5
C5 π/16
-S 3π/16
C 3π/16
-1
C π/16
X3
X7
x1
-C π/4
X2
X1
x0
-C π/4
-1
C π/4
C π/4
-1
C π/4
-S 3π/16
C 7π/16
-1
x4
x5
-1
x6
-1
x7
Figure 4-82 Signal Flow Graph for an 8-sample 1D IDCT
User’s Manual
4-313
V 1.2, 2000-01
Function Descriptions
4.10
Inverse Discrete Cosine Transform (IDCT)
4.10.1
Algorithm
The Inverse Discrete Cosine Transform (IDCT) is easily derived from the DCT. By
multiplying both sides of Equation [4.127] with CN-1 from left and considering the
orthogonality Equation [4.130] we obtain
u = CNTv
or
N–1
u(n) =
∑ v ( k ) ⋅ αN ( k ) cos
( 2n + 1 )kπ
--------------------------- (n = 0, 1,...,N-1)
2N
[4.135]
k=0
In other words, to get the IDCT we simply replace the DCT matrix CN by its transpose
CNT. The same is true for the 2D IDCT, i.e.
U = CN1TVCN2
or
N1 – 1 N2 – 1
∑ ∑
u ( n 1, n 2 ) =
v ( k 1, k 2 ) ⋅ α N1 ( k 1 )α N2 ( k 2 )
[4.136]
k1 = 0 k2 = 0
( 2n 1 + 1 )k 1 π
( 2n 2 + 1 )k 2 π
cos -------------------------------- cos -------------------------------2N 1
2N 2
(n1 = 0, 1,...,N1-1; n2 = 0,1,...,N2-1)
For the fast computation of IDCT, we use the same idea “References” on Page 423
as for DCT. Because each butterfly in Figure 4-81 represents an orthogonal transform
(except for a possible scaling), we only need to reserve the signal flow in Figure 4-81
in order to get a signal flow graph for IDCT. By introducing the transformed samples v(k)
in bit reversed order at the right side, we recover u’(n) in natural order at the left side.
The original samples u(n) defined in Equation [4.135] are given by
u(n) =
2
---- u’(n) (n = 0, 1,...,N)
N
= u’(n)/2
[4.137]
for n = 8
like in Equation [4.134]. The number of additions and multiplications is exactly the same
as for DCT. A C code of 16 bit 8 × 8 IDCT is given below. It has the same structure as
for the DCT and differs only in the reversed signal flow.
User’s Manual
4-314
V 1.2, 2000-01
Function Descriptions
4.11
Multidimensional DCT (General Information)
As DCT is a separable transform, 1D DCT, defined in Equation [4.126] can be extended
to 2D DCT as follows.
2D DCT (separable)
N–1M–1
X u, v
c2
4
= ---------- c u c v
NM
∑ ∑
( 2n + 1 )uπ
( 2m + 1 )vπ
x n, m cos --------------------------- cos ----------------------------2N
2M
[4.138]
n = 0m = 0
cl = 1/ 2
u = 0, 1,...,N-1,
v = 0, 1,...,M-1,
1,
l=0
l≠0
2D IDCT
N – 1M – 1
x n, m =
∑ ∑
( 2n + 1 )uπ
( 2m + 1 )vπ
c2
c u c v X u, v cos --------------------------- cos ----------------------------2N
2M
[4.139]
u = 0v = 0
n = 0, 1,...,N-1
m = 0, 1,...,M-1,
The normalized version of 2D DCT is
2D DCT (normalized)
X u, v
c2
2
= c u c v -------------NM
( 2n + 1 )uπ
( 2m + 1 )vπ
x n, m cos --------------------------- cos ----------------------------2N
2M
∑ ∑
n = 0m = 0
N–1
=
2
---N
∑ cu
M–1
2
--------- c v
M
n=0
∑
[4.140]
( 2m + 1 )vπ
( 2n + 1 )uπ
x n, m cos ----------------------------- cos --------------------------2M
2N
m=0
u = 0, 1,...,N-1,
cl = 1/ 2
l=0
v = 0, 1,...,M-1,
1,
l≠0
2D IDCT (normalized)
N – 1M – 1
x n, m
2
= -------------NM
∑ ∑
( 2n + 1 )uπ
( 2m + 1 )vπ
c2
c u c v Xu, v cos --------------------------- cos ----------------------------2N
2M
[4.141]
u = 0v = 0
n = 0, 1,...,N-1
m = 0, 1,...,M-1
User’s Manual
4-315
V 1.2, 2000-01
Function Descriptions
DCT is a separable transform, as is IDCT. An implication of this is that 2D DCT can be
implemented by a series of 1D DCTs, i.e., 1D DCTs along rows (columns) of a 2D array
followed by 1D DCTs along columns (rows) of the semi-transformed array Figure 4-83
Data domain
Transform domain
0, 1, 2, .... , M-1
0
1
.
N-1
x(n,m)
N
(a)
M (N-point 1D-DCT’s) along columns
followed by
N (M-point 1D-DCT’s) along rows
(NxM) 2D-DCT
0, 1, 2, .... , M-1
0
1
.
N-1
x(n,m)
N
(b)
N (M-point 1D-DCT’s) along rows
followed by
M (N-point 1D-DCT’s) along columns
(NxM) 2D-DCT
Figure 4-83 Implementation of 2D (NxM) DCT by Series of 1D DCTs
a) 1D DCTs along columns followed by 1D DCTs along rows.
b) 1D DCTs along rows followed by 1D DCTs along columns.
User’s Manual
4-316
V 1.2, 2000-01
Function Descriptions
Theoretically, both are equivalent. All the properties of the ID DCT (fast algorithms,
recursivity, etc.) extend automatically to the MD-DCT. The separability property can be
observed by rewriting Equation [4.138] as follows.
X u, v
c2
2
= ---N
∑ cu
2
----M
∑
n=0
m=0
M–1
N–1
2
= ---N
2
c u ---N
∑
m
0
( 2m + 1 )vπ
( 2n + 1 )uπ
c v x n, m cos ----------------------------- cos --------------------------2M
2N
[4.142]
( 2n + 1 )uπ
∑ cu x n, m cos -------------------------2N
n
( 2m + 1 )vπ
cos ----------------------------2M
0
u = 0, 1,...,N-1,
v = 0, 1,...,M-1,
A similar manipulation on Equation [4.139] yields the separability property of the 2D
IDCT. This property is illustrated in Figure 4-83.
Since DCT is a separable transform, it can be expressed in a matrix form as follows
2D DCT
T
.
π
π
2
2
x
---- C N
= ---- C N
N
N
(N × N)
(N × N)(N × N) (N × N)
X
c2
[4.143]
2D IDCT
x
(N × N)
=
CN
π
T
X
c2
CN
π
(N × N)(N × N)(N × N )
T
T
π
π
π
π
2
2
CN
CN
---- C N
= ---- C N
N
N
(N × N )( N × N)
( N × N) ( N × N)
=
[4.144]
IN
(N × N )
For the 2D DCT, the sizes (dimensions) along each coordinate need not be the same.
2D DCT
T
π
π
2
2
x
----- C M
= ---- C N
N
M
( N × M)
( N × N )( N × M ) ( M × M )
X
c2
User’s Manual
[4.145]
4-317
V 1.2, 2000-01
Function Descriptions
2D IDCT
x
(N × M)
CN
=
π
T
c2
CM
π
(N × N )( N × M)(M × M )
2
---- C π C π
N
N N
T
2
= ---- C π
N N
2
----- C π C π
M
M M
4.11.1
X
T
[4.146]
π
C N = IN
T
= IM
Descriptions
The following DCT functions are described.
• Discrete Cosine Transform
• Inverse Discrete Cosine Transform
4.11.2
2D 8x8 Spatial Block DCT/IDCT Implementation
The DCT, IDCT is implemented using the Chen’s “References” on Page 423 Fast DCT/
IDCT one dimensional algorithm which is discussed in the earlier Section 4.10.1. The
2D DCT /IDCT exploits the orthogonal property of the algorithm and breaks the 2D 8x8
Spatial block into the 8 rows and 8 columns.
Each row is taken as a whole and is processed by the Chen’s ID DCT as in
Equation [4.135]and the schematic is shown in the signal flow graph Figure 4-81. This
is achieved by the RDct1d macro for the DCT and the RIdct1d macro for the IDCT. The
column is then processed by the CDct1d for the DCT and the CIDct1d for the IDCT.
User’s Manual
4-318
V 1.2, 2000-01
Function Descriptions
DCT_2_8
Discrete Cosine Transform
Signature
DataS* DCT_2_8(DataS *X);
Inputs
X
Output
None
Return
R
Description
User’s Manual
:
Pointer to Real Data block 8 × 8
array Input coefficients
:
Pointer to the Real Data block of
8 × 8 DCT coefficient
This function implements the 2 dimensional Discrete Cosine
Transform. This is implemented using the FDCT algorithm
based on the Chen’s, that falls in the class of orthogonal DCTs.
The data is organized in the 8 × 8 block, the result is returned
in the same block.
4-319
V 1.2, 2000-01
Function Descriptions
DCT_2_8
Discrete Cosine Transform (cont’d)
Pseudo code
{
int t[12],i,j;
for (j=8; j>0; j-=7,d-=8)
{
t[0] = d[0];
t[1] = d[j];
t[2] = d[2 * j];
t[3] = d[3 * j];
t[4] = d[4 * j];
t[5] = d[5 * j];
t[6] = d[6 * j];
t[7] = d[7 * j];
t[8] = t[0] + t[7];
t[7] = t[0] - t[7];
t[9] = t[1] + t[6];
t[6] = t[1] - t[6];
t[10] = t[2] + t[5];
t[5] = t[2] - t[5];
t[11] = t[3] + t[4];
t[4] = t[3] - t[4];
t[0]
t[1]
t[2]
t[3]
=
=
=
=
t[8]
t[8]
t[9]
t[9]
+
+
-
t[11];
t[11];
t[10];
t[10];
t[10] = r[0] * (short) (t[6] - t[5]);
t[11] = r[0] * (short) (t[6] + t[5]);
t[8] = t[4] + t[10];
t[9] = t[4] - t[10];
t[10] = t[7] + t[11];
t[11] = t[7] - t[11];
d[0] = (r[0] * (short)(t[0] + t[2])) >> 15;
d[j] = (r[3] * t[11] + r[4] * t[8]) >> 15;
d[2 * j] = (r[1] * t[1] + r[2] * t[3]) >> 15;
d[3 * j] = (r[5] * t[10] - r[6] * t[9]) >> 15;
d[4 * j] = (r[0] * (short)(t[0] - t[2])) >> 15;
d[5 * j] = (r[6] * (t[10] + r[5] * t[9]) >> 15;
d[6 * j] = (r[2] * t[1] - r[1] * t[3]) >> 15;
d[7 * j] = (r[4] * t[11] - r[3] * t[8]) >> 15;
User’s Manual
4-320
V 1.2, 2000-01
Function Descriptions
DCT_2_8
Discrete Cosine Transform (cont’d)
}
}
Techniques
•
•
•
•
Assumptions
• Input is real sign extended data packed in 16 bit
• Output is the sign extended data shifted to left by 3 bit
positions and packed in 16 bits
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• The processing is done inplace so the input block itself gets
modified by the program
• Dynamic Input range is -2048 to 2047 before scaling
User’s Manual
Packed multiplication/addition
Software pipelining
Load/Store scheduling
Packed Load/Store
4-321
V 1.2, 2000-01
Function Descriptions
DCT_2_8
Discrete Cosine Transform (cont’d)
Memory Note
16 bit 8x8 2Dimensional Block
8 columns
0 1 2
3 4
5 6
7
0
1
8
r
o
w
s
2
3
r
o
w
i
4
5
6
7
DCT-Row
16
bit
i
Note: Input spatial block has to be
scaled up by 8
i+1
0
1
DCT-Column
2
3
4
5
6
7
Figure 4-84 DCT_2_8
User’s Manual
4-322
V 1.2, 2000-01
Function Descriptions
DCT_2_8
Discrete Cosine Transform (cont’d)
Implementation
Section 4.11.2
Example
Trilib\Example\Tasking\Transforms\DCT\expDCT_2_8.c,
expDCT_2_8.cpp
Trilib\Example\GreenHills\Transforms\DCT
\expDCT_2_8.cpp, expDCT_2_8.c
Trilib\Example\GNU\Transforms\DCT\expDCT_2_8.c
Cycle Count
Initialization
:
4
Kernel
:
453
Post Processing
:
3
Code Size
User’s Manual
444 bytes
4-323
V 1.2, 2000-01
Function Descriptions
IDCT_2_8
Inverse Discrete Cosine Transform
Signature
DataS* IDCT_2_8(DataS *X);
Inputs
X
Output
None
Return
R
Description
User’s Manual
:
Pointer to Real Data block 8 × 8
array Input coefficients
:
Pointer to the Real Data block of
8 × 8 DCT coefficient
This function implements the 2D Inverse Discrete Cosine
Transform. This is implemented using the FIDCT algorithm
based on the Chen’s, that falls in the class of orthogonal DCTs.
The data is organized in the 8 × 8 block, the result is returned
in the same block.
4-324
V 1.2, 2000-01
Function Descriptions
IDCT_2_8
Inverse Discrete Cosine Transform (cont’d)
Pseudo code
{
int t[12],i,j;
for (j=8; j>0; j-=7,d-=8)
{
t[0] = d[0];
t[1] = d[j];
t[2] = d[2 * j];
t[3] = d[3 * j];
t[4] = d[4 * j];
t[5] = d[5 * j];
t[6] = d[6 * j];
t[7] = d[7 * j];
t[8] = (r[4] * t[1] - r[3] * t[7]) >> 15;
t[9] = (r[3] * t[1] + r[4] * t[7]) >> 15;
t[10] = (r[5] * t[5] - r[6] * t[3]) >> 15;
t[11] = (r[6] * t[5] + r[5] * t[3]) >> 15;
t[1]
t[3]
t[5]
t[7]
=
=
=
=
(r[0]
(r[0]
(r[2]
(r[1]
*
*
*
*
(short) (t[0]
(short) (t[0]
t[2] - r[1] *
t[2] + r[2] *
t[0]
t[2]
t[4]
t[6]
=
=
=
=
t[1]
t[1]
t[3]
t[3]
+
+
-
t[7];
t[7];
t[5];
t[5];
t[1]
t[3]
t[5]
t[7]
=
=
=
=
t[8]
t[8]
t[9]
t[9]
+
-
t[10];
t[10];
t[11];
t[11];
+ t[4]))
- t[4]))
t[6]) >>
t[6]) >>
>> 15;
>> 15;
15;
15;
t[10] = r[0] * (short) (t[5] - t[3]) >> 15;
t[11] = r[0] * (short) (t[5] + t[3]) >> 15;
d[0] = t[0] + t[7];
d[j] = t[4] + t[11];
d[2 * j] = t[6] + t[10];
d[3 * j] = t[2] + t[1];
d[4 * j] = t[2] - t[1];
User’s Manual
4-325
V 1.2, 2000-01
Function Descriptions
IDCT_2_8
Inverse Discrete Cosine Transform (cont’d)
d[5 * j] = t[6] - t[10];
d[6 * j] = t[4] - t[11];
d[7 * j] = t[0] - t[7];
}
}
Techniques
• Packed multiplication/additions
• Load/Store scheduling
• Packed Load/Store
Assumptions
• Input is real sign extended data packed in 16 bit and has to
be scaled up by a factor of 8 (left shifted by 3)
• Output is the sign extended data packed in the 16 bit
• Input is halfword aligned in IntMem and word aligned in
ExtMem
• The processing is done inplace so the input block itself gets
modified by the program
• Dynamic Input range is -2048 to 2047 before scaling
User’s Manual
4-326
V 1.2, 2000-01
Function Descriptions
IDCT_2_8
Inverse Discrete Cosine Transform (cont’d)
Memory Note
16 bit 8x8 2Dimensional Block
8 columns
0 1 2
3 4
5 6
7
0
1
8
r
o
w
s
2
3
r
o
w
i
4
5
6
7
IDCT-Row
16
bit
i
Note: Input spatial block has to be
scaled up by 8
i+1
0
1
IDCT-Column
2
3
4
5
6
7
Figure 4-85 IDCT_2_8
User’s Manual
4-327
V 1.2, 2000-01
Function Descriptions
IDCT_2_8
Inverse Discrete Cosine Transform (cont’d)
Implementation
Section 4.11.2
Example
Trilib\Example\Tasking\Transforms\DCT\expDCT_2_8.c,
expDCT_2_8.cpp
Trilib\Example\GreenHills\Transforms\DCT
\expDCT_2_8.cpp, expDCT_2_8.c
Trilib\Example\GNU\Transforms\DCT\expDCT_2_8.c
Cycle Count
Initialization
:
4
Kernel
:
439
Post Processing
:
3
Code Size
User’s Manual
430 bytes
4-328
V 1.2, 2000-01
Function Descriptions
4.12
Mathematical Functions
4.12.1
Functions using Polynomial Approximation
The Mathematical and Trignometrical functions can be approximated by polynomial
expansion. Generally, Taylor & McLaren series are used for expansion of these
functions. The function uses the coefficients calculated by statistical analysis technique
of regression. Only limited terms of series are used. To improve the accuracy of the
output of the function, the optimized coefficients are used.
4.12.1.1 Descriptions
The following series functions are described.
•
•
•
•
•
•
•
•
Sine
Cosine
Arctan
Square Root
Natural log
Natural Antilog
Exponential
X Power Y
User’s Manual
4-329
V 1.2, 2000-01
Function Descriptions
Sine_32
Sine
Signature
DataS Sine_32(int X);
Inputs
X
Output
None
Return
Description
R
:
The radian input in [-pi,pi] range
:
Output sine value of the function
This function calculates the sine of an angle. It takes 32 bit
input which represents the angle in radians and returns the 16
bit sine value.
Pseudo code
{
int Xabs;
int sign;
frac32 XbyPi;
frac32 acc;
frac32 Rf;
frac16 R;
//Stores Absolute value
//sign of the result
//angle scaled down by pi
//Output of polynomial calculation in 4Q28 format
//32-bit Sine value in 1Q31
//Result in 1Q15 format
Xabs = |X|;
if (Xabs != X)
sign = 1;
//sign = 1 if X is in III or IV quadrant
if (Xabs > Pi/2)
Xabs = Pi - Xabs;
//if input angle in II or III quadrant subtract
//absolute value from pi
XbyPi = Xabs (*) one_Pi;
//angle is scaled down by pi before being used in the
//polynomial calculation
acc = ((((H[4] (*) XbyPi + H[3]) (*) XbyPi + H[2]) (*) XbyPi
+ H[1]) (*) XbyPi + H[0]) (*) XbyPi;
//polynomial calculation - acc in 4Q28 format
acc = acc << 3;
//acc in 1Q31 format
if (sign == 1)
Rf = 0 - acc;
//sine is negative in III and IV quadrant
R = (frac16)Rf;
//16 bit result in 1Q15 format
return R;
//Returns the calculated sine value
}
Techniques
• Use of MAC instructions
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Input is the radian value in 3Q29 format, output is the sine
value in 1Q15 format and coefficients are in 4Q28 format
User’s Manual
4-330
V 1.2, 2000-01
Function Descriptions
Sine_32
Sine (cont’d)
Memory Note
None
Implementation
Sin(x), where x is in radians is approximated using the
polynomial expansion.
sin ( x ) = 3.140625 ( x ⁄ π ) + 0.02026367 ( x ⁄ π )
3
2
– 5.325196 ( x ⁄ π ) + 0.5446778 ( x ⁄ π )
+ 1.800293 ( x ⁄ π )
4
[4.147]
5
0 ≤ x ≤ π ⁄ 2 radians.
Sine value in other quadrants is computed by using the
relations,
sin ( – x ) = – sin ( x ) and sin ( 180 – x ) = sin x
The function takes 32 bit radian input in 3Q29 format to
accommodate the range ( – π, π ) . The output is 16 bits in 1Q15
format. Coefficients are stored in 4Q28 format. Constants pi,
pi/2 and 1/pi are also stored in the data segment in 3Q29,
3Q29 and 1Q31 formats respectively.
The absolute value of the radian input is calculated. If the
input angle is negative (III/IV Quadrant), then sign=1. If
absolute value of the angle is greater than pi/2 (II/III
Quadrant), it is subtracted from pi. The angle is then scaled
down by pi, converted to 1Q31 and used in polynomial
calculation. The result is negated, if sign=1 to give the final
sine result.
To have an optimal implementation with zero overhead load
store, the polynomial in Equation [4.147] is rearranged as
below.
sin ( x ) = ( ( ( ( 1.800293 ( x ⁄ π ) + 0.5446778 ) ( x ⁄ π )
– 5.325196 ) ( x ⁄ π ) + 0.02026367 ) ( x ⁄ π )
[4.148]
+ 3.140625 ) ( x ⁄ π )
Hence, 4 multiply-accumulate and 1 multiply instruction will
compute the expression Equation [4.148] with a load of
coefficient done in parallel with each of them.
User’s Manual
4-331
V 1.2, 2000-01
Function Descriptions
Sine_32
Sine (cont’d)
Example
Trilib\Example\Tasking\Mathematical\expSine_32.c,
expSine_32.cpp
Trilib\Example\GreenHills\Mathematical\expSine_32.cpp,
expSine_32.c
Trilib\Example\GNU\Mathematical\expSine_32.c
Cycle Count
With DSP
Extensions
If input angle is in
(I/II Quadrant)
:
15+2
If input angle is in
(III/IV Quadrant)
:
18+2
If input angle is in
(I/II Quadrant)
:
16+2
If input angle is in
(III/IV Quadrant)
:
19+2
Without DSP
EXtensions
Code Size
76 bytes
32 bytes (Data)
User’s Manual
4-332
V 1.2, 2000-01
Function Descriptions
Cos_32
Cosine
Signature
DataS Cos_32(int X);
Inputs
X
Output
None
Return
Description
R
:
The radian input in [-pi,pi] range
:
Output cosine value of the function
This function calculates the cosine of an angle. It takes 32 bit
input which represents the angle in radians and returns the 16
bit cosine value.
Pseudo code
{
int Xabs;
//absolute value of angle
frac32 XbyPi;
//angle scaled down by pi
frac32 Pi = pi;
frac32 one_Pi = 1/pi;
//Constant 1/pi in 1Q31 format
int sign;
//sign of the result
frac32 acc;
//Output of polynomial calculation in 4Q28 format
frac32 Rf;
//32-bit Cosine value in 1Q31
frac16 R;
//Result in 1Q15 format
Xabs = |X|;
X = Pi/2 - Xabs;
Xabs = |X|;
if (X != Xabs)
sign = 1;
//Complementary angle is calculated
//sign = 1 if input angle is in the II or III
//quadrant
XbyPi = Xabs (*) one_Pi;
//angle is scaled down by pi before being used in the
//polynomial calculation
acc = ((((H[4] (*) XbyPi + H[3]) (*) XbyPi + H[2]) (*) XbyPi
+ H[1]) (*) XbyPi + H[0]) (*) XbyPi;
//polynomial calculation - acc in 4Q28 format
Rf = acc << 3;
//acc in 1Q31 format
if (sign == 1)
//cosine value is negative in the II or III quadrant
Rf = 0 - acc;
R = (frac16)Rf;
return R;
//cosine result in 1Q15 format
//Returns the calculated cosine value
}
Techniques
User’s Manual
• Use of MAC instructions
• Instruction ordering provided for zero overhead Load/Store
4-333
V 1.2, 2000-01
Function Descriptions
Cos_32
Cosine (cont’d)
Assumptions
• Input is the radian value in 3Q29 format, output is the
cosine value in 1Q15 format and coefficients are in 4Q28
format
Memory Note
None
Implementation
Cos(x) is approximated by the same polynomial expression
used for sine as cos ( x ) = sin ( 90 – x ) .
The function takes 32 bit radian input in 3Q29 format to
accommodate the range ( – π, π ) . The output is 16 bits in
1Q15 format. Coefficients are stored in 4Q28 format.
Constants pi, pi/2 and 1/pi are also stored in the data segment
in 3Q29, 3Q29 and 1Q31 formats respectively.
Absolute value of the radian input is calculated. Its
complementary angle is determined. If the complementary
angle is negative, the input angle is in II/III Quadrant where
cos is negative. Hence sign=1. The absolute value of
complementary angle is scaled down by pi, brought to 1Q31
format and is used in the polynomial calculation. If sign=1, the
result of the polynomial calculation is negated, to give the final
cosine result.
The implementation of the polynomial is optimal with zero
overhead Load/Store.
Example
Trilib\Example\Tasking\Mathematical\expCos_32.c,
expCos_32.cpp
Trilib\Example\GreenHills\Mathematical\expCos_32.cpp,
expCos_32.c
Trilib\Example\GNU\Mathematical\expCos_32.c
Cycle Count
With DSP
Extensions
User’s Manual
If input angle is in
(I/IV Quadrant)
:
15+2
If input angle is in
(III/II Quadrant)
:
18+2
4-334
V 1.2, 2000-01
Function Descriptions
Cos_32
Cosine (cont’d)
Without DSP
Extensions
Code Size
If input angle is in
(I/IV Quadrant)
:
16+2
If input angle is in
(III/II Quadrant)
:
19+2
68 bytes
28 bytes (Data)
User’s Manual
4-335
V 1.2, 2000-01
Function Descriptions
Arctan_32
Arctan
Signature
short Arctan_32(int X);
Inputs
X
Output
None
Return
Description
User’s Manual
R
:
tan value in the range [-215, 215)
:
Output arctan value of the function
This function calculates the arc tangent of the input. The input
to the function is 32 bits. The input range is [-215, 215). The
function returns 16 bit value which represents the angle in
radians.
4-336
V 1.2, 2000-01
Function Descriptions
Arctan_32
Arctan (cont’d)
Pseudo code
{
frac32 Xabs;
frac32 X1Q31;
frac32 acc;
int sign;
frac32 Rf;
frac16 R;
//absolute value of input
//|X| or 1/|X| in 1Q31 format used in the polynomial
//calculation
//Output of the polynomial calculation in 1Q31 format
//sign of the result
//32 bit arctan value in 2Q30 format
//16 bit arctan result in 2Q14 format
Xabs = |X|;
if (X != Xabs)
sign = 1;
//if input tan value is negative,sign = 1
if (Xabs > 1)
X1Q31 = 1/Xabs;
//X1Q31 = 1/|X| in 1Q31 format if |X| > 1
else
X1Q31 = Xabs << 15;
//X1Q31 = |X| in 1Q31 format
acc = ((((H[4] (*) X1Q31 + H[3]) (*) X1Q31 + H[2]) (*) X1Q31
+ H[1]) (*) X1Q31 + H[0]) (*) X1Q31;
//polynomial calculation - acc in 1Q31 format
if (Xabs > 1)
acc = 0.5 - acc;//polynomial result is subtracted from 0.5 if
//1/|X| has been used in the calculation
Rf = acc (*) Pi;
R = (frac16)Rf;
return R;
//32 bit arctan value in radians - Rf in 2Q30 format
//16 bit arctan value in radians in 2Q14 format
//Returns the calculated arctan value
}
Techniques
• Use of MAC instructions
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Input tan value is in 16Q16 format, output is the angle in
radians in 2Q14 format and coefficients are in 1Q31 format
Memory Note
None
User’s Manual
4-337
V 1.2, 2000-01
Function Descriptions
Arctan_32
Arctan (cont’d)
Implementation
Arctan(x) in radians is approximated using the following
polynomial expansion.
For x<1,
2
arc tan ( x ) = π ( 0.318253x + 0.003314x – 0.130908x
4
5
+ 0.068542x – 0.009159x )
3
[4.149]
For x ≥ 1 the formula
arc tan ( x ) = π ⁄ 2 – arc tan ( 1 ⁄ x )
[4.150]
can be used.
As 1/x < 1 (for x>1), the polynomial of Equation [4.149] can
be used to compute arctan(1/x).
Combining Equation [4.149] and Equation [4.150],
For x ≥ 1 ,
arc tan ( x ) = π ( 0.5 – arc tan ( 1 ⁄ x ) )
The input to the function is 32 bits in 16Q16 format. Hence
input is in the range [-215, 215). The function returns 16 bit
output which is the arctan value in radians. Since arctan
values lie in the range [-pi/2, pi/2] output format is 2Q14. 32
bits are used to store coefficients in 1Q31 format in the data
segment. π value is also stored in 3Q29 format in data
segment. The absolute value of the input is taken in a register
and if input is less than 0, sign is set to 1. When input is less
than 1, the upper 16 bits of absolute value will be zero and the
lower 16 bits represent the tan value in 0Q16. Shifting 15
times to the left will bring the input to 1Q31 format. This value
is used in polynomial calculation. The output of the polynomial
is multiplied by π and if sign=1, the result is negated to give
the final arctan result.
If x > 1, the reciprocal is calculated by dividing a one in
16Q16 format by the given input. The result gives reciprocal
of input in 0Q32, which is converted to 1Q31. This value is
now used in the polynomial calculation.
User’s Manual
4-338
V 1.2, 2000-01
Function Descriptions
Arctan_32
Arctan (cont’d)
The result of the polynomial calculation is subtracted from 0.5
and then multiplied by pi. Once again, it is negated if sign =1
to give the final arctan result in radians.
The implementation of the polynomial is optimal with zero
overhead Load/Store.
Example
Trilib\Example\Tasking\Mathematical\expArctan_32.c,
expArctan_32.cpp
Trilib\Example\GreenHills\Mathematical\expArctan_32.cpp,
expArctan_32.c
Trilib\Example\GNU\Mathematical\expArctan_32.c
Cycle Count
For |X| < 1 and X > 0
:
28+2
For |X| < 1 and X < 0
:
31+2
For |X| > 1 and X > 0
:
50+2
For |X| > 1 and X < 0
:
53+2
Code Size
126 bytes
24 bytes(Data)
User’s Manual
4-339
V 1.2, 2000-01
Function Descriptions
Sqrt_32
Square Root
Signature
short Sqrt_32(int X);
Inputs
X
Output
None
Return
R
Description
User’s Manual
:
Real input value in the range
[0, 214)
:
Output value of the function
This function calculates the square root of a given number. It
takes 32 bit input in the range [0, 214) and returns 16 bit square
root value in the range [0, 27).
4-340
V 1.2, 2000-01
Function Descriptions
Sqrt_32
Square Root (cont’d)
Pseudo code
{
int Shcnt;
int Scale;
frac32 acc;
frac32 X1Q31;
frac16 R;
//Shift count
//Scaling factor
//Result of Polynomial calculation
//Input scaled to 1Q31 format
//Result in 8Q8 format
Shcnt = count_lead_sign(X);
// number of leading sign values
Scale = Shcnt - 15;//Get the scale factor
X1Q31 = X << Shcnt;// 1Q31 <- 16Q16
acc = ((((H5 (*) X1Q31 + H4) (*) X1Q31 + H3) (*) X1Q31 + H2) (*) X1Q31 +
H1) (*) X1Q31 + H0
//polynomial calculation - acc in 1Q31 format
//Input less than 1
if (Scale >= 0)
{
acc = acc (*) SqrtTab[Scale];
//acc = acc * Scale factor
R = (frac16) acc >> 22;
//8Q8 format <- 2Q30 format
}
//Input greater than 1
else
{
acc = acc (*) SqrtTab[ShCnt+1];
//acc = acc * Scale factor
R = (frac16) acc >> 14;
//8Q8 format <- 10Q22 format
}
return R;
//Returns the calculated square root
}
Techniques
• Use of MAC instructions
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs are in 16Q16 format and returned output is in 8Q8
format
• Input is always positive
Memory Note
None
User’s Manual
4-341
V 1.2, 2000-01
Function Descriptions
Sqrt_32
Square Root (cont’d)
Implementation
The square root of the input value x can be calculated by
using the following approximation series.
2
sqrt ( x ) = 1.454895x – 1.34491x + 1.106812x
4
5
3
[4.151]
– 0.536499x + 0.1121216x + 0.2075806
where, 0.5 ≥ x ≥ 1
The coefficients of polynomial are stored in 2Q30 format. The
n
square root table (table of scale factors) stores ( 1 ⁄ 2 ) in
1Q31 format where n ranges from 0 to 15. This is same as
n
( 2 ) in 9Q23 format, where n ranges from 16 to 1. The 32
bit input given is in 16Q16 format which can take values in the
range [-215, 215). As input should be positive it will be subset
of actual input range, i.e., it is in the range [0, 215). The 16 bit
output returned is in 8Q8 format. So the output values are in
the range of [0, 27). So it can accommodate inputs in the
range [0, 214).
As the polynomial expansion needs input only in the range 0.5
to 1, the given input has to be scaled up or scaled down. If the
given input number is greater than 1, then it is scaled down by
powers of two, so that scaled input value lies in the range 0.5
to 1.This scaled input is used in polynomial calculation. The
calculated output is scaled up by power of 2 to get the
actual output.
If the input is less than 1, then it is scaled up by power of two,
so that scaled value lies in the range 0.5 to 1. This scaled
input is used in polynomial calculation. The calculated output
is scaled down by power of 1 ⁄ 2 to get actual output.
The CLS instructions of TriCore gives directly the shiftcount,
to scale up or scale down the input. When input is shifted by
this count, it is brought into 1Q15 format. If shiftcount is15,
input already exists in the range of 0.5 to 1. If shiftcount is less
than 15, indicates input is greater than 1 and has to be scaled
down.
User’s Manual
4-342
V 1.2, 2000-01
Function Descriptions
Sqrt_32
Square Root (cont’d)
If shiftcount is greater than 15, indicates input is less than 1
and has to be scaled up.
Scale factor is obtained as (15-shiftcount). The output of
polynomial calculation is scaled by a value from square root
table. The appropriate scale factor is obtained and multiplied
to get the square root of given input.
The implementation of the polynomial is optimal with zero
overhead Load/Store.
Example
Trilib\Example\Tasking\Mathematical\expSqrt_32.c,
expSqrt_32.cpp
Trilib\Example\GreenHills\Mathematical\expSqrt_32.cpp,
expSqrt_32.c
Trilib\Example\GNU\Mathematical\expSqrt_32.c
Cycle Count
If X>1
:
14+2
If X<=1
:
16+2
Code Size
88 byes
88 bytes(Data)
User’s Manual
4-343
V 1.2, 2000-01
Function Descriptions
Ln_32
Natural logarithm
Signature
short Ln_32(int X);
Inputs
X
Output
None
Return
R
Description
User’s Manual
:
Real input value in the range
[2-16, 215)
:
Output value of the function
This function calculates logarithm of a function to the base e,
i.e., natural logarithm. It takes 32 bit input in the range
[2-16, 215) and returns the output logarithm in the range
[-24, 24).
4-344
V 1.2, 2000-01
Function Descriptions
Ln_32
Natural logarithm (cont’d)
Pseudo code
{
int Shcnt
int Scale;
frac32 acc;
frac32 Xu1Q31;
frac32 Xsub1;
frac32 Rf;
frac16 R;
//Shift count
//Scaling factor
//Result of Polynomial calculation
//Input scaled to unsigned 1Q31 format
//X-1
//Output of polynomial calculation
//Result in 5Q11 format
Shcnt = count_lead_sign(X);
// number of leading sign values
Scale = 14 - Shcnt;//Get the scale factor
Shcnt = Shcnt + 1; //add 1 to shift count to bring input to
//1 to 2(unsigned 1Q15)from 0.5 to 1
Xu1Q31 = X << Shcnt;
//unsigned 1Q15 <- 16Q16
Xsub1 = Xu1Q31 - 1;//X = X - 1
acc = ((((H4 * Xsub1 + H3) * Xsub1 + H2) * Xsub1 + H1) * Xsub1 + H0) *
Xsub1
//polynomial calculation - acc in 1Q31 format
acc = acc << 4;
//5Q27 <- 1Q31
Add = Scale (*) ln2;
//Get the adding factor by scaling Ln2
Add = Add << 12;
//5Q27 <- 17Q15
Rf = acc + Add;
R = (frac16)Rf;
//Add the factor to get the result in 5Q27
//format
//result in 5Q11 format
return R;
//Returns the calculated natural logarithm
}
Techniques
• Use of MAC instructions
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs are in 16Q16 format and returned output is in 5Q11
format
• Input is always positive
Memory Note
None
User’s Manual
4-345
V 1.2, 2000-01
Function Descriptions
Ln_32
Natural logarithm (cont’d)
Implementation
The natural logarithm of the input value x can be calculated
using the following approximation series.
ln ( x ) = 0.9991150 ( x – 1 ) – 0.4899597 ( x – 1 )
2
3
+ 0.2856751 ( x – 1 ) – 0.1330566 ( x – 1 )
+ 0.03137207 ( x – 1 )
4
[4.152]
5
where, 1 ≥ x ≥ 2 which means 0 ≥ ( x – 1 ) ≥ 1
The coefficients of polynomial are stored in 1Q31 format. The
constant ln2 is also stored in 1Q31 format.
The 32 bit input is in 16Q16 format which can take values in
the range [-215, 215). As input to logarithm should always be
positive it will be subset of actual input range, i.e., it is in the
range [2-16, 215). The 16 bit output returned format is in 5Q11
format.
As the polynomial expansion needs x in the range 1 to 2, the
input has to be scaled up or scaled down. If the given input
number is greater than 1, then it is scaled down. If less than
1, it is scaled up by powers of two, so that scaled input lies in
the range 1 to 2. One is subtracted from this scaled input and
this is used in polynomial calculation.
The scale factor is positive, if input is greater than 1 and
negative, if input is less than 1. The CLS instruction of TriCore
gives the shiftcount. When the input is shifted by this
shiftcount it will be scaled in the range 0.5 to 1. The
polynomial expects input to be in the range 1 to 2 (unsigned).
So 1 is added to the shiftcount.
Scale factor is obtained as (14-shiftcount). The output of
polynomial is added with scale times ln2 to get the natural
logarithm of given input.
The implementation of the polynomial is optimal with zero
overhead Load/Store.
User’s Manual
4-346
V 1.2, 2000-01
Function Descriptions
Ln_32
Natural logarithm (cont’d)
Example
Trilib\Example\Tasking\Mathematical\expLn_32.c,
expLn_32.cpp
Trilib\Example\GreenHills\Mathematical\expLn_32.cpp,
expLn_32.c
Trilib\Example\GNU\Mathematical\expLn_32.c
Cycle Count
For all X
Code Size
86 bytes
:
19+2
24 bytes (Data)
User’s Manual
4-347
V 1.2, 2000-01
Function Descriptions
AntiLn_16
Natural Antilogarithm
Signature
int AntiLn_16(short X);
Inputs
X
Output
None
Return
Description
R
:
Real Input value in the range [-8, 8)
:
Output value of the function
This function calculates antilog of a function. It takes 16 bit
input in the range [-23, 23) and returns 32 bit antilog value in
the range [2-16, 216).
Pseudo code
{
int Shcnt
int Scale;
frac32 acc;
frac32 Rf;
frac32 X1Q31;
int Expow;
frac32 R;
//Shift count
//Scaling factor
//Result of Polynomial calculation
//Result of antilog in Q format
//Input scaled to 1Q31 format
//Power of calculated polynomial
//Result in 16Q16 format
Shcnt = count_lead_sign(X);
//number of leading sign values
X1Q31 = X << Shcnt;//1Q15 <- 4Q12
Scale = 19 - Shcnt;//Get the scale factor
acc = ((((H5 (*) X1Q31 + H4) (*) X1Q31 + H3) (*) X1Q31 + H2) (*) X1Q31
+ H1) (*) X1Q31 + H0
//polynomial calculation - acc in 3Q29 format
if(Scale <= 0)
{
R = acc >> 13;
}
User’s Manual
//Final result in 16Q16 format
4-348
V 1.2, 2000-01
Function Descriptions
AntiLn_16
Natural Antilogarithm (cont’d)
else{
Rf = acc;
//Rf <- acc
Expow = 1 << Scale;
// Get power of e^x1Q31
tmp = Expow - 1;
//x^n needs (n-1) multiplications
for (i=0;i<tmp;i++)
{
Rf = Rf (*) acc;
//Multiply calculated e^x1Q31 with itself power times
}
//Get the shift count to convert final result in 16Q16 format
Expow = Expow << 1;
ShCnt = Expow - 15;
R = Rf << ShCnt;
//Final result in 16Q16 format
}
return R;
//Returns the calculated natural antilogarithm
}
Techniques
• Use of MAC instructions
• Instruction ordering for zero overhead Load/Store
Assumptions
• Input 4Q12 format, output is the antilog of the input in
16Q16 format and coefficients are in 3Q29 format
Memory Note
None
User’s Manual
4-349
V 1.2, 2000-01
Function Descriptions
AntiLn_16
Natural Antilogarithm (cont’d)
Implementation
The antilog of the input value x can be calculated by using the
following approximation series.
2
AntiLn ( x ) = 1.0000 + 1.0001x + 0.4990x + 0.1705x
4
+ 0.0348x + 0.0139x
5
3
[4.153]
The coefficients of polynomial are stored in 3Q29 format. The
16 bit input is in 4Q12 format which can take values in the
range [-23, 23). The output returned is in 16Q16 format.
The input is scaled in the range -1 to +1. If the given number
is greater than 1, it is scaled down and if it is less than -1, it is
scaled up by powers of 2. This scaled input is used in
polynomial calculation.
The CLS instruction of TriCore gives the shiftcount to scale up
or scale down the input. Only when shiftcount is less than 19,
input is scaled up or scaled down. Otherwise input is in the
range -1 to +1. The scale factor is obtained as (19-shiftcount).
This scale factor will always be positive for the inputs greater
than 1 and less than -1. The output of polynomial calculation
is multiplied with itself scale factor times to get the actual
output.
The implementation of the polynomial is optimal with zero
overhead Load/Store.
Example
Trilib\Example\Tasking\Mathematical\expAntiLn_16.c,
expAntiLn_16.cpp
Trilib\Example\GreenHills\Mathematical\expAntiLn_16.cpp,
expAntiLn_16.c
Trilib\Example\GNU\Mathematical\expAntiLn_16.c
Cycle Count
If X in the range
-1 to 1
:
14+2
else
:
16 + ( scale × 2 ) + 5 + 2
Code Size
104 bytes
24 bytes (Data)
User’s Manual
4-350
V 1.2, 2000-01
Function Descriptions
Expn_16
Exponential
Signature
short Expn_16(DataS X);
Inputs
X
Output
None
Return
R
Description
:
Real Input value in the range [-1, 1)
:
Output exponent value of the
function
This function calculates the exponent of the given input. It
takes 16 bit input in the range [-1, 1) and returns the
exponential value in 16 bits.
Pseudo code
{
frac32 acc;
//result of polynomial calculation in 3Q29 format
frac16 R;
//16 bit exponential result in 3Q13 format
acc = ((((H[5] (*) X + H[4]) (*) X + H[3]) (*) X
+ H[2]) (*) X + H[1]) (*) X + H0;
//polynomial calculation - acc is result in 3Q29 format
R = (frac16)acc;
//16 bit exponential result in 3Q13 format
}
Techniques
• Use of packed data Load/Store
• Use of MAC instructions
• Instruction ordering for zero overhead Load/Store
Assumptions
• Input 1Q15 format, output is the exponential of the input in
3Q13 format and coefficients are in 3Q29 format
Memory Note
None
Implementation
Exp(x) is approximated using the polynomial expansion given
below.
2
exp ( x ) = 1.0000 + 1.0001x + 0.4990x + 0.1705x
4
+ 0.0348x + 0.0139x
5
3
[4.154]
The input to the function is 16 bits in 1Q15 format. Hence input
range is [-1, 1). Input outside this range should be scaled to
this range before calling the function. Coefficients are stored
in 3Q29 format. Output of the function is in 3Q13 format.
The polynomial is implemented in an optimal way so as to
have zero overhead Load/Store.
User’s Manual
4-351
V 1.2, 2000-01
Function Descriptions
Expn_16
Exponential (cont’d)
Example
Trilib\Example\Tasking\Mathematical\expExpn_16.c,
expExpn_16.cpp
Trilib\Example\GreenHills\Mathematical\expExpn_16.cpp,
expExpn_16.c
Trilib\Example\GNU\Mathematical\expExpn_16.c
Cycle Count
10+2
Code Size
42 bytes
24 bytes (Data)
User’s Manual
4-352
V 1.2, 2000-01
Function Descriptions
XpowY_32
X Power Y
Signature
int XpowY_32(int X, DataS Y);
Inputs
X
:
Y
:
Output
None
Return
R
Description
User’s Manual
:
Real input value in the range [2-11,
211)
power in the range [-1,1)
Output value of the function in the
range [2-11, 211)
X power Y is calculated. The input is 32-bit in 12Q20 format but
it should lie within the range [2-11, 211). The exponent
Y is 16-bit in 1Q15 format and is in the range [-1,1). The output
is 32-bit in 12Q20 format and lies in the range
[2-11, 211)
4-353
V 1.2, 2000-01
Function Descriptions
XpowY_32
X Power Y (cont’d)
Pseudo code
{
int Shcnt
int Scale;
frac32 acc;
frac32 Xu1Q31;
frac32 Xsub1;
frac32 Rf;
frac32 LnX;
frac32 LnXPowY;
int Expow;
frac32 R;
//Shift count
//Scaling factor
//Result of Polynomial calculation
//Input scaled to unsigned 1Q31 format
//X-1
//Output of polynomial calculation
//Result of ln in 4Q28 format
//Y*lnX in 4Q28 format
//Power of calculated polynomial
//Result in 12Q20 format
Shcnt = count_lead_sign(X);
// number of leading sign values
Scale = 10 - Shcnt;//Get the scale factor
Shcnt = Shcnt + 1; //add 1 to shift count to bring input to
//1 to 2(unsigned 1Q15)from 0.5 to 1
Xu1Q31 = X << Shcnt;
//unsigned 1Q15 <- 16Q16
Xsub1 = Xu1Q31 - 1;//X = X - 1
if(Xsub1 == 0)
go to XpowY_2
acc = ((((H4 * Xsub1 + H3) * Xsub1 + H2) * Xsub1 + H1) * Xsub1 + H0) *
Xsub1
//polynomial calculation - acc in 1Q31 format
acc = acc << 3;
//4Q28 <- 1Q31
XpowY_2:
Scale = Scale << 26;
//6Q26 <Add = Scale (*) ln2;
//Get the
Add = Add << 2;
//4Q28 <LnX = acc + Add;
//Add the
//format
LnXpowY = LnX (*) Y;
32Q0
adding factor by scaling Ln2
6Q26
factor to get the result in 4Q28
Shcnt = count_lead_sign(LnXpowY);
//number of leading sign values
X1Q31 = LnXpowY << Shcnt;//1Q31 <- 4Q28
User’s Manual
4-354
V 1.2, 2000-01
Function Descriptions
XpowY_32
X Power Y (cont’d)
Scale = 19 - Shcnt;//Get the scale factor
acc = ((((H5 (*) X1Q31 + H4) (*) X1Q31 + H3) (*) X1Q31 + H2) (*) X1Q31
+ H1) (*) X1Q31 + H0
//polynomial calculation - acc in 3Q29 format
if(Scale <= 0)
{
R = acc >> 9;
//Final result in 12Q20 format
}
else
{
Rf = acc;
//Rf <- acc
Expow = 1 << Scale;
// Get power of e^x1Q31
tmp = Expow - 1;
//x^n needs (n-1) multiplications
for (i=0;i<tmp;i++)
{
Rf = Rf (*) acc;
//Multiply calculated e^x1Q31 with itself power times
}
//Get the shift count to convert final result in 12Q20 format
Expow = Expow << 1;
ShCnt = Expow - 11;
R = Rf << ShCnt;
//Final result in 12Q20 format
}
return R;
//Returns the calculated X power Y
}
Techniques
• Use of MAC instructions
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs are in 12Q20 format and should in the range [2-11,
211) which is a subset of actual range. Exponent is in 1Q15
format and is in the range [-1,1).The returned output is in
12Q20 format and lies in the range [2-11, 211)
• Input is always positive
Memory Note
None
User’s Manual
4-355
V 1.2, 2000-01
Function Descriptions
XpowY_32
X Power Y (cont’d)
Implementation
X power Y can be calculated as e(Y.lnX). The natural logarithm
of the input value x can be calculated using the following
approximation series.
ln ( x ) = 0.9991150 ( x – 1 ) – 0.4899597 ( x – 1 )
2
3
+ 0.2856751 ( x – 1 ) – 0.1330566 ( x – 1 )
+ 0.03137207 ( x – 1 )
4
[4.155]
5
where, 1 ≥ x ≥ 2 which means 0 ≥ ( x – 1 ) ≥ 1
The coefficients of polynomial are stored in 1Q31 format. The
constant ln2 is also stored in 1Q31 format.
The 32 bit input is in 12Q20 format which can take values in
the range [-211, 211). As input to logarithm should always be
positive it will be subset of actual input range, i.e., in the range
[2-20, 211). For proper operation of lnX and antiln(Y.lnX) input
should lie in the range [2-11, 211). The 32 bit output format is
12Q20 which lies in the range [2-11, 211). Implementation of
lnX is same as natural logarithm of X except that scale factor
is obtained as (10 - shiftcount) [Refer Natural Logarithm].
The output (lnX) is multiplied with the exponent Y. The
resulting product is in 4Q28 format. The antilog of this product
gives the desired output.
The antilog of the input value X can be calculated by using the
following approximation series.
2
AntiLn ( x ) = 1.0000 + 1.0001x + 0.4990x + 0.1705x
4
+ 0.0348x + 0.0139x
5
3
[4.156]
The coefficients of polynomial are stored in 3Q29 format.
The 32 bit input is in 4Q28 format. The output is in 12Q20
format. Implementation is same as natural antilog of function.
[Refer Natural Antilog].
User’s Manual
4-356
V 1.2, 2000-01
Function Descriptions
XpowY_32
X Power Y (cont’d)
Example
Trilib\Example\Tasking\Mathematical\expXpowY_32.c,
expXpowY_32.cpp
Trilib\Example\GreenHills\Mathematical
\expXpowY_32.cpp, expXpowY_32.c
Trilib\Example\GNU\Mathematical\expXpowY_32.c
Cycle Count
When X is a power of 2 and XY in the range [e-1, e)
38+2
When X is a power of 2 and XY not in the range [e-1, e)
42 + 2 × scale + 1 + 2 for scale = 1
scale factor for antiln(YlnX)
42 + 2 × scale + 2 + 2 otherwise
scale factor for antiln(YlnX)
When X is not a power of 2 and XY in the range [e-1, e)
47+2
When X is not a power of 2 and XY not in the range [e-1, e)
51 + 2 × scale + 1 + 2 for scale = 1
scale factor for antiln(YlnX)
51 + 2 × scale + 2 + 2 otherwise
scale factor for antiln(YlnX)
Code Size
190 bytes
48 bytes (Data)
4.12.2
Random Number Generation
Randomness is typically associated with unpredictability. Mathematics provides a
precise definition of randomness that is then applied here to evaluate random number
vector. Random numbers within the context of the function Rand_16 refers to "a
sequence of independent numbers with a specified distribution and a specified
probability of falling in any given range of values".
User’s Manual
4-357
V 1.2, 2000-01
Function Descriptions
Here Random Number Generator is implemented using Linear Congruential Method
(L.C.M). RNG using linear congruential method is also called pseudo RNG because
they require a seed and produce a deterministic sequence of numbers. Algorithm used
here is called L.C.M introduced by D. Lehmen in 1951.
Linear Congruential Method
This method produces a sequence of integers X1, X2, X3,... between zero and M-1
according to the following recursive relationship
X i + 1 = ( aX i + c )modM i = 0,1,2,...
[4.157]
where,
Xi
:
the initial value, called the seed
a
:
constant multiplier (RNDMULT)
c
:
increment (RNDINC)
M
:
modulus
Apart from LCM many Random Number Generators exist, but this method is arguably
the fastest for a 16-bit value. If a 32-bit value is needed, the code can be modified by
performing a 32-bit multiply and using 32-bit constants (RNDMULT, RNDINC). This
method, however, does have one major disadvantage. It is very sensitive to the values
of RNDMULT and RNDINC.
Much research has been done to identify the optimal choices of these constants to avoid
degeneration. The constants used in the subroutine below were chosen based on this
research.
M: The modulus value. This routine returns a random number from 0 to 65536 (64K) and
is not internally bounded. If the user needs a min/max limit, this must be coded externally
to this routine.
RNDSEED: An arbitrary constant, can be chosen to be any value representable by the
(0-64K) word. If zero is chosen, RNDINC should be some larger value than one.
Otherwise, the first two values will be zero and one. This is ok if the generator is given
three cycles to warm up. To change the set of random numbers generated by this
routine, change the RNDSEED value. RNDSEED=21845 is used in this routine because
it is 65536/3.
RNDMULT: Should be chosen such that the last three digits are even-2-1 (such as
xx821, x421, etc). RNDMULT=31821 is used in this routine.
User’s Manual
4-358
V 1.2, 2000-01
Function Descriptions
RNDINC: In general, this constant can be any prime number related to M (or 64K in this
case).Two values were actually tested, 1 and 13849. Research shows that RNDINC (the
increment value) should be chosen by the following formula
RNDINC = ( ( 1 ⁄ 2 – ( 1 ⁄ 6 × SQRT ( 3 ) ) ) × M )
[4.158]
Using M=65536, RNDINC=13849. (as indicated above.)
RNDINC=13849 is used in this routine.
Because PRNG’s employ a mathematical algorithm for number generation, all PRNG’s
possess the following properties:
• A seed value is required to initialize the equation
• The sequence will cycle after a particular period
4.12.2.1 Description
The following Random Number Generation functions are described.
• Random Number Initialization
• Random Number Generator
User’s Manual
4-359
V 1.2, 2000-01
Function Descriptions
RandInit_16
Random Number Initialization
Signature
void RandInit_16(void);
Inputs
None
Output
None
Return
None
Description
RandInit_16 function initializes the value of seed stored in
global memory location for 16-bit random number generation
routine.
Pseudo code
None
Techniques
None
Assumptions
None
Memory Note
aRndSeed
RandSeed
Declared as Global
Figure 4-86 RandInit_16
Implementation
RndSeed, the seed for Random Vector Generator is initialized
from global memory. Assembler directive .space is used to
reserve a block of memory. The seed value is stored in this
memory. This memory is declared as global so that seed
value can be accessed while generating random vector.
Example
Trilib\Example\Tasking\Mathematical\expRandInit_16.c,
expRandInit_16.cpp
Trilib\Example\GreenHills\Mathematical
\expRandInit_16.cpp, expRandInit_16.c
Trilib\Example\GNU\Mathematical\expRandInit_16.c
Cycle Count
2+2
Code Size
14 bytes
User’s Manual
4-360
V 1.2, 2000-01
Function Descriptions
Rand_16
Random Number Generator
Signature
void Rand_16(int nX,
int *R
);
Inputs
nX
:
Size of output vector
R
:
Pointer to output vector
Output
R[nX]
:
Output vector
Return
None
Description
Rand_16 function computes vector of 16 bit random numbers.
Seed value is initialized by RandInit_16 function. This function
uses 16 bit predefined RandMul, RandInc values to calculate
output vector of given size. After calculation of random vector
the seed in memory is updated. So if this function is called
again, will use this new seed value and vector generated will
be different from the original one.
Pseudo code
{
int i;
for (i=0;i<max;i++)
{
rndvec[i] = (rndseed*rndmul+rndinc)%modulus;
//Rndvec=16-bit random number
//RndSeed=Seed value=21845,Userdefined constant
//RndMul=Multiplier=31821,Userdefined constant
//RndInc=Increment=13849, Userdefined constant
//Modulus=65536,Userdefined constant
}
rndseed = rndvec[i];
}
Techniques
• Instruction ordering for zero overhead Load/Store
Assumptions
• Uses seed value from the memory location which can be
initialized by Rand initialization routine
User’s Manual
4-361
V 1.2, 2000-01
Function Descriptions
Rand_16
Random Number Generator (cont’d)
Memory Note
aRndSeed
RandSeed
Initialized in Rndinit
Figure 4-87 Rand_16
Implementation
Random vector generation uses
Randvec = ( RndSeed × RndMul + RndInc )Modulus [4.159]
RndSeed is initialized by routine RandInit_16, rest other
constant values are stored immediate to data registers.
viz.,RndMul, RndInc, Modulus.
Rndseed stored in global memory is accessed as external
variable and Random Vector is calculated as per above
equation.
Example
Trilib\Example\Tasking\Mathematical\expRand_16.c,
expRand_16.cpp
Trilib\Example\GreenHills\Mathematical\expRand_16.cpp,
expRand_16.c
Trilib\Example\GNU\Mathematical\expRand_16.c
Cycle Count
With DSP
Extensions
4 + nX × ( 8 ) + 1 + 2
Without DSP
Extensions
4 + nX × ( 8 ) + 1 + 2
Code Size
User’s Manual
38 bytes
4-362
V 1.2, 2000-01
Function Descriptions
4.13
Matrix Operations
A matrix is a rectangular array of numbers (or functions) enclosed in brackets. These
numbers (or functions) are called entries or elements of the matrix.The number of entries
in the matrix is product of number of rows and columns. An m × n matrix means matrix
with m rows and n columns. In the double-subscript notation for the entries, the first
subscript always denotes the row and the second the column.
4.13.1
Descriptions
The following Matrix Operations are described.
•
•
•
•
Addition
Subtraction
Multiplication
Transpose
User’s Manual
4-363
V 1.2, 2000-01
Function Descriptions
MatAdd_16
Addition
Signature
void MatAdd_16 (short
short
short
int
int
);
Inputs
X
Y
R
nRow
nCol
:
:
:
:
:
Pointer to first matrix
Pointer to second matrix
Pointer to output matrix
Number of rows
Number of columns
Output
R
:
Pointer to output matrix which is
the sum of the matrices X and Y
Return
None
Description
X[ ] [MAXCOL],
Y[ ] [MAXCOL],
R[ ] [MAXCOL],
nRow,
nCol
This function performs the addition of two matrices. It takes
pointers to the two matrices, pointer to the output matrix, size
of row and size of column as input. The entries in the matrices
are 16 bit values. The output matrix is stored starting from the
address which is sent as input.
Pseudo code
{
short *R;
//Ptr to a two dimensional output array of nRow
//rows and nCol columns
int Tmp;
Tmp = nRow * nCol; //number of elements
loopCnt = Tmp/4
//4 additions performed per loop
for(i=0;i<loopCnt;i+=4)
{
*(R+i) = *(X+i) + *(Y+i);
*(R+i+1) = *(X+i+1) + *(Y+i+1);
*(R+i+2) = *(X+i+2) + *(Y+i+2);
*(R+i+3) = *(X+i+3) + *(Y+i+3);
}
}
Techniques
User’s Manual
•
•
•
•
Loop Unrolling, 4 additions/loop
Use of packed data Load/Store
Use of packed addition with saturation
Instruction ordering provided for zero overhead Load/Store
4-364
V 1.2, 2000-01
Function Descriptions
MatAdd_16
Addition (cont’d)
Assumptions
• nRow = 2*m, m = 1,2,3...
• nCol = 2*n, n = 1,2,3...
Memory Note
Input-Buffer-1
aX
X[0][0]
X[0][1]
Input-Buffer-2
Y[0][0]
+
+
aY
Y[0][1]
.
.
X[0][nCol-1]
X[1][0]
Packed
add
Y[0][nCol-1]
Y[1][0]
X[1][1]
Y[1][1]
.
.
X[nRow-1][nCol-1]
Y[nRow-1][nCol-1]
short
short
Output-Buffer
aR
R[0][0]
R[0][1]
.
R[0][nCol-1]
R[1][0]
R[1][1]
Alignment of Input &
Output Buffers
IntMem - halfword aligned
ExtMem - word aligned
.
R[nRow-1][nCol-1]
short
Figure 4-88 MatAdd_16
User’s Manual
4-365
V 1.2, 2000-01
Function Descriptions
MatAdd_16
Addition (cont’d)
Implementation
The inputs to the function are three pointers (one each to each
of the input matrices to be added and one to the output matrix)
and the number of rows and number of columns. Both number
of rows and number of columns are multiple of two. Hence the
number of elements could be 4,8,12,.... This fact is made use
of in implementing the matrix addition in an optimal manner.
Addition is performed in a loop. Using TriCore’s load
doubleword instruction, four elements of each matrix are
loaded in two data register pairs. Using packed arithmetic on
halfwords, two of the 16 bit entries can be added in one cycle.
Hence, by using two packed add instructions per loop, the
loop count is brought down by a factor of four. The loop is
executed (nRow * nCol)/4 times.
Example
Trilib\Example\Tasking\Matrix\expMatAdd_16.c,
expMatAdd_16.cpp
Trilib\Example\GreenHills\Matrix\expMatAdd_16.cpp,
expMatAdd_16.c
Trilib\Example\GNU\Matrix\expMatAdd_16.c
Cycle Count
Pre-loop
:
Loop
:
Post-loop
:
Code Size
User’s Manual
5
3 × nRow × nCol
------------------------------------------- + 2
4
0+2
52 bytes
4-366
V 1.2, 2000-01
Function Descriptions
MatSub_16
Subtract
Signature
void MatSub_16(short
short
short
int
int
);
Inputs
X
Y
R
nRow
nCol
:
:
:
:
:
Pointer to first matrix
Pointer to second matrix
Pointer to output matrix
Number of rows
Number of columns
Output
R
:
Pointer to output matrix which is
the subtraction of the matrices X
and Y
Return
Description
X[ ] [MAXCOL],
Y[ ] [MAXCOL],
R[ ] [MAXCOL],
nRow,
nCol
None
This function performs the subtraction of two matrices. It takes
pointers to the two matrices, pointer to the output matrix, size
of row and size of column as input. The entries in the matrices
are 16 bit values. The output matrix is stored starting from the
address which is sent as input.
Pseudo code
{
short *R;
//Ptr to a two dimensional output array of nRow
//rows and nCol columns
int Tmp;
Tmp = nRow * nCol; //number of elements
loopCnt = Tmp/4
//4 subtractions performed per loop
for(i=0;i<loopCnt;i+=4)
{
*(R+i) = *(X+i) - *(Y+i);
*(R+i+1) = *(X+i+1) - *(Y+i+1);
*(R+i+2) = *(X+i+2) - *(Y+i+2);
*(R+i+3) = *(X+i+3) - *(Y+i+3);
}
}
User’s Manual
4-367
V 1.2, 2000-01
Function Descriptions
MatSub_16
Subtract (cont’d)
Techniques
•
•
•
•
Assumptions
• nRow = 2*m, m = 1,2,3...
• nCol = 2*n, n = 1,2,3...
User’s Manual
Loop Unrolling, 4 subtractions/loop
Use of packed data Load/Store
Use of packed subtraction with saturation
Instruction ordering provided for zero overhead Load/Store
4-368
V 1.2, 2000-01
Function Descriptions
MatSub_16
Subtract (cont’d)
Memory Note
Input-Buffer-1
aX
X[0][0]
X[0][1]
Input-Buffer-2
Y[0][0]
-
aY
Y[0][1]
.
.
X[0][nCol-1]
X[1][0]
Packed
sub
Y[0][nCol-1]
Y[1][0]
X[1][1]
Y[1][1]
.
.
X[nRow-1][nCol-1]
Y[nRow-1][nCol-1]
short
short
Output-Buffer
aR
R[0][0]
R[0][1]
.
R[1][0]
Alignment of Input &
Output Buffers
IntMem - halfword aligned
R[1][1]
ExtMem - word aligned
R[0][nCol-1]
.
R[nRow-1][nCol-1]
short
Figure 4-89 MatSub_16
User’s Manual
4-369
V 1.2, 2000-01
Function Descriptions
MatSub_16
Subtract (cont’d)
Implementation
The inputs to the function are three pointers (one each to each
of the input matrices to be subtracted and one to the output
matrix) and the number of rows and number of columns. Both
number of rows and number of columns are multiple of two.
Hence the number of elements could be 4, 8, 12,.... This fact
is made use of in implementing the matrix subtraction in an
optimal manner. Subtraction is performed in a loop. Using
TriCore’s load doubleword instruction, four elements of each
matrix are loaded in two data register pairs. Using packed
arithmetic on halfwords, two of the 16 bit entries can be
subtracted in one cycle. Hence by using two packed subtract
instructions per loop, the loop count is brought down by a
factor of four. The loop is executed (nRow * nCol)/4 times.
Example
Trilib\Example\Tasking\Matrix\expMatSub_16.c,
expMatSub_16.cpp
Trilib\Example\GreenHills\Matrix\expMatSub_16.cpp,
expMatSub_16.c
Trilib\Example\GNU\Matrix\expMatSub_16.c
Cycle Count
Pre-loop
:
Loop
:
Post-loop
:
Code Size
User’s Manual
5
3 × nRow × nCol
------------------------------------------- + 2
4
0+2
52 bytes
4-370
V 1.2, 2000-01
Function Descriptions
MatMult_16
Multiplication
Signature
DataS MatMult_16(DataS X[] [MaxCol],
DataS Y[] [MaxCol],
DataS R[] [MaxCol],
int
nRowX,
int
nColX,
int
nColY
);
Inputs
X
Y
R
nRowX
nColX
nColY
:
:
:
:
:
:
Pointer to first matrix
Pointer to second matrix
Pointer to output matrix
Number of rows of first matrix
Number of columns of first matrix
Number of columns of second
matrix
Output
R
:
Pointer to output matrix which is
the multiplication of the matrices X
and Y
Return
None
Description
User’s Manual
The multiplication of two matrices X and Y is done. Both the
input matrices and output matrix are 16-bit. All the matrices are
halfword aligned. All the element of the matrix are stored rowby-row in the buffer.
4-371
V 1.2, 2000-01
Function Descriptions
MatMult_16
Multiplication (cont’d)
Pseudo code
{
int nRowX;
int nColX;
int nColY;
frac16 R;
frac32 acc;
//Number
//Number
//Number
//Result
of
of
of
of
rows of first matrix
columns of first matrix
columns of second matrix
matrix multiplication
for(i=0; i<nRowX; i++)
//Outer loop is executed nRow times
{
for(j=0; j<nColY; j=j+2)
//Middle loop is executed nColY/2 times
{
acc = 0;
for(k=0; k<nColX/2; k++)
//Inner loop is executed nColX/2 times
{
acc += (sat rnd) Y[i][j+1] (*) X[i][j] || Y[i][j] (*) X[i][j]
acc += (sat rnd) Y[i+1][j+1] (*) X[i][j+1] || Y[i+1][j] (*)
X[i][j+1]
}
R[i][j] = (frac16)accLo;
R[i][j+1] = (frac16)accHi;
}
}
}
Techniques
• Use of packed data Load/Store
• Use of packed MAC instruction
• Instruction ordering for zero overhead Load/Store
Assumptions
• nRowX = 2*l,
l = 1,2,3...
• nColX = nRowY = 2*m, m = 1,2,3...
• nColY = 2*n,
n = 1,2,3...
User’s Manual
4-372
V 1.2, 2000-01
Function Descriptions
MatMult_16
Multiplication (cont’d)
Memory Note
Input-Matrix-1
aX
Input-Matrix-2
X[0][0]
Y[0][0]
X[0][1]
Y[0][1]
.
.
X[0][nColX-1]
Y[0][nColY-1]
X[1][0]
Y[1][0]
X[1][1]
Y[1][1]
.
.
X[nRowX-1][nColX-1]
Y[nColX-1][nColY-1]
halfword
aligned
PACKED
MAC
aY
halfword
aligned
Output-Matrix
R[0][0]
aR
R[0][1]
.
R[0][nColY-1]
R[1][0]
R[1][1]
.
R[nRowX-1][nColY-1]
halfword
aligned
Figure 4-90 MatMult_16
User’s Manual
4-373
V 1.2, 2000-01
Function Descriptions
MatMult_16
Multiplication (cont’d)
Implementation
The pointer to both the input matrices (X and Y), pointer to
output matrix (R), number of rows of X (nRowX), number of
columns of X (nColX) and number of columns of Y (nColY) are
sent as arguments.
The implementation uses three loops:
The outer loop is executed nRowX times. The middle loop is
executed nColY/2 times and the inner loop is executed nColX/
2 times.
In the outer loop, the pointer is initialized to first element of X
(X[0][0]). For every next iteration of loop it is updated to point
to next row (X[i+1][0]). Thus this loop is executed nRowX
times.
In the middle loop, the pointer to X is always initialized to point
to the row of X selected by outer loop. The pointer to Y is
initialized to first element of Y (Y[0][0]). For every next iteration
of loop it is updated to point to next to next column of Y
(Y[i][j+2]). Since the two columns are considered in one pass
of inner loop, this loop is executed nColY/2 times.
In the inner loop two values of X and two values of Y are
loaded using load word instruction. Two packed MAC
instructions are used in this loop.
First packed MAC uses X[i][j] and following operation is
performed.
acc = acc + Y [ i ] [ j + 1 ] ⋅ X [ i ] [ j ] || Y [ i ] [ j ] ⋅ X [ i ] [ j ]
[4.160]
Second packed MAC uses X[i][j+1] and following operation is
performed.
acc = acc + Y [ i + 1 ] [ j + 1 ] ⋅ X [ i ] [ j + 1 ] || Y [ i + i ] [ j ]
X [ i ][ j + 1 ]
[4.161]
As two values from the selected row of X are used in each
pass, this loop is executed nColX/2 times.
User’s Manual
4-374
V 1.2, 2000-01
Function Descriptions
MatMult_16
Multiplication (cont’d)
Example
Trilib\Example\Tasking\Matrix\expMatMult_16.c,
expMatMult_16.cpp
Trilib\Example\GreenHills\Matrix\expMatMult_16.cpp,
expMatMult_16.c
Trilib\Example\GNU\Matrix\expMatMult_16.c
Cycle Count
 nColY

nColX
8 + nRowX  ----------------- 6 + ----------------- ( 6 ) + 2 ( or1 ) + 1 + 4  + 1
2
2


Code Size
100 bytes
User’s Manual
4-375
V 1.2, 2000-01
Function Descriptions
MatTrans_16
Transpose
Signature
void MatTrans_16(short
short
int
int
);
X[ ] [MAXCOL],
R[ ] [MAXROW],
nRow,
nCol
Inputs
X
R
nRow
nCol
:
:
:
:
Pointer to input matrix
Pointer to output matrix
Number of rows
Number of columns
Output
R
:
Pointer to output matrix which is
the transpose of the matrix X
Return
Description
None
This function performs transpose of the given matrix. It takes
pointers to input and output matrix, size of row and size of
column as input. The entries in the matrix are 16 bit values.
The output matrix is stored from the address which is sent as
input.
Pseudo code
{
int i,j;
for(i=0;i<nCol;i++)//Column loop
{ K = 0;
for(j=0;j<nRow/2;j++)
//Row loop
{
R[i][k] = X[k][i];
//Two elements of input matrix are read
//and stored
R[i][k+1] = X[k+1][i];
k = K+2;
}
}
}
Techniques
• Use of packed data Load/Store
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• nRow = 2*m, m = 1,2,3...
• nCol = 2*n, n = 1,2,3...
User’s Manual
4-376
V 1.2, 2000-01
Function Descriptions
MatTrans_16
Transpose (cont’d)
Memory Note
Input-Buffer
aX
Output-Buffer
X[0][0]
R[0][0]
X[0][1]
R[0][1]
.
.
X[0][nCol-1]
R[0][nCol-1]
X[1][0]
R[1][0]
X[1][1]
R[1][1]
.
aR
.
X[nRow-1][nCol-1]
R[nRow-1][nCol-1]
short
short
Figure 4-91 MatTrans_16
Implementation
The inputs to the function are two pointers to the matrices
(input matrix and output matrix respectively), number of rows
and number of columns. Both number of rows and number of
columns are multiple of 2. The outer loop is executed number
of column times. The inner loop is executed nRow/2 times. In
the row loop two input elements from first column are read and
packed. Using TriCore’s store word instruction, it is stored in
first row of output matrix. The inner loop is executed for the
first column. Then pointer is made to point to second element
in the first row. Then inner loop is executed for second
column. Thus outer loop is executed number of column times
and transpose is obtained.
Example
Trilib\Example\Tasking\Matrix\expMatTrans_16.c,
expMatTrans_16.cpp
Trilib\Example\GreenHills\Matrix\expMatTrans_16.cpp,
expMatTrans_16.c
Trilib\Example\GNU\Matrix\expMatTrans_16.c
User’s Manual
4-377
V 1.2, 2000-01
Function Descriptions
MatTrans_16
Transpose (cont’d)
Cycle Count
For all
X[nRow][nCol]
:
nRow
3 +  --------------- × 5 + 2 + 5 × nCol
 2 
+2+2
Code Size
User’s Manual
52 bytes
4-378
V 1.2, 2000-01
Function Descriptions
4.14
Statistical Functions
4.14.1
Descriptions
The following Statistical functions are described.
• Autocorrelation
• Convolution
• Mean Value
Autocorrelation
Correlation determines the degree of similarity between two signals. If two signals are
identical their correlation coefficient is 1, and if they are completely different it is 0. If the
phase shift between them is 180 and otherwise they are identical, then correlation
coefficient is -1.
There are two types of correlation Cross Correlation and Autocorrelation.
When two independent signals are compared, the procedure is cross correlation. When
the same signal is compared to phase shifted copies of itself, the procedure is
autocorrelation. Autocorrelation is used to extract the fundamental frequency of a signal.
The distance between correlation peaks is the fundamental period of the signal. Discrete
correlation is simply a vector dot product.
N
R(j) =
∑ x(i) × y(i + j )
[4.162]
i=0
where,
N = nX - j -1 (j = 0, 1,...,nR-1),
nX = Size of input vector
nR = Desired number of outputs. It can take values from 1 to nX
Autocorrelation is given by
N
R(j) =
∑ x(i) × x(i + j )
(j = 0, 1,...,nR-1)
[4.163]
i=0
i is the index of the array, j is the lag value, as it indicates the shift/lag considered for the
R(j) autocorrelation. N is the correlation length and it determines how much data is used
for each correlation result. When R(j) is calculated for a number of j values, it is referred
to as autocorrelation function.
User’s Manual
4-379
V 1.2, 2000-01
Function Descriptions
Convolution
Discrete convolution is a process, whose input is two sequences, that provide a single
output sequence.
Convolution of two time domain sequences results in a time domain sequence. Same
thing applies to frequency domain.
Both the input sequences should be in the same domain but the length of the two input
sequences need not be the same.
Convolution of two sequences X(k) and H(k) of length nX and nH respectively can be
given mathematically as
nX + nH – 2
R(n) =
∑
H( k ) ⋅ X( n – k)
[4.164]
k=0
The resulting output sequence R(n) is of length nX+nH-1.
The convolution in time domain is multiplication in frequency domain and vice versa.
User’s Manual
4-380
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation
Signature
void ACorr_16( DataS
DataL
int
int
);
Inputs
X
R
:
:
nX
nR
:
:
Pointer to Input-Vector
Pointer to Output-Vector containing
the first nR elements of the positive
side of the autocorrelation function
of the vector X
Size of vector X
Size of vector R
Output
R
:
Output-Vector
Return
None
Description
User’s Manual
*X,
*R,
nX,
nR
The function performs the positive side of the autocorrelation
function of real vector X. The arguments to the function are
pointer to the input vector, pointer to output buffer to store
autocorrelation result, size of input buffer (only even) and
number of auto correlated outputs desired. The input values
are in 16 bit fractional format and output values are in 32 bit
fractional format. The implementation is optimal and works if
size of output buffer is even/odd.
4-381
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
Pseudo code
{
frac16 *X1;
frac16 *X2;
frac64 acc;
int dCnt;
//Macro
macro ACorr;
//Ptr to input vector
//Ptr to input vector + LagCount
//Autocorrelation result
//Correlation loop count
{
int aCorlen;
//Correlation loop count
aCorlen = dCnt; //Correlation loop count for current autocorrelation
//output
for(i=0; i<aCorlen; i++)
{
acc = acc + *(X1++) * *(X2++) + *(X1++) * *(X2++);
//acc = acc + X(0) * X(0+aLagCnt) + X(1) *
//X(1+aLagCnt)(even correlation length) (or)
//acc = acc + X(1) * X(1+aLagCnt) + X(2) * X(2+aLagCnt)
//(odd correlation length)
}
}
ACorr_16:
{
int lflag = 0;
int aLagCnt = 0;//First autocorrelation output is with zero lag
int dCnt = nX/2;
X1 = X;
//Initialize first Ptr to start of input vector
if (nR%2 != 0)
{
nR++;
lflag = 1;
//lflag = 1 if nR is odd
}
//If desired no. of output is 1 or 2 skip ACorr_OutDataL
if (nR == 2)
go to ACorr_R_1or2;
//ACorr_OutDataL
for (i=0; i<nR/2-1; i++)
{
acc = 0;
//Clear accumulator
X2 = X + aLagCnt;
//Second Ptr initialized to first Ptr plus an offset
User’s Manual
4-382
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
//of aLagCnt
ACorr;
//Autocorrelation computation
*R++ = (frac32_sat) acc;
//Autocorrelation result converted to 32 bits with
//saturation and stored to output buffer
acc = 0;
//Clear accumulator
aLagCnt = aLagCnt + 2;
//Lag count is incremented for the next correlation
X1 = X;
//Initialize first Ptr to start of input vector
X2 = X2 + alagCnt;
//Second Ptr initialized to first Ptr plus an offset
//of aLagCnt
//Autocorrelation computation
dCnt--;
acc = acc + *(X1++) * *(X2++);
//acc = acc + X(0) * X(0+aLagCnt)
ACorr;
X1 = X;
//Initialize first Ptr to start of input vector
aLagCnt = aLagCnt + 1;
//Lag cnt incremented for next autocorrelation
//computation
}
//Last two results (if nR is even) or last one result (if nR is
//odd) is calculated outside the loop
ACorr_R_1or2:
acc = 0;
//Clear accumulator
X2 = X + aLagCnt;
ACorr;
*R++ = (frac32_sat)acc;
if (lflag == 1) //Jump to ACorr_16_Ret if lflag = 1
go to ACorr_Ret;
else
acc = 0;
//Clear accumulator
X1 = X;
//Initialize first Ptr to start of input vector
X2 = X2 + aLagCnt;
acc = acc + *(X1++) * *(X2++);
//If nR = nX, jump to ACorr_Rlast
if (dCnt = 0)
go to ACorr_Rlast;
else
User’s Manual
4-383
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
{
dCnt--;
ACorr;
}
ACorr_Rlast:
(*R++)(frac32_sat)acc;
ACorr_Ret:
}
}
Techniques
• Loop unrolling is done so that implementation is efficient for
both even and odd number of desired outputs. Last two
outputs (for nR even) or last one output (for nR odd) is
computed outside the loop
• A macro ACorr is used to calculate each autocorrelation
output. The macro uses packed load and dual MAC to
reduce the number of cycles for a given correlation length
• One pass through the loop calculates two outputs, i.e.,
there are two calls to the macro
• For odd correlation length one multiplication is performed
before calling the macro
• Implementation is optimal for both even and odd values of
nR
• Intermediate result stored in 64 bit register (16 guard bits)
• Instruction ordering for zero overhead Load/Store
Assumptions
• Input is in 1Q15 format
• Output is in 1Q31 format
User’s Manual
4-384
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
Memory Note
Input-Vector
aX1
aX2
aX2 = aX1 + lag count
Output-Vector
aX
X(0)
R(0)
aR
X(1)
R(1)
X(2)
R(2)
.
.
X(n-1)
.
X(n)
.
X(n+1)
.
.
R(nR-1)
1Q15
1Q31
halfword
aligned
halfword
Dual MAC
aligned
(odd
Corr.len)
Dual MAC
(even
Corr.len)
MAC (odd
Corr.len)
Figure 4-92 ACorr_16
User’s Manual
4-385
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
Implementation
Correlation is similar to FIR filtering without the time reversal
of the second input variable. In autocorrelation, the signal is
multiplied with phase shifted copies of itself. The
implementation begins with zero lag, i.e., the value at each
instant is squared and added to produce the first
autocorrelation output.
The lag value is incremented by one for each next output.
Hence, in autocorrelation computation the number of
multiplication (correlation length) needed for each R(i)
decreases as i increases from 1 to nR-1. Since the given
assumption is that the number of input is always even,
correlation length is even for all R(j) where j = 0, 2, 4,....,nR-2
and it is odd when j = 1, 3, 5,...,nR-1.
For each autocorrelation output computation, two pointers to
input buffer aX1, aX2 are initialized such that aX1 points to
beginning of input vector and the difference between them is
equal to the lag value for that output, i.e., aX2 = aX1+lag
count.
A macro ACorr is used to calculate each autocorrelation
output. The macro uses packed load and dual MAC to reduce
the number of cycles for a given correlation length. This brings
down the loop count for each autocorrelation by a factor of 2.
For all R(i), i = 0, 2, 4,...., the call to ACorr will directly give the
autocorrelation result in a 64 bit register which is then
converted with saturation to 1Q31 format and stored to output
buffer. In case of R(i) with i = 1, 3, 5,..., the correlation length
is odd. Hence, one MAC is performed before calling the ACorr
macro. This makes the implementation optimal for all R(i).
The loop in the ACorr_16 function runs (nR/2-1) times. During
each pass through the loop two outputs are calculated and
written to output buffer (there are two calls to ACorr). The
implementation works for both odd and even values of nR,
i.e., nR = 1, 2,...,nX.
User’s Manual
4-386
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
Example
Trilib\Example\Tasking\Statistical\expACorr_16.c,
expACorr_16.cpp
Trilib\Example\GreenHills\Statistical\expACorr_16.cpp,
expACorr_16.c
Trilib\Example\GNU\Statistical\expACorr_16.c
Cycle Count
For Macro ACorr
Mcall ( 1 ) = 1 + nX + 2
Mcall ( i ) = 1 + 2 × ( ( nX ) ⁄ 2 – ( i – imod2 ) ⁄ 2 ) + 2
i = 2, 3,...,nX-2
Mcall ( i ) = 1 + 2 × ( ( nX ) ⁄ 2 – ( i – imod2 ) ⁄ 1 ) + 2
i = 2, 3,...,nX-2
Mcall ( i ) = 1 + 2 × ( ( nX ) ⁄ 2 – ( i – imod2 ) ⁄ 1 ) + 2
i = nX-1
where Mcall(i) refers to the ith call to the macro
For ACorr_16
a) When nR = any Even value less than nX and greater than 2
Pre-loop
:
9
Loop
:
19 × ( nR ⁄ 2 – 1 ) + Mcall ( 1 ) + …
+ Mcall ( nR – 2 )
Post-loop
:
2 + 2 + Mcall ( nR – 1 ) + 14 +
Mcall ( nR ) + 6 + 2
Example
:
When nX = 54, nR = 4
:
Cycle Count = 274 cycles
b) When nR = any Odd value less than nX and greater than 1
User’s Manual
Pre-loop
:
9
Loop
:
19 × ( ( nR + 1 ) ⁄ 2 – 1 ) ) + Mcall ( 1 ) +
… + Mcall ( nR – 1 )
Post-loop
:
2 + 2 + Mcall ( nR ) + 9 + 2
4-387
V 1.2, 2000-01
Function Descriptions
ACorr_16
Autocorrelation (cont’d)
Example
:
When nX = 54, nR = 5
:
Cycle Count = 335 cycles
Pre-loop
:
9
Loop
:
19 × ( nR ⁄ 2 – 1 ) + Mcall ( 1 ) + …
+ Mcall ( nX – 2 )
Post-loop
:
2 + 2 + Mcall ( nX – 1 ) + 17 + 2
Example
:
When nR = nX = 54
:
Cycle Count = 2141 cycles
c) When nR = nX
d) When nR = 1
The OutData loop is bypassed
Cycle Count
:
13 + Mcall ( 1 ) + 9 + 2
Example
:
When nX = 54, nR = 1
:
Cycle Count = 79 cycles
e) When nR = 2
The OutData loop is bypassed
Code Size
User’s Manual
Cycle Count
:
13 + Mcall ( 1 ) + 14 + Mcall ( 2 )
+6+2
Example
:
When nX = 54, nR = 2
:
Cycle Count = 145 cycles
268 bytes
4-388
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution
Signature
void Conv_16(DataS
DataS
DataL
int
int
);
Inputs
X
H
R
nH
nR
:
:
:
:
:
Pointer to First Input-Vector
Pointer to Second Input-Vector
Pointer to Output-Vector
Size of Second Input-Vector
Size of Output-Vector
Output
R(nR)
:
Output-Vector
Return
None
Description
User’s Manual
*X,
*H,
*R,
nR,
nH
The convolution of two sequences X and Y is done. The input
vectors are 16-bit and returned output is 32-bit. All the vectors
are halfword aligned. The length of input vectors is even.
Therefore for full convolution length output vector length is
always odd.
4-389
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
Pseudo code
{
frac16 *X;
frac16 *H;
frac64 acc;
int dCnt;
//Ptr to First Input-Vector
//Ptr to Second Input-Vector
//Convolution result
//Convolution loop count
//Macro
macro Conv;
{
int aOvlpCnt;
//Convolution loop count
aOvlpCnt = dCnt;//Convolution loop count for current convolution
//output
for(i=0; i<aOvlpCnt; i++)
{
acc = acc + (*(X-K)) (*) H(K) + (*(X-K-1)) (*) H(K+1)
//acc += X(n) * H(0) + X(n-1) * H(i)
K = K + 2;
}
Conv_16:
{
int anHCnt;
int anX_nHCnt;
int anR_nXCnt;
int dCnt = 1;
int nX_1;
dnHCnt = nH/2 - 1;
anHCnt = dnHCnt;
X1 = X;
//Store Ptr to First Input-Vector
H1 = H;
//Store Ptr to Second Input-Vector
*R++ = X[0].H[0]
acc = 0.0;
Conv;
//Convolution computation
*R++ = (frac32 sat)acc;
//Result stored
X1 = X1 + 2;
X = X1;
H = H1;
User’s Manual
4-390
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
if (nR == 3)
go to Conv_R_3;
for (i=0; i<anHCnt; i++)
{
acc = 0.0;
acc = X[n] (*) H[0];
Conv;
//Convolution computation
*R++ = (frac16 sat)acc;
//Result stored
dCnt++;
X = X1;
H = H1;
acc = 0.0;
Conv;
//Convolution computation
X1 = X1 + 2;
X = X1;
H = H1;
*R++ = (frac32 sat)acc;
}
nX_1 = nR - nH;
X1 = X1 - 1;
X = X1;
anR_nXCnt = dnHCnt;
if (nX == nH)
go to Conv_DCntr;
H = H1;
anX_nHCnt = nX - nH;
for (i=0; i<anX_nHCnt; i++)
{
X = X1;
acc = 0.0;
Conv;
//Convolution computation
X1 = X1 + 1;
H = H1;
*R++ = (frac32 sat)acc;
//Result stored
}
User’s Manual
4-391
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
X = X1;
for (i=0; i<anR_nXCnt; i++)
{
dCnt--;
H1 = H1 + 1;
H = H1;
acc = 0.0;
acc = X(n) (*) H(0);
Conv;
//Convolution computation
*R++ = (frac32 sat)acc;
X1 = X1 - 1;
H1 = H1 + 1;
X = X1;
H = H1;
acc = 0.0;
Conv;
//Convolution computation
*R++ = (frac32 sat)acc;
X1 = X1 + 1;
X = X1;
}
Conv_R_3;
acc = 0.0;
acc = X(nX - 1) (*) H(nH - 1);
K++ = (frac32)acc;
return;
}
}
Techniques
• For optimization implementation is divided into three loops.
First loop where overlap count increases, second loop
overlap count remains same and third loop overlap count
decreases
• A macro Conv is used which calculates convolution output.
The macro uses packed load and dual MAC to reduce the
number of cycles for a given overlap count of two
sequences
• Use of dual MAC and MAC instructions
• Intermediate results stored in 64 bit register (16 guard bits)
• Instruction ordering for zero overhead Load/Store
Assumptions
• Inputs are in 1Q15 format, Output is in 1Q31 format
• nX and nH are even and hence nR is always odd
User’s Manual
4-392
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
Memory Note
First Input-Vector
aX1
Second Input-Vector Output-Vector
X(0)
H(0)
aH
R(0)
aR
.
H(1)
R(1)
X(n-2)
H(2)
R(2)
X(n-1)
.
.
X(n)
.
.
.
.
.
.
.
.
X(nR-nH)
H(nH-1)
R(nR-1)
1Q15
1Q31
halfword
aligned
halfword
aligned
1Q15
halfword
aligned
MAC
(odd
overlap
count)
Dual MAC
(even
overlap
count)
Dual MAC
(odd
overlap
count)
Figure 4-93 Conv_16
User’s Manual
4-393
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
Implementation
Convolution is same as FIR filtering. For convolution one of
the two sequences is inverted in time. To implement the
convolution, the two sequences are multiplied together and
the products are summed to compute the output sample. To
calculate next output sample time inverted signal is shifted by
one and process is repeated. If two sequences of length nX
and nH are convolved the convolution length is given by nR =
nX+nH-1.
The pointer to input vectors, output vector, the size of output
vector (nR) and size of the input sequence of smaller length
(nH) are sent as arguments. The size of the other input
sequence is calculated as (nR-nH+1).
Implementation uses macro Conv. The macro uses two load
word and one dual MAC instruction. Thus two multiplications
and one addition is performed per loop according to the
equation
acc = acc + X ( n ) ⋅ H ( 0 ) + X ( n – 1 ) ⋅ H ( 1 )
[4.165]
Thus loop count is always (overlap count/2-2) for even and
odd lengths of overlap count. For odd one more MAC is
performed before the macro is called.
The convolution is divided into three loops.
First loop: The first two convolution outputs are given as
R( 0 ) = X( 0 ) ⋅ H( 0 )
[4.166]
R( 1 ) = X( 1 ) ⋅ H( 0 ) + X( 0) ⋅ H( 1 )
[4.167]
The number of multiplication and additions required for
computation of R(i) increases as i is increased from 0 to nH1. The overlap count of the two input sequences is even for i
= 1, 3, 5,...,nH-1 and odd for i = 0, 2, 4,...,nH-2. Macro is called
for every R(n).
User’s Manual
4-394
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
The first loop is unrolled and first two outputs are calculated
outside the loop. One pass through the first loop gives two
outputs. Thus loop count for first loop is (nH/2-2). This loop
gives first nH outputs.
Second loop: Here the overlap count is always constant and
is nH. Macro Conv is called for (nX-nH) times. This loop gives
next (nX-nH) outputs.
This loop is skipped if nX = nH.
Third loop: The overlap count decreases from (nH-1) to 1 as i
increases from (nX+1) to (nR-1). The loop is unrolled and last
output which needs only one multiplication is done outside the
loop. Thus loop count for this loop is (nH/2-2).
Example
Trilib\Example\Tasking\Statistical\expConv_16.c,
expConv_16.cpp
Trilib\Example\GreenHills\Statistical\expConv_16.cpp,
expConv_16.c
Trilib\Example\GNU\Statistical\expConv_16.c
Cycle Count
For i = 1 to nH-1
Mcall(1) and Mcall(2) = 1+2+1
Mcall ( i ) = 1 + 2 × ( i + 1 ) ⁄ 2 + 2
for i = 3, 5,...,(nH-1)
Mcall ( i ) = 1 + 2 × i ⁄ 2 + 2
for i = 4,...,(nH-2)
For i = nH to nX-1
Mcall ( i ) = 1 + 2 × nH ⁄ 2 + 2
for i = nH,nH+1,...,(nX-1)
For i = nX to nR-2
User’s Manual
4-395
V 1.2, 2000-01
Function Descriptions
Conv_16
Convolution (cont’d)
Mcall ( i ) ) = 1 + 2 × ( nH ⁄ 2 – ( i ⁄ 2 – ( nX ) ⁄ 2 + 1 ) ) + 2
for i = nX, nX+2,...,(nR-5)
Mcall ( i ) ) = 1 + 2 × ( nH ⁄ 2 – ( ( i – 1 ) ⁄ 2 – ( nX ) ⁄ 2 + 1 ) ) + 2
for i = nX+1, nX+3,...,(nR-4)
Mcall(nR-3) and Mcall(nR-2) = 1+2+1
For nX>nH
14+Mcall(1)
First loop
( nH ⁄ 2 – 1 ) [ 18 + Mcall ( 2 ) + Mcall ( 3 ) + … + Mcall ( nH – 1 ) ]
+8
For nH>4
( nH ⁄ 2 – 1 ) [ 18 + Mcall ( 2 ) + Mcall ( 3 ) + … + Mcall ( nH – 1 ) ]
+7
For nH = 4
Second loop
( nX – nH ) [ 8 + Mcall ( nH ) + Mcall ( nH + 1 ) + … +
Mcall ( nX – 1 ) ] + 3
Third loop
( nH ⁄ 2 – 1 ) [ 19 + Mcall ( nX ) + Mcall ( nX + 1 ) + … +
Mcall ( nR – 2 ) ] + 2
2+2
For nX = nH
Second loop is skipped and first loop will take 2 extra cycles
for jump
For nH = nX =2
16+Mcall(1)+4
Code Size
User’s Manual
420 bytes
4-396
V 1.2, 2000-01
Function Descriptions
Avg_16
Mean Value
Signature
DataS Avg_16(DataS *X,
int nX
);
Inputs
X
nX
Output
None
Return
R
Description
:
:
Pointer to Input-Buffer
Size of Input-Buffer
:
Mean value of the input values
This function calculates the mean of a given array of values. It
takes pointer to the array and size of the array as input. Input
range is [-1, 1). The return is the mean value represented using
32 bits.
Pseudo code
{
frac32
frac32
frac64
frac32
acc = 0;
one_nX;
Ra;
R;
//Sum of inputs
//1/no. Of inputs
for(i=0; i<nx; i++)
{
acc = acc + X[i];
//acc in 17Q15 format
}
one_nX = 1/nX;
//one_nX in 1Q31 format
Ra = acc (*) one_nX;
//Mean value in 17Q47 format
R = (frac32)Ra;
//32 bit result in 1Q31 format
}
Techniques
• 32 bit addition is used to provide 16 guard bits for addition
• Instruction ordering provided for zero overhead Load/Store
Assumptions
• Inputs are in the range [-1,1) and in 1Q15 format. Output is
also in 1Q15 format.
User’s Manual
4-397
V 1.2, 2000-01
Function Descriptions
Avg_16
Mean Value (cont’d)
Memory Note
aX
Input-Buffer
X(0)
X(1)
.
.
.
.
.
X(nX-1)
1Q15
Figure 4-94 Avg_16
Implementation
The function takes a short pointer to an array whose mean is
to be calculated and the size of the array as input. The return
value is the 32 bit mean value.
x ( 0 ) + x ( 1 ) + … + x ( nx – 1 )
mean = ----------------------------------------------------------------------nx
[4.168]
Load of inputs and addition are performed in a loop. The input
values are read into the lower 16 bits of a 32 bit register.
Hence 32 bit addition is performed on 17Q15 values thereby
providing 16 guard bits for addition. The reciprocal of the size
is calculated.
The product of the sum and the reciprocal gives the mean
value in 17Q47 format. This is converted to 1Q31 and
returned.
User’s Manual
4-398
V 1.2, 2000-01
Function Descriptions
Avg_16
Mean Value (cont’d)
Example
Trilib\Example\Tasking\Statistical\expAvg_16.c,
expAvg_16.cpp
Trilib\Example\GreenHills\Statistical\expAvg_16.cpp,
expAvg_16.c
Trilib\Example\GNU\Statistical\expAvg_16.c
Cycle Count
Pre-loop
:
3
Loop
:
nX + 2
Post-loop
:
27+2
Code Size
User’s Manual
54 bytes
4-399
V 1.2, 2000-01
Function Descriptions
User’s Manual
4-400
V 1.2, 2000-01
Applications
5
Applications
The following applications are described.
• Spectrum Analyzer
• Sweep Oscillator
• Equalizer
5.1
Spectrum Analyzer
To perform a spectral analysis of any signal spectrum analyzer is used. The spectrum
analyzer uses radix-2 FFT to get the frequency content of a signal. The FFT algorithm
takes N-data-samples x(n), n=0,1,...,N-1 of the input given and produces N-point
complex frequency samples X(K), K=0,1,...,N-1. The power spectrum is obtained by
squaring the scaled magnitude of complex frequency samples.
1
1
2
2
2
P ( K ) = ---- X ( K ) = ---- { Re [ X ( K ) ] + Im [ X ( K ) ] } K=0,1,...,N/2
N
N
[5.1]
The Power Spectrum Density (PSD) gives a measure of the distribution of the average
power of a signal over frequency.
The PSD can be actual or averaged. The actual PSD gives N/2 point output from N point
complex FFT output. The averaged PSD gives b band output where the number of bands
is user input.
A simple example showing functioning of Spectrum Analyzer.
The following are the diagrams where input given is a mixture of 4kHz and 12kHz sine
waves sampled at 32kHz. The FIR filter has a cutoff frequency of 8 kHz. So after filtering
the input to FFT contains only 4kHz wave. The power spectrum gives the corresponding
frequency. Here the number of FFT points taken is 512. The maximum frequency value
represented by the spectrum is 16K as sampling frequency is 32K. Since FFT is of 512
complex points it will result in a power spectrum of 256 points. Here 256th doppler bin
represents frequency of 16K. So the frequency corresponding to 64th doppler bin is 4K.
User’s Manual
5-401
V 1.2, 2000-01
Applications
Figure 5-1
Input given to Spectrum Analyzer
Figure 5-2
Output of FIR filter
User’s Manual
5-402
V 1.2, 2000-01
Applications
Figure 5-3
Output power spectrum considering actual PSD
Figure 5-4
20 Band averaged power spectrum
User’s Manual
5-403
V 1.2, 2000-01
Applications
5.2
Sweep Oscillator
The generation of pure tones is often used for testing DSP systems and to synthesize
waveforms of required frequencies. The basic oscillator is a special case of an IIR filter
where the poles are on the unit circle and the initial conditions are such that the input is
an impulse. If the poles are moved outside the unit circle, the oscillator output will grow
at an exponential rate. If the poles are placed inside the unit circle, the output will decay
toward zero. The state (or history) of the second-order section determines the amplitude
and phase of the future output.
The impulse of a continuous second order oscillator is given by
R(t) = e
– dt sin ωt
[5.2]
-------------ω
If d>0 then the output will decay toward zero and the peak will occur at
Arc tan ( ω ⁄ d )
t peak = ---------------------------------ω
[5.3]
The peak value will be
– dt peak
e
R ( t peak ) = ---------------------2
2
d +ω
[5.4]
A second order difference can be used to generate an approximation response of this
continuous-time output. The equation for a second-order discrete time oscillators is
based on an IIR filter and is as follows
[5.5]
Rn + 1 = a 1 yn – a 2 yn – 1 + b1 x n
where, the x input is only present for t=0 as an initial condition to start the oscillator and
a 1 = 2e
a2 = e
– dτ
cos ( ωτ )
[5.6]
– dτ
[5.7]
where, τ is the sampling period (1/fs) and ω is 2 π times the oscillator frequency.
The frequency and rate of change of envelope of the oscillator output can be changed
by modifying the values of d and ω on a sample by sample basis.
The sweep oscillator implemented here uses the function IirBiq_4_16.
When the oscillator has to be started, the function oscillator is called with one of the
arguments indicating to start new oscillator where impulse is given as an input and the
User’s Manual
5-404
V 1.2, 2000-01
Applications
delay line gets updated. From the next sample onwards input is made zero, but as the
poles lie on the unit circle the output is oscillatory at given frequency. The coefficients,
whenever there is frequency change, are calculated for that particular frequency.
Following parameters are programmable
•
•
•
•
•
The sampling frequency
Start frequency
The factor, by which frequency has to be incremented or decremented
The number of cycles for a start frequency
Number of cycles for changed frequency
Figure 5-5
User’s Manual
Sweep Oscillator
5-405
V 1.2, 2000-01
Applications
5.3
Equalizer
A Graphic Equalizer is a powerful tool to characterize and enhance audio signals.
Technically it is composed of a bank of band-pass filters, each with a fixed center
frequency and a variable gain. This kind of processing unit is called Graphic since the
position of the slider resembles the frequency response of the filters bank. Thus its
usage is extremely intuitive, moving the slider up boosts a selected band, moving it down
will cut it.
Graphic equalizer uses high quality constant Q digital filters. This allows to isolate every
filter section from the effects of the amplitude with respect to the centre frequency and
bandwidth. The result is an accurate control permitting each band not to affect the
adjacent ones.
5-band equalizer implemented uses 128-tap FIR filters to get the desired band pass filter
response. Here the function FirBlk_16 is used for FIR filtering.
The five bands are
•
0 - 170
• 170 - 600
• 600 - 3K
• 3K - 12K
• 12K - 16K
The gain in dB for each band is programmable. Also the common master gain is
programmable. The filters are designed for three sampling frequencies 32kHz, 44.1kHz,
48kHz. The user gives the desired sampling frequency as an input. Depending on this
corresponding filter bank is selected. After input is passed through all the five filters the
output of each filter is multiplied with the gain for that particular band. All the outputs are
added and then finally multiplied with master gain to get the equalizer output.
User’s Manual
5-406
V 1.2, 2000-01
Applications
0dB
-3dB
85 170 385 600
Figure 5-6
User’s Manual
1800
3K
7.5K
12K 14K 16K
frequency
5 Band Graphic Equalizer
5-407
V 1.2, 2000-01
Applications
5.4
Hardware Setup for Applications
Serial port
Parallel port
Power supply
Figure 5-7
Hardware Setup
1. Preparing the TriBoard for Debugging
Connect a parallel cable from the parallel port on the PC to the On Board Wiggler (DB25)
on the TriBoard as shown in Figure 5-7. Connect a “one to one” serial port cable from
the RS232 interface on the PC to the serial interface (RS232-0) on the TriBoard. For
details refer TriBoard manual.
2. Starting a Terminal Program
A terminal program can be used to communicate with the TriBoard via RS232. Both
transmit and receive of data is possible. The TriBoard has an RS232 transceiver on
board to meet the RS232 specification of your PC.
User’s Manual
5-408
V 1.2, 2000-01
Applications
3. Power Up the TriBoard
Connect the power supply (6V to 25V DC, power plug with surrounding ground) to the
lower left edge of the card as shown in Figure 5-7. Power up the unit. The green LED’s
next to the OCDS2 Connector indicates the right power status. The red LED near
the reset button indicates the reset status.
Once the connections are done the applications can be run over the TriBoard. The
spectrum analyzer and the equalizer applications can be run by reading the input from
the serial port of TriBoard and calculated output is sent again to serial port of TriBoard.
User’s Manual
5-409
V 1.2, 2000-01
Applications
5.4.1
Spectrum Analyzer
Frontend for Spectrum Analyzer:
Figure 5-8
Frontend of Spectrum Analyzer
Figure 5-9
Settings for Spectrum Analyzer
User’s Manual
5-410
V 1.2, 2000-01
Applications
Figure 5-10 Actual PSD of the input (128 point power spectrum)
Figure 5-11 Averaged PSD of the input (10 bands)
User’s Manual
5-411
V 1.2, 2000-01
Applications
The inputs taken from the user are
1. Actual band or average band
2. Sampling frequency
3. Cutoff frequency
Actual band gives 128 point power spectrum of the given 1024 input samples.
Sampling frequency can be one of the three choices 32K, 44.1K, and 48K.
Cutoff frequency can be one of the three choices 4K, 8K, and 16K.
From the host machine, first 1 byte is sent to the serial port of TriBoard to get the above
user inputs. Then acknowledgement is sent to host machine as 1 byte is received. Then
follows the data from the host machine to the TriBoard. 1024, 16 bit data is sent to the
TriBoard. This data is read in a buffer. The FFT of 1024 points input data is calculated.
From the frequency spectrum, power spectrum density is calculated by squaring the
scaled magnitude complex frequency samples. Then 128 point PSD is calculated from
512 point PSD by averaging. If user input is actual PSD, the 128 point PSD is sent to
serial port of TriBoard. If the user input is average input then calculated PSD is divided
into 10 segments and averaged 10 bands are sent to serial port. The host machine reads
the data on the serial port and displays actual or averages spectrum depending on user
input.
User’s Manual
5-412
V 1.2, 2000-01
Applications
5.4.2
Equalizer
Frontend for Equalizer:
Settings:
Figure 5-12 Frontend of Equalizer
User’s Manual
5-413
V 1.2, 2000-01
Applications
Figure 5-13 Settings for Equalizer
The inputs taken from the user are
1. Sampling frequency
2. 5 band gains in dB
3. Master gain in dB
Sampling frequency can be one of the three choices 32K, 44.1K and 48K.
Band gains can be from -20dB to +20dB.
Master gain can be from 0 to +50dB.
User’s Manual
5-414
V 1.2, 2000-01
Applications
From the host machine, first 13 bytes are sent to the serial port of TriBoard to get the
above user inputs. Then a one byte acknowledgement is sent to the host machine. This
is followed by the data from the host machine. 128, 16 bit data is sent to the TriBoard.
This data is read in a buffer. This is band passed through 5 Band pass filters. Each of
the outputs of the filters is multiplied by the respective gain and the final output is
generated by their sum. This is then multiplied by the master gain and sent back to the
host machine. The host machine then sends this data to an output file.
User’s Manual
5-415
V 1.2, 2000-01
Applications
User’s Manual
5-416
V 1.2, 2000-01
References
6
References
1. Digital Signal Processing by Alan V Oppenheim and Ronald W Schafer
2. Digital Signal Processing, A Practical Approach by Emmanuel C Ifeachor and Barrie
W Jervis
3. Discrete-Time Signal Processing by Alan V Oppenheim and Ronald W Schafer
4. Advanced Engineering Mathematics by Erwin Kreyszig
5. K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advantages,
Applications
6. W. H. Chen, C. H. Smith, and S. C. Fralick, "A fast computational algorithm for the
Discrete Cosine Transform"
User’s Manual
6-417
V 1.2, 2000-01
References
User’s Manual
6-418
V 1.2, 2000-01
Frequently Asked Questions
7
Frequently Asked Questions
7.1
FIR Basics
1. What are FIR filters?
FIR filters are one of two primary types of digital filters used in Digital Signal Processing
(DSP) applications (the other type being IIR). FIR means Finite Impulse Response.
2. Why is the impulse response "finite"?
The impulse response is "finite" because there is no feedback in the filter, if an impulse
is given as an input (i.e., a single one sample followed by many zero samples), zeroes
will eventually come out after the one sample has made its way in the delay line past all
the coefficients.
3. What is the alternative to FIR filters?
DSP filters can also be Infinite Impulse Response (IIR). IIR filters use feedback, so when
an impulse is input the output theoretically rings indefinitely.
4. How do FIR filters compare to IIR filters?
Each has advantages and disadvantages. Overall, the advantages of FIR filters
outweigh the disadvantages, so they are used much more than IIRs.
a) What are the advantages of FIR Filters as compared to IIR filters?
Compared to IIR filters, FIR filters have the following advantages.
• They can easily be designed to be "linear phase". Simple linear-phase filters delay the
input signal, but do not distort its phase.
• They are simple to implement. On most DSP microprocessors, the FIR calculation can
be done by looping a single instruction.
• They are suited to multi-rate applications. By multi-rate, we mean either decimation
(reducing the sampling rate), interpolation (increasing the sampling rate) or both.
Whether decimating or interpolating, the use of FIR filters allows some of the
calculations to be omitted, thus providing an important computational efficiency. In
contrast, if IIR filters are used, each output must be individually calculated, even if that
output is discarded. (so the feedback will be incorporated into the filter.)
• They have desirable numeric properties. In practice, all DSP filters must be
implemented using finite-precision arithmetic, i.e., a limited number of bits. The use of
finite-precision arithmetic in IIR filters can cause significant problems due to the use
of feedback, but FIR filters have no feedback, so they can usually be implemented
using fewer bits.
User’s Manual
7-419
V 1.2, 2000-01
Frequently Asked Questions
• They can be implemented using fractional arithmetic. Unlike IIR filters, it is always
possible to implement an FIR filter using coefficients with magnitude of less than 1.0.
(The overall gain of the FIR filter can be adjusted at its output, if desired). This is an
important consideration when using fixed-point DSP's, because it makes the
implementation much simpler.
b) What are the disadvantages of FIR Filters as compared to IIR filters?
FIR filters sometimes have the disadvantage that they require more memory and/or
calculation to achieve a given filter response characteristic. Also, certain responses are
not practical to implement with FIR filters.
5. What terms are used in describing FIR filters?
Impulse Response - The impulse response of an FIR filter is actually just the set of FIR
coefficients. (If an impulse is put into an FIR filter which consists of a one sample
followed by many zero samples, the output of the filter will be the set of coefficients, as
the one sample moves past each coefficient in turn to form the output.)
Tap - An FIR tap is simply a coefficient/delay pair. The number of FIR taps, (often
designated as N) is an indication of
• The amount of memory required to implement the filter
• The number of calculations required
• The amount of filtering the filter can do
In effect, more taps means more stopband attenuation, less ripple, narrower filters, etc.
7.1.1
FIR Properties
Linear Phase
1. What is the association between FIR filters and linear-phase?
Most FIRs are linear-phase filters. When a linear-phase filter is desired an FIR is usually
used.
2. What is a linear phase filter?
Linear Phase refers to the condition where the phase response of the filter is a linear
(straight-line) function of frequency (excluding phase wraps at +/- 180 degrees). This
results in the delay through the filter being the same at all frequencies. Therefore, the
filter does not cause phase distortion or delay distortion. The lack of phase/delay
distortion can be a critical advantage of FIR filters over IIR and analog filters in certain
systems, for example, in digital data modems.
User’s Manual
7-420
V 1.2, 2000-01
Frequently Asked Questions
3. What is the condition for linear phase?
FIR filters are usually designed to be linear-phase (but they don’t have to be). An FIR
filter is linear-phase if (and only if) its coefficients are symmetrical around the center
coefficient, i.e., the first coefficient is the same as the last, the second is the same as the
next-to-last, etc. (A linear-phase FIR filter having an odd number of coefficients will have
a single coefficient in the center which has no mate.)
4. What is the delay of a linear-phase FIR?
The formula is simple. Given an FIR filter which has N taps, the delay is
(N - 1) / Fs, where Fs is the sampling frequency. So, for example, a 21 tap linear-phase
FIR filter operating at a 1 kHz rate has delay (21 - 1) / 1 kHz = 20 milliseconds.
Frequency Response
1. What is the Z transform of an FIR filter?
For an N-tap FIR filter with coefficients h(k), whose output is described by
y(n) = h(0) ⋅ x(n) + h(1) ⋅ x(n – 1 ) + h( 2) ⋅ x( n – 2) + … + h(N – 1) ⋅ x(n – N – 1)
[7.1]
The filter’s Z transform is
H ( z ) = h ( 0 )z
–0
+ h ( 1 )z
–1
+ h ( 2 )z
–2
+ … + h ( N – 1 )z
–( N – 1 )
[7.2]
2. What is the frequency response formula for an FIR filter?
The variable z in H(z) is a continuous complex variable and can be described as
z = re
jw
[7.3]
where,
r is the magnitude and w is the angle of z.
let r = 1, then H(z) around the unit circle becomes the filter’s frequency response H(ejw).
This means that substituting ejw for z in H(z) gives an expression for the filter’s frequency
response H(ejw), which is
jw
H ( e ) = h ( 0 )e
– j0w
+ h ( 1 )e
– j1w
+ h ( 2 )e
– j2w
+ … + h ( N – 1 )e
– j ( N – 1 )w
or
[7.4]
Using Euler’s identity,
e
– ja
= cos ( a ) – j sin ( a )
User’s Manual
[7.5]
7-421
V 1.2, 2000-01
Frequently Asked Questions
H(w) can be written in rectangular form as
H ( jw ) = h ( 0 ) [ cos ( 0w ) – j sin ( 0w ) ] + h ( 1 ) [ cos ( 1w ) – j sin ( 1w ) ] + …
+ h ( N – 1 ) [ cos ( ( N – 1 )w ) – j sin ( ( N – 1 )w ) ]
[7.6]
3. How to scale the gain of an FIR filter?
Multiply all coefficients by the scale factor.
Numeric Properties
1. Are FIR filters inherently stable?
Yes, since they have no feedback elements, any bounded input results in a bounded
output.
2. What makes the numerical properties of FIR filters good?
The key is the lack of feedback. The numeric errors that occur when implementing FIR
filters in computer arithmetic occur separately with each calculation, the FIR does not
remember its past numeric errors. In contrast, the feedback aspect of IIR filters can
cause numeric errors to compound with each calculation, as numeric errors are fed back.
The practical impact of this is that FIRs can generally be implemented using fewer bits
of precision than IIRs. For example, FIRs can usually be implemented with 16-bits, but
IIRs generally require 32-bits, or even more.
6. Why are FIR filters generally preferred over IIR filters in multirate (decimating and
interpolating) systems?
Because only a fraction of the calculations that would be required to implement a
decimating or interpolating FIR in a literal way actually needs to be done.
Since FIR filters do not use feedback, only those outputs which are actually going to be
used have to be calculated. Therefore, in case of decimating FIRs (in which only 1 of N
outputs will be used), the other N-1 outputs do not have to be calculated. Similarly, for
interpolating filters (in which zeroes are inserted between the input samples to raise the
sampling rate) the inserted zeroes need not have to be multiplied with their
corresponding FIR coefficients and sum the result, the multiplication-additions that are
associated with the zeroes are just omitted. (because they don’t change the result
anyway.)
In contrast, since IIR filters use feedback, every input must be used, and every input
must be calculated because all inputs and outputs contribute to the feedback in the filter.
User’s Manual
7-422
V 1.2, 2000-01
Frequently Asked Questions
7.1.2
FIR Design
1. What are the methods of designing FIR filters?
The three most popular design methods are (in order):
a) Parks-McClellan: The Parks-McClellan method is probably the most widely used
FIR filter design method. It is an iteration algorithm that accepts filter specifications
in terms of passband and stopband frequencies, passband ripple, and stopband
attenuation. The fact that all the important filter parameters can be directly specified
is what makes this method so popular. The Parks-McClellan method can design not
only FIR filters but also FIR differentiators and FIR Hilbert transformers.
b) Windowing: In the windowing method, an initial impulse response is derived by
taking the Inverse Discrete Fourier Transform (IDFT) of the desired frequency
response. Then, the impulse response is refined by applying a data window to it.
c) Direct Calculation: The impulse responses of certain types of FIR filters (e.g.
Raised Cosine and Windowed Sine) can be calculated directly from formulae.
User’s Manual
7-423
V 1.2, 2000-01
Frequently Asked Questions
7.2
IIR Basics
1. What are IIR filters?
IIR filters are one of two primary types of digital filters used in Digital Signal Processing
(DSP) applications (the other type being FIR). IIR means Infinite Impulse Response.
2. Why is the impulse response "infinite"?
The impulse response is "infinite" because there is feedback in the filter, if an impulse is
given as an input (a single 1 sample followed by many 0 samples), an infinite number of
non-zero values will come out (theoretically).
3. What is the alternative to IIR filters?
DSP filters can also be Finite Impulse Response (FIR). FIR filters do not use feedback.
So, for an FIR filter with N coefficients, the output always becomes zero after putting in N
samples of an impulse response.
4. What are the advantages of IIR filters as compared to FIR filters?
IIR filters can achieve a given filtering characteristic using less memory and fewer
calculations than a similar FIR filter.
5. What are the disadvantages of IIR filters as compared to FIR filters?
• They are more susceptible to problems of finite-length arithmetic, such as noise
generated by calculations and limit cycles. (This is a direct consequence of
feedback, when the output is not computed perfectly and is fed back, the imperfection
can compound.)
• They are harder (slower) to implement using fixed-point arithmetic.
• They do not offer the computational advantages of FIR filters for multirate
(decimation and interpolation) applications.
User’s Manual
7-424
V 1.2, 2000-01
Frequently Asked Questions
7.3
FFT
The Fast Fourier Transform is one of the most important topics in Digital Signal
Processing but it is a confusing subject which frequently raises questions. Here, we
answer Frequently Asked Questions (FAQs) about the FFT.
7.3.1
FFT Basics
1. What is FFT?
The Fast Fourier Transform (FFT) is a fast (computationally efficient) way to calculate
the Discrete Fourier Transform (DFT).
2. How does the FFT work?
By making use of periodicities in the sines that are multiplied to do the transforms, the
FFT greatly reduces the amount of calculation required.
Functionally, the FFT decomposes the set of data to be transformed into a series of
smaller data sets to be transformed. Then, it decomposes those smaller sets into even
smaller sets. At each stage of processing, the results of the previous stage are combined
in special way. Finally, it calculates the DFT of each small data set. For example, an FFT
of size 32 is broken into 2 FFTs of size 16, which are broken into 4 FFTs of size 8,
which are broken into 8 FFTs of size 4, which are broken into 16 FFTs of size 2.
Calculating a DFT of size 2 is trivial.
This can be explained as follows. It is possible to take the DFT of the first N/2 points and
combine them in a special way with the DFT of the second N/2 points to produce a single
N-point DFT. Each of these N/2-point DFTs can be calculated using smaller DFTs in the
same way. One (radix-2) FFT begins, therefore, by calculating N/2 2-point DFTs. These
are combined to form N/4 4-point DFTs. The next stage produces N/8 8-point DFTs and
so on, until a single N-point DFT is produced.
3. How efficient is the FFT?
The DFT takes N2 operations for N points. Since at any stage the computation required
to combine smaller DFTs into larger DFTs is proportional to N and there are log2(N)
stages (for radix-2), the total computation is proportional to N * log2(N). Therefore, the
ratio between a DFT computation and an FFT computation for the same N is
proportional to N / log2(n). In cases where N is small this ratio is not very significant, but
when N becomes large, this ratio gets very large. (Every time N is doubled, the
numerator doubles, but the denominator only increases by 1.)
4. Are FFTs limited to sizes that are powers of 2?
User’s Manual
7-425
V 1.2, 2000-01
Frequently Asked Questions
No. The most common and familiar FFTs are radix-2. However, other radices are
sometimes used, which are usually small numbers less than 10. For example, radix-4 is
especially attractive because the twiddle factors are all 1, -1, j or -j, which can be
applied without any multiplications at all.
Also, mixed radix FFTs can be done on composite sizes. In this case, you break a nonprime size down into its prime factors and do an FFT whose stages use those factors.
For example, an FFT of size 1000 might be done in six stages using radices of 2 and 5,
since 1000 = 2 * 2 * 2 * 5 * 5 * 5. It can also be done in three stages using radix-10, since
1000 = 10 * 10 * 10.
5. Can FFTs be done on prime sizes?
Yes, although these are less efficient than single-radix or mixed-radix FFTs. It is almost
always possible to avoid using prime sizes.
7.3.2
FFT Terminology
1. What is an FFT radix?
The radix is the size of an FFT decomposition. For single-radix FFTs, the transform size
must be a power of the radix.
2. What are twiddle factors?
Twiddle factors are the coefficients used to combine results from a previous stage to
form inputs to the next stage.
3. What is an "in place" FFT?
An "in place" FFT is an FFT that is calculated entirely inside its original sample
memory. In other words, calculating an "in place" FFT does not require additional buffer
memory. (as some FFTs do.)
4. What is bit reversal?
Bit reversal is just what it sounds like, reversing the bits in a binary word from left to
right. Therefore the MSB’s become LSB’s and the LSB’s become MSB’s. The data
ordering required by radix-2 FFTs turns out to be in bit reversed order, so bit-reversed
indices are used to combine FFT stages. It is possible (but slow) to calculate these bitreversed indices in software. However, bit reversals are trivial when implemented in
hardware. Therefore, almost all DSP processors include a hardware bit-reversal
indexing capability. (which is one of the things that distinguishes them from other
microprocessors.)
User’s Manual
7-426
V 1.2, 2000-01
Frequently Asked Questions
5. What is decimation in time versus decimation in frequency?
FFTs can be decomposed using DFTs of even and odd points, which is called a
Decimation-In-Time (DIT) FFT or they can be decomposed using a first-half/second-half
approach, which is called a Decimation-In-Frequency (DIF) FFT.
User’s Manual
7-427
V 1.2, 2000-01
Frequently Asked Questions
User’s Manual
7-428
V 1.2, 2000-01
Appendix
8
Appendix
Convention Document for TriLib
8.1
Introduction
8.1.1
Scope of the Document
This document describes the Programming Conventions for the TriCore DSP Library.
The purpose of the document is to bring out a unified programming style for the TriCore
DSP. It is recommended that the guidelines and the conventions be observed to
organize each DSP application software. This ensures uniform and well-structured code.
User’s Manual
8-429
V 1.2, 2000-01
Appendix
8.2
File Organization
8.2.1
File Extensions
The Software application, TriLib should be organized as a collection of modules or files
that belongs to any one of the following categories. The following table brings out the
details of the different categories of files.
Table 8-1
Type
Directory Structure
Extension
Description
’C’ Source files *.c
C Language Source files
Include files
The include files for the ’C’ and the assembly
functions. The C include files generally have *.h as
extension. Assembly can have different extensions
based on the compiler in use. All the include files
should define the global constants and variable
types, if any. They should not allocate memory or
define functions as this prevents them from being
included by multiple source files. All subroutines
which form part of the overall interface to a source
file should be declared in include file. This provides
a convenient overview of the interface and allows
the compiler or assembler to check for errors
*.h, *.inc
Testvector files *.dat
These files should only contain data to be used for
test purposes or algorithmic usage. There must not
be any code in these data files. These files, if used,
will probably be included or copied (.include
directive) in other source files or assembled as
stand-alone modules. These files can also be given
as the command line argument for the example
programs depending upon the implementation
Build files
*.pjt, *.bld, *.out It is strongly recommended that a project make file
is maintained that checks for any out-of-date target
files and builds them automatically. Different
compilers use different extension for the build files.
TriCore
Source files
*.asm, *.tri, *.S
User’s Manual
Different compilers use different extensions for the
assembly source files. Generally *.asm file is widely
accepted by many compilers.
8-430
V 1.2, 2000-01
Appendix
8.2.2
File Naming Conventions
The Files will be named using the following convention. This helps in easy identification
of the file.
• All the Source files of TriCore assembly will have *.asm, *.tri or *.S extension
depending upon the compiler being used. The name can be formulated by using the
following convention.
<Function class Operation name>_<Suffix info>.asm/(.tri)/(.S)
The suffix has to be numeric that gives the
information such as data size (16 or 32 bits)
of input in case of arithmetic operations, or
constraint on the order of Filters, say multiple
of four (this is optional and can be used
wherever applicable). When order and bit
information are required, the suffix info is
exploded as <order>_<no.bits>
Abbreviated function name approximately in multiples of three
letters for each concept or words.
a. The initial three letters will be the class of the functions such as
Finite Impulse Response filters and can be represented as ’Fir’
b. The next three letters will be operation name such as for block
operation it can be represented as ’Blk’ or for Maximum Index
as ’MaxIdx’
8.2.3
File Header and Guidelines
The following is the format of the file header.
//**********************************************************************************************
User’s Manual
8-431
V 1.2, 2000-01
Appendix
// @Module:
Name of the function or module (e.g., main())
// @Filename:
Name of the file with extension (e.g., expFir_4_16.c)
// @Project:
Name of the Project (DSP Library for Tricore V1.2,
V1.3)
// @Controller:
Name of the controller (TriCore V1.2, V1.3)
// @Compiler:
Compiler name (Tasking or GHS or GNU)
// @Version:
Version of the S/W
// @Description:
The description of the file
// @See Also:
List the include files used
// @References:
List the reference documents /manuals
// @Caveats:
Caveats if any
// @Date:
Date (only in this format dd mm yy e.g., 14th Jan
2000)
// @History:
Revision history or the modification details
//------------------------------------------------------------------------------------------------------------Notes
• The names in the fields - module, file name etc., should match exactly with the existing
name of the file and the module. Consistency should be maintained in all the fields
wherever there are multiple references.
• The description should provide the information about the implementation in the file
and the global issues, if any.
User’s Manual
8-432
V 1.2, 2000-01
Appendix
8.3
Coding Rules and Conventions for ’C’ and ’C++’
This section describes the coding rules and conventions for C/C++ languages.
8.3.1
File Organization
• It is recommended to have one functional module in one file. This can be relaxed when
the functional module is very small and does not justify having a separate file.
• Tab size is always set to four white spaces.
8.3.2
Function Declaration
The general recommendations and rules for the function declaration are as follows.
• Declaration of all global interface functions should be done in a header file, which
should be made available to the external programs.
• All local functions should be declared in the respective C files that makes use of them.
This should not be visible outside.
• All functions, arguments, and variables must be explicitly declared. If a function does
not return a value, then the return type should be void.
• Function definition should never be put in a .h header file unless it is an inline function
this is applicable only for C++.
• Declare all external functions in a .h header file.
• Do not #include .c files.
• Any module that needs to provide extern variables must provide a header file that
declares them. Other modules that need to reference the extern variable should
include that header file.
• All global variables should be declared as extern in the common header file. This
avoids the multiple declaration if included in multiple files.
Function definition should have the following syntax.
<return_type> <func_name>(<data_type><param1>,
<data_type><param2>,
...
...
<data_type><paramn>)
{
/*********Declaration of local variables
/* comments */
/* comments */
/* comments */
********/
/***** Description about the body below**********/
/**** Body *****/
....
....
....
/***** Start of loop *****/
User’s Manual
8-433
V 1.2, 2000-01
Appendix
{
} /***** Mark end of loop here *****/
/*****Mark end of body here ******/
}/* Mark end of function here with the <func_name> ***/
8.3.3
Variable Declaration
The general recommendations and rules for the variable declaration is as follows.
• All global variables should be defined in a .c file and not in a .h file. In the .h header
file, it should be declared as extern.
• If different types of variables are declared in a file, there should be a clear demarcation
between the global variables for the project and the global variables for a file.
• Declare the class of variables in groups with a general comment. Determination of the
class can be done on basis of usage, locality, etc.
• Local variables should be declared only at the beginning of the function for greater
visibility.
Example:
void func_name()
{
int x;
/****** body of the function*****/
int y; /* improper - never declare a variable inside the body of the
function */
/******end of the body***********/
}
• Never mix the index variables or pointer variables with that of the other local variables
in the declaration.
Example:
int
int
int
int
i, temp_32, *pTable;
i;
*pTable;
temp_32;
/*
/*
/*
/*
Improper */
Correct */
Correct */
Correct */
• Declare and use the variables as per the naming convention that is formalized for
each of the projects.
• For pointer variable declaration, use the '*' sign near to the variable name and in case
of multiple pointer declaration, use the '*' sign separately for each of the variables.
User’s Manual
8-434
V 1.2, 2000-01
Appendix
• Never initialize the pointer in the same line where it is declared, do it explicitly to
increase the visibility.
8.3.4
Comments
• Comments should be written at the beginning of the body of the function to describe
its activity.
• Comments and code should not cross the 79th column of the line. In case there is a
need to further comment, use the next line and start in the same column it was started
in previous line.
• Comments should be to the point.
• Comments should be avoided where the code itself is sufficient to understand the flow
of the program.
• Comments are mandatory at the beginning of the new block. It should explain the
purpose and the operation of that block.
• Arithmetic and logical operations can be represented by means of symbols in the
comments to make it short and increase the readability.
User’s Manual
8-435
V 1.2, 2000-01
Appendix
8.4
Coding Rules and Conventions for Assembly Language
This section describes the coding rules and conventions for the Assembly language.
8.4.1
File Organization
• It is recommended to have one functional module in one file. This can be relaxed when
the functional module is very small and does not justify having a separate file.
• Tab size is always set to four white spaces.
8.4.2
General Coding Guidelines
The following describes the order of declaration and syntax for the same in the assembly
language programs.
• Include syntax should start from the 1st column since some assemblers does not
accept if it is other than 1st column.
Example:
; -------- Section for all include header files -------------.include file.h
• All include files should have a preprocessor directive at the beginning.
Example:
#ifndef _TriLib_h
#define _TriLib_h
....
....
#endif // end of _TriLib_h include file
• Describe the external references
Example:
; -------- Section for external references ------------------.global
_mpy32
;here _mpy32 is the global label that
;can be referenced in other files by using extern
.extern
_mpy32
;used to refer the global labels.
; -------- Section for constants ----------------------------Pi
.set
3.14
Localvarsize .set
1
User’s Manual
8-436
V 1.2, 2000-01
Appendix
Note: .equ directive can also be used here but .set can be used if one needs to
change the value at a later point in the program.
• Constant definitions for the pointer offsets
Example for Tasking Compiler:
.define
.define
.define
W16
W32
W64
’2’
’4’
’8’
;Two bytes offset
;Four bytes offset
;Eight bytes offset
Example for GHS Compiler:
#define
#define
#define
W16
W32
W64
2
4
8
;Two bytes offset
;Four bytes offset
;Eight bytes offset
Example for GNU Compiler:
.equ
.equ
.equ
W16
W32
W64
2
4
8
;Two bytes offset
;Four bytes offset
;Eight bytes offset
• Use the freely available registers for local variables and document the same.
Otherwise, use the macros which will set aside a frame for the required size by
decrementing the stack.
Example:
FEnter
5
;will decrement the stack by 5 words
(FEnter is the macro that subtracts the stack pointer by the required number which is
passed as the argument)
• Labels must be written in the same convention as that of the function naming
convention and should start from the 1st column. It is recommended that all labels
should have some prefix that relates it to the function it belongs. This helps to avoid
duplicate label names in different files.
For instance, all labels in an assembly function named Function1 could begin with the
prefix F1_. A label should end with a colon character.
User’s Manual
8-437
V 1.2, 2000-01
Appendix
Example:
In case of a Finite Impulse Response filter, a label can be written as FirS4_TapL: for tap
loop of FIR on sample, coefficient multiple of 4. This helps to identify a label from
mnemonics and other assembler directives.
• All instruction mnemonics must be written in lower-case letters. Instruction
mnemonics must begin from the 5th column of each line. All operands must start from
the 17th column. Most text editors can be configured to position tabs to any column
number. In case of multiple operands, they should be separated with a comma.
• When writing a complex assembly language function, it is sometimes difficult to keep
track of the contents of registers. Use of symbolic names to replace registers can
improve readability of code. It is recommended that .define or #define assembler
directives be used depending upon the compiler used to substitute registers with
appropriate symbolic names. Since a register may be used for more than one purpose
during the execution of a program, more than one symbolic name can be equated to
one register. Note that all symbols replacing registers should be in the convention as
described in the section 7.4.4, as shown in the following example.
Example for Tasking compiler:
.define
.define
.define
caeDLY
caoDLY
aTapLoops
"a12"
"a13"
"a14"
;Even-Reg of Circ-Ptr
;Odd-Reg of Circ-Ptr
;Number of taps
Another advantage of using symbolic names to identify registers is maintainability of the
code. By using symbolic names for registers, it becomes easier to change register
assignments later. For example, if a function uses A1 as an input parameter pointing to
an array but the calling function prefers using A2 for that purpose, the .define directive
in the called function can be modified to equate the input array symbol with A2 instead
of A1. If a symbol had not been equated to A1 in the called function, it would have
required a search-and-replace operation to find all occurrences of A1 and replace them
with A2. Symbolic names should be used whenever it is possible.
• Comments can either begin from the 37th column or from the 1st column if the entire
line is required for lengthy comments at the beginning of the block. This rule is for
general instruction wise commenting only. In case of block or program commenting,
which is trying to explain about the overall function/algorithm, it can start from 1st
column. Remember the commenting is inclusive of the semicolon also. Comments
should be avoided between parallel instructions. The commenting conventions are
described in the later section.
User’s Manual
8-438
V 1.2, 2000-01
Appendix
Example:
1st Column
5th Column
Fir_b:
Ld.da
17th Column
37th Column
caDLY,[aDLY]
;Load the Circ-Ptr of
;Delay-Buffer to reg
;pair caDLY
; This long comment refers to the next group of instructions.
; for readability, this sentence begins from the fourth column.
1st Column
8.4.3
Function Organization
The general function organization is as follows. Changes can be made to suit the
requirements.
Function_name_label
----------Prolog of fn starts here-------SP = SP + Locvarsize
;Allocate local variables in stack
----------End of prolog------------------Body of function......
----------Epilog starts here-------------SP=SP-Locvarsize
;Deallocate local variables
;in stack
----------End of epilog------------------RETURN
User’s Manual
8-439
V 1.2, 2000-01
Appendix
• If there is a reference code or pseudocode, use the same variable names for easy
debugging and maintenance.
• Loop start and end should be commented for easy identification.
;--------------------------loop start---------------------------Body of loop
;--------------------------loop end------------------------------
8.4.4
Variables and Argument Convention
The variables should have following conventions.
Prefix
Variables
s
Short (16 bit value)
ss
Two short values in a 32 bit register
ssss
Four short values in a 64 bit register
l
Long (32 bit) in a 32 bit register
ll
Two long in a 64 bit register
a
Address register or data type prefix
dTmp
Temporary data register
n
Loop count data register
ca
Circular buffer address register pair
aa
Pointer to pointer
o
Odd register
e
Even register
Example:
;Registers used for storing input Data Registers (Tasking)
.define
ssXa
"d10"
;D10-Register holds 2 inputs
.define
ssXb
"d11"
;D11-Register holds 2 inputs
.define
ssssXab "d10"
;E10-Register holds 4 inputs
.define
aVec1
"d11"
;A1 is the address register
.define
nCnt
"a5"
;A5 used as loop counter
.define
caH
"a6"
;A6 is the pointer to circular
User’s Manual
8-440
V 1.2, 2000-01
Appendix
;buffer address pointer
• Define a temporary register of two short values
Example:
.define
dTmp
"d4"
;Generic temp-data-reg
• Define the lower half or the upper half of the registers explicitly for GHS and GNU
compilers whereas for Tasking it is not needed.
Example for the incorrect implementation:
.define
.define
lKa
lKa_UL
"d8"
"D8ul"
;d8-Register
;
maddm.h
Acc,Acc,drXb,lKa_UL,1
Example for the correct implementation:
.define
ssKa
"d8"
;d8-Register holds
maddm.h
Acc,Acc,ssXb,ssKa ul,#1
• Use a consistent notation. Always use the symbolic name that is defined. Do not mix
the symbolic names with the register names.
Example for the incorrect implementation:
.define
caCoef
"a6/a7"
ld.da
caDelay,[A7]
ld.w
lKb,[caCoef+c]2*w16
;A6/A7-Circ-buf
;Use absolute
;register name
;Use define
If the defines are changed then the absolute names will not match. Also the probability
of making errors is high, and the code is not readable. In case of defines that use a
register pair (e.g. caH), additional defines can be used for individual odd and even
registers.
User’s Manual
8-441
V 1.2, 2000-01
Appendix
8.4.5
Function Header and Guidelines
The format of the function header is as follows.
;**********************************************************************
; Return_Value
Function_Name ( Arg1,
Arg2,
……..
……..
Arg N);
; INPUTS:
Input parameters
; OUTPUTS:
Output parameters
; RETURN:
Return value and type and its significance
; DESCRIPTION:
Describe the function if relevant give the formula,
C code, Error conditions, etc.
; ALGORITHM:
Algorithm of the implementation in simple english or
in the pseudo C syntax equations etc.
; TECHNIQUES:
List the different techniques of optimization used in
the implementation
; ASSUMPTIONS:
List the assumptions made
; MEMORY NOTE:
Table to depict the variables and the its type, name,
alignment, etc.
; REGISTER USAGE:
List of registers used in this function
; CYCLE COUNTS:
Profiled result in terms of number of cycles
; CODE SIZE:
Size in terms of words of memory
; DATE:
Date
; VERSION:
Version of the function
;**********************************************************************************************
User’s Manual
8-442
V 1.2, 2000-01
Appendix
Notes
• The signature of the function should be same as what is declared as the function
prototype.
• The input/output parameters are passed to the function as arguments. Sometimes the
input parameters can also act as the output parameters, such as a pointer variable
getting used and updated inside the function. This information should be explained in
this field. This field should have information about the type of parameter, its normal
value or range of values and it's significance.
• Return values should not be mixed with the output parameters. Sometimes return
values are themselves the output values of the function. In DSPLIB implementation,
the return values are generally void in many cases as the output will be in form of an
array, etc. The return value should give information about the type, range of values
and its significance.
• The description field should contain the required description of the function, without
any redundant information. It should contain equations wherever applicable. The
purpose of the description is to give a good overview of the function and the
methodology of implementation. It should also contain information on the
implementation with right justification for a specific method, which is followed in the
implementation. Alternative methodologies can also be discussed which are optional.
Error conditions should be discussed wherever applicable.
• Any assumptions that are made in the implementation such as bits of precision, range
of values etc., should be mentioned under assumptions. The assumption should deal
only with the implicit requirements of the function. Any direct given data or the
requirements should not be listed in the assumptions list.
User’s Manual
8-443
V 1.2, 2000-01
Appendix
8.5
Testing
8.5.1
Test Methodology
• Testing of the DSP library is done using the test vectors that are developed internally.
• The reference 'C' code is developed and reviewed critically.
• For few codes the input test vectors (test cases) are used to generate the reference
output test vectors using the reference 'C' code.
• The module under test will be tested using the test vector. The output of the module
will be cross-examined for correctness with the reference output test vectors. This is
test for the PASS/FAIL criterion.
• For all the codes the input test vectors are given in the example main of the function.
Same test case can be given to test code and outputs of both can be verified.
8.5.2
Convention
Refer Test Design Specification: INF_DSP.1.0.TD.1.0 dated March 01, 2000.
User’s Manual
8-444
V 1.2, 2000-01
Appendix
8.6
Compiler Support
8.6.1
General Common System
The TriLib implementation is designed for multiple compilers. TriCore processor is
supported by three compilers at present namely,
• Tasking
• GHS
• GNU
TriLib should be implemented with and without language extensions. It is intended not to
have any changes in the organization of the code to support the different compilers.
Since the implementation of each of the compilers varies from one another, it is expected
that the implementation of the TriLib cannot be uniform across the compilers.
The following sections will bring in the details of how to support the TriLib in Tasking,
GHS and the GNU compilers. The main idea of this is to bring in the aspects of portability
and extensibility across different platforms.
8.6.2
Distinguishing Tasking, GHS and GNU Specific Directives
Tasking compiler, GHS and GNU have a specific set of assembler directives, refer the
individual documentation for more details.
Principally, all the compilers have some directive which are same by syntax and usage
perspective. There are also some equivalent directives whose syntax differs. Finally
there are some distinctive sets of directives, which are specific to each of the compilers.
Refer individual documentation for more details on the language extensions part of each
of the compilers.
8.6.3
Note on Implementation on Different Compilers
Table 8-2
Equal Directives
Tasking Compiler
GHS Compiler
GNU Compiler
.align
.align
.align
.byte
.byte
.byte
.word
.word
.word
.double
.double
.double
.float
.float
.float
User’s Manual
8-445
V 1.2, 2000-01
Appendix
Table 8-2
Equal Directives
.space
.space
.space
.set
.set
.set
.extern
.extern
.extern
.include
.include
.include
.macro
.macro
.macro
.endm
.endm
.endm
.exitm
.exitm
.exitm
.if
.if
.if
.else
.else
.else
.endif
.endif
.endif
Table 8-3
Directives with the same functionality but different syntax
Tasking Compiler
GHS Compiler
GNU Compiler
.define
#define
#define
.global
.globl
.global/.globl
.sect ".text"
.text
.text
.sect ".data"
.data
.data
.half
.hword
.hword
Table 8-4
Datatypes with DSPEXT
Tasking Compiler
GHS Compiler
GNU Compiler
_sfract
fract16
Not applicable
_fract
fract32
Not applicable
_sfract_circ
circptr<frac16>
Not applicable
_fract_circ
circptr<frac32>
Not applicable
User’s Manual
8-446
V 1.2, 2000-01
Appendix
Table 8-4
Datatypes with DSPEXT
struct
{
_sfract imag;
_sfract real;
} CplxS;
struct
{
frac16 imag;
frac16 real;
} CplxS;
Not applicable
struct
{
_fract imag;
_fract real;
} CplxL;
struct
{
frac32 imag;
frac32 real;
} CplxL;
Not applicable
Datatypes without DSPEXT are same for all compilers. They are as shown
Table 8-5
Datatypes without DSPEXT
Data Size
Data Type
16-bit
short
32-bit
int
Circular buffer structure 16-bit
struct
{
short *base;
short index;
short base;
} CptrDataS
Circular buffer structure 32-bit
struct
{
int *base;
short index;
short base;
} CptrDataL
Complex 16-bit
{
short imag;
short real;
} CplxS
Complex 32-bit
{
int imag;
int real;
} CplxL
User’s Manual
8-447
V 1.2, 2000-01
Appendix
The instructions which need to be changed for porting.
1. Instructions using address register pair: In case of instruction using address
register pair for GNU one need to specify even address register of the register pair.
Example for Tasking Compiler:
ld.da caDLY,[aDLY]0
Example for GHS Compiler:
ld.da caDLY,[aDLY]0
Example for GNU Compiler:
ld.da caeDLY,[aDLY]0
2. Definition of data register pair: It should be as shown below.
Example for Tasking Compiler:
.define
llAcc
"d12/d13" or
.define
llAcc
"e12"
Example for GHS Compiler:
#define
llAcc
"d12/d13 or
#define
llAcc
e12
Example for GNU Compiler:
#define
llAcc
%e12
3. Instructions using packed multiply-add: For instructions using packed multiply-add
where lower or upper 16-bits of registers have to be specified, in case of GHS and
GNU those registers need to be explicitly defined.
Example for Tasking Compiler:
maddm llAcc, llAcc, ssex, ssOH ul, #1
In case of GHS the ssOH_ul need to be defined as
#define ssOH d9
#define ssoH_ul d9ul
User’s Manual
8-448
V 1.2, 2000-01
Appendix
Example for GHS Compiler:
maddm llAcc, llAcc, ssex, ssOH_ul, 1
In case of GNU the ssOH_ul need to be defined as
#define ssOH %d9
#define ssoH_ul %d9ul
Example for GNU Compiler:
maddm llAcc, llAcc, ssex, ssOH_ul, 1
4. Arithmetic Instruction using same source and destination register: Any
arithmetic instruction where source and destination registers are same GHS needs to
explicitly specify registers but it works on Tasking.
Example for Tasking Compiler:
add dTmp, #1 or
add dTmp, dTmp, #1
Example for GHS Compiler:
add dTmp, dTmp, 1
Example for GNU Compiler:
add dTmp, dTmp, 1
5. Reading data from the data section: While reading data from the data section of the
code the label of data section should be preceded by %sdaoff in case of GHS
Example for Tasking Compiler:
lea aH, CoeffTab
Example for GHS Compiler:
lea aH, %sdaoff(CoeffTab)
Example for GNU Compiler:
lea aH, CoeffTab
User’s Manual
8-449
V 1.2, 2000-01
Appendix
6. Macro definition:
Example for Tasking Compiler:
macro_name .macro
Example for GHS Compiler:
.macro macro_name
Example for GNU Compiler:
.macro macro_name
7. The arguments sent to macro:
For Tasking and GHS they will be used as it is where as in case of GNU it is preceded
by \ in the code of macro.
Example for Tasking Compiler:
FirDec .macro Ev_Coef,Ev_Coef_Od_Df
.if Ev_Coef == TRUE
sh
dTmp1, dTmp1, #-1
;>>1 2Taps/loop
Example for GHS Compiler:
.macro FirDec Ev_Coef,Ev_Coef_Od_Df
.if Ev_Coef == TRUE
sh
dTmp1, dTmp1, -1
;>>1 2Taps/loop
Example for GNU Compiler:
.macro FirDec Ev_Coef,Ev_Coef_Od_Df
.if \Ev_Coef == TRUE
sh
dTmp1, dTmp1, -1
//>>1 2Taps/loop
8. Loop within macro:
For Tasking the label for loop within macro should always have first character as ^ , e.g.
^conv_conL where as for GHS label need to be a number and where the loop
instruction encounters the label should be that number with a letter b as it is a backward
jump. For forward jump it should be f.
User’s Manual
8-450
V 1.2, 2000-01
Appendix
Example:
For Tasking: ^conv_conL :
.
.
loop aloopcount, ^conv_conL
For GHS: 1:
.
.
loop aloopcount, 1b
For GNU: 1:
.
.
loop aloopcount, 1b
9. cmov instruction: Instruction cmovn does not work for GHS ver 2.0 it has to be
replaced by seln.
Example for Tasking Compiler:
cmovn loAcc, dTmp2, dTmp1
Example for GHS Compiler:
seln loAcc, dTmp2, dTmp1, loAcc
Example for GNU Compiler:
seln loAcc, dTmp2, dTmp1, loAcc
10. Jump Instruction: Jump instruction syntax is different across these compilers.
Example for Tasking Compiler:
jnz.t
dTmp:0, label
Example for GHS Compiler:
jnz.t
dTmp,0, label
User’s Manual
8-451
V 1.2, 2000-01
Appendix
Example for GNU Compiler:
jnz.t
dTmp,0, label
Note:
The instruction jz works only for the GreenHills V2.0.2. For old versions of GreenHills this
instruction is not supported.
User’s Manual
8-452
V 1.2, 2000-01
Glossary
9
Glossary
A
Acquisition Time
The time required for a sample-and-hold (S/H) circuit to capture
an input analog value. Specifically, the time for the S/H output
to approximately equal its input.
Adaptive Delta
Modulation (ADM)
A variation of delta modulation in which the step size may vary
from sample to sample.
ADC (or A/D,
Analog-to-Digital
Converter)
The electronic component which converts the instantaneous
value of an analog input signal to a digital word (represented as
a binary number) for Digital Signal Processing. The ADC is the
first link in the digital chain of signal processing.
ADPCM (Adaptive
Differential Pulse
Code Modulation)
A very fast data compression algorithm based on the
differences occurring between two samples.
Algorithm
A structured set of instructions and operations tailored to
accomplish a signal processing task. For example, a Fast
Fourier Transform (FFT), or a Finite Impulse Response (FIR)
filter are common DSP algorithms.
Aliasing
The problem of unwanted frequencies created when sampling
a signal of a frequency higher than half the sampling rate.
All-Pass Filter
A filter that provides only phase shift or phase delay without
appreciable changing the magnitude characteristic.
Amplitude
1. Greatness of size, magnitude.
2. Physics. The maximum absolute value of a periodically
varying quantity.
3. Mathematics.
a) The maximum absolute value of a periodic curve
measured along its vertical axis.
b) The angle made with the positive horizontal axis by the
vector representation of a complex number.
4. Electronics. The maximum absolute value reached by a
voltage or current waveform.
User’s Manual
9-453
V 1.2, 2000-01
Glossary
Analog
A real world physical quantity or data, characterized by being
continuously variable (rather than making discrete jumps),
and can be as precise as the available measuring technique.
ANSI (American
National
Standards
Institute)
A private organization that develops and publishes standards
for voluntary use in the U.S.A.
Anti-Aliasing Filter
A low-pass filter used at the input of digital audio converters to
attenuate frequencies above the half-sampling frequency to
prevent aliasing.
Anti-Imaging Filter
A low-pass filter used at the output of digital audio converters
to attenuate frequencies above the half-sampling frequency to
eliminate image spectra present at multiples of the sampling
frequency.
ASCII
(pronounced "askee") (American
Standard Code for
Information
Interchange)
An ANSI standard data transmission code consisting of seven
information bits, used to code 128 letters, numbers, and special
characters. Many systems now use an 8-bit binary code, called
ASCII-8, in which 256 symbols are represented (for example,
IBM’s "extended ASCII").
Asymmetrical
(non-reciprocal)
Response
Term used to describe the comparative shapes of the boost/cut
curves for variable equalizers. The cut curves do not mirror the
boost curves, but instead are quite narrow, intended to act as
notch filters.
Asynchronous
A transmission process where the signal is transmitted without
any fixed timing relationship between one word and the next
(and the timing relationship is recovered from the data stream).
B
Bandpass Filter
A filter that has a finite passband, neither of the cutoff
frequencies being zero or infinite. The bandpass frequencies
are normally associated with frequencies that define the half
power points, i.e., the -3 dB points.
Band-Limiting
Filters
A low-pass and a high-pass filter in series, acting together to
restrict (limit) the overall bandwidth of a system.
User’s Manual
9-454
V 1.2, 2000-01
Glossary
Bandwidth
Abbreviation. BW
The numerical difference between the upper and lower -3 dB
points of a band of audio frequencies. Used to figure the Q, or
quality factor for a filter.
Bilinear Transform
A mathematical method used in the transformation of a
continuous time (analog) function into an equivalent discrete
time (digital) function. Fundamentally important for the design
of digital filters. A bilinear transform ensures that a stable
analog filter results in a stable digital filter, and it exactly
preserves the frequency-domain characteristics, albeit with
frequency compression.
Bit Error Rate
The number of bits processed before an erroneous bit is found
(e.g. 10E13), or the frequency of erroneous bits (e.g. 10E-13).
Bit Rate
The rate or frequency at which bits appear in a bit stream. The
bit rate of raw data from a CD, for example, is 4.3218 MHz.
Bit Stream
A binary signal without regard to grouping.
Bit-Mapped
Display
A display in which each pixel’s color and intensity data are
stored in a separate memory location.
Boost/Cut
Equalizer
The most common graphic equalizer. Available with 10 to 31
bands on octave to 1/3-octave spacing. The flat (0 dB) position
locates all sliders at the center of the front panel. Comprised of
bandpass filters, all controls start at their center 0 dB position
and boost (amplify or make larger) signals by raising the
sliders, or cut (attenuate or make smaller) the signal by
lowering the sliders on a band-by-band basis. Commonly
provide a center-detent feature identifying the 0 dB position.
Proponents of boosting in permanent sound systems argue
that cut-only use requires adding make-up gain which runs
the same risk of reducing system headroom as boosting.
Buffer
In data transmission, a temporary storage location for
information being sent or received.
Burst Error
A large number of data bits lost on the medium because of
excessive damage to or obstruction on the medium.
User’s Manual
9-455
V 1.2, 2000-01
Glossary
Bus
One or more electrical conductors used for transmitting signals
or power from one or more sources to one or more
destinations. Often used to distinguish between a single
computer system (connected together by a bus) and multicomputer systems connected together by a network.
C
Cartesian
Coordinate
System
1. A two-dimensional coordinate system in which the
coordinates of a point in a plane are its distances from two
perpendicular lines that intersect at an origin, the distance
from each line being measured along a straight line parallel
to the other.
2. A three-dimensional coordinate system in which the
coordinates of a point in space are its distances from each
of three perpendicular lines that intersect at an origin. After
the Latin form of Descartes, the mathematician who
invented it.
Codec (CodeDecode)
A device for converting voice signals from analog to digital for
use in digital transmission schemes, normally telephone
based, and then converting them back again. Most codecs
employ proprietary coding algorithms for data compression,
common examples being Dolby’s AC-2, ADPCM, and MPEG
schemes.
Compander
A contraction of compressor-expander. A term referring to
dynamic range reduction and expansion performed by first a
compressor acting as an encoder, and second by an expander
acting as the decoder. Normally used for noise reduction or
headroom reasons.
Complex
Frequency
Variable
An AC frequency in complex number form.
Complex Number
Mathematics
Any number of the form a + bj, where a and b are real numbers
and j is an imaginary number whose square equals -1 and a
represents the real part (e.g., the resistive effect of a filter, at
zero phase angle) and b represents the imaginary part (e.g.,
the reactive effect, at 90 phase angle).
User’s Manual
9-456
V 1.2, 2000-01
Glossary
Compression
1. An increase in density and pressure in a medium, such as
air, caused by the passage of a sound wave.
2. The region in which this occurs.
Compression
Wave
A wave propagated by means of the compression of a fluid,
such as a sound wave in air.
Constant-Q
Equalizer (also
ConstantBandwidth)
Term applied to graphic and rotary equalizers describing
bandwidth behavior as a function of boost/cut levels. Since Q
and bandwidth are inverse sides of the same coin, the terms
are fully interchangeable. The bandwidth remains constant for
all boost/cut levels. For constant-Q designs, the skirts vary
directly proportional to boost/cut amounts. Small boost/cut
levels produce narrow skirts and large boost/cut levels produce
wide skirts.
Convolution
A mathematical operation producing a function from a certain
kind of summation or integral of two other functions. In the time
domain, one function may be the input signal, and the other the
impulse response. The convolution than yields the result of
applying that input to a system with the given impulse
response. In DSP, the convolution of a signal with FIR filter
coefficients results in the filtering of that signal.
Correlation
A mathematical operation that indicates the degree to which
two signals are alike.
Crest Factor
The term used to represent the ratio of the peak (crest) value
to the RMS value of a waveform.
Critical Band
Physiology of
Hearing
A range of frequencies that is integrated (summed together) by
the neural system, equivalent to a bandpass filter (auditory
filter) with approximately 10-20% bandwidth (approximately
one-third octave wide).
[Although the latest research says critical bands are more like
1/6-octave above 500 Hz, and about 100 Hz wide below 500
Hz]. The ear can be said to be a series of overlapping critical
bands, each responding to a narrow range of frequencies.
Introduced by Fletcher (1940) to deal with the masking of a
pure-tone by wideband noise.
User’s Manual
9-457
V 1.2, 2000-01
Glossary
Cut-Only Equalizer
Term used to describe graphic equalizers designed only for
attenuation. (Also referred to as notch equalizers, or bandreject equalizers). The flat (0 dB) position locates all sliders at
the top of the front panel. Comprised only of notch filters
(normally spaced at 1/3-octave intervals), all controls start at 0
dB and reduce the signal on a band-by-band basis. Proponents
of cut-only philosophy argue that boosting runs the risk of
reducing system headroom.
Cutoff Frequency
Filters
The frequency at which the signal falls off by 3 dB (the half
power point) from its maximum value. Also referred to as the 3 dB points, or the corner frequencies.
D
DAC (or D/A,
Digital-to-Analog
Converter)
The electronic component which converts digital words into
analog signals that can then be amplified and used to drive
loudspeakers, etc. The DAC is the last link in the digital chain
of signal processing.
Decibel
Abbreviation. dB
A unit used to express relative difference in power, intensity,
voltage or other, between two acoustic or electric signals, equal
to ten times (for power ratios - twenty times for all other ratios)
the common logarithm of the ratio of the two levels. Equal to
one-tenth of a bel.
Delta Modulation
A single-bit coding technique in which a constant step size
digitizes the input waveform. Past knowledge of the information
permits encoding only the differences between consecutive
values.
User’s Manual
9-458
V 1.2, 2000-01
Glossary
Delta-Sigma
Modulation (also
Sigma-Delta)
An analog-to-digital conversion scheme rooted in a design
originally proposed in 1946, but not made practical until 1974
by James C. Candy. The name delta-sigma modulation was
coined by Inose and Yasuda at the University of Tokyo in 1962,
but due to a misunderstanding the words were interchanged
and taken to be sigma-delta. Both names are still used for
describing this modulator. Characterized by oversampling and
digital filtering to achieve high performance at low cost, a deltasigma A/D thus consists of an analog modulator and a digital
filter. The fundamental principle behind the modulator is that of
a single-bit A/D converter embedded in an analog negative
feedback loop with high open loop gain. The modulator loop
oversamples and processes the analog input at a rate much
higher than the bandwidth of interest. The modulator’s output
provides 1-bit information at a very high rate and in a format
that a digital filter can process to extract higher resolution (such
as 20-bits) at a lower rate.
Digital Audio Data
Compression,
commonly
shortened to
"Audio
Compression."
Any of several algorithms designed to reduce the number of
bits (hence, bandwidth and storage requirements) required for
accurate digital audio storage and transmission. Characterized
by being "lossless" or "lossy". The audio compression is "lossy"
if actual data is lost due to the compression scheme, and
"lossless" if it is not. Well designed algorithms ensure "lost"
information is inaudible.
Digital Audio
The use of sampling and quantization techniques to store or
transmit audio information in binary form. The use of numbers
(typically binary) to represent audio signals.
Digital Filter
Any filter accomplished in the digital domain.
Digital Signal
Any signal which is quantized (i.e., limited to a distinct set of
values) into digital words at discrete points in time. The
accuracy of a digital value is dependent on the number of bits
used to represent it.
Digitization
Any conversion of analog information into a digital form.
Discrete Fourier
Transform (DFT)
A DSP algorithm used to determine the fourier coefficient
corresponding to a set of frequencies, normally linearly
spaced.
User’s Manual
9-459
V 1.2, 2000-01
Glossary
DSP (Digital Signal
Processing)
A technology for signal processing that combines algorithms
and fast number-crunching digital hardware and is capable of
high-performance and flexibility.
F
FFT (Fast Fourier
Transform)
A DSP algorithm that is the computational equivalent to
performing a specific number of discrete fourier transforms, but
by taking advantage of computational symmetries and
redundancies, significantly reduces the computational burden.
FIR (Finite
ImpulseResponse) Filter
A commonly used type of digital filter. Digitized samples of the
audio signal serve as inputs and each filtered output is
computed from a weighted sum of a finite number of previous
inputs. An FIR filter can be designed to have completely linear
phase (i.e., constant time delay, regardless of frequency). FIR
filters designed for frequencies much lower than the sample
rate and/or with sharp transitions are computationally intensive
with large time delays. Popularly used for adaptive filters.
Floating Point
An encoding technique consisting of two parts:
1. A mantissa representing a fractional value with magnitude
less than one
2. An exponent providing the position of the decimal point.
Floating point arithmetic allows the representation of very
large or very small numbers with fewer bits.
Fourier Analysis
Mathematics
The approximation of a function through the application of a
Fourier Series to periodic data.
Fourier Series
Application of the Fourier theorem to a periodic function,
resulting in sine and cosine terms which are harmonics of the
periodic frequency. (After Baron Jean Baptiste Joseph
Fourier.)
Fourier Theorem
A mathematical theorem stating that any function may be
resolved into sine and cosine terms with known amplitudes and
phases.
User’s Manual
9-460
V 1.2, 2000-01
Glossary
Frequency
1. The property or condition of occurring at frequent intervals.
2. Mathematics. Physics. The number of times a specified
phenomenon occurs within a specified interval as
a) The number of repetitions of a complete sequence of
values of a periodic function per unit variation of an
independent variable.
b) The number of complete cycles of a periodic process
occurring per unit time.
c) The number of repetitions per unit time of a complete
waveform, as of an electric current.
G
Graphic Equalizer
A multi-band variable equalizer using slide controls as the
amplitude adjustable elements. Named for the positions of the
sliders “graphing” the resulting frequency response of the
equalizer. Only found on active designs. Center frequency and
bandwidth are fixed for each band.
H
Harmonic Series
1. Mathematics. A series whose terms are in harmonic
progression as 1 + 1/3 + 1/5 + 1/7 +...
2. Music. A series of tones consisting of a fundamental tone
and the overtones produced by it and whose frequencies are
consecutive integral multiples of the frequency of the
fundamental.
High-Pass Filter
A filter having a passband extending from some finite cutoff
frequency (not zero) up to infinite frequency. An infrasonic filter
is a high-pass filter.
I
IIR (Infinite
ImpulseResponse) Filter
User’s Manual
A commonly used type of digital filter. This recursive structure
accepts as inputs digitized samples of the audio signal and
then each output point is computed on the basis of a weighted
sum of past output (feedback) terms, as well as past input
values. An IIR filter is more efficient than its FIR counterpart,
but poses more challenging design issues. Its strength is in not
requiring as much DSP power as FIR, while its weakness is not
having linear group delay and possible instabilities.
9-461
V 1.2, 2000-01
Glossary
Interpolating
Response
Term adopted by Rane Corporation to describe the summing
response of adjacent bands of variable equalizers using
buffered summing stages. If two adjacent bands, when
summed together, produce a smooth response without a dip in
the center, they are said to interpolate between the fixed center
frequencies, or combine well.
Inverse Square
Law Sound
Pressure Level
Sound propagates in all directions to form a spherical field, thus
sound energy is inversely proportional to the square of the
distance, i.e., doubling the distance quarters the sound energy
(the inverse square law), so SPL is attenuated 6dB for each
doubling.
Interleaving
The process of rearranging data in time. Upon de-interleaving,
errors in consecutive bits or words are distributed to a wider
area to guard against consecutive errors in the storage media.
L
Linear PCM
A pulse code modulation system in which the signal is
converted directly to a PCM word without companding, or other
processing.
Low-Pass Filter
A filter having a passband extending from DC (zero Hz) to
some finite cutoff frequency (not infinite). A filter with a
characteristic that allows all frequencies below a specified
rolloff frequency to pass and attenuate all frequencies above.
Anti-aliasing and anti-imaging filters are low-pass filters.
M
Minimum-Phase
Filters
User’s Manual
Electrical circuits from an electrical engineering viewpoint, the
precise definition of a minimum-phase function is a detailed
mathematical concept involving positive real transfer functions,
i.e., transfer functions with all zeros restricted to the left half splane (complex frequency plane using the Laplace transform
operator s). This guarantees unconditional stability in the
circuit. For example, all equalizer designs based on 2nd-order
bandpass or band-reject networks have minimum-phase
characteristics.
9-462
V 1.2, 2000-01
Glossary
MIPS (Million
Instructions
Processed Per
Second)
A measure of computing power.
MLS (MaximumLength
Sequences)
A time-domain-based analyzer using a mathematically
designed test signal optimized for sound analysis. The test
signal (a maximum-length sequence) is electronically
generated and characterized by having a flat energy-vsfrequency curve over a wide frequency range. Sounding similar
to white noise, it is actually periodic, with a long repetition rate.
Similar in principle to impulse response testing - think of the
maximum-length sequence test signal as a series of randomly
distributed positive- and negative-going impulses.
N
Narrow-Band Filter
Term popularized by equalizer pioneer C.P. Boner to describe
his patented (tapped toroidal inductor) passive notch filters.
Boner’s filters were very high Q (around 200) and extremely
narrow (5 Hz at the -3 dB points). Boner used 100-150 of these
sections in series to reduce feedback modes. Today’s usage
extends this terminology to include all filters narrower than 1/3octave. This includes parametrics, notch filter sets, and certain
cut-only variable equalizer designs.
Noise Shaping
A technique used in oversampling low-bit converters and other
quantizers to shift (shape) the frequency range of quantizing
error (noise and distortion). The output of a quantizer is fed
back through a filter and summed with its input signal. Dither is
sometimes used in the process. Oversampling A/D converters
shift much of it out of the audio range completely. In this case,
the in-band noise is decreased, which allows low-bit converters
(such as delta-sigma) to equal or out-perform high-bit
converters (those greater than 16 bits). When oversampling is
not involved, the noise still appears to decrease by 12dB or
more because it is redistributed into less audible frequency
areas. The benefits of this kind of noise shaping are usually
reversed by further digital processing.
User’s Manual
9-463
V 1.2, 2000-01
Glossary
Nyquist Frequency
The highest frequency that may be accurately sampled. The
Nyquist frequency is one-half the sampling frequency. For
example, the theoretical Nyquist Frequency of a CD system is
22.05 kHz.
O
Octave
1. Audio. The interval between any two frequencies having a
ratio of 2 to 1.
2. Music
a) The interval of eight diatonic degrees between two tones,
one of which has twice as many vibrations per second as
the other.
b) A tone that is eight full tones above or below another given
tone.
c) An organ stop that produces tones an octave above those
usually produced by the keys played.
One-Third Octave
1. Term referring to frequencies spaced every one-third of an
octave apart. One-third of an octave represents a frequency
1.26-times above a reference, or 0.794-times below the
same reference. The math goes like this: 1/3-octave = 2E1/
3 = 1.260 and the reciprocal, 1/1.260 = 0.794. Therefore, for
example, a frequency 1/3-octave above a 1kHz reference
equals 1.26kHz (which is rounded-off to the ANSI-ISO
preferred frequency of "1.25 kHz" for equalizers and
analyzers), while a frequency 1/3-octave below 1 kHz equals
794 Hz (labeled "800 Hz"). Mathematically it is significant to
note that, to a very close degree, 2E1/3 equals 10E1/10
(1.2599 vs. 1.2589). This bit of natural niceness allows the
same frequency divisions to be used to divide and mark an
octave into one-thirds and a decade into one-tenths.
2. Term used to express the bandwidth of equalizers and other
filters that are 1/3-octave wide at their -3dB (half-power)
points.
3. Approximates the smallest region (bandwidth) humans
reliably detect change. Compare with third-octave.
Oversampling
A technique where each sample from the converter is sampled
more than once, i.e., oversampled. This multiplication of
samples permits digital filtering of the signal, thus reducing the
need for sharp analog filters to control aliasing.
User’s Manual
9-464
V 1.2, 2000-01
Glossary
P
Parametric
Equalizer
A multi-band variable equalizer offering control of all the
"parameters" of the internal bandpass filter sections. These
parameters being amplitude, center frequency and bandwidth.
This allows the user not only to control the amplitude of each
band, but also to shift the center frequency and to widen or
narrow the affected area. Available with rotary and slide
controls. Subcategories of parametric equalizers exist which
allow control of center frequency but not bandwidth. For rotary
control units the most used term is quasi-parametric. For units
with slide controls the popular term is paragraphic. The
frequency control may be continuously variable or switch
selectable in steps. Cut-only parametric equalizers (with
adjustable bandwidth or not) are called notch equalizers or
band-reject equalizers.
Passive Equalizer
A variable equalizer requiring no power to operate. Consisting
only of passive components (inductors, capacitors and
resistors) passive equalizers have no AC line cord. Favored for
their low noise performance (no active components to generate
noise), high dynamic range (no active power supplies to limit
voltage swing), extremely good reliability (passive components
rarely break), and lack of RFI interference (no semiconductors
to detect radio frequencies). Disliked for their cost (inductors
are expensive), size (and bulky), weight (and heavy), hum
susceptibility (and need careful shielding) and signal loss
characteristic (passive equalizers always reduce the signal).
Also inductors saturate easily with large low frequency signals,
causing distortion. Rarely seen today, but historically they were
used primarily for notching in permanent sound systems.
PCM (Pulse Code
Modulation)
A conversion method in which digital words in a bit stream
represent samples of analog information. The basis of most
digital audio systems.
Peaking Response
Term used to describe a bandpass shape when applied to
program equalization.
User’s Manual
9-465
V 1.2, 2000-01
Glossary
Period
Abbreviation T, t
1. The period of a periodic function is the smallest time interval
over which the function repeats itself. (For example, the
period of a sine wave is the amount of time T, it takes for the
waveform to pass through 360 degrees. Also, it is the
reciprocal of the frequency itself, i.e., T = 1/f.)
2. Mathematics.
a) The least interval in the range of the independent variable
of a periodic function of a real variable in which all
possible values of the dependent variable are assumed.
b) A group of digits separated by commas in a written
number.
c) The number of digits that repeat in a repeating decimal.
For example, 1/7 = 0.142857142857... has a six-digit
period.
Phaser also called
a "Phase Shifter,"
This is an electronic device creating an effect similar to
flanging, but not as pronounced. Based on phase shift
(frequency dependent), rather than true signal delay
(frequency independent), the phaser is much easier and
cheaper to construct. Using a relatively simple narrow notch
filter (all-pass filters also were used) and sweeping it up and
down through some frequency range, then summing this
output with the original input, creates the desired effect. Narrow
notch filters are characterized by having sudden and rather
extreme phase shifts just before and just after the deep notch.
This generates the needed phase shifts for the ever-changing
magnitude cancellations.
Phase Shift
The fraction of a complete cycle elapsed as measured from a
specified reference point and expressed as an angle out of
phase. In an un-synchronized or un-correlated way.
Phase Delay
A phase-shifted sine wave appears displaced in time from the
input waveform. This displacement is called phase delay.
Phasor
1. A complex number expressing the magnitude and phase of
a time-varying quantity. It is math shorthand for complex
numbers. Unless otherwise specified, it is used only within
the context of steady-state alternating linear systems.
(Example: 1.5 /27° is a phasor representing a vector with a
magnitude of 1.5 and a phase angle of 27 degrees.)
2. For some unknown reason, used a lot by Star Fleet
personnel.
User’s Manual
9-466
V 1.2, 2000-01
Glossary
Pink Noise
Pink noise is a random noise source characterized by a flat
amplitude response per octave band of frequency (or any
constant percentage bandwidth), i.e., it has equal energy, or
constant power, per octave. Pink noise is created by passing
white noise through a filter having a 3 dB/octave roll-off rate.
See white noise discussion for details. Due to this roll-off, pink
noise sounds less bright and richer in low frequencies than
white noise. Since pink noise has the same energy in each
1/3-octave band, it is the preferred sound source for many
acoustical measurements due to the critical band concept of
human hearing.
Polarity
A signal’s electromechanical potential with respect to a
reference potential. For example, if a loudspeaker cone moves
forward when a positive voltage is applied between its red and
black terminals, then it is said to have a positive polarity. A
microphone has positive polarity if a positive pressure on its
diaphragm results in a positive output voltage.
Pre-Emphasis
A high-frequency boost used during recording, followed by deemphasis during playback, designed to improve signal-tonoise performance.
Proportional-Q
Equalizer (also
Variable-Q)
Term applied to graphic and rotary equalizers describing
bandwidth behavior as a function of boost/cut levels. The term
"proportional-Q" is preferred as being more accurate and less
ambiguous than "variable-Q." If nothing else, "variable-Q"
suggests the unit allows the user to vary (set) the Q, when no
such controls exist. The bandwidth varies inversely
proportional to boost (or cut) amounts, being very wide for
small boost/cut levels and becoming very narrow for large
boost/cut levels. The skirts, however, remain constant for all
boost/cut levels.
Psychoacoustics
The scientific study of the perception of sound.
PWM (Pulse Width
Modulation)
A conversion method in which the widths of pulses in a pulse
train represent the analog information.
Q
Quantization Error
User’s Manual
Error resulting from quantizing an analog waveform to a
discrete level. In general the longer the word length, the less
the error.
9-467
V 1.2, 2000-01
Glossary
Quantization
The process of converting, or digitizing, the almost infinitely
variable amplitude of an analog waveform to one of a finite
series of discrete levels. Performed by the A/D converter.
R
Real-Time
Operation
What is perceived to be instantaneous to a user (or more
technically, processing which completes in a specific time
allotment).
Reconstruction
Filter
A low-pass filter used at the output of digital audio processors
(following the DAC) to remove (or at least greatly attenuate)
any aliasing products (image spectra present at multiples of the
sampling frequency) produced by the use of real-world (nonbrickwall) input filters.
Recursive
A data structure that is defined in terms of itself. For example,
in mathematics, an expression, such as a polynomial, each
term of which is determined by application of a formula to
preceding terms. Pertaining to a process that is defined or
generated in terms of itself, i.e., its immediate past history.
Rotary Equalizer
A multi-band variable equalizer using rotary controls as the
amplitude adjustable elements. Both active and passive
designs exist with rotary controls. Center frequency and
bandwidth are fixed for each band.
S
Sample Rate
Conversion
The process of converting one sample rate to another, e.g.
44.1kHz to 48kHz. Necessary for the communication and
synchronization of dissimilar digital audio devices, e.g., digital
tape machines to CD mastering machines.
Sample-and-Hold
(S/H)
A circuit which captures and holds an analog signal for a finite
period of time. The input S/H proceeds the A/D converter,
allowing time for conversion. The output S/H follows the D/A
converter, smoothing glitches.
Sampling
(Nyquist)Theorem
A theorem stating that a bandlimited continuous waveform may
be represented by a series of discrete samples if the sampling
frequency is at least twice the highest frequency contained in
the waveform.
User’s Manual
9-468
V 1.2, 2000-01
Glossary
Sampling
Frequency or
Sampling Rate
The frequency or rate at which an analog signal is sampled or
converted into digital data. Expressed in Hertz (cycles per
second). For example, compact disc sampling rate is 44,100
samples per second or 44.1kHz, however in pro audio other
rates exist, common examples being 32kHz, 48kHz and
50kHz.
Sampling
The process of representing the amplitude of a signal at a
particular point in time.
S/N ratio (Signalto-Noise ratio)
The ratio of signal level (or power) to noise level (or power),
normally expressed in decibels.
T
Third-Octave
Term referring to frequencies spaced every three octaves
apart. For example, the third-octave above 1kHz is 8kHz.
Commonly misused to mean one-third octave. While it can be
argued that "third" can also mean one of three equal parts and
as such might be used to correctly describe one part of an
octave spit into three equal parts, it is potentially too confusing.
The preferred term is one-third octave.
Transversal
Equalizer
A multi-band variable equalizer using a tapped audio delay line
as the frequency selective element, as opposed to bandpass
filters built from inductors (real or synthetic) and capacitors.
The term "transversal filter" does not mean "digital filter". It is
the entire family of filter functions done by means of a tapped
delay line. There exists a class of digital filters realized as
transversal filters, using a shift register rather than an analog
delay line, with the inputs being numbers rather than analog
functions.
W
Wavelength
Symbol (Greek
lower-case
Lambda)
User’s Manual
The distance between one peak or crest of a sine wave and the
next corresponding peak or crest. The wavelength of any
frequency may be found by dividing the speed of sound by the
frequency.
9-469
V 1.2, 2000-01
Glossary
White Noise
Analogous to white light containing equal amounts of all visible
frequencies, white noise contains equal amounts of all audible
frequencies (technically the bandwidth of noise is infinite, but
for audio purposes it is limited to just the audio frequencies).
From an energy standpoint white noise has constant power per
hertz (also referred to as unit bandwidth), i.e., at every
frequency there is the same amount of power (while pink noise,
for instance, has constant power per octave band of
frequency). A plot of white noise power vs. frequency is flat if
the measuring device uses the same width filter for all
measurements. This is known as a fixed bandwidth filter. For
instance, a fixed bandwidth of 5 Hz is common, i.e., the test
equipment measures the amplitude at each frequency using a
filter that is 5 Hz wide. It is 5 Hz wide when measuring 50 Hz or
2 kHz or 9.4 kHz, etc. A plot of white noise power vs. frequency
change is not flat if the measuring device uses a variable width
filter. This is known as a fixed percentage bandwidth filter. A
common example of which is 1/3-octave wide, which equals a
bandwidth of 23%. This means that for every frequency
measured the bandwidth of the measuring filter changes to
23% of that new center frequency. For example the measuring
bandwidth at 100 Hz is 23 Hz wide, then changes to 230 Hz
wide when measuring 1 kHz, and so on. Therefore the plot of
noise power vs. frequency is not flat, but shows a 3 dB rise in
amplitude per octave of frequency change. Due to this rising
frequency characteristic, white noise sounds very bright and
lacking in low frequencies.
Z
Z-Transform
User’s Manual
A mathematical method used to relate coefficients of a digital
filter to its frequency response, and to evaluate stability of the
filter. It is equivalent to the Laplace transform of sampled data
and is the building block of digital filters.
9-470
V 1.2, 2000-01
"Microcontrollers" Template
for Technical Documentation
A
Adaptive Digital Filters 197
CplxDlms_4_16 214
CplxDlmsBlk_4_16 222
Dlms_2_16x32 229
Dlms_4_16 201
DlmsBlk_2_16x32 235
DlmsBlk_4_16 208
Applications 401
Equalizer 406
Hardware Setup for Applications 408
Oscillators 404
Spectrum Analyzer 401
Argand Diagram 32
Argument Conventions 29
aR 30
CplxL 30
CplxS 30
cptrDataS 30
DataD 29
DataL 29
DataS 29
nH 29
B
Building DSPLIB 18
C
Canonical Form (Direct Form II) Second-order Section 174
Cascaded Biquad IIR Filter 175
Complex Arithmetic 32
Addition 32
Conjugate 33
Magnitude 33
Multiplication 32
Phase 33
Shift 33
Subtraction 32
Complex Arithmetic Functions 31
CplxAdd_16 36
CplxAdd_32 61
CplxAdds_16 38
User’s Manual
471
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
CplxAdds_32 63
CplxConj_16 49
CplxConj_32 74
CplxMag_16 51
CplxMag_32 76
CplxMul_16 44
CplxMul_32 69
CplxMuls_16 46
CplxMuls_32 71
CplxPhase_16 54
CplxPhase_32 79
CplxShift_16 59
CplxShift_32 83
CplxSub_16 40
CplxSub_32 65
CplxSubs_16 42
CplxSubs_32 67
Complex Data Structure 35
ANSI C 35
GHS 35
Tasking 35
Complex Functions
CplxSub_16 40
CplxSubs_16 42
Complex Number Representation 31
Exponential form 31
Magnitude and angle form 31
Rectangular form 31
Trigonometric form 31
Complex Number Schematic 34
Complex Plane 31
D
Design of Test Cases for the FFT functions 256
Directory Structure 17, 430, 445, 446, 447
Discrete Cosine Transform
DCT_2_8 319
IDCT_2_8 324
Discrete Cosine Transform (DCT) 309
DSP Library Notations 23
F
Fast Fourier Transforms 241
User’s Manual
472
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
FFT_2_16 261
FFT_2_16X32 293
FFT_2_32 277
FFTReal_2_16 269
FFTReal_2_16x32 301
FFTReal_2_32 285
IFFT_2_16 265
IFFT_2_16X32 297
IFFT_2_32 281
IFFTReal_2_16 273
IFFTReal_2_16X32 305
IFFTReal_2_32 289
Features 15
FIR Filters 106
Multirate Filters
FirDec_16 156
FirInter_16 165
Normal FIR 106
Fir_16 108
Fir_4_16 121
FirBlk_16 115
FirBlk_4_16 126
Symmetric FIR
FirSym_16 132
FirSym_4_16 142
FirSymBlk_16 137
FirSymBlk_4_16 148
Function Descriptions 29
Functional Implementation 250
Future of TriLib 16
I
IIR Filters 173
IirBiq_4_16 176
IirBiq_5_16 187
IirBiqBlk_4_16 182
IirBiqBlk_5_16 192
Implementation of FFT to Process the Real Sequences of Data 254
Installation and Build 17
Installing DSPLIB 18
Introduction 15
Inverse Discrete Cosine Transform (IDCT) 314
User’s Manual
473
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
M
Mathematical Functions 329
AntiLn_16 348
Arctan_32 336
Cos_32 333
Expn_16 351
Ln_32 344
Rand_16 361
RandInit_16 360
Sine_32 330
Sqrt_32 340
XpowY_32 353
Matrix Operations 363
MatAdd_16 364
MatMult_16 371
MatSub_16 367
MatTrans_16 376
Memory Issues 24
Multidimensional DCT 315
O
Optimization Approach 24
Options in Library Configurations 26
R
Register Naming Conventions 30
a 30
ca 30
S
Source Files List 19
Statistical Functions 379
ACorr_16 381
Avg_16 397
Conv_16 389
Support Information 16
T
TriCore Implementation Note 248
TriLib Content 17
TriLib Data Types 23
TriLib Implementation - A Technical Note 24
User’s Manual
474
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
V
Vector Arithmetic Functions 85
VecAdd 86
VecDotPro 92
VecMaxIdx 94
VecMaxVal 100
VecMinIdx 97
VecMinVal 103
VecSub 89
User’s Manual
475
V 1.1, 2000-01
"Microcontrollers" Template
for Technical Documentation
User’s Manual
476
V 1.1, 2000-01
((477))
Infineon goes for Business Excellence
“Business excellence means intelligent approaches and clearly
defined processes, which are both constantly under review and
ultimately lead to good operating results.
Better operating results and business excellence mean less
idleness and wastefulness for all of us, more professional
success, more accurate information, a better overview and,
thereby, less frustration and more satisfaction.”
Dr. Ulrich Schumacher
http://www.infineon.com
Published by Infineon Technologies AG