Software Optimizatin of DFTs and IDFTs Using the StarCore SC3850 DSP Core

Freescale Semiconductor
Application Note
Document Number: AN3680
Rev. 0, 11/2010
Software Optimization of DFTs and
IDFTs Using the StarCore SC3850
DSP Core
The Discrete Fourier transform (DFT) is a transform used for
Fourier analysis of a finite sampled sequence of a signal. The
DFT is widely employed in signal processing and related
fields to analyze the frequencies contained in a signal. The
Fast Fourier Transform (FFT) is a numerically efficient
algorithm used to compute the DFT. For a complex N-point
Fourier transform, the FFT reduces the number of complex
multiplications from the order of N2 to the order of NlogN.
Mixed radix-2/3/4/5 FFTs can be used to implement the DFT
algorithm with reduced computation if the number of DFT
point N = 2k 3l 4m 5n (k, l, m, and n are positive integers).
For example, an FFT of size 1200 can be calculated in five
stages using radices of 5, 4, and 3 (1,200 = 5 × 5 × 4 × 4 ×
3). Therefore, an efficient implementation for the DFT is
possible using the mixed radix FFT calculation.
This application note describes the implementation of the
DFT/IDFT using mixed radix-2/3/4/5 decimation in time
(DIT) FFTs/IFFTs on the Freescale Semiconductor SC3850
Digital Signal Processor (DSP) core. This document
discusses how to use new features available in the SC3850
core, such as dual-multiplier, to improve the performance of
the FFTs. The data structure, code optimization, test vector
generation, and performance results are also investigated in
this document. Typical reference code is included in this
document to demonstrate the implementation details.
© 2009–2010 Freescale Semiconductor, Inc..
1.
2.
3.
4.
5.
6.
7.
A.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Mixed Radix FFT Algorithms . . . . . . . . . . . . . . . . . . 4
Implementation of DFT on StarCore SC3850 DSP . 20
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 33
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
DFT Reference Code . . . . . . . . . . . . . . . . . . . . . . . . 36
Introduction
1
Introduction
The discrete Fourier transform (DFT) plays an important role in the analysis, design, and implementation
of discrete-time signal processing algorithms and systems because efficient algorithms exist for the
computation of the DFT. These efficient algorithms are called fast Fourier transform (FFT) algorithms. In
terms of multiplications and additions, the FFT algorithms can be orders of magnitude more efficient than
competing algorithms.
It is well known that the DFT takes N2 complex multiplications and N2 complex additions for complex
N-point. Direct computation of the DFT is inefficient. The basic idea of the FFT algorithm is to break up
an N-point DFT transform into successive smaller and smaller transform known as a butterfly (basic
computational element). The small transform used can be a 2-point DFT known as Radix-2, 4-point DFT
known as Radix-4, or other points. A two-point butterfly requires 1 complex multiplication and 2 complex
additions, and a 4-point butterfly requires 3 complex multiplications and 8 complex additions. Therefore,
a Radix-2 FFT reduces the complexity of a N-point DFT down to (N/2)log2N complex multiplications and
Nlog2N complex additions because there are log2N stages and at each stage there are N/2 2-point
butterflies. For a Radix-4 FFT, there are log4N stages and, at each stage, there are N/4 butterflies. Thus,
the total number of complex multiplication is (3N/4)log4N = (3N/8)log2N, and the number of required
complex additions is 8(N/4)log4N = Nlog2N.
Because the FFTs reduce the number of complex multiplications from the order of N2 to the order of
NlogN, they are often used to compute the DFT. For instance, the size of the DFT/IDFT in 3GPP-LTE can
be calculated as a product of the factors 2, 3, 4, and 5, which is equal to 2k 3l 4m 5n (k, l, m, and n are
positive integers). The specific sizes are [12, 24, 36, 48, 60, 72, 96, 108, 120, 144, 180, 192, 216, 240, 288,
300, 324, 360, 384, 432, 480, 540, 576, 600, 648, 720, 768, 864, 900, 960, 972, 1080, 1152, and 1200].
Therefore, an efficient implementation of DFT/IDFT is possible using the mixed radix FFT/IFFT
calculation.
The FFTs are of importance to a wide variety of applications, such as telecommunications. For instance,
M-point DFT and IDFT are used to convert the signal between frequency domain and time domain in Third
Generation Partnership Project-Long Term Evolution (3GPP-LTE).
To begin, consider an example that demonstrates how FFTs are used in real applications. In the 3GPP-LTE
(Long Term Evolution), M-point DFT and Inverse DFT (IDFT) are used to convert the signal between
frequency domain and time domain. 3GPP-LTE aims to provide for an uplink speed of up to 50 Mbps and
a downlink speed of up to 100 Mbps. For this purpose, the 3GPP-LTE physical layer uses Orthogonal
Frequency Division Multiple Access (OFDMA) on the downlink and Single Carrier-Frequency Division
Multiple Access (SC-FDMA) on the uplink. Figure 1 shows the transmitter and receiver structure of
OFDMA and SC-FDMA systems.
We can see from this example that DFT and IDFT are the key elements used to represent changing signals
in time and frequency domains. In real applications, FFTs are normally used instead of direct DFT
calculation to achieve higher performance.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
2
Freescale Semiconductor
Organization
SC-FDMA Transmitter
User
N-point
DFT
{Xn}
Subcarrier
Mapping
M-point
IDFT
Cyclic Prefix &
Pulse Shapi ng
DAC & RF
Subcarrier DeMapping&
Equalization
M-point
DFT
Cyclic Prefix
Removal
RF & ADC
Cyclic Prefix &
Puls e Shaping
DAC & RF
Channel
SC-FDMA Receiver
Base Station
N-point
IDFT
Uplink SC-FDMA
OFDMA Transmitter
Base Station
Subcarrier
Mapping
M-point
IDFT
Channel
OFDMA Receiver
User
{Xn}
Subcarrier DeMapping&
Eq ualization
M-point
DFT
Cyclic Prefix
Removal
RF & ADC
Downlink: OFDMA
Figure 1. Transmitter and Receiver Structure of SC-FDMA and OFDMA Systems
2
Organization
This application note describes how to implement DFT/IDFT using the mixed radix FFT/IFFT algorithms
on the StarCore DSP SC3850.The rest of the document is organized as follows:
• Section 3, “Mixed Radix FFT Algorithms”—Reviews the basic information on DFTs s and
provides a brief introduction to the mixed radix FFT algorithms.
• Section 4, “Implementation of DFT on StarCore SC3850 DSP”—Discusses the algorithm
implementation on SC3850, fixed point issues, and how to optimize the implementation for
SC3850.
• Section 5, “Experimental Results”—Present results and discusses performance.
• Section 6, “Conclusions”—Presents the conclusions.
• Section 7, “References”—Provides a list of useful references.
• Appendix A, “DFT Reference Code”—Provides example code.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
3
Mixed Radix FFT Algorithms
3
Mixed Radix FFT Algorithms
The following sections provide background information, definitions, and examples for using FFT and
inverse FFT algorithms.
3.1
Definition of DFT and IDFT
The DFT X(k), k=0,1,2,...,N-1 of a sequence x(n), n=0,1,2,...,N-1 is defined as
N–1
X(k) =
-⎞
∑ x ( n ) exp ⎛⎝ –j -----------N ⎠
2πnk
n=0
N–1
=
∑ x ( n )W
Eqn. 1
nk
N
n=0
2πnk
W Nnk = exp ⎛ – j -------------⎞
⎝
N ⎠
Eqn. 2
2πnk
2πnk
= cos ⎛ -------------⎞ – j sin ⎛ -------------⎞
⎝ N ⎠
⎝ N ⎠
In Equation 1 and Equation 2, N is the number of data, j = – 1 , and W Nnk is the twiddle factor. Equation 1
is called the N-point DFT of the sequence of x(n). For each value of k, the value of X(k) represents the
Fourier transform at the frequency 2πk
---------- . The IDFT is defined as follows:
N
N–1
1
x ( n ) = ---N
-⎞
∑ X ( k ) exp ⎛⎝ j -----------N ⎠
2πnk
k=0
Eqn. 3
N–1
1
= ---N
∑ X ( k )W
– nk
N
k=0
2πnk
W N– nk = exp ⎛ j -------------⎞
⎝ N ⎠
Eqn. 4
2πnk
2πnk
= cos ⎛ -------------⎞ + j sin ⎛ -------------⎞
⎝ N ⎠
⎝ N ⎠
Equation 3 is essentially the same as Equation 1. The differences are that the exponent of the twiddle factor
in Equation 3 is the negative of the one in Equation 1 and the scaling factor is 1/N. The IDFT can be simply
computed using the same algorithms for DFT but with conjugated twiddle factors. Alternatively, we can
use the same twiddles factors for DFT with conjugated input and output to compute IDFT. Equation 1 is
also called the analysis equation and Equation 3 the synthesis equation.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
4
Freescale Semiconductor
Mixed Radix FFT Algorithms
From Equation 1, it is clear that computing X(k) for each k requires N complex multiplications and N-1
complex additions. So, for N values of k, that is, for the entire DFT, it requires N2 complex multiplications
and N ( N – 1 ) ≈ N 2 complex additions. Thus, the DFT is very computationally intensive. Note that a
multiplication of two complex numbers ( ( a + jb )c + jd ) = ( ac – bd ) + j ( bc + ad ) requires four real
multiplications and two real additions. A complex addition ( a + jb ) + ( c + jd ) = ( a + c ) + j ( b + d ) requires
two real additions.
The Fast Fourier transforms (FFT) refers to algorithms that compute the DFT in a numerically efficient
manner. Different radices 2, 3, 4, and 5 can be combined to calculate DFT if the number of DFT point
N = 2k 3l 4m 5n (k, l, m, and n are positive integers). Therefore, in the following sections, we describe how
to combine different radix FFTs and then review the decimation-in-time (DIT) algorithms for radix-2/3/4/5
FFTs. Please note that the DIT algorithm is more efficient to run on an SC3850 core because the core
performs dual MACs.
3.2
Combination of Different Radix FFTs
The number of input data (the number of DFT points), N, is equal to the product of the FFT radixes, which
is defined as follows:
N = R( 0) × R( 1) × R( 2) × … × R(S – 1 )
S–1
=
∏ R(s)
Eqn. 5
s=0
In Equation 5, the parameters are defined as follows:
• S: Specifies the number of FFT stages
• s: Specifies stage index starting at the FFT input
• R(s): Specifies radix of the FFT stage
Neither he FFT stage radixes nor the R(s) parameters are constrained to be prime numbers. Mixed radix
FFTs can be arbitrarily assigned in the sequence order. For example, the DFT of 300 points can be factored
as 5 × 5 × 3 × 4 and can be calculated as a combination of either of the following FFT radixes:
• (radix-5) × (radix-5) × (radix-3) × (radix-4)
• (radix-4) × (radix-3) × (radix-5) × (radix-5)
Each stage of the FFT represents N/R(s) executions of an R(s) point butterfly calculation. In this
application note, radices 2, 3, 4, and 5 DIT FFTs are investigated because they are relatively easy to
develop and provide a fairly large set of possible number of points for DFT calculation.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
5
Mixed Radix FFT Algorithms
3.3
Radix-5 DIT FFT
The example uses the symmetry and periodicity properties in the derivation, which are defined as shown
in Equation 6.
symmetry property:
periodicity property:
N
k + ---2
WN
= – W nk
Eqn. 6
W Nk + N = W Nk
The Radix-5 DIT FFT algorithm breaks the DFT calculation into several 5-point DFTs. The Radix-5 DIT
FFT can be derived as follows:
N–1
k) =
∑ x ( n )WNnk
n=0
N
---- – 1
5
=
N
---- – 1
5
n=0
N
---- – 1
5
=
N
---- – 1
5
N
---- – 1
5
∑ x ( 5n )WN5nk + ∑ x ( 5n + 1 )WN( 5n + 1)k + ∑ x ( 5n + 2 )WN( 5n + 2)k + ∑ x ( 5n + 3 )WN( 5n + 3 )k + ∑ x ( 5n + 4 )WN( 5n + 4 )
n=0
=
N
---- – 1
5
n=0
N
---- – 1
5
n=0
N
---- – 1
5
n=0
N
---- – 1
5
N
---- – 1
5
∑ x ( 5n )WN5nk + WNk ∑ x ( 5n + 1 )WN5nk + WN2k ∑ x ( 5n + 2 )WN5nk + WN3k ∑ x ( 5n + 3 )WN5nk + WN4k ∑ x ( 5n + 4 )WN5nEqn. 7
n=0
n=0
N
---- – 1
5
N
---- – 1
5
n=0
N
---- – 1
5
n=0
N
---- – 1
5
n=0
N
---- – 1
5
∑ x ( 5n )WNnk⁄ 5 + WNk ∑ x ( 5n + 1 )WNnk⁄ 5 + WN2k ∑ x ( 5n + 2 )WNnk⁄ 5 + WN3k ∑ x ( 5n + 3 )WNnk⁄ 5 + WN4k ∑ x ( 5n + 4 )WNn
n=0
n=0
n=0
n=0
n=0
k Q ( k ) + W 2k R ( k ) + W 3k S ( k ) + W 4k T ( k )
= P ( k ) + WN
N
N
N
k = 0 , 1 , 2 , 3 , …, N – 1
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
6
Freescale Semiconductor
Mixed Radix FFT Algorithms
Each of the sums, P(k), Q(k), R(k), S(k), and T(k), in Equation 7 is recognized as an N/5-point DFT.
Although the index k ranges over N values, k=0,1,2,...,N-1, each of the sums can be computed only for
k=0,1,2,...,N/5-1, since they are periodic with period N/5. The transform X(k) can be broken into five parts
as follows:
X ( k ) = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) + W N3k S ( k ) + W N4k T ( k )
⎛
N⎞
⎛
N⎞
⎛
N⎞
⎛
N⎞
k + ---2 k + ---3 k + ---4 k + ---N
⎝
⎠
⎝
⎠
⎝
⎠
⎝
⎠
X ⎛⎝ k + ----⎞⎠ = P ( k ) + W N 5 Q ( k ) + W N 5 R ( k ) + W N 5 S ( k ) + W N 5 T ( k )
5
= P ( k ) + W 1 / 5 W Nk Q ( k ) + W 2 / 5 W N2k R ( k ) + W 3 / 5 W N3k S ( k ) + W 4 / 5 W N4k T ( k )
⎛
2N⎞
⎛
2N⎞
⎛
2N⎞
⎛
2N⎞
k + ------2 k + ------3 k + ------4 k + ------2N
⎝
⎠
⎝
⎝
⎝
5⎠
5⎠
5⎠
X ⎛ k + -------⎞ = P ( k ) + W N 5 Q ( k ) + W N
R ( k ) + WN
S ( k ) + WN
T( k)
⎝
5⎠
= P ( k ) + W 2 / 5 W Nk Q ( k ) + W 4 / 5 W N2k R ( k ) + W 1 / 5 W N3k S ( k ) + W 3 / 5 W N4k T ( k )
⎛
3N⎞
⎛
3N⎞
⎛
3N⎞
⎛
3N⎞
2 ⎝ k + -------⎠
3 ⎝ k + -------⎠
4 ⎝ k + -------⎠
3N
⎝ k + -------⎠
5
5
5
X ⎛ k + -------⎞ = P ( k ) + W N 5 Q ( k ) + W N
R ( k ) + WN
S ( k ) + WN
T( k)
⎝
⎠
5
Eqn. 8
= P ( k ) + W 3 / 5 W Nk Q ( k ) + W 1 / 5 W N2k R ( k ) + W 4 / 5 W N3k S ( k ) + W 2 / 5 W N4k T ( k )
⎛ k + 4N
-------⎞⎠
5
4N
⎝
X ⎛ k + -------⎞ = P ( k ) + W N
⎝
5⎠
4N
2 ⎛⎝ k + -------⎞⎠
5
Q ( k ) + WN
4N
3 ⎛⎝ k + -------⎞⎠
5
R ( k ) + WN
4N
4 ⎛⎝ k + -------⎞⎠
5
S ( k ) + WN
T( k)
= P ( k ) + W 4 / 5 W Nk Q ( k ) + W 3 / 5 W N2k R ( k ) + W 2 / 5 W N3k S ( k ) + W 1 / 5 W N4k T ( k )
N
k = 0, 1, 2, …, ---- – 1
5
In the equation above, the periodicity propertyW Nk + N = W Nkis used to simplify W5k . For example,W 512 = W52.
The four complex numbers W51, W52, W53, and W54 can be expressed as shown in Figure 2.
Im(imaginary)
W53
W54
b
d
-c
a
Re(real)
0
-d
W52
-b
W51
a = 0.3090169944
b = 0.9510565163
c = 0.8090169944
d = 0.5877852523
Figure 2. Complex numbers of W51, W52, W53, and W54
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
7
Mixed Radix FFT Algorithms
2π
2π
2π
W 51 = exp ⎛ – j ------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = a – jb
⎝ 5⎠
⎝ 5⎠
⎝ 5⎠
2π × 2
4π
4π
W 52 = exp ⎛ – j ---------------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = – c – jd
⎝
⎝ 5⎠
⎝ 5⎠
5 ⎠
W 53
Eqn. 9
6π
6π
2π × 3
= exp ⎛ – j ---------------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = – c + jd
⎝ 5⎠
⎝ 5⎠
⎝
5 ⎠
8π
8π
2π × 4
W 54 = exp ⎛ – j ---------------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = a + jb
⎝ 5⎠
⎝ 5⎠
⎝
5 ⎠
The constants in Equation 9 are as follows:
a = 0.3090169944,
b = 0.9510565163,
c = 0.8090169944,
d = 0.5877852523.
Figure 3 shows the corresponding Radix-5 DIT butterfly diagram.
P(k)
A
Q(k)
Wb
B
R(k)
Wc
C
S(k)
Wd
D
T(k)
E
We
A’
(1,1,1,1,1)
X(k)
B’
(1,a-jb,-c-jd,
-c+jd,a+jb)
X(k+N/5)
C’
(1,-c-jd,a+jb,
a-jb,-c+jd)
X(k+2N/5)
D’
(1,-c+jd,a-jb,
a+jb,-c-jd)
E’
(1,a+jb,-c+jd,
-c-jd,a-jb)
2π ⎞
⎛
Wb = exp ⎜ − j
k⎟
N ⎠
⎝
2π
⎛
⎞
Wc = exp⎜ − j
2k ⎟
N
⎝
⎠
⎛
2π ⎞
Wd = exp⎜ − j
3k ⎟
N
⎝
⎠
(k = 0,1, ...,N/5-1)
Aout
Bout
Cout
X(k+3N/5)
Dout
X(k+3N/5)
Eout
a = 0x278e
b = 0x79bc
c = 0x678e
d = 0x4b3d
2π
⎛
⎞
We = exp ⎜ − j
4k ⎟
N
⎝
⎠
Figure 3. Radix-5 DIT Butterfly
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
8
Freescale Semiconductor
Mixed Radix FFT Algorithms
The twiddle factors in Figure 3 are defined as follows:
W Nk = Wb = Wbr + j × Wbi
W N2k = Wc = Wcr + j × Wci
W N3k = Wd = Wdr + j × Wdi
Eqn. 10
W N4k = We = Wer + j × Wei
The real and imaginary part of the outputs of Radix-5 DIT butterfly is specified as follows:
Aout_r = Ar'+Br'+Cr'+Dr'+Er'
Aout_i = Ai'+Bi'+Ci'+Di'+Ei'
Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')+d(Ci'-Di'))
Bout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di'))-(b(Br'-Er')+d(Cr'-Dr'))
Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')-b(Ci'-Di'))
Cout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')-b(Cr'-Dr'))
Dout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))-(d(Bi'-Ei')-b(Ci'-Di'))
Dout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+(d(Br'-Er')-b(Cr'-Dr'))
Eout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))-(b(Bi'-Ei')+d(Ci'-Di'))
Eout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+(b(Br'-Er')+d(Cr'-Dr'))
where Ar’, Ai’, ...., Er’, and Ei’ are calculated as follows:
Ar' = Ar
Ai' = Ai
Br' = (Br*Wbr - Bi*Wbi)
Bi' = (Br*Wbi + Bi*Wbr)
Cr' = (Cr*Wcr - Ci*Wci)
Ci' = (Cr*Wci + Ci*Wcr)
Dr' = (Dr*Wdr - Di*Wdi)
Di' = (Dr*Wdi + Di*Wdr)
Er' = (Er*Wer - Ei*Wei)
Ei' = (Er*Wei + Ei*Wer)
NOTE
The principal difference between DIT and DIF (decimation in frequency)
butterflies is that the order of calculation changes. In the DIF algorithm, the
time domain data is “twiddled” before the sub-transforms are performed. In
DIT, however, the sub-transforms are performed first, and the output is
obtained by “twiddling” the resulting frequency domain data.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
9
Mixed Radix FFT Algorithms
3.4
Radix-4 DIT FFT
The Radix-4 DIT FFT algorithm breaks a N-point DFT calculation into a number of 4-point DFTs (4-point
butterflies). Compared with direct computation of N-point DFT, 4-point butterfly calculation requires
much less operations. The Radix-4 DIT FFT can be derived as follows:
N–1
X(k) =
∑ x( n )W
nk
N
n=0
N
---- – 1
4
=
N
---- – 1
4
∑ x( 4n )W
4nk
N
n=0
N
---- – 1
4
=
=
+
N
---- – 1
4
∑ x( 4n + 1 )W
( 4n + 1 )k
N
+
n=0
N
---- – 1
4
∑ x( 4n + 2 )W
4nk
N
n=0
N
---- – 1
4
N
---- – 1
4
nk
N⁄4
+ W Nk
n=0
∑ x ( 4n + 3 )W
4nk
N
+ W N2k
N
---- – 1
4
∑ x ( 4n + 2 )W
4nk
N
+ W N3k
n=0
∑ x ( 4n + 3 )W
nk
N⁄4
n=0
N
---- – 1
4
∑ x( 4n + 2)W
+ W N2k
Eqn. 11
4nk
N
n=0
N
---- – 1
4
∑ x ( 4n + 1 )W
( 4n + 3 )k
N
n=0
N
---- – 1
4
∑ x ( 4n + 1 )W
+ W Nk
n=0
∑ x( 4n )W
+
n=0
N
---- – 1
4
∑ x( 4n )W
( 4n + 2 )k
N
nk
N⁄4
∑ x ( 4n + 3 )W
+ W N3k
n=0
nk
N⁄4
n=0
= P ( k ) + W Nk Q ( k ) + W N2k R ( k ) + W N3k S ( k )
k = 0, 1, 2, 3, …, N – 1
Each of the sums, P(k), Q(k), R(k), and S(k), in Equation 11 is recognized as an N/4-point DFT. Although
the index k ranges over N values, k=0,1,2,...,N-1, each of the sums can be computed only for
k=0,1,2,...,N/4-1, since they are periodic with period N/4. The transform X(k) can be broken into four parts
as follows:
X ( k ) = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) + W N3k S ( k )
⎛k + N
----⎞
4⎠
N
⎝
X ⎛ k + ----⎞ = P ( k ) + W N
⎝
4⎠
N
2 ⎛ k + ----⎞
⎝
4⎠
Q ( k ) + WN
N
3 ⎛ k + ----⎞
⎝
4⎠
R ( k ) + WN
S( k )
= P ( k ) – jW Nk Q ( k ) – W N2k R ( k ) + jW N3k S ( k )
⎛ k + 2N
-------⎞⎠
4
2N
⎝
X ⎛ k + -------⎞ = P ( k ) + W N
⎝
4⎠
2N
2 ⎛⎝ k + -------⎞⎠
4
Q ( k ) + WN
2N
3 ⎛⎝ k + -------⎞⎠
4
R ( k ) + WN
S( k)
Eqn. 12
= P ( k ) – W Nk Q ( k ) + W N2k R ( k ) – W N3k S ( k )
⎛ k + 3N
-------⎞
4⎠
3N
⎝
X ⎛ k + -------⎞ = P ( k ) + W N
⎝
4⎠
3N
2 ⎛ k + -------⎞
⎝
4⎠
Q ( k ) + WN
3N
3 ⎛ k + -------⎞
⎝
4⎠
R ( k ) + WN
S( k)
= P ( k ) + jW Nk Q ( k ) – W N2k R ( k ) – jW N3k S ( k )
N
k = 0, 1, 2, …, ---- – 1
4
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
10
Freescale Semiconductor
Mixed Radix FFT Algorithms
Figure 4 shows the corresponding Radix-4 DIT butterfly diagram, and Figure 5 shows its simplified
version.
P(k)
A’
A
Q(k)
Wc
C
Wb
R(k)
B
S(k)
Wd
D
B’
(1,1,1,1)
X(k)
(1,-j,-1,j)
X(k+N/4)
Aout
Bout
C’
(1,-1,1,-1)
D’
(1,j,-1,-j)
X(k+2N/4)
Cout
X(k+3N/4)
Dout
2π ⎞
⎛
Wb = exp ⎜ − j
k⎟
N ⎠
⎝
2π
⎛
⎞
Wc = exp ⎜ − j
2k ⎟
N
⎝
⎠
2π ⎞
⎛
Wd = exp ⎜ − j
3k ⎟
N
⎝
⎠
(k = 0,1, ...,N/4-1)
Figure 4. Radix-4 DIT Butterfly
A
0
Aout
B
n
Bout
C
2n
Cout
3n
Dout
D
Figure 5. Simplified Radix-4 DIT Butterfly
The real and imaginary part of output data for Radix-4 DIT butterfly is specified in Equation 13.
Aoutr′ = Ar + ( Cr × Wcr – Ci × Wci ) + ( Br × Wbr – Bi × Wbi ) + ( Dr × Wdr – Di × Wdi )
Aouti′ = Ai + ( Cr × Wci + Ci × Wcr ) + ( Br × Wbi + Bi × Wbr ) + ( Dr × Wdi + Di × Wdr
Boutr′ = Ar – ( Cr × Wcr – Ci × Wci ) + ( Br × Wbi + Bi × Wbr ) – ( Dr × Wdi + Di × Wdr )
Bouti′ = Ai – ( Cr × Wci + Ci × Wcr ) – ( Br × Wbr – Bi × Wbi ) + ( Dr × Wdr – Di × Wdi )
Coutr′ = Ar + ( Cr × Wcr – Ci × Wci ) – ( Br × Wbr – Bi × Wbi ) – ( Dr × Wdr – Di × Wdi )
Eqn. 13
Couti′ = Ai + ( Cr × Wci + Ci × Wcr ) – ( Br × Wbi + Bi × Wbr ) – ( Dr × Wdi + Di × Wdr )
Doutr′ = Ar – ( Cr × Wcr – Ci × Wci ) – ( Br × Wbi + Bi × Wbr ) + ( Dr × Wdi + Di × Wdr )
Douti′ = Ai – ( Cr × Wci + Ci × Wcr ) + ( Br × Wbr – Bi × Wbi ) – ( Dr × Wdr – Di × Wdi )
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
11
Mixed Radix FFT Algorithms
3.5
Radix-3 DIT FFT
The Radix-3 DIF FFT can be derived as shown in Equation 14.
X(k) =
∑ x ( n )W
nk
N
n=0
N
---- – 1
3
=
N
---- – 1
3
∑ x ( 3n )W
3nk
N
n=0
+
∑ x ( 3n + 1 )W
=
( 3n + 1 )k
N
+
n=0
N
---- – 1
3
=
N
---- – 1
3
∑ x( 3n + 2)W
n=0
N
---- – 1
3
∑ x ( 3n )W
3nk
N
n=0
N
---- – 1
3
N
---- – 1
3
nk
N⁄3
+ W Nk
n=0
N
---- – 1
3
∑ x ( 3n + 1 )W
+ W Nk
n=0
∑ x ( 3n )W
( 3n + 2 )k
N
3nk
N
+ W N2k
∑ x( 3n + 2 )W
Eqn. 14
3nk
N
n=0
N
---- – 1
3
∑ x( 3n + 1 )W
nk
N⁄3
+ W N2k
n=0
∑ x ( 3n + 2 )W
nk
N⁄3
n=0
= P ( k ) + W Nk Q ( k ) + W N2k R ( k )
k = 0 1 2 3
N–1
Each of the sums, P(k), Q(k), and R(k), in Equation 14 is recognized as an N/3-point DFT. The transform
X(k) can be broken into three parts as shown in Equation 15.
X ( k ) = P ( k ) + W Nk Q ( k ) + W N2k R ( k )
⎛
N⎞
⎛
N⎞
2 ⎝ k + ----⎠
N
⎝ k + ----⎠
X ⎛ k + ----⎞ = P ( k ) + W N 3 Q ( k ) + W N 3 R ( k )
⎝
⎠
3
= P ( k ) + W 1 / 3 W Nk Q ( k ) + W 2 / 3 W N2k R ( k
⎛
2N⎞
⎛
Eqn. 15
2N⎞
2 ⎝ k + -------⎠
2N
⎝ k + -------⎠
3
X ⎛ k + -------⎞ = P ( k ) + W N 3 Q ( k ) + W N
R(k
⎝
⎠
3
= P ( k ) + W 2 / 3 W Nk Q ( k ) + W 1 / 3 W N2k R ( k
N
k = 0, 1, 2, …, ---- – 1
3
In Equation 15, the periodicity propertyWNk + N = W Nkis used to simplifyW 34 = W 31. The complex numbers W31
and W32 can be expressed as shown in Equation 16.
1
3
2π
W 31 = exp ⎛ – j ------⎞ = – --- – ------- j
⎝ 3⎠
2 2
W 32
Eqn. 16
1
3
2π × 2
= exp ⎛ – j ---------------⎞ = – --- + ------- j
⎝
2 2
3 ⎠
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
12
Freescale Semiconductor
Mixed Radix FFT Algorithms
Figure 6 shows the corresponding Radix-3 DIT butterfly diagram.
P(k)
A’
A
Q(k)
Wb
B
R(k)
Wc
Aout
X(k+N/3)
B’
(1,-1/2-j
-1/2+j
C
X(k)
(1,1,1)
3 /2,
Bout
3 /2)
X(k+2N/3)
C’
(1, -1/2+j
-1/2-j
2π ⎞
⎛
Wb = exp ⎜ − j
k⎟
N ⎠
⎝
2π
⎛
⎞
Wc = exp⎜ − j
2k ⎟
N
⎝
⎠
3 /2,
Cout
3 /2)
(k = 0,1, ...,N/3-1)
Figure 6. Radix-3 DIT Butterfly
The twiddle factors in Figure 6 are defined as shown in Equation 17.
W Nk = Wb = Wbr + j × Wbi
W N2k = Wc = Wcr + j × Wci
Eqn. 17
The real and imaginary part of the outputs of Radix-3 DIT butterfly is specified as follows:
Aout_r = Ar'+Br'+Cr'
Aout_i = Ai'+Bi'+Ci'
Bout_r = Ar'-(Br’+Cr’)/2+(Bi’-Ci’)* 3 /2
Bout_i = Ai'-(Bi’+Ci’)/2-(Br’-Cr’)* 3 /2
Cout_r = Ar'-(Br’+Cr’)/2-(Bi’-Ci’)* 3 /2
Cout_i = Ai'-(Bi’+Ci’)/2+(Br’-Cr’)* 3 /2
where Ar’, Ai’, Br’, Bi’, Cr’, and Ci’ are calculated as follows:
Ar' = Ar
Ai' = Ai
Br' = (Br*Wbr - Bi*Wbi)
Bi' = (Br*Wbi + Bi*Wbr)
Cr' = (Cr*Wcr - Ci*Wci)
Ci' = (Cr*Wci + Ci*Wcr)
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
13
Mixed Radix FFT Algorithms
3.6
Radix-2 DIT FFT
The Radix-2 DIF FFT can be derived as shown in Equation 18.
X(k) =
∑ x ( n )W
nk
N
n=0
N
---- – 1
2
=
N
---- – 1
2
∑ x ( 2n )W
2nk
N
+
n=0
∑ x( 2n + 1 )W
n=0
N
---- – 1
2
=
=
( 2n + 1 )k
N
N
---- – 1
2
∑ x ( 2n )W
2nk
N
∑ x( 2n + 1 )W
+ W Nk
n=0
n=0
N
---- – 1
2
N
---- – 1
2
∑ x ( 2n )W
nk
N⁄2
+ W Nk
n=0
Eqn. 18
2nk
N
∑ x( 2n + 1)W
nk
N⁄2
n=0
= P ( k ) + W Nk Q ( k )
k = 0 1 2 3
N–1
Each of the sums, P(k) and Q(k), in Equation 18 is recognized as an N/2-point DFT. The transform X(k)
can be broken into two parts as shown in Equation 19.
X ( k ) = P ( k ) + W Nk Q ( k )
⎛
N⎞
N
⎝ k + ----⎠
X ⎛ k + ----⎞ = P ( k ) + W N 2 Q ( k
⎝
2⎠
Eqn. 19
= P ( k ) – W Nk Q ( k )
N
k = 0, 1, 2, …, ---- – 1
2
Figure 7 shows the corresponding Radix-2 DIT butterfly diagram.
P(k)
A’
A
Q(k)
B
Wb
(1,1)
(1,-1)
B’
X(k)
Aout
X(k+N/2)
Bout
2π ⎞
⎛
Wb = exp ⎜ − j
k⎟
N ⎠
⎝
(k = 0,1, ...,N/2-1)
Figure 7. Radix-2 DIT Butterfly
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
14
Freescale Semiconductor
Mixed Radix FFT Algorithms
The twiddle factors in Figure 7 are defined in Equation 20.
W Nk = Wb = Wbr + j × Wb
Eqn. 20
The real and imaginary part of the outputs of Radix-2 DIT butterfly is specified as follows:
Aout_r
Aout_i
Bout_r
Bout_i
3.7
=
=
=
=
Ar+(Br*Wbr
Ai+(Br*Wbi
Ar-(Br*Wbr
Ai-(Br*Wbi
+
+
Bi*Wbi)
Bi*Wbr)
Bi*Wbi)
Bi*Wbr)
An Example of 12-Point Mixed Radix DIT FFT
An example of 12-point mixed Radix DIT FFT diagram is shown in Figure 8 to illustrate an entire DIT
algorithm. There are 2 stages to process the whole DFT computation. Radix-4 DIT butterflies are used in
the first stage, and there 3 butterflies; Radix-3 DIT butterflies are used in the second stage, and there 4
butterflies.
x(0)
X(0)
0
0
0
x(3)
0
X(1)
0
x(2)
0
x(6)
0
x(9)
X(3)
0
x(1)
0
x(4)
0
X(5)
0
x(7)
X(4)
0
1
2
X(6)
3
X(7)
0
x(10)
0
x(2)
0
x(5)
0
2
0
4
x(8)
X(8)
X(9)
X(10)
0
6
x(11)
X(11)
Stage
Radix
Butterflies / Subgroup
Subgroups
0
4
1
3
1
3
4
1
Figure 8. An Example of a 12-point Mixed Radix DIT FFT
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
15
Mixed Radix FFT Algorithms
In Figure 8, the inputs are digit-reversed ordered while the outputs are presented in a normal order. The
digital reversed order of the inputs are stored in an array and pre-calculated outside of the DFT function.
The pointer of the array is passed into the DFT function to get the digital reversed indices.
The digital reversed indices n can be calculated as follows:
n = r ( 0 )A ( 0 ) + r ( 1 )A ( 1 ) + … + r ( S – 1 )A ( S – 1 )
Eqn. 21
where Ss –is1 the total number of stages, r ( s ) = { 0, 1, …, R ( s ) – 1 } specifies possible digits of Radix R(s), and
A(s ) =
∏ R ( m ) specifies the product of the radixes of the following stages.
m=0
For example, consider the digit-reversed order of the 300-point DFT. The 300-point DFT can be factored
as 4 × 3 × 5 × 5 and we assume that the calculated radix of FFT stage is performed in this order. So the
products of radixes of the following FFT stages and the digits at each stage are
At first stage, A(0) = 3 x 5 x 5 = 75, r(0) = {0,1,2,3},
At second stage, A(1) = 5 x 5 = 25, r(0) = {0,1,2},
At third stage, A(2) = 5, r(0) = {0,1,2,3,4},
At fourth stage, A(3) = 1, r(0) = {0,1,2,3,4},
The final output indexes in the digit-reversed order are calculated as
k
0
75
150
225
25
100
175
250
50
125
200
275
5
80
155
230
...
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
r(0)A(0)
0
75
150
225
0
75
150
225
0
75
150
225
0
75
150
225
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
r(1)A(1)
0
0
0
0
25
25
25
25
50
50
50
50
0
0
0
0
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
r(2)A(2)
0
0
0
0
0
0
0
0
0
0
0
0
5
5
5
5
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
r(3)A(3)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
16
Freescale Semiconductor
Mixed Radix FFT Algorithms
The general C program to calculate the output indexes in the digit-reversed order is shown below. This
program can calculate the output indexes in the digit-reversed order up to 6 FFT stages.
//=================================================
// Input indices are in the digit-reversed order
//=================================================
short *psiRadix;
// Pointer to the array of radix every stage
short *psiDigitReversedAddress; // Pointer to the Digit Reversed Address
short
short
short
short
short
short
siI;
siI0,siI1,siI2,siI3,siI4,siI5;
siIr0,siIr1,siIr2,siIr3,siIr4,siIr5;
siR[6];
// Local variable for radix every stage
siRR[6][5];
// Digits of the radix number
my_psiSubgroup[6];
// Product of radixes of the following DFT stages
// Radix number every stage
for(siI=0;siI<6;siI++)
{
siR[siI] = psiRadix[siI];
if(psiRadix[siI] == 0)
{
siR[siI] = 1;
}
}
// Loop
siIr0 =
siIr1 =
siIr2 =
siIr3 =
siIr4 =
siIr5 =
index to calculate digit reversed address
siR[0];
siR[1];
siR[2];
siR[3];
siR[4];
siR[5];
// Digits of the radix number
for(siI=0;siI<5;siI++)
{
siRR[0][siI] = siI;
siRR[1][siI] = siI;
siRR[2][siI] = siI;
siRR[3][siI] = siI;
siRR[4][siI] = siI;
siRR[5][siI] = siI;
}
// Product of radixes of the following DFT stages
my_psiSubgroup[0] = siIr1*siIr2*siIr3*siIr4*siIr5;
for(siI=1;siI<6;siI++)
{
if(psiRadix[siI] == 0)
{
my_psiSubgroup[siI] = 0;
}
else
{
my_psiSubgroup[siI] = my_psiSubgroup[siI-1]/siR[siI];
}
}
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
17
Mixed Radix FFT Algorithms
// Digit reversed address
siI = 0;
for(siI5=0;siI5<siIr5;siI5++)
{
for(siI4=0;siI4<siIr4;siI4++)
{
for(siI3=0;siI3<siIr3;siI3++)
{
for(siI2=0;siI2<siIr2;siI2++)
{
for(siI1=0;siI1<siIr1;siI1++)
{
for(siI0=0;siI0<siIr0;siI0++)
{
psiDigitReversedAddress[siI]
= siRR[0][siI0]*my_psiSubgroup[0] +
siRR[1][siI1]*my_psiSubgroup[1]
+ siRR[2][siI2]*my_psiSubgroup[2] +
siRR[3][siI3]*my_psiSubgroup[3]
+ siRR[4][siI4]*my_psiSubgroup[4] +
siRR[5][siI5]*my_psiSubgroup[5];
siI++;
}
}
}
}
}
}
3.8
Mixed Radix Inverse FFT Algorithms
The common method of calculating the IFFT is to conjugate the twiddle factors and use the FFT routine
to get the IFFT. However, this method needs an additional memory to store the twiddle factors. Therefore,
it is not an efficient way for reduction of memory requirement.
If the complex conjugate of IDFT (Equation 3) is considered, the equation is defined as shown in
Equation 22.
1
x∗ ( n ) = ---N
∑
2πnk
X ( k ) exp ⎛ j -------------⎞
⎝ N ⎠
k=0
N–1
1
= ---N
-⎞
∑ X∗ ( k ) exp ⎛⎝ ( –j ) -----------N ⎠
2πnk
k=0
N–1
1
= ---N
Eqn. 22
⎧
2πnk ⎫
-⎞ – j sin ⎛ -------------⎞ ⎬
∑ X∗ ( k )⎨⎩ cos ⎛⎝ -----------⎝ N ⎠
N ⎠
⎭
2πnk
k=0
N–1
1
= ---N
∑ X∗ ( k )W
nk
N
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
18
Freescale Semiconductor
Mixed Radix FFT Algorithms
The asterisk indicates the complex conjugate. In this form, there is no need to have additional memory for
the twiddle factors since the twiddle factor is the same as the DFT. The IDFT is calculated by applying the
DFT on the complex conjugate of X(k). The complex conjugate of the resulting time signal, x(n), is the
output of the IDFT (with appropriate scaling). The advantage of this calculation is that the same twiddle
factors can be used for both the DFT and the IDFT, and there is no need to save additional conjugated
twiddle factors.
The following code describes the Radix-5 DIT butterfly calculations of the IFFT: Note that the conjugating
process is embedded in the equations.
Aout_r = Ar'+Br'+Cr'+Dr'+Er'
Aout_i = Ai'-Bi'-Ci'-Di'-Ei'
Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')+d(Ci'-Di'))
Bout_i = Ai'-(a(Bi'+Ei')-c(Ci'+Di'))+(b(Br'-Er')+d(Cr'-Dr'))
Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')-b(Ci'-Di'))
Cout_i = Ai'+(c(Bi'+Ei')-a(Ci'+Di'))+(d(Br'-Er')-b(Cr'-Dr'))
Dout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))-(d(Bi'-Ei')-b(Ci'-Di'))
Dout_i = Ai'+(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')-b(Cr'-Dr'))
Eout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))-(b(Bi'-Ei')+d(Ci'-Di'))
Eout_i = Ai'-(a(Bi'+Ei')-c(Ci'+Di'))-(b(Br'-Er')+d(Cr'-Dr'))
where Ar', Ai', ...., Er', and Ei' are calculated as follows:
Ar' = Ar
Ai' = Ai
Br' = (Br*Wbr + Bi*Wbi)
Bi' = (Br*Wbi - Bi*Wbr)
Cr' = (Cr*Wcr + Ci*Wci)
Ci' = (Cr*Wci - Ci*Wcr)
Dr' = (Dr*Wdr + Di*Wdi)
Di' = (Dr*Wdi - Di*Wdr)
Er' = (Er*Wer + Ei*Wei)
Ei' = (Er*Wei - Ei*Wer)
a = 0.3090169944
b = 0.9510565163
c = 0.8090169944
d = 0.5877852523
The following code listing describes the Radix-4 DIT butterfly calculations of the IFFT:
Aout_r = (Ar'+Cr')+(Br'+Dr')
Aout_i = (Ai'-Ci')-(Bi'+Di'))
Bout_r = (Ar'+Cr')-(Br'+Dr')
Bout_i = (Ai'-Ci')+(Bi'+Di')
Cout_r = (Ar'-Cr')+(Bi'-Di')
Cout_i = (Ai'+Ci')+(Br'-Dr'))
Dout_r = (Ar'-Cr')-(Bi'-Di')
Dout_i = (Ai'+Ci')-(Br'-Dr')
where Ar', Ai', ...., Dr', and Di' are calculated as follows:
Ar' = Ar
Ai' = -Ai
Br' = (Br*Wbr + Bi*Wbi)
Bi' = (Br*Wbi - Bi*Wbr)
Cr' = (Cr*Wcr + Ci*Wci)
Ci' = (Cr*Wci - Ci*Wcr)
Dr' = (Dr*Wdr + Di*Wdi)
Di' = (Dr*Wdi - Di*Wdr)
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
19
Implementation of DFT on StarCore SC3850 DSP
The following code listing describes the Radix-3 DIT butterfly calculations of the IFFT:
Aout_r = Ar'+Br'+Cr'
Aout_i = Ai'-Bi'-Ci'
Bout_r = Ar'-(Br’+Cr’)/2+(Bi’-Ci’)* 3 /2
Bout_i = Ai'+(Bi’+Ci’)/2+(Br’-Cr’)* 3 /2
Cout_r = Ar'-(Br’+Cr’)/2-(Bi’-Ci’)* 3 /2
Cout_i = Ai'+(Bi’+Ci’)/2-(Br’-Cr’)* 3 /2
where Ar’, Ai’, Br’, Bi’, Cr’, and Ci’ are calculated as follows:
Ar' = Ar
Ai' = Ai
Br' = (Br*Wbr + Bi*Wbi)
Bi' = (Br*Wbi - Bi*Wbr)
Cr' = (Cr*Wcr + Ci*Wci)
Ci' = (Cr*Wci - Ci*Wcr)
The following code listing describes the Radix-2 DIT butterfly calculations of the IFFT:
Aout_r
Aout_i
Bout_r
Bout_i
4
=
=
=
=
Ar+(Br*Wbr
Ai-(Br*Wbi
Ar-(Br*Wbr
Ai+(Br*Wbi
+
+
-
Bi*Wbi)
Bi*Wbr)
Bi*Wbi)
Bi*Wbr)
Implementation of DFT on StarCore SC3850 DSP
In this section, an implementation of the mixed Radix DIT FFT algorithm will be presented. Please note
that the implementation of IDFT is similar, and thus the reference code of IDFT is not listed in this
application note.
4.1
Overflow in Fixed-Point Computation and Scaling
The SC3850 is a fixed-point digital signal processor. In fixed-point processors, numbers are represented
either as integers or as fractions between [-1.0, +1.0). In the implementation of the DFT algorithm,
fractional number are used to interpret all input data, output data, and twiddle factors to avoid
multiplication overflow because a fractional number multiplied with another fractional number always
results in value less than one (under an assumption that the two operants can’t take -1 at the same time).
Another overflow comes from additions in the DFT calculation. For example, in a Radix-5 DIT butterfly,
although the inputs of the butterfly have real and imaginary parts with magnitudes less than one, the
maximum magnitude of outputs can have a up to 3-bit growth as shown in Equation 23.
A =
=
2
( Ar + Br + Cr + Dr + Er ) + ( Ai + Bi + Ci + Di + Ei )
2
(1 + 1 + 1 + 1 + 1) + (1 + 1 + 1 + 1 + 1)
2
2
Eqn. 23
= 5 2
= 7.071
Scaling is an important operation in fixed-point DSP applications to avoid the overflow with the loss of
precision and dynamic range. When the inputs are scaled to a smaller range, some of the lower-order bits
of the original value are lost through truncation or rounding resulting in reduced precision. Thus, scaling
must be done with great care, balancing the need to eliminate the possibility of overflow with the need to
maintain adequate dynamic range and precision.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
20
Freescale Semiconductor
Implementation of DFT on StarCore SC3850 DSP
Two scaling methods to prevent overflowing are implemented in the DFT calculation. The first method is
to scale down the input data by a fixed number of bits at each stage without considering the range of the
input data. This is called fixed scaling method. The scaling factor is determined based on the maximum
bit growth of the butterfly calculation. For example, the inputs to the Radix-5 butterfly are scaled by 3 bits
(3-bit right shift) since it may cause up to 3-bit growth. This method is easy to implement. However, it may
results in unnecessary scaling down of the input data, especially when the input data values are in a small
range.
Another method is to detect the bit growth at the output of each DFT stage. Once the number of bit growth
is known, the number of scaling bits can be determined to reverse the bit growth. This method is called
automatic scaling method. The precision of the DFT stage can be enhanced by predicting bit growth of an
upcoming stage and normalizing the data accordingly.
The automatic scaling method algorithm is as follows:
1. The input data are scaled down at the first stage to gain extra guard bits. The scaling factors are 2
for the first Radix-2 stage and 4 for the first Radix-4 stage.
2. At each butterfly calculation, search the maximum magnitude of the input data and detect the bit
growth. The number of leading zeros or ones (except the sign bit) of the data with the maximum
magnitude is recorded.
3. Based on the number of leading bits and the maximum bit growth S of the current DFT stage, the
number of scaling bits is determined. The input data to the DFT stage are shifted accordingly.
Table 1 shows the detailed scaling parameters used at different stages in the DFT calculation.
Table 1. Automatic Scaling at Different Stages
Radix-2 First Stage
Automatic
Scaling
Automatic
Scaling
Automatic
Scaling
Automatic
Scaling
if(leadbits > 2) Shift = 2 - leadBits;
else Shift = 2 - leadBits
N/A
N/A
Radix-4 First Stage
if(leadbits > 3) Shift = 3- leadBits;
else Shift = 3- leadBits
N/A
N/A
N/A
N/A
N/A
N/A
Radix-3 Middle Stage
if(leadBits > 2) Shift = 0;
else if(leadBits == 2) Shift = 1;
else Shift = 2;
Radix-4 Middle Stage
if(leadBits > 3) Shift = 0;
else if(leadBits == 3) Shift = 2;
else Shift = 2;
Radix-5 Middle Stage
if(leadBits > 3) Shift = 0;
else Shift = 3 - leadBits;
Radix-3 Last Stage
if(leadBits > 2) Shift = 0;
else Shift = 2 - leadBits;
N/A
N/A
Radix-5 Last Stage
if(leadBits > 2) Shift = 2 - leadBits;
else Shift = 2 - leadBits;
4. The final output data obtained by the DFT kernel may need to be scaled because the automatic
scaling process may use different scaling factors for different input data. The DFT kernel does not
implement this step since difference applications may require different scaling factors. Users can
scale the output data outside of the DFT kernel based on application requirement.
The following code is used to find the maximum magnitude and determines the number of leading bits for
the automatic scaling. Four real parts and four imaginary parts are compared simultaneously in a loop
within two cycles.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
21
Implementation of DFT on StarCore SC3850 DSP
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
[
move.w (r2),d15
; DFT point
]
[
tfr d15,d14
sub #8,d15
; d15-8
]
[
asrr #2,d15
; (d15-8)/4
]
[
adda #8,r0,r1
; Input address
move.w #2,n0
]
[
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
move.2l (r1)+n0,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
FALIGN
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.l (sp-40),r3
; Stage counter
]
[
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
22
Freescale Semiconductor
Implementation of DFT on StarCore SC3850 DSP
tfr d6,d0
tfr d7,d1
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
max2 d0,d4
max2 d1,d5
]
[
tfr d5,d0
]
[
clr d15
max2 d0,d4
move.l (sp-32),r13
; digit_reversed_address
]
[
add #2,d15
; scale down = 2
sxt.w d4,d0
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
]
[
extract #16,#16,d4,d4
move.l (sp-68),r1 ; psiScale
]
[
max d0,d4
; Max real/imag
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9 ; A output pointer
]
[
clb d4,d0
; count reading bit
move.w (r2),r14
; num_butterfly[s]
move.w (r4),r5
; num_subgroup[s]
]
[
neg d0
; negates
dosetup0 radix2_dit_1_stage_loop1
move.l (sp-28),r4 ; num_radix_offset
]
[
clr d4
sub #16,d0
; limit 16-bit
move.w #S22,d2
]
[
add #1,d4
cmpgt.w #(S21+1),d0
; Check if siNorm (leading bits) > S21+1
move.w #S21,d1
]
[
ift clr d15
; scale down = 0
if siNorm > S21+1
ifa
cmpeq.w #(S21+1),d0
; Check if siNorm == S21+1
]
tfrt d4,d15
; scale down = 1
if siNorm == S21+1
; d15 is the automatic scaling bits
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
23
Implementation of DFT on StarCore SC3850 DSP
4.2
Implementation of Mixed Radix DIT FFTs
Figure 9 shows the actual process of calculation for 600-point DFT as an example.
Ex. N = 600
Bit Reverse
0
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
1
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
2
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
3
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
N/R(s)-4
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
N/R(s)-3
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
N/R(s)-2
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
N/R(s)-1
Radix-2 FFT
Radix-4 FFT
Radix-3 FFT
Radix-5 FFT
Radix-5 FFT
First stage
R(s)=2, s=0
Middle stage
R(s)=4, s=1
R(s)=3, s=2
Last stage
R(s)=5, s=3
R(s)=5, s=4
2 x 4 x 3 x 5 x 5 -> 5 stages
Figure 9. Diagram of 1024-point FFT Calculation Using Radix-4 DIT Algorithm
In this implementation, the DFT is factorized as 2 × 4 × 3 × 5 × 5. There are S=5 stages, and the stage index
s = 0, 1, 2, 3, 4. The whole DFT process is summarized as follows:
1. First stage: The input data are loaded in the bit-reversed addressing mode. The bit-reversed indices
are calculated outside the DFT kernel and they are passed into the kernel as one of the parameters.
The input data is checked to detect the maximum magnitude, and the scaling factor is determined.
The Radix-2 (R(s=0)=2) DIT butterfly calculation is repeated for N/2 times on the input data. In
the first stage, the twiddle factors are all equal to 1. As a result, there is no multiplication in the first
stage. The data are scaled according to the scaling factor to prevent the overflow.
2. Middle stage: There 3 middle stages in this example. In each stage, there are N/R(s) Radix-R(s)
DIT butterfly calculations, where s = 1, 2, 3. The twiddle factors vary with stages.
— In stage s=1, the twiddle factors are equal to (WN75k, WN2*75k, WN3*75k) (k = 0, N/300-1).
— In stage s=2, the twiddle factors are (WN25k, WN2*25k) (k = 0, 1, 2, ... , N/75-1).
— In stage s=3, the twiddle factors are (WN5k, WN2*5k, WN3*5k,WN4*5k) (k = 0, 1, 2, ... , N/25-1),
and so on.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
24
Freescale Semiconductor
Implementation of DFT on StarCore SC3850 DSP
The input data is checked to detect the maximum magnitude, and the scaling factor is determined.
The data are scaled according to the scaling factor to prevent the overflow.
3. Last stage: The Radix-5 (R(s=4)=5) DIT butterfly calculation is repeated for N/5 times. The
twiddle factors are (WNk, WN2k, WN3k,WN4k) (k = 0, 1, 2, ... , N/5-1) in this stage. The scaling
factor is determined based on the maximum magnitude of the input data and the data are scaled
accordingly.
Input and output data are 16-bit complex array, each of length N. Two separate buffers are used to store
the input and output data, as shown in Figure 10. Each stage uses the input and output data buffer one after
the other. Therefore, if stage n takes the input data buffer as the input and the output data buffer as the
output, the next stage, n + 1, uses the output data buffer as the input and the input data buffer as the output.
The final output data are stored in input data buffer if the number of stages S is even or in output data buffer
if S is odd.
Input Data Buffer
Number of stages
N x 2-byte x 2 IQ
[bytes]
Output Data Buffer
Final output data
Even
Input data buffer
Odd
Output data buffer
Input and output data are 16-bit
complex array
N x 2-byte x 2 IQ
[bytes]
Figure 10. Input and Output Data Buffers
At each stage, the output data are stored into the memory and they are used as input to the next stage. For
example, if Radix-4 butterflies are used in stage n, then the outputs of a butterfly are A’, B’, C’, and D’.
Note that A’, B’, C’, and D’ are not consecutively stored in memory, as shown in Figure 11.
The input addressing of the first stage is in digital reversed order
Input addressing
r0
A
B
C
Output addressing
r8
D
A’
r1
N/4
r9
B’
N/4
N
r10
C’
N/4
r11
D’
N/4
,
Figure 11. Input and output addressing in the DFT implementation
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
25
Implementation of DFT on StarCore SC3850 DSP
The offset value between the output data samples is calculated as shown in Equation 24.
N
4 × ----------R(s)
Eqn. 24
where s specifies the stage index and R ( s ) is the Radix of the FFT stage s . The offset value is multiplied
by 4 to align the 2-byte for real part and 2-byte for imaginary part.
4.3
Optimization Used in DFT Implementation
The optimization techniques improve the performance by taking special advantage of StarCore’s parallel
computation capability. The SC3850 cores efficiently deploy a Variable Length Execution Set (VLES)
instruction set. The SC38500 architecture has:
• Four parallel ALUs, each is capable of performing dual MAC and most other arithmetic operations
• Two Address Arithmetic Units (AAUs) for address arithmetic operations
A VLES can contain up to four data ALU instructions and two AAU instructions. To maximize the
computational performance, four ALUs and two AAUs should be utilized simultaneously as much as
possible. Because the StarCore architecture has high instruction-level parallelism, it is possible to schedule
independent blocks in parallel to increase performance.
Code optimization also considers the memory structure to improve the performance. The SC3850
architecture provides a total sixteen 40-bit data registers, D0-D15 and sixteen 32-bit address registers,
R0-R15. The dual Harvard architecture in the SC3850 is capable to access up to two 64-bit data per cycle.
The following subsections present the optimization techniques used to increase the speed of the DIT DFT
algorithm.
4.3.1
Single Instruction Multiple Data (SIMD) and Parallel Computing
The multiplication and addition operations in the DIT butterflies can be calculated in parallel using the
multiple ALUs in SC3850. The SC3850 has some instructions which enable faster implementation of the
DIT butterfly algorithms. The instructions used for the computation are shown in Table 2.
Table 2. SC3850 SIMD Instructions for DFT
Instruction
MAX2 Da,Db
ABS2 Da,Dn
ASRR2 Da,Dn
SOD2FFCC
MAC2FFGGR
Description
Writes the larger of the signed 16-bit values in each of the corresponding portions in a pair of source data
registers to the second of the two registers (SIMD). Compares the high and low portions of the two registers
independently.
Stores the absolute values of the high portion and low portion of the source data register in the high portion
and low portion, respectively, of the destination data register (SIMD).
Shifts the bits in each portion of the destination data register to the right, the number of positions specified in
an immediate unsigned 4-bit value (SIMD).
Sum or difference of two 16-bit values - function & cross (SIMD instruction).
Performs two separate 16-bit additions or subtractions between the high and low portions of two source data
registers and stores the results in the two portions of the destination data register.
FF: A for addition and S for subtraction
CC: XX for crossed and II for not crossed
SIMD signed fractional multiply and wide accumulate - real
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
26
Freescale Semiconductor
Implementation of DFT on StarCore SC3850 DSP
Table 2. SC3850 SIMD Instructions for DFT
Instruction
MAC2FFGGI
MOVE2.2F
MOVE2.4F
MOVERH.4F
MOVERL.4F
Description
SIMD signed fractional multiply and wide accumulate - imaginary
Transfers 2 16-bit fractional data between the memory and one data registers in packed 20-bit format, in a
single 32-bit access
Transfers 4 16-bit fractional data between the memory and two data registers in packed 20-bit format, in a
single 64-bit access
Move Four Wide High Fractional Words to Memory With Scaling, Rounding, and Saturation (AGU)
Move Four Wide Low Fractional Words to Memory With Scaling, Rounding, and Saturation (AGU)
The functionality of the MAC2FFGGR and MACFFGGI are shown in Figure 12 and Figure 13,
respectively.
Figure 12. MAC2AASSR Da,Db,Ds.H/L,Dn
Figure 13. MAC2AASSI Da,Db,Ds.H/L,Dn
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
27
Implementation of DFT on StarCore SC3850 DSP
In the implementation, the parallel computing technique is used to maximize the usage of StarCore
multiple ALUs for the calculation of independent output values that have an overlap input source data
values. For example, in a Radix-4 DIT butterfly, outputs A’, C’, B’, and D’ can be calculated in parallel.
4.3.2
Packed Data Move
In the StarCore architecture, most element combinations that are supported have data width up to 64-bit.
This means that four short words (4 × 16-bits) or two long words (2 × 32-bits) can be fetched at the same
time. For example, the MOVE2.2F, MOVE2.4F, MOVERH.4F and MOVERL.4F instructions are used
in the DFT computation to efficiently load and store the data samples. For this purpose, the memory
addresses should be aligned to the access width of the instructions used. For example, 8-byte (4 × short
words) accessing should be aligned to the 8-byte addresses.
4.3.3
Software Pipelining
Software pipelining is an optimization technique where a sequence of instructions is transformed into a
pipeline of several copies of the sequence. The sequences then work in parallel to leverage more of the
available parallelism of the architecture. For example, two set of Radix-4 DIT butterflies can be calculated
inside one loop with the software pipelining.
4.3.4
Function Alignment
Functions should be aligned so that they fall on cache boundaries. On a Change of Flow (COF), the core
has only filled one fetch buffer. If the destination VLES is within two fetch sets, the core will have to fill
another fetch buffer and that will cause a stall. To avoid this situation, ensure that the destination VLES of
all COF operations are contained within one fetch set. The FALIGN directive will pad the previous VLES
with NOP to ensure the following VLES is contained within one fetch set.
4.4
DFT Implementation Diagrams
This is an implementation of the mixed Radix FFTs, where Radix-5, 4, 3, and 2 FFTs are combined
together to perform the DFT computation. Decimation in time (DIT) algorithm is used. Bit-reversed
addressing is not performed inside the kernel. The "psiDigitReversedAddress" array contains offsets for
digit reversed address, which is pre calculated.
The butterfly diagrams are shown here to help understand the DFT algorithm. The DFT reference code is
listed at Appendix A, “DFT Reference Code”. The implementation diagrams of IDFT are the same except
some operation sign changes. The IDFT reference code is not included to save space.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
28
Freescale Semiconductor
Implementation of DFT on StarCore SC3850 DSP
Figure 14. First Stage Radix-2 Butterfly Diagram
Figure 15. Middle Stage Radix-3 Butterfly Diagram
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
29
Implementation of DFT on StarCore SC3850 DSP
Figure 16. Last Stage Radix-3 Butterfly Diagram
Figure 17. First Stage Radix-4 Butterfly Diagram
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
30
Freescale Semiconductor
Implementation of DFT on StarCore SC3850 DSP
Figure 18. Middle Stage Radix-4 Butterfly Diagram
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
31
Implementation of DFT on StarCore SC3850 DSP
Figure 19. Middle and Last Stage Radix-5 Butterfly Diagram
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
32
Freescale Semiconductor
Experimental Results
5
Experimental Results
This section provide the performance results of the DFT/IDFT kernel running on SC3850.
The real-time cycle counts for the implemented DIT DFT and IDFT kernels are measured using the
SC3850 cycle accurate simulator. CodeWarrior Integrated Development Environment (IDE) R10.0 is used
to build the DFT kernels and the corresponding test harness. The cycle counts for the DIT DFT and IDFT
kernels are shown in Figure 20 and Figure 21. Automatic scaling method finds the maximum magnitude
data at each DFT stage and thus requires roughly 30% more computation as shown in the figure.
16
SC3850 DFT Performance
14
Cycles (Thousands)
12
10
8
Fixed Scaling
6
Auto Scaling
4
2
52
0
8
8
6
2
11
97
90
76
64
57
4
4
8
6
0
0
0
48
38
32
28
21
18
12
96
60
36
12
0
DFT points
Figure 20. SC3850 DIT DFT Cycle Count at Different Points
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
33
Experimental Results
16
SC3850 IDFT Performance
14
Cycles (Thousands)
12
10
8
Fixed Scaling
6
Auto Scaling
4
2
76
8
90
0
97
2
11
52
12
0
18
0
21
6
28
8
32
4
38
4
48
0
57
6
64
8
60
96
12
36
0
IDFT points
Figure 21. SC3850 DIT IDFT Cycle Count at Different Points
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
34
Freescale Semiconductor
Conclusions
6
Conclusions
The purpose of this application note is to develop DIT DFT/IDFT algorithm implemented on the SC3850
core architecture. The DFT/IDFT is computed by a combination of Radix-2, 3, 4, and 5 FFTs/IFFTs to
achieve good speed performance. The implementation benefits the following SC3850 features:
• VLES execution model, which utilizes maximum parallelism by allowing multiple address
generation and data arithmetic logic units to execute multiple instructions in a single cycle
• Special SIMD instructions, such as MAX2, ABS2, SOD2ffcc, MAC2ffggR, and MAC2ffggI
• MOVER instructions with scaling, rounding, and limiting to avoid overflowing
• Memory access with multiple data, such as MOVE2.4F, MOVE2.2f, MOVERH.4f, and
MOVERL.4F.
The SC3850 fixed-point code incurs very little loss of accuracy compared to the floating-point Matlab
results. Fixed scaling and automatic scaling methods can be used to avoid the overflowing in the DFT
calculation. In fixed scaling method, the scaling factors for the DFT stages are pre defined, and the
algorithm scales the data without considering the data dynamic range. On the contrary, the automatic
scaling method scales the data based on their dynamic range to obtain good bit precision, and thus produce
results with higher SNR. But it needs to find the maximum magnitude of the input data to each stage,
resulting in roughly 30% more computation.
The powerful architecture and instruction set of the SC3850 core permits flexible and compact coding of
the algorithms in assembly language. The optimization of the DIT DFT/IDFT is done by taking special
advantage of StarCore parallel computation capabilities, such as SIMD, parallel computing, and software
pipelining.
7
References
1. Discrete-Time Signal Processing (Second Edition). Alan V. Oppenheim, Ronald W. Schafer, John
R. Buck, Prentice Hall International, 1999.
2. The Fast Fourier Transform and Its Applications. E. Oran Brigham, Prentice Hall, 1988.
3. SC3850 DSP Core Reference Manual. Rev.A, Freescale Semiconductor, 08/2008.
4. Functional Differences Between the SC3400 and the SC3x50 Cores. Rev. C 06/2008.
5. SC3850 Programmer’s Guide. Rev. A, 06/2008.
6. Implementation of Radix-4 Fast Fourier Transform on StarCore DSP SC3400. AN3426 Rev. 0,
06/2007.
7. Implementation of Mixed Radix Fast Fourier Transform (FFT) and Inverse FFT on StarCore DSP
SC3400. AN3443 Rev. A, 11/2007.
8. Software Optimization of FFTs and IFFTs Using the SC3850 Core. AN3666 Rev. C, 1/2009
NOTE
Freescale documents in the above list that use a letter for a revision
designator are only available under NDA. Contact your local sales office or
representative to access to these documents.
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
35
DFT Reference Code
Appendix A
DFT Reference Code
DEFINE FIX_SCALE '1'
UNDEF FIX_SCALE
IF @DEF('FIX_SCALE')
SCALE_RADIX_2 EQU 1
SCALE_RADIX_3 EQU 2
SCALE_RADIX_4 EQU 2
SCALE_RADIX_5 EQU 3
MSG 'FIX_SCALE is ON'
ELSE
MSG 'FIX_SCALE is OFF'
ENDIF
S51
S52
S53
S54
S31
S32
S33
S34
S41
S42
S43
S44
S21
S22
equ
equ
equ
equ
equ
equ
equ
equ
equ
equ
equ
equ
equ
equ
2
2
3
3
2
2
2
2
2
2
2
2
1
1
;
;
;
;
;
;
;
;
;
;
;
;
;
;
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
Scaling
check
check
check
check
check
check
check
check
check
check
check
check
check
check
for
for
for
for
for
for
for
for
for
for
for
for
for
for
radix-5
radix-5
radix-5
radix-5
radix-3
radix-3
radix-3
radix-3
radix-4
radix-4
radix-4
radix-4
radix-2
radix-2
stage
stage
stage
stage
stage
stage
stage
stage
stage
stage
stage
stage
stage
stage
(Last stage)
(Last stage)
(Middle stage)
(Middle stage)
(Last stage)
(Last stage)
(Middle stage)
(Middle stage)
(Middle stage)
(Middle stage)
(1st stage)
(1st stage)
(1st stage)
(1st stage)
section .data
radix5parameter
align 256
dcw $278E
dcw $79BC
dcw $678E
dcw $4B3D
radix3parameter
dcw $6ED9
;
;
;
;
radix-5
radix-5
radix-5
radix-5
a
b
c
d
; radix-3 sqrt(3)/2
ALIGN 8
my_OUTPUT_pointer:
DS 8
endsec
section .text
global _sc3850_dft_dit_complex_16x16_auto_scale_asm
align 16
_sc3850_dft_dit_complex_16x16_auto_scale_asm
init:
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
36
Freescale Semiconductor
DFT Reference Code
push.2l d6:d7
push.2l r6:r7
[
clr d3
; Clear for status register stack
clr d4
; Clear stage counter
adda #72,sp,r4
move.l (r0)+,r1
; base_address
]
[
tfra r4,sp
move.l r0,(my_OUTPUT_pointer)
;// --> psFft->psiOut
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
[
move.l
move.l
]
(r0)+,r14
r1,(sp-8)
; out_address
; base_address
(r0)+,r2
r14,(sp-12)
; num_radix
; out_address
(r0)+,r3
r2,(sp-16)
; num_butterfly
; num_radix
(r0)+,r4
r3,(sp-20)
; num_subgroup
; num_butterfly
(r0)+,r5
r4,(sp-24)
; num_radix_offset
; num_subgroup
(r0)+,r6
r5,(sp-28)
; digit_reversed_address
; num_radix_offset
r6,(sp-32)
sr,d3
; digit_reversed_address
; Status Register
d3,(sp-36)
; Status register
#$0BE4008C,SR ; no scaling, Two's-complement rounding, 32 bit saturate,
; 16 bit saturate, W20=0
(r0)+,r7
d4,(sp-40)
; wb
; Initialize the stage counter
r7,(sp-44)
(r0)+,r7
; wb
; wc
r7,(sp-48)
(r0)+,r7
; wc
; wd
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
37
DFT Reference Code
[
move.l r7,(sp-52)
;
move.l (r0)+,r7
;
]
[
move.l (r0)+,r1
;
move.w (r2),d0
;
]
[
move.l (r0)+,r4
;
move.l r7,(sp-56)
;
]
[
move.l r1,(sp-60)
;
move.l r4,(sp-68)
;
]
move.w (r2),d0
;
cmpeq.w #2,d0
;
jf radix4_dit_1_stage
wd
we
DFT point
num_radix
pliAutoScale
we
DFT point
pliAutoScale
num_radix[s]
num_radix[s]=0 or not
; Go to adix4_1st_stage if num_radix[s]!=2
;///////////////////////////////////////////////////////////////////////
; Radix-2 Stage:
;///////////////////////////////////////////////////////////////////////
radix2_init:
;--------------------------------------------------------; 1st stage (radix-2)
;--------------------------------------------------------radix2_dit_1_stage:
IF @DEF('FIX_SCALE')
[
move.l (sp-60),r2
; DFT point
move.l (sp-40),r3
; Stage counter
]
[
move.l (sp-32),r13 ; digit_reversed_address
move.l (sp-68),r1
; psiScale
]
[
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9
; A output pointer
]
[
dosetup0 radix2_dit_1_stage_loop1
move.l (sp-28),r4
; num_radix_offset
]
[
move.w (r2),d14
; DFT point
move.w #SCALE_RADIX_2,d15
]
ELSE
;//IF @DEF('FIX_SCALE')
;-- Find max value (real & imag) & stage initialization
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
38
Freescale Semiconductor
DFT Reference Code
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
[
move.w (r2),d15
; DFT point
]
[
tfr d15,d14
sub #8,d15
; d15-8
]
[
asrr #2,d15
; (d15-8)/4
]
[
adda #8,r0,r1
; Input address
move.w #2,n0
]
[
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
move.2l (r1)+n0,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.l (sp-40),r3
; Stage counter
]
[
tfr d6,d0
tfr d7,d1
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
39
DFT Reference Code
]
[
max2 d0,d4
max2 d1,d5
]
[
tfr d5,d0
]
[
clr d15
max2 d0,d4
move.l (sp-32),r13
; digit_reversed_address
]
[
add #2,d15
; scale down = 2
sxt.w d4,d0
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
]
[
extract #16,#16,d4,d4
move.l (sp-68),r1 ; psiScale
]
[
max d0,d4
; Max real/imag
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9 ; A output pointer
]
[
clb d4,d0
; count reading bit
move.w (r2),r14
; num_butterfly[s]
move.w (r4),r5
; num_subgroup[s]
]
[
neg d0
; negates
dosetup0 radix2_dit_1_stage_loop1
move.l (sp-28),r4 ; num_radix_offset
]
[
clr d4
sub #16,d0
; limit 16-bit
move.w #S22,d2
]
[
add #1,d4
cmpgt.w #S21,d0
; Check if siNorm > S21
move.w #S21,d1
]
[
ift clr d15
; scale down = 0
if siNorm > S21+1
ifa
cmpeq.w #S21,d0
; Check if siNorm == S21
]
tfrt d4,d15
; scale down = 1
if siNorm == S21+1
ENDIF
;// IF @DEF('FIX_SCALE')
[
move.l d15,(sp-64) ; Set norm to stack
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
40
Freescale Semiconductor
DFT Reference Code
move.w d15,(r1)+
; psiScale <- norm
]
[
move.l r1,(sp-68) ; psiScale
asrr #2,d14
; DFT point/4
move.l d15,r5
; scale down
]
[
doen0 d14
; loop count
adda #2,r13,r14
]
[
adda r3,r4
; num_radix_offset[s]
move.l SR,d3
]
[
move.w (r4),r1
; num_radix_offset[s]
move.l (sp-32),r15 ; digit_reversed_address
]
[
move.l d3,r4
move.l #(_psiDigitReversedAddress_absolut_address),r15
]
;///
; For next stage input/output
[
move.l r9,(sp-8)
; base_address
move.l r0,(sp-12) ; out_address
]
[
bmset #$1000,SR.L ; W20=1
tfra r9,r8
; A output
]
[
bmclr #$0080,SR.L ; SM2=0
adda r1,r9
; B output pointer (long)
]
[
bmclr #$0030,SR.L ; SCM=00
tfra r9,r10
; B output
]
;--- Start Radix-2 (1st stage)
[
move.l (r15)+,r0
; digit_reversed_address(i+0)
cmpeqa.w #1,r5
]
[
ift bmset #$0010,SR.L
; SCM=01
ifa
move.l (r15)+,r1
; digit_reversed_address(i+1)
]
[
move.l (r15)+,r2
; digit_reversed_address(i+2)
cmpeqa.w #2,r5
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
41
DFT Reference Code
[
ift bmset #$0030,SR.L
; SCM=11
ifa
move.l (r15)+,r3
; digit_reversed_address(i+3)
]
[
MOVE2.2F (r0),d0
; d0=Ar:Ai
move.l (r15)+,r0
; digit_reversed_address(i+0)
]
[
MOVE2.2F (r1),d1
; d1=Br:Bi
move.l (r15)+,r1
; digit_reversed_address(i+1)
]
[
MOVE2.2F (r2),d4
; d4=Ar:Ai
move.l (r15)+,r2
; digit_reversed_address(i+2)
]
FALIGN
radix2_dit_1_stage_loop1:
loopstart0
[ ; 01
MOVE2.2F (r3),d5
; d5=Br:Bi
move.l (r15)+,r3
; digit_reversed_address(i+3)
]
[ ; 02
sod2aaii d1,d0,d2
; d2=(Ar+Br),(Ai+Bi)
sod2ssii d1,d0,d6
; d3=(Ar-Br),(Ai-Bi)
sod2aaii d5,d4,d3
; d6=(Ar+Br),(Ai+Bi)
sod2ssii d5,d4,d7
; d7=(Ar-Br),(Ai-Bi)
MOVE2.2F (r0),d0
; d0=Ar:Ai
move.l (r15)+,r0
; digit_reversed_address(i+0)
]
[ ; 03
MOVE2.2F (r1),d1
; d1=Br:Bi
move.l (r15)+,r1
; digit_reversed_address(i+1)
]
[ ; 04
MOVER2.4F d2:d3,(r8)+
; out:[Aout_r,Aout_i], out:[Aout_r,Aout_i]
MOVE2.2F (r2),d4
; d4=Ar:Ai
]
[ ; 05
MOVER2.4F d6:d7,(r9)+
; out:[Bout_r,Bout_i], out:[Bout_r,Bout_i]
move.l (r15)+,r2
; digit_reversed_address(i+2)
]
loopend0
[
move.l (sp-16),r2
; num_radix
move.l (sp-40),r3
; Load the stage counter
]
[
move.l r4,d3
adda #2,r3,r3
; +2
]
[
move.l d3,SR
adda r3,r2
; num_butterfly[s]
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
42
Freescale Semiconductor
DFT Reference Code
[
move.w (r2),r0
; num_radix[s]
move.l r3,(sp-40)
; Update the stage counter
]
cmpeqa.w #0,r0
; num_radix[s]=0 or not
[
jt radix3_dit_last_stage ; Go to radix-3 Last stage processing if num_radix[s]=0
cmpeqa.w #5,r0
; num_radix[s]=5 or not
]
[
jt radix5_dit_m_stage
; Go to radix-5 Middle stage processing if num_radix[s]=5
cmpeqa.w #1,r0
; num_radix[s]=1 or not
]
[
jt radix5_dit_last_stage
; Go to radix-5 Last stage processing if num_radix[s]=1
cmpeqa.w #4,r0
; num_radix[s]=4 or not
]
jt radix4_dit_m_stage
; Go to radix-4 Middle stage processing if num_radix[s]=4
;///////////////////////////////////////////////////////////////////////
; Radix-3 Stages:
;///////////////////////////////////////////////////////////////////////
radix3_init:
;--------------------------------------------------------; Radix-3 (Middle stage)
;--------------------------------------------------------radix3_dit_m_stage:
IF @DEF('FIX_SCALE')
[
move.l (sp-40),r3 ; Stage counter
dosetup0 radix3_dit_m_stage_loop1
]
[
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
move.l (sp-44),r5
; r5 -> Wb
move.l (sp-28),r12
; num_radix_offse
]
[
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9
; A output pointer
]
[
dosetup1 radix3_dit_m_stage_loop2
dosetup2 radix3_dit_m_stage_loop2_2
]
[
adda r3,r2
; num_butterfly[s]
adda r3,r12
; num_radix_offse[s]
]
[
move.w (r2),n0
; num_butterfly[s],The stage twiddle factor's offset
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
43
DFT Reference Code
move.w (r2),r13
; num_butterfly[s]
]
[
move.w (r12),r15
; num_radix_offse[s]
move.l (sp-68),r2 ; psiScale
]
[
adda r3,r4
; num_subgroup[s]
move.l (sp-48),r6 ; r6 -> Wc
]
[
move.w (r4),r14
; num_subgroup[s]
move.w #SCALE_RADIX_3,d15
]
ELSE
;//IF @DEF('FIX_SCALE')
move.l (sp-40),r3
; Stage counter
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
[
move.w (r2),d15
; DFT point
]
[
sub #8,d15
; d15-8
]
[
asrr #2,d15
; (d15-8)/4
]
[
adda #8,r0,r1
; Input address
move.w #2,n0
]
[
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
move.2l (r1)+n0,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
44
Freescale Semiconductor
DFT Reference Code
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
tfr d6,d0
tfr d7,d1
move.l (sp-44),r5
; r5 -> Wb
move.l (sp-28),r12
; num_radix_offse
]
[
max2 d0,d4
max2 d1,d5
dosetup0 radix3_dit_m_stage_loop1
dosetup1 radix3_dit_m_stage_loop2
]
[
tfr d5,d0
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9
; A output pointer
]
[
max2 d0,d4
dosetup2 radix3_dit_m_stage_loop2_2
]
[
sxt.w d4,d0
adda r3,r12
; num_radix_offse[s]
]
[
extract #16,#16,d4,d4
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
]
[
max d0,d4
; Max real/imag
move.l (sp-48),r6 ; r6 -> Wc
move.w (r12),r15
; num_radix_offse[s]
]
[
clb d4,d0
; count reading bit
move.w (r2),r13
; num_butterfly[s]
move.w (r4),r14
; num_subgroup[s]
]
[
clr d14
neg d0
; negates
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
45
DFT Reference Code
move.l (sp-68),r2 ; psiScale
]
[
add #1,d14
; d14 = 1
sub #16,d0
; limit 16-bit
move.w #S34,d2
; d2 = 2
move.w #2,d15
; scale down = 2;
]
[
cmpgt.w #S33,d0 ; Check if siNorm > S33
move.w #S33,d1
]
[
ift clr d15
; scale down = 0
if siNorm > S33+1
ifa
cmpeq.w #S33,d0 ; Check if siNorm == S33
]
[
tfrt d14,d15
; scale down = 1
if siNorm == S33+1
tfra r13,n0
; The stage twiddle factor's offset
]
ENDIF
;// IF @DEF('FIX_SCALE')
; For next stage input/output
[
move.l r9,(sp-8)
; base_address at the next stage
move.l r0,(sp-12)
; out_address at the next stage
]
[
move.l d15,(sp-64) ; Set norm to stack
move.w d15,(r2)+
; psiScale <- norm
]
[
move.l r2,(sp-68)
; psiScale
move.l d15,r11
]
[
tfra r9,r8
; A output
move.l SR,d4
]
[
adda r15,r9
; B output pointer (long)
move.l d4,r3
]
[
tfra r9,r10
; B output
move.f #$4000,d7
; M2_3 = (1/2):0
]
[
tfra r13,r12
bmset #$1000,SR.L
; W20=1
]
[
adda r15,r10
; C output pointer (long)
bmclr #$0080,SR.L
; SM2=0
]
[
asra r12
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
46
Freescale Semiconductor
DFT Reference Code
bmclr #$0030,SR.L
; SCM=00
]
;--- Start Radix-3 (Middle stage)
[
move.l #radix3parameter,r15 ; Point to the radix-3 parameter array
cmpeqa.w #1,r11
]
[
ift bmset #$0010,SR.L
; SCM=01
ifa
asra r14
]
[
doen0 r14
cmpeqa.w #2,r11
]
[
ift bmset #$0030,SR.L
; SCM=11
ifa
move.2f (r0)+,d4:d5
; IAre:IAim=Ar:Ai
]
[
bmtsts #$0001,r13.l
; if num_butterfly[s] is odd, T=1
tfra r12,r13
]
[
suba #1,r13
move.l (r0)+,d2
; v_IB1=Br:Bi
]
; Intra loop
FALIGN
radix3_dit_m_stage_loop1:
; --> Butterflies/Subgroup
loopstart0
[
doen1 r12
move.l (r5)+n0,d9
; v_WB1=Wbr:Wbi
]
FALIGN
radix3_dit_m_stage_loop2:
; --> Subgroup
loopstart1
[ ;01
MPYRE d2,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r6),d10
; v_WC1=Wcr:Wci
]
[ ;02
mac2assar d8,d10,d12.H,d6
; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.2f (r0)+,d14:d15
; IAre:IAim=Ar:Ai
]
;stall
[ ;04
sod2aaii d4,d6,d0
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d5,d13,d1
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
mac -d6,d7,d4
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
47
DFT Reference Code
mac -d13,d7,d5
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d13,d8
move.l (r0)+,d2
; v_IB1=Br:Bi
]
[ ;05
mac2aassi d8,d10,d4.H,d4
; Bout_r = Ar' - (Br'+Cr')*1/2 + (Bi'-Ci')*sqrt(3)/2
:Cout_r = Ar' - (Br'+Cr')*1/2 - (Bi'-Ci')*sqrt(3)/2
MPYRE d2,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r6),d10
; v_WC1=Wcr:Wci
]
[ ;06
mac2assar d8,d10,d12.H,d12 ; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.l d6,d8
]
;stall
[ ;08
sod2aaii d14,d12,d2
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d15,d13,d3
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
mac -d12,d7,d14
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac2ssaai d8,d10,d5.H,d5
; Bout_i=Ai' - (Bi'+Ci')*1/2 - (Br'-Cr')*sqrt(3)/2
:Cout_i=Ai' - (Bi'+Ci')*1/2 + (Br'-Cr')*sqrt(3)/2
move.l d13,d8
]
[ ;09
mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r
mac -d13,d7,d15
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d12,d8
move.2f (r0)+,d12:d13
; IAre:IAim=Ar:Ai
]
[ ;10
mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i
MOVERH.4F d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i]
move.l (r0)+,d2
; v_IB1=Br:Bi
]
[ ;11
tfr d12,d4
tfr d13,d5
MOVERH.4F d4:d5:d14:d15,(r9)+
; out:[Bout_r,Bout_i],[Bout_r,Bout_i]
MOVERL.4F d4:d5:d14:d15,(r10)+
; out:[Cout_r,Cout_i],[Cout_r,Cout_i]
]
loopend1
[
iff addl2a n0,r6
iff move.l (r5)+n0,d9 ; v_WB1=Wbr:Wbi
]
[
MPYRE d2,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r6),d10
; v_WC1=Wcr:Wci
]
[
mac2assar d8,d10,d12.H,d6 ; M2_1 = (Br'+Cr'):(Br'-Cr')
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
48
Freescale Semiconductor
DFT Reference Code
mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f
(r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.2f (r0)+,d14:d15
; IAre:IAim=Ar:Ai
]
[
ift addl2a n0,r6
ift move.l (r5),d9
; v_WB1=Wbr:Wbi
;stall
]
[
sod2aaii d4,d6,d0
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d5,d13,d1
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
mac -d6,d7,d4
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac -d13,d7,d5
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d13,d8
move.l (r0)+,d2
; v_IB1=Br:Bi
]
[
mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r
MPYRE d2,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r6),d10
; v_WC1=Wcr:Wci
]
[
mac2assar d8,d10,d12.H,d12
; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13
; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.l d6,d8
]
[
ift addl2a n0,r5
ift doen2 r12
ifa
sod2aaii d14,d12,d2
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d15,d13,d3
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
;stall
]
[
iff doen2 r13
ifa
mac -d12,d7,d14
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac2ssaai d8,d10,d5.H,d5
; Bout_i:Cout_i
move.l d13,d8
]
[
mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r
mac -d13,d7,d15
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d12,d8
move.2f (r0)+,d12:d13
; IAre:IAim=Ar:Ai
]
[
mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i
moverh.4f d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i]
move.l (r0)+,d2
; v_IB1=Br:Bi
]
radix3_dit_m_stage_loop2_2:
; --> Subgroup
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
49
DFT Reference Code
loopstart2
[ ;01
tfr d12,d4
tfr d13,d5
MOVERH.4F d4:d5:d14:d15,(r9)+
; out:[Bout_r,Bout_i],[Bout_r,Bout_i]
MOVERL.4F d4:d5:d14:d15,(r10)+
; out:[Cout_r,Cout_i],[Cout_r,Cout_i]
]
[ ;02
MPYRE d2,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r6),d10
; v_WC1=Wcr:Wci
]
[ ;03
mac2assar d8,d10,d12.H,d6
; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13
; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.2f (r0)+,d14:d15
; IAre:IAim=Ar:Ai
]
;STALL
[ ;05
sod2aaii d4,d6,d0
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d5,d13,d1
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
mac -d6,d7,d4
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac -d13,d7,d5
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d13,d8
move.l (r0)+,d2
; v_IB1=Br:Bi
]
[ ;06
mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r
MPYRE d2,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r6),d10
; v_WC1=Wcr:Wci
]
[ ;07
mac2assar d8,d10,d12.H,d12
; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13
; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.l d6,d8
]
;stall
[ ;09
sod2aaii d14,d12,d2
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d15,d13,d3
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
mac -d12,d7,d14
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac2ssaai d8,d10,d5.H,d5
; Bout_i:Cout_i
move.l d13,d8
]
[ ;10
mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r
mac -d13,d7,d15
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d12,d8
move.2f (r0)+,d12:d13
; IAre:IAim=Ar:Ai
]
[ ;11
mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
50
Freescale Semiconductor
DFT Reference Code
MOVERH.4F d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i]
move.l (r0)+,d2
; v_IB1=Br:Bi
]
loopend2
[
tfr d12,d4
tfr d13,d5
MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i]
MOVERL.4F d4:d5:d14:d15,(r10)+ ; out:[Cout_r,Cout_i],[Cout_r,Cout_i]
]
addl2a n0,r6
loopend0
[
move.l r3,d3
move.l (sp-40),r3
; Load the stage counter
]
[
move.l d3,SR
move.l (sp-16),r2
; num_radix
]
adda #2,r3,r3
; +2
[
move.l r3,(sp-40)
; Update the stage counter
adda r3,r2
; num_butterfly[s]
]
move.w (r2),r0
; num_radix[s]
cmpeqa.w #3,r0
; num_radix[s]=3 or not
[
jt radix3_dit_m_stage
; Go to radix-3 Middle stage processing if num_radix[s]=3
cmpeqa.w #5,r0
; num_radix[s]=5 or not
]
[
jt radix5_dit_m_stage
; Go to radix-5 Middle stage processing if num_radix[s]=5
cmpeqa.w #1,r0
; num_radix[s]=1 or not
]
jt radix5_dit_last_stage
; Go to radix-5 Last stage processing if num_radix[s]=1
;--------------------------------------------------------; Radix-3 (Last stage)
;--------------------------------------------------------radix3_dit_last_stage:
IF @DEF('FIX_SCALE')
[
move.l (sp-40),r3
move.l (sp-28),r2
]
[
move.l (sp-24),r4
move.l (sp-44),r12
]
[
move.l (sp-12),r8
move.l (sp-48),r13
]
[
clr d15
; Stage counter
; num_radix_offse
; num_subgroup
; r12 -> Wb
; A output pointer
; r13 -> Wc
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
51
DFT Reference Code
move.l (sp-8),r0
; A input pointer
move.l (sp-68),r1
; psiScale
]
[
add #SCALE_RADIX_3,d15
adda r3,r2
; num_radix_offse[s]
dosetup0 radix3_dit_last_stage_loop1
]
[
move.w (r2),r15
; num_radix_offse[s]
adda r3,r4
; num_subgroup[s]
]
[
move.w (r4),r14
; num_subgroup[s]
tfra r8,r9
; A output
]
[
move.l d15,(sp-64)
; Set norm to stack
adda r15,r9
; B output pointer (long)
]
[
tfra r9,r10
; B output
move.w d15,(r1)+
; psiScale <- norm
]
[
adda r15,r10
; C output pointer (long)
move.l #radix3parameter,r15 ; Point to the radix-3 parameter array
]
ELSE
;//IF @DEF('FIX_SCALE')
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
[
move.w (r2),d15
; DFT point
]
[
sub #8,d15
; d15-8
]
[
asrr #2,d15
; (d15-8)/4
]
[
adda #8,r0,r1
; Input address
move.w #2,n0
]
[
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
move.2l (r1)+n0,d6:d7
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
52
Freescale Semiconductor
DFT Reference Code
[
abs2 d4,d4
abs2 d6,d6
doensh0 d15
]
abs2 d5,d5
abs2 d7,d7
; loop count for max search
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
tfr d6,d0
tfr d7,d1
move.l (sp-40),r3
; Stage counter
move.l (sp-28),r2
; num_radix_offse
]
[
max2 d0,d4
max2 d1,d5
move.l (sp-44),r12
; r12 -> Wb
move.l (sp-48),r13
; r13 -> Wc
]
[
tfr d5,d0
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r8
; A output pointer
]
[
max2 d0,d4
move.l (sp-68),r1 ; psiScale
]
[
sxt.w d4,d0
]
[
extract #16,#16,d4,d4
]
[
max d0,d4
; Max real/imag
adda r3,r2
; num_radix_offse[s]
adda r3,r4
; num_subgroup[s]
]
[
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
53
DFT Reference Code
clb d4,d0
; count reading bit
dosetup0 radix3_dit_last_stage_loop1
move.w (r2),r15
; num_radix_offse[s]
]
[
neg d0
; negates
move.w (r4),r14
; num_subgroup[s]
tfra r8,r9
; A output
]
[
sub #16,d0
; limit 16-bit
move.w #S32,d2
adda r15,r9
; B output pointer (long)
]
[
clr d11
cmpgt.w #S31,d0
; Check if norm > S or norm <= S
move.w #S31,d1
tfra r9,r10
; B output
]
[
iff sub d0,d1,d15 ; scale down = S - d0 if norm <=S
ift sub d0,d2,d15 ; scale up
= S - d0 if norm > S
]
[
max d11,d15
; shift shouldn't be to the left
adda r15,r10
; C output pointer (long)
move.l #radix3parameter,r15 ; Point to the radix-3 parameter array
]
[
move.l d15,(sp-64)
; Set norm to stack
move.w d15,(r1)+
; psiScale <- norm
]
ENDIF
;// IF @DEF('FIX_SCALE')
[
move.l r1,(sp-68)
; psiScale
move.l d15,r1
]
;--- Start Radix-3 (Last stage)
[
asra r14
move.l SR,d3
]
[
move.l d3,r3
bmset #$1000,SR.L
]
[
doen0 r14
]
; Inner loop
[
move.2f (r0)+,d4:d5
bmclr #$0080,SR.L
]
; W20=1
; set the loop
; IAre:IAim=Ar:Ai
; SM2=0
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
54
Freescale Semiconductor
DFT Reference Code
[
move.w (r4),r14
; num_subgroup[s]
bmclr #$0030,SR.L
; SCM=00
]
[
cmpeqa.w #1,r1
move.l (r0)+,d12
; v_IB1=Br:Bi
]
[
ift bmset #$0010,SR.L
; SCM=01
ifa
move.2l (r12)+,d8:d9
; v_WB1=Wbr:Wbi
]
[
MPYRE d12,d8,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d12,d8,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.l (r13)+,d10
; v_WC1=Wcr:Wci
]
[
mac2assar d8,d10,d12.H,d6
; M2_1 = (Br'+Cr'):(Br'-Cr')
move.f #$4000,d7
; M2_3 = (1/2):0
cmpeqa.w #2,r1
]
[
ift bmset #$0030,SR.L
; SCM=11
ifa
mac2aassi d8,d10,d13.H,d13
; M2_2 = (Bi'+Ci'):(Bi'-Ci')
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
]
[
bmtsts #$0001,r14.l
; if num_subgroup[s] is odd, T=1
move.2f (r0)+,d14:d15
; IAre:IAim=Ar:Ai
]
FALIGN
radix3_dit_last_stage_loop1:
loopstart0
[ ;01
sod2aaii d4,d6,d0
; Aout_r = Ar + M2_1_H
// Aout_r
sod2aaii d5,d13,d1
; Aout_i = Ai + M2_2_H
// Aout_i
mac -d6,d7,d4
; M3_1_H = Ar - M2_1_H*1/2
//
mac -d13,d7,d5
; M3_2_H = Ai - M2_2_H*1/2
//
move.l d13,d8
move.l (r0)+,d12
; v_IB1=Br:Bi
]
[ ;02
mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r
MPYRE d12,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d12,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r0)+,d8
; v_IC1=Cr:Ci
move.2l (r13)+,d10:d11
; v_WC1=Wcr:Wci
]
[ ;03
mac2assar d8,d10,d12.H,d12
; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13
; M2_2 = (Bi'+Ci'):(Bi'-Ci')
tfr d6,d9
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
= Ar'+Br'+Cr'
= Ai'+Bi'+Ci'
Ar' - (Br'+Cr')*1/2
Ai' - (Bi'+Ci')*1/2
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
55
DFT Reference Code
]
[ ;04
sod2aaii d14,d12,d2
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d15,d13,d3
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
move.l d13,d8
]
[ ;05
tfr d5,d0
mac -d12,d7,d14
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac -d13,d7,d15
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
MOVERH.4F d0:d1:d2:d3,(r8)+
; out:[Aout_r,Aout_i]
move.2f (r0)+,d2:d3
; IAre:IAim=Ar:Ai
]
[ ;06
mac2ssaai d9,d10,d0.H,d5
; Bout_i:Cout_i
mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r
move.l d12,d8
move.l (r0)+,d12
; v_IB1=Br:Bi
]
[ ;07
mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i
tfr d11,d10
move.l (r12)+,d9
; v_WB1=Wbr:Wbi
move.l (r0)+,d8
; v_IC1=Cr:Ci
]
[ ;08
MPYRE d12,d9,d12
; M1_1 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d12,d9,d13
; M1_2 = Bi' = (Br*Wbi + Bi*Wbr)
move.l (r12)+,d9
; v_WB1=Wbr:Wbi
]
[ ;09
tfr d2,d4
tfr d3,d5
mac2assar d8,d10,d12.H,d6
; M2_1 = (Br'+Cr'):(Br'-Cr')
mac2aassi d8,d10,d13.H,d13
; M2_2 = (Bi'+Ci'):(Bi'-Ci')
MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i]
MOVERL.4F d4:d5:d14:d15,(r10)+; out:[Cout_r,Cout_i],[Cout_r,Cout_i]
]
[ ;10
move.f (r15),d10
; M3_4 = (sqrt(3)/2):(0)
move.2f (r0)+,d14:d15
; IAre:IAim=Ar:Ai
;stall
]
loopend0
[
sod2aaii d4,d12,d0
; Aout_r = Ar + M2_1_H
// Aout_r = Ar'+Br'+Cr'
sod2aaii d5,d13,d1
; Aout_i = Ai + M2_2_H
// Aout_i = Ai'+Bi'+Ci'
mac -d12,d7,d4
; M3_1_H = Ar - M2_1_H*1/2
// Ar' - (Br'+Cr')*1/2
mac -d13,d7,d5
; M3_2_H = Ai - M2_2_H*1/2
// Ai' - (Bi'+Ci')*1/2
move.l d13,d8
]
[
mac2aassi d8,d10,d4.H,d4
; Bout_r:Cout_r
move.l d12,d8
]
[
ift
MOVERH.4F d0:d1:d2:d3,(r8)+
; out:[Aout_r,Aout_i]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
56
Freescale Semiconductor
DFT Reference Code
ifa
mac2ssaai d8,d10,d5.H,d5
; Bout_i:Cout_i
move.l r3,d6
]
[
ift
MOVERH.4F d4:d5:d14:d15,(r9)+
; out:[Bout_r,Bout_i],[Bout_r,Bout_i]
ift
MOVERL.4F d4:d5:d14:d15,(r10)+
; out:[Cout_r,Cout_i],[Cout_r,Cout_i]
]
[
jmp exit_mixed_radix
move.l d6,SR
]
;///////////////////////////////////////////////////////////////////////
; Radix-4 Stages:
;///////////////////////////////////////////////////////////////////////
radix4_init:
;--------------------------------------------------------; Radix-4 (1st stage)
;--------------------------------------------------------radix4_dit_1_stage:
IF @DEF('FIX_SCALE')
[
clr d15
move.l (sp-20),r15
; num_butterfly
move.l (sp-28),r4
; num_radix_offset
]
[
add #SCALE_RADIX_4,d15
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r8
; A output pointer
]
[
dosetup0 radix4_dit_1_stage_loop1
move.l (sp-68),r2
; psiScale
]
[
move.l d15,(sp-64)
; Set norm to stack
move.l d15,r5
; psiScale <- norm
]
[
move.w (r4),r14
; num_radix_offset[s]
move.w (r15),r15
; num_butterfly[s]
]
; For next stage input/output
[
move.l r8,(sp-8)
; base_address at the next stage
move.l r0,(sp-12)
; out_address at the next stage
]
[
tfra r8,r9
; A output
move.w d15,(r2)+
; psiScale <- norm
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
57
DFT Reference Code
[
move.l r2,(sp-68)
; psiScale
adda r14,r9
; B output
]
[
tfra r9,r10
; B output
suba #1,r15
]
[
adda r14,r10
; C output
doen0 r15
; loop0 = num_butterfly[s]
]
[
tfra r10,r11
; C output
move.l #(_psiDigitReversedAddress_absolut_address),r15
]
ELSE
;//IF @DEF('FIX_SCALE')
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
move.w #2,n0
[
move.w (r2),d15
; DFT point
adda #8,r0,r1
; Input address
]
[
sub #8,d15
; d15-8
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
asrr #2,d15
; (d15-8)/4
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
move.2l (r1)+n0,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
58
Freescale Semiconductor
DFT Reference Code
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
]
[
tfr d6,d0
tfr d7,d1
dosetup0 radix4_dit_1_stage_loop1
]
[
max2 d0,d4
max2 d1,d5
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9
; A output pointer
]
[
tfr d5,d0
]
[
max2 d0,d4
move.w (r4),r5
; num_subgroup[s]
move.l (sp-28),r4
; num_radix_offset
]
[
sxt.w d4,d0
move.w (r2),r15
; num_butterfly[s]
move.w #S44,d2
]
[
extract #16,#16,d4,d4
move.l (sp-68),r2
; psiScale
]
[
max d0,d4
; Max real/imag
move.w #S43,d1
]
[
clb d4,d0
; count leading bit
]
[
neg d0
; negates
move.w (r4),r14
; num_radix_offset[s]
suba #1,r15
]
[
clr d14
sub #16,d0
; limit 16-bit
suba #1,r5
; r5 - 1
tfra r9,r8
; A output
]
[
add #1,d14
cmpgt.w #S43,d0 ; Check if siNorm > S43
move.w #2,d15
; scale down = 2
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
59
DFT Reference Code
adda r14,r9
; B output
]
[
ift clr d15
; scale down = 0
if siNorm > S43+1
ifa
cmpeq.w #S43,d0 ; Check if siNorm = S43
tfra r9,r10
; B output
]
[
tfrt d14,d15
; scale down = 1
if siNorm = S43
adda r14,r10
; C output
]
[
move.w d15,(r2)+
; psiScale <- norm
tfra r10,r11
; C output
]
[
move.l d15,(sp-64) ; Set norm to stack
move.l r2,(sp-68)
; psiScale
]
[
doen0 r15
; loop0 = num_butterfly[s]
move.l d15,r5
; psiScale <- norm
]
; For next stage input/output
[
move.l r8,(sp-8)
; base_address at the next stage
move.l r0,(sp-12)
; out_address at the next stage
]
move.l #(_psiDigitReversedAddress_absolut_address),r15
ENDIF
;// IF @DEF('FIX_SCALE')
[
adda r14,r11
; D output
;--- Start Radix-4 (1st stage)
move.l (r15)+,r0
]
[
move.l SR,d3
move.l (r15)+,r1
]
[
move.l d3,r4
move.l (r15)+,r2
]
[
bmset #$1000,SR.L
move.l (r15)+,r3
]
[
bmclr #$0080,SR.L
MOVE2.2F (r0),d0
]
[
bmclr #$0030,SR.L
move.l (r15)+,r0
; digit_reversed_address(i+0)
; digit_reversed_address(i+1)
; digit_reversed_address(i+2)
; W20=1
; digit_reversed_address(i+3)
; SM2=0
; d0=Ar:Ai
; SCM=00
; digit_reversed_address(i+0)
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
60
Freescale Semiconductor
DFT Reference Code
]
[
cmpeqa.w #1,r5
MOVE2.2F (r1),d2
; d2=Br:Bi
]
[
ift bmset #$0010,SR.L
; SCM=01
ifa
move.l (r15)+,r1
; digit_reversed_address(i+1)
]
[
cmpeqa.w #2,r5
MOVE2.2F (r2),d1
; d1=Cr:Ci
]
[
ift bmset #$0030,SR.L
; SCM=11
ifa
move.l (r15)+,r2
; digit_reversed_address(i+2)
]
[
MOVE2.2F (r3),d3
; d3=Dr:Di
move.l (r15)+,r3
; digit_reversed_address(i+3)
]
[
sod2aaii d1,d0,d4
; d0=(Ar+Cr),(Ai+Ci)
sod2ssii d1,d0,d5
; d1=(Ar-Cr),(Ai-Ci)
sod2aaii d3,d2,d6
; d2=(Br+Dr),(Bi+Di)
sod2ssii d3,d2,d7
; d3=(Br-Dr),(Bi-Di)
MOVE2.2F (r0),d0
; d0=Ar:Ai
]
FALIGN
radix4_dit_1_stage_loop1:
loopstart0
[
MOVE2.2F (r1),d2
; d2=Br:Bi
move.l (r15)+,r0
; digit_reversed_address(i+0)
]
[
sod2aaii d6,d4,d4
; d4=(Ar':Ai')=((Ar+Br)+(Cr+Dr),(Ai+Bi)+(Ci+Di))
sod2saxx d7,d5,d5
; d5=(Br':Bi')=((Ar-Cr)+(Bi-Di),(Ai-Ci)-(Br-Dr))
sod2ssii d6,d4,d6
; d6=(Cr':Ci')=((Ar+Cr)-(Br+Dr),(Ai+Ci)-(Bi+Di))
sod2asxx d7,d5,d7
; d7=(Dr':Di')=((Ar-Cr)-(Bi-Di),(Ai-Ci)+(Br-Dr))
MOVE2.2F (r2),d1
; d1=Cr:Ci
move.l (r15)+,r1
; digit_reversed_address(i+1)
]
[
MOVER2.2F d4,(r8)+ ; out:[Aout_r,Aout_i], out:[Bout_r,Bout_i]
move.l (r15)+,r2
; digit_reversed_address(i+2)
]
[
MOVER2.2F d5,(r9)+ ; out:[Aout_r,Aout_i], out:[Bout_r,Bout_i]
MOVE2.2F (r3),d3
; d3=Dr:Di
]
[
MOVER2.2F d6,(r10)+ ; out:[Cout_r,Cout_i], out:[Dout_r,Dout_i]
move.l (r15)+,r3
; digit_reversed_address(i+3)
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
61
DFT Reference Code
[
sod2aaii d1,d0,d4
; d0=(Ar+Cr),(Ai+Ci)
sod2ssii d1,d0,d5
; d1=(Ar-Cr),(Ai-Ci)
sod2aaii d3,d2,d6
; d2=(Br+Dr),(Bi+Di)
sod2ssii d3,d2,d7
; d3=(Br-Dr),(Bi-Di)
MOVER2.2F d7,(r11)+ ; out:[Cout_r,Cout_i], out:[Dout_r,Dout_i]
MOVE2.2F (r0),d0
; d0=Ar:Ai
]
loopend0
[
sod2aaii d6,d4,d4
; d4=(Ar':Ai')=((Ar+Br)+(Cr+Dr),(Ai+Bi)+(Ci+Di))
sod2saxx d7,d5,d5
; d5=(Br':Bi')=((Ar-Cr)+(Bi-Di),(Ai-Ci)-(Br-Dr))
sod2ssii d6,d4,d6
; d6=(Cr':Ci')=((Ar+Cr)-(Br+Dr),(Ai+Ci)-(Bi+Di))
sod2asxx d7,d5,d7
; d7=(Dr':Di')=((Ar-Cr)-(Bi-Di),(Ai-Ci)+(Br-Dr))
move.l (sp-16),r2
; num_radix
move.l (sp-40),r3
; Load the stage counter
]
[
MOVER2.2F d4,(r8)+
; out:[Aout_r,Aout_i]
MOVER2.2F d5,(r9)+
; out:[Bout_r,Bout_i]
]
[
MOVER2.2F d6,(r10)+
; out:[Cout_r,Cout_i]
move.l r4,d3
]
[
MOVER2.2F d7,(r11)+
; out:[Dout_r,Dout_i]
adda #2,r3,r3
; +2
]
[
move.l r3,(sp-40)
; Update the stage counter
adda r3,r2
; num_butterfly[s]
]
[
move.w (r2),r0
; num_radix[s]
move.l d3,SR
]
[
cmpeqa.w #0,r0
; num_radix[s]=0 or not
]
[
jt radix3_dit_last_stage ; Go to radix-3 Last stage processing if num_radix[s]=0
cmpeqa.w #3,r0
; num_radix[s]=3 or not
]
[
jt radix3_dit_m_stage
; Go to radix-3 Middle stage processing if num_radix[s]=3
cmpeqa.w #5,r0
; num_radix[s]=5 or not
]
[
jt radix5_dit_m_stage
; Go to radix-5 Middle stage processing if num_radix[s]=5
cmpeqa.w #1,r0
; num_radix[s]=1 or not
]
jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1
;--------------------------------------------------------; Radix-4 (Middle stage)
;--------------------------------------------------------radix4_dit_m_stage:
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
62
Freescale Semiconductor
DFT Reference Code
IF @DEF('FIX_SCALE')
[
clr d15
move.l (sp-40),r3
; Stage counter
dosetup0 radix4_dit_m_stage_loop1
; radix4 loop1 (outer loop)
]
[
add #SCALE_RADIX_4,d15
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
move.l (sp-44),r5
; r5 -> Wb
move.l (sp-48),r6
; r6 -> Wc
]
[
move.l (sp-28),r12
; num_radix_offset
move.l (sp-52),r7
; r7 -> Wd
]
[
dosetup1 radix4_dit_m_stage_loop2_1
; radix4 loop2_1 (inner loop)
move.l (sp-8),r0
; A input pointer
]
[
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
]
[
move.w (r2),r13
; num_butterfly[s]
move.l (sp-68),r2 ; psiScale
]
[
move.w (r4),r14
; num_subgroup[s]
move.l (sp-12),r9 ; A output pointer
]
[
dosetup2 radix4_dit_m_stage_loop2_2
; radix4 loop2_2 (inner loop)
adda r3,r12
; num_radix_offset[s]
]
[
move.w #3,n3
; output offset for radix-4
]
ELSE
;//IF @DEF('FIX_SCALE')
[
move.l (sp-40),r3
; Stage counter
]
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
[
move.w (r2),d15
; DFT point
]
[
sub #8,d15
; d15-8
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
63
DFT Reference Code
[
asrr #2,d15
; (d15-8)/4
]
[
adda #8,r0,r1
; Input address
move.w #2,n3
]
[
move.2l (r0)+n3,d0:d1
move.2l (r1)+n3,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n3,d4:d5
move.2l (r1)+n3,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n3,d0:d1
move.2l (r1)+n3,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
tfr d6,d0
tfr d7,d1
move.l (sp-44),r5
; r5 -> Wb
move.l (sp-48),r6
; r6 -> Wc
]
[
max2 d0,d4
max2 d1,d5
move.l (sp-28),r12
; num_radix_offset
move.l (sp-52),r7
; r7 -> Wd
]
[
tfr d5,d0
dosetup0 radix4_dit_m_stage_loop1
; radix4 loop1 (outer loop)
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
64
Freescale Semiconductor
DFT Reference Code
[
max2 d0,d4
dosetup1 radix4_dit_m_stage_loop2_1 ; radix4 loop2_1 (inner loop)
]
[
move.l (sp-8),r0
; A input pointer
]
[
sxt.w d4,d0
dosetup2 radix4_dit_m_stage_loop2_2
; radix4 loop2_2 (inner loop)
]
[
extract #16,#16,d4,d4
move.l (sp-12),r9 ; A output pointer
]
[
max d0,d4
; Max real/imag
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
]
[
clb d4,d0
; count reading bit
move.w (r2),r13
; num_butterfly[s]
move.w (r4),r14
; num_subgroup[s]
]
[
move.l (sp-68),r2 ; psiScale
]
[
neg d0
; negates
adda r3,r12
; num_radix_offset[s]
]
[
clr d14
sub #16,d0
; limit 16-bit
move.w #S42,d2
move.w #3,n3
; output offset for radix-4
]
[
add #2,d14
cmpgt.w #(S41+1),d0 ; Check if siNorm > S41+1
move.w #2,d15
; scale down = 2
]
[
ift clr d15
; scale down = 0
if siNorm > S41+1
ifa
cmpeq.w #(S41+1),d0 ; Check if siNorm = S41+1
]
[
tfrt d14,d15
; scale down = 2
if siNorm = S41+1
move.w #S41,d1
]
ENDIF
;// IF @DEF('FIX_SCALE')
[
move.l d15,(sp-64) ; Set norm to stack
move.w d15,(r2)+
; psiScale <- norm
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
65
DFT Reference Code
[
move.l r2,(sp-68) ; psiScale
move.w (r12),r15
; num_radix_offset[s]
]
[
tfra r13,n0
; The stage twiddle factor's offset
tfra r14,n1
]
; For next stage input/output
[
move.l r9,(sp-8)
; base_address
move.l r0,(sp-12)
; out_address
]
[
tfra r9,r8
; A output
adda r15,r9
; B output pointer (long)
]
[
move.l d15,r1
; psiScale <- norm
tfra r9,r10
; B output
]
;--- Start Radix-4 (Middle stage)
[
adda r15,r10
move.l SR,d3
]
[
move.l d3,r4
bmset #$1000,SR.L
;
]
[
tfra r10,r11
;
bmclr #$0080,SR.L
;
]
[
adda r15,r11
;
bmclr #$0030,SR.L
;
]
[
move.l (sp-44),r5
cmpeqa.w #1,r1
]
[
ift bmset #$0010,SR.L
ifa
move.l (sp-48),r6
]
[
move.l (sp-52),r7
cmpeqa.w #2,r1
]
[
ift bmset #$0030,SR.L
ifa
move.l (r6)+n0,d2
; C output pointer (long)
W20=1
C output
SM2=0
D output pointer (long)
SCM=00
; r5 -> Wb
; SCM=01
; r6 -> Wc
; r7 -> Wd
; SCM=11
; d14=Wcr:Wci
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
66
Freescale Semiconductor
DFT Reference Code
]
[
tfra r13,n0
bmtsts #$0001,r13.l
]
[
asra r13
asra r14
]
[
adda #-1,r13,r12
adda #8,r0,r1
]
[
MOVE2.2F (r0)+,d0
MOVE2.2F (r1)+,d1
]
[ ;01
mac2assar d1,d2,d0.H,d12
mac2aassi
d1,d2,d0.L,d13
MOVE2.2F (r0)+n3,d8
move.l (r5)+n0,d10
]
[ ;02
mac2assar d8,d10,d12.H,d12
mac2aassi
d8,d10,d12.L,d4
mac2aassi
d8,d10,d13.H,d13
mac2saasr
d8,d10,d13.L,d5
move.l (r7)+n0,d11
doen0 r14
]
FALIGN
radix4_dit_m_stage_loop1:
loopstart0
[
doen1 r13
MOVE2.2F (r1)+n3,d9
]
;/// Inter loop
FALIGN
radix4_dit_m_stage_loop2_1
loopstart1
[ ;03
mac2assar d9,d11,d12
mac2ssaai d9,d11,d4
mac2aassi d9,d11,d13
mac2assar d9,d11,d5
; Input address
; d0=Ar:Ai
; d1=Cr:Ci
;
;
;
;
;
;
M1_1.H = IA[re]
M1_1.L = IA[re]
M1_2.H = IA[im]
M1_2.L = IA[im]
d8=Br:Bi
d10=Wbr:Wbi
+
+
-
IC[re]
IC[re]
IC[re]
IC[re]
*
*
*
*
WC[re]
WC[re]
WC[im]
WC[im]
+
+
-
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
;
;
;
;
;
;
;
;
;
;
M2_1.H = M1_1.H + IB[re]
M2_1.L = M1_1.H - IB[re]
M2_2.H = M1_1.L + IB[re]
M2_2.L = M1_1.L - IB[re]
M2_3.H = M1_2.H + IB[re]
M2_3.L = M1_2.H - IB[re]
M2_4.H = M1_2.L - IB[re]
M2_4.L = M1_2.L + IB[re]
d11=Wdr:Wdi
loop0 = num_subgroup[s]
*
*
*
*
*
*
*
*
WB[re]
WB[re]
WB[im]
WB[im]
WB[im]
WB[im]
WB[re]
WB[re]
+
+
+
+
-
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
*
*
*
*
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
WB[re]
WB[re]
WB[im]
WB[im]
; --> Butterflies/Subgroup
; loop1 = num_butterfly[s]
; d9=Dr:Di
; --> Subgroup
; OA[re] = M2_1.H + ID[re]
; OC[re] = M2_1.L - ID[re]
; OB[re] = M2_2.H - ID[re]
; OD[re] = M2_2.L + ID[re]
; OA[im] = M2_3.H + ID[re]
; OC[im] = M2_3.L - ID[re]
; OB[im] = M2_4.H + ID[re]
; OD[im] = M2_4.L - ID[re]
*
*
*
*
*
*
*
*
WD[re]
WD[re]
WD[im]
WD[im]
WD[im]
WD[im]
WD[re]
WD[re]
+
+
+
+
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
*
*
*
*
*
*
*
*
WD[im]
WD[im]
WD[re]
WD[re]
WD[re]
WD[re]
WD[im]
WD[im]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
67
DFT Reference Code
MOVE2.2F (r0)+,d0
MOVE2.2F (r1)+,d8
]
[ ;01
mac2assar d8,d2,d0.H,d6
; d0=Ar:Ai
; d8=Cr:Ci
+
+
-
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
; M2_1.H = M1_1.H + IB[re] * WB[re] ; M2_1.L = M1_1.H - IB[re] * WB[re] +
mac2aassi d8,d10,d6.L,d14 ; M2_2.H = M1_1.L + IB[re] * WB[im] +
; M2_2.L = M1_1.L - IB[re] * WB[im] mac2aassi d8,d10,d7.H,d7
; M2_3.H = M1_2.H + IB[re] * WB[im] +
; M2_3.L = M1_2.H - IB[re] * WB[im] mac2saasr d8,d10,d7.L,d15 ; M2_4.H = M1_2.L - IB[re] * WB[re] +
; M2_4.L = M1_2.L + IB[re] * WB[re] MOVE2.2F (r0)+,d0
; d0=Ar:Ai,
MOVE2.2F (r1)+,d1
; d1=Cr:Ci
]
[ ;03
mac2assar d9,d11,d6
; OA[re] = M2_1.H + ID[re] * WD[re] ; OC[re] = M2_1.L - ID[re] * WD[re] +
mac2ssaai d9,d11,d14
; OB[re] = M2_2.H - ID[re] * WD[im] ; OD[re] = M2_2.L + ID[re] * WD[im] +
mac2aassi d9,d11,d7
; OA[im] = M2_3.H + ID[re] * WD[im] +
; OC[im] = M2_3.L - ID[re] * WD[im] mac2assar d9,d11,d15
; OB[im] = M2_4.H + ID[re] * WD[re] ; OD[im] = M2_4.L - ID[re] * WD[re] +
MOVE2.2F (r0)+n3,d8
; d8=Br:Bi,
MOVE2.2F (r1)+n3,d9
; d9=Dr:Di
]
[ ;01
mac2assar d1,d2,d0.H,d12
; M1_1.H = IA[re] + IC[re] * WC[re] ; M1_1.L = IA[re] - IC[re] * WC[re] +
mac2aassi d1,d2,d0.L,d13
; M1_2.H = IA[im] + IC[re] * WC[im] +
; M1_2.L = IA[im] - IC[re] * WC[im] MOVERH.4F d12:d13:d6:d7,(r8)+
; save OA1:OA2
MOVERL.4F d12:d13:d6:d7,(r10)+
; save OC1:OC2
]
[ ;02
mac2assar d8,d10,d12.H,d12 ; M2_1.H = M1_1.H + IB[re] * WB[re] ; M2_1.L = M1_1.H - IB[re] * WB[re] +
mac2aassi d8,d10,d12.L,d4 ; M2_2.H = M1_1.L + IB[re] * WB[im] +
; M2_2.L = M1_1.L - IB[re] * WB[im] mac2aassi d8,d10,d13.H,d13 ; M2_3.H = M1_2.H + IB[re] * WB[im] +
; M2_3.L = M1_2.H - IB[re] * WB[im] mac2saasr d8,d10,d13.L,d5 ; M2_4.H = M1_2.L - IB[re] * WB[re] +
; M2_4.L = M1_2.L + IB[re] * WB[re] MOVERH.4F d4:d5:d14:d15,(r9)+
; save OB1:OB2
MOVERL.4F d4:d5:d14:d15,(r11)+
; save OD1:OD2
]
loopend1
[
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
*
*
*
*
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
WB[re]
WB[re]
WB[im]
WB[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
*
*
*
*
*
*
*
*
WD[im]
WD[im]
WD[re]
WD[re]
WD[re]
WD[re]
WD[im]
WD[im]
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
*
*
*
*
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
WB[re]
WB[re]
WB[im]
WB[im]
mac2aassi d8,d2,d0.L,d7
MOVE2.2F (r0)+n3,d8
MOVE2.2F (r1)+n3,d9
]
[ ;02
mac2assar d8,d10,d6.H,d6
; M1_1.H = IA[re] + IC[re] * WC[re]
; M1_1.L = IA[re] - IC[re] * WC[re]
; M1_2.H = IA[im] + IC[re] * WC[im]
; M1_2.L = IA[im] - IC[re] * WC[im]
; d8=Br:Bi
; d9=Dr:Di
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
68
Freescale Semiconductor
DFT Reference Code
iff doen2 r12
iff move.l (r6)+n0,d2
ifa
tfr d9,d3
]
[ ;01
iff move.l (r5)+n0,d10
iff move.l (r7)+n0,d11
ifa
mac2assar d1,d2,d0.H,d12
IC[re]
IC[re]
IC[re]
IC[re]
*
*
*
*
WC[re]
WC[re]
WC[im]
WC[im]
+
+
-
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
; M2_1.H = M1_1.H +
; M2_1.L = M1_1.H d8,d10,d12.L,d4
; M2_2.H = M1_1.L +
; M2_2.L = M1_1.L d8,d10,d13.H,d13 ; M2_3.H = M1_2.H +
; M2_3.L = M1_2.H d8,d10,d13.L,d5
; M2_4.H = M1_2.L ; M2_4.L = M1_2.L +
(r0)+,d0
; d0=Ar:Ai,
(r1)+,d1
; d1=Cr:Ci
IB[re]
IB[re]
IB[re]
IB[re]
IB[re]
IB[re]
IB[re]
IB[re]
*
*
*
*
*
*
*
*
WB[re]
WB[re]
WB[im]
WB[im]
WB[im]
WB[im]
WB[re]
WB[re]
+
+
+
+
-
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
IB[im]
*
*
*
*
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
WB[re]
WB[re]
WB[im]
WB[im]
ID[re]
ID[re]
ID[re]
ID[re]
ID[re]
ID[re]
ID[re]
ID[re]
*
*
*
*
*
*
*
*
WD[re]
WD[re]
WD[im]
WD[im]
WD[im]
WD[im]
WD[re]
WD[re]
+
+
+
+
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
*
*
*
*
*
*
*
*
WD[im]
WD[im]
WD[re]
WD[re]
WD[re]
WD[re]
WD[im]
WD[im]
d1,d2,d0.L,d13
]
[ ;02
mac2assar
d8,d10,d12.H,d12
mac2aassi
mac2saasr
MOVE2.2F
MOVE2.2F
]
[ ;03
mac2assar
; d10=Wbr:Wbi
; d11=Wdr:Wdi
+
+
-
mac2aassi
mac2aassi
; loop1 = num_butterfly[s]
; d2=Wcr:Wci
d9,d11,d12
mac2ssaai
d9,d11,d4
mac2aassi
d9,d11,d13
mac2assar
d9,d11,d5
MOVE2.2F (r0)+n3,d8
MOVE2.2F (r1)+n3,d9
]
[
ift doen2 r13
ift move.l (r6)+n0,d2
]
[ ;01
ift move.l (r5)+n0,d10
ift move.l (r7)+n0,d11
ifa
mac2assar d1,d2,d0.H,d6
mac2aassi
d1,d2,d0.L,d7
]
[ ;02
ifa
mac2assar
d8,d10,d6.H,d6
mac2aassi
d8,d10,d6.L,d14
; M1_1.H =
; M1_1.L =
; M1_2.H =
; M1_2.L =
;
;
;
;
;
;
;
;
IA[re]
IA[re]
IA[im]
IA[im]
OA[re] = M2_1.H
OC[re] = M2_1.L
OB[re] = M2_2.H
OD[re] = M2_2.L
OA[im] = M2_3.H
OC[im] = M2_3.L
OB[im] = M2_4.H
OD[im] = M2_4.L
; d8=Br:Bi,
; d9=Dr:Di
+
+
+
+
-
; loop1 = num_butterfly[s]
; d14=Wcr:Wci
; d10=Wbr:Wbi
; d11=Wdr:Wdi
; M1_1.H =
; M1_1.L =
; M1_2.H =
; M1_2.L =
IA[re]
IA[re]
IA[im]
IA[im]
+
+
-
IC[re]
IC[re]
IC[re]
IC[re]
*
*
*
*
WC[re]
WC[re]
WC[im]
WC[im]
+
+
-
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
; M2_1.H = M1_1.H
; M2_1.L = M1_1.H
; M2_2.H = M1_1.L
; M2_2.L = M1_1.L
+
+
-
IB[re]
IB[re]
IB[re]
IB[re]
*
*
*
*
WB[re]
WB[re]
WB[im]
WB[im]
+
+
-
IB[im]
IB[im]
IB[im]
IB[im]
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
69
DFT Reference Code
mac2aassi
d8,d10,d7.H,d7
mac2saasr
d8,d10,d7.L,d15
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
+
+
-
IB[im]
IB[im]
IB[im]
IB[im]
*
*
*
*
WB[re]
WB[re]
WB[im]
WB[im]
*
*
*
*
*
*
*
*
WD[re]
WD[re]
WD[im]
WD[im]
WD[im]
WD[im]
WD[re]
WD[re]
+
+
+
+
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
*
*
*
*
*
*
*
*
WD[im]
WD[im]
WD[re]
WD[re]
WD[re]
WD[re]
WD[im]
WD[im]
; M1_1.H = IA[re] + IC[re] * WC[re]
; M1_1.L = IA[re] - IC[re] * WC[re]
mac2aassi d1,d2,d0.L,d13 ; M1_2.H = IA[im] + IC[re] * WC[im]
; M1_2.L = IA[im] - IC[re] * WC[im]
MOVERH.4F d12:d13:d6:d7,(r8)+
; save OA1:OA2
MOVERL.4F d12:d13:d6:d7,(r10)+
; save OC1:OC2
]
[ ;02
mac2assar d8,d10,d12.H,d12
; M2_1.H = M1_1.H + IB[re] *
; IB[im] * WB[im]
; M2_1.L = M1_1.H - IB[re] *
; IB[im] * WB[im]
mac2aassi d8,d10,d12.L,d4
; M2_2.H = M1_1.L + IB[re] *
; IB[im] * WB[re]
; M2_2.L = M1_1.L - IB[re] *
; IB[im] * WB[re]
mac2aassi d8,d10,d13.H,d13
; M2_3.H = M1_2.H + IB[re] *
; IB[im] * WB[re]
; M2_3.L = M1_2.H - IB[re] *
; IB[im] * WB[re]
mac2saasr d8,d10,d13.L,d5
; M2_4.H = M1_2.L - IB[re] *
; IB[im] * WB[im]
; M2_4.L = M1_2.L + IB[re] *
; IB[im] * WB[im]
MOVERH.4F d4:d5:d14:d15,(r9)+
; save OB1:OB2
MOVERL.4F d4:d5:d14:d15,(r11)+
; save OD1:OD2
]
[ ;03
mac2assar d9,d11,d12
; OA[re] = M2_1.H + ID[re] * WD[re]
; OC[re] = M2_1.L - ID[re] * WD[re]
mac2ssaai d9,d11,d4
; OB[re] = M2_2.H - ID[re] * WD[im]
; OD[re] = M2_2.L + ID[re] * WD[im]
mac2aassi d9,d11,d13
; OA[im] = M2_3.H + ID[re] * WD[im]
; OC[im] = M2_3.L - ID[re] * WD[im]
+
+
-
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
*
*
*
*
*
*
WD[im]
WD[im]
WD[re]
WD[re]
WD[re]
WD[re]
MOVE2.2F (r0)+,d0
MOVE2.2F (r1)+,d1
]
FALIGN
radix4_dit_m_stage_loop2_2
loopstart2
[ ;03
mac2assar d9,d11,d6
mac2ssaai d9,d11,d14
mac2aassi d9,d11,d7
mac2assar d9,d11,d15
MOVE2.2F (r0)+n3,d8
MOVE2.2F (r1)+n3,d9
]
[ ;01
mac2assar d1,d2,d0.H,d12
; M2_3.H = M1_2.H
; M2_3.L = M1_2.H
; M2_4.H = M1_2.L
; M2_4.L = M1_2.L
; d0=Ar:Ai,
; d1=Cr:Ci
+
+
IB[re]
IB[re]
IB[re]
IB[re]
; --> Subgroup
; OA[re] = M2_1.H + ID[re]
; OC[re] = M2_1.L - ID[re]
; OB[re] = M2_2.H - ID[re]
; OD[re] = M2_2.L + ID[re]
; OA[im] = M2_3.H + ID[re]
; OC[im] = M2_3.L - ID[re]
; OB[im] = M2_4.H + ID[re]
; OD[im] = M2_4.L - ID[re]
; d8=Br:Bi,
; d9=Dr:Di
WB[re] WB[re] +
WB[im] +
WB[im] WB[im] +
WB[im] WB[re] +
WB[re] -
+
+
+
-
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
ID[im]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
70
Freescale Semiconductor
DFT Reference Code
mac2assar d9,d11,d5
MOVE2.2F (r0)+,d0
MOVE2.2F (r1)+,d1
]
[ ;01
mac2assar d1,d2,d0.H,d6
; OB[im] = M2_4.H + ID[re] * WD[re] - ID[im] * WD[im]
; OD[im] = M2_4.L - ID[re] * WD[re] + ID[im] * WD[im]
; d0=Ar:Ai,
; d1=Cr:Ci
IC[im]
IC[im]
IC[im]
IC[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
; M2_1.H = M1_1.H + IB[re] * WB[re] - IB[im]
; M2_1.L = M1_1.H - IB[re] * WB[re] + IB[im]
mac2aassi d8,d10,d6.L,d14 ; M2_2.H = M1_1.L + IB[re] * WB[im] + IB[im]
; M2_2.L = M1_1.L - IB[re] * WB[im] - IB[im]
mac2aassi d8,d10,d7.H,d7
; M2_3.H = M1_2.H + IB[re] * WB[im] + IB[im]
; M2_3.L = M1_2.H - IB[re] * WB[im] - IB[im]
mac2saasr d8,d10,d7.L,d15 ; M2_4.H = M1_2.L - IB[re] * WB[re] + IB[im]
; M2_4.L = M1_2.L + IB[re] * WB[re] - IB[im]
MOVE2.2F (r0)+,d0
; d0=Ar:Ai,
MOVE2.2F (r1)+,d1
; d1=Cr:Ci
]
loopend2
[ ;03
mac2assar d9,d11,d6
; OA[re] = M2_1.H + ID[re] * WD[re] - ID[im]
; OC[re] = M2_1.L - ID[re] * WD[re] + ID[im]
mac2ssaai d9,d11,d14
; OB[re] = M2_2.H - ID[re] * WD[im] - ID[im]
; OD[re] = M2_2.L + ID[re] * WD[im] + ID[im]
mac2aassi d9,d11,d7
; OA[im] = M2_3.H + ID[re] * WD[im] + ID[im]
; OC[im] = M2_3.L - ID[re] * WD[im] - ID[im]
mac2assar d9,d11,d15
; OB[im] = M2_4.H + ID[re] * WD[re] - ID[im]
; OD[im] = M2_4.L - ID[re] * WD[re] + ID[im]
MOVE2.2F (r0)+n3,d8
; d8=Br:Bi
move.l (r5)+n0,d10
; d10=Wbr:Wbi
]
[
move.l (r6)+n0,d2
; d2 =Wcr:Wci
move.l (r7)+n0,d11
; d11=Wdr:Wdi
]
[ ;01
mac2assar d1,d2,d0.H,d12
; M1_1.H = IA[re] + IC[re] * WC[re] - IC[im]
; M1_1.L = IA[re] - IC[re] * WC[re] + IC[im]
mac2aassi d1,d2,d0.L,d13
; M1_2.H = IA[im] + IC[re] * WC[im] + IC[im]
; M1_2.L = IA[im] - IC[re] * WC[im] - IC[im]
MOVERH.4F d12:d13:d6:d7,(r8)+ ; save OA1:OA2
MOVERL.4F d12:d13:d6:d7,(r10)+ ; save OC1:OC2
]
[ ;02
mac2assar d8,d10,d12.H,d12
; M2_1.H = M1_1.H + IB[re] * WB[re] - IB[im]
; M2_1.L = M1_1.H - IB[re] * WB[re] + IB[im]
mac2aassi d8,d10,d12.L,d4
; M2_2.H = M1_1.L + IB[re] * WB[im] + IB[im]
; M2_2.L = M1_1.L - IB[re] * WB[im] - IB[im]
mac2aassi d8,d10,d13.H,d13
; M2_3.H = M1_2.H + IB[re] * WB[im] + IB[im]
; M2_3.L = M1_2.H - IB[re] * WB[im] - IB[im]
*
*
*
*
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
WB[re]
WB[re]
WB[im]
WB[im]
*
*
*
*
*
*
*
*
WD[im]
WD[im]
WD[re]
WD[re]
WD[re]
WD[re]
WD[im]
WD[im]
*
*
*
*
WC[im]
WC[im]
WC[re]
WC[re]
*
*
*
*
*
*
WB[im]
WB[im]
WB[re]
WB[re]
WB[re]
WB[re]
mac2aassi d1,d2,d0.L,d7
MOVE2.2F (r0)+n3,d8
MOVE2.2F (r1)+n3,d9
]
[ ;02
mac2assar d8,d10,d6.H,d6
; M1_1.H = IA[re] + IC[re] * WC[re]
; M1_1.L = IA[re] - IC[re] * WC[re]
; M1_2.H = IA[im] + IC[re] * WC[im]
; M1_2.L = IA[im] - IC[re] * WC[im]
; d8=Br:Bi,
; d9=Dr:Di
+
+
-
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
71
DFT Reference Code
mac2saasr
d8,d10,d13.L,d5
; M2_4.H = M1_2.L - IB[re] * WB[re] + IB[im] * WB[im]
; M2_4.L = M1_2.L + IB[re] * WB[re] - IB[im] * WB[im]
MOVERH.4F d4:d5:d14:d15,(r9)+ ; save OB1:OB2
MOVERL.4F d4:d5:d14:d15,(r11)+ ; save OD1:OD2
]
loopend0
[
move.l (sp-16),r2
; num_radix
move.l (sp-40),r3
; Load the stage counter
]
move.l r4,d3
[
move.l d3,SR
adda #2,r3,r3
; +2
]
[
move.l r3,(sp-40)
; Update the stage counter
adda r3,r2
; num_butterfly[s]
]
move.w (r2),r0
; num_radix[s]
cmpeqa.w #0,r0
; num_radix[s]=0 or not
[
jt radix3_dit_last_stage ; Go to radix-3 Last stage processing if num_radix[s]=0
cmpeqa.w #3,r0
; num_radix[s]=3 or not
]
[
jt radix3_dit_m_stage
; Go to radix-3 Middle stage processing if num_radix[s]=3
cmpeqa.w #1,r0
; num_radix[s]=1 or not
]
jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1
cmpeqa.w #4,r0
; num_radix[s]=4 or not
]
jt radix4_dit_m_stage
; Go to radix-4 Middle stage processing if num_radix[s]=4
;///////////////////////////////////////////////////////////////////////
; Radix-5 Stages:
;///////////////////////////////////////////////////////////////////////
radix5_init:
;--------------------------------------------------------; Radix-5 (Middle stage)
;--------------------------------------------------------radix5_dit_m_stage:
IF @DEF('FIX_SCALE')
[
clr d0
move.l (sp-20),r2
move.l (sp-24),r4
]
[
add #SCALE_RADIX_5,d0
move.l (sp-40),r3
move.l (sp-44),r5
]
[
; num_butterfly
; num_subgroup
; Stage counter
; r5 -> Wb
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
72
Freescale Semiconductor
DFT Reference Code
move.w #2,n0
dosetup0 radix5_dit_m_stage_loop1
]
[
move.l (sp-48),r6
; r6 -> Wc
move.l (sp-28),r1
; num_radix_offset
]
[
move.l (sp-52),r7
; r7 -> Wd
move.l (sp-56),r8
; r8 -> We
]
[
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
]
[
move.w (r2),r14
; num_butterfly[s] ; butterflies/subgroup
move.w (r4),r13
; num_subgroup[s] ; subgroups/stage
]
[
move.l (sp-12),r9
; A output pointer
move.l (sp-68),r4
; psiScale
]
[
move.l (sp-8),r0
; A input pointer
adda r3,r1
; num_radix_offset[s]
]
[
move.w (r1),r15
; num_radix_offset[s]
]
; For next stage input/output
[
move.l r9,(sp-8)
; base_address
move.l r0,(sp-12)
; out_address
]
[
tfra r14,r2
; Middle stage twiddle factor's offset
tfra r9,r8
; A output
]
ELSE
;//IF @DEF('FIX_SCALE')
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
; 3 STALLS on r2
[
move.w (r2),d15
; DFT point
move.l (sp-40),r3
; Stage counter
]
[
sub #8,d15
; d15-8
]
[
asrr #2,d15
; (d15-8)/4
]
[
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
73
DFT Reference Code
adda #8,r0,r1
; Input address
move.w #2,n0
]
[
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
move.2l (r1)+n0,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.l (sp-20),r2
; num_butterfly
move.l (sp-24),r4
; num_subgroup
]
[
tfr d6,d0
tfr d7,d1
move.l (sp-44),r5
; r5 -> Wb
dosetup0 radix5_dit_m_stage_loop1
]
[
max2 d0,d4
max2 d1,d5
move.l (sp-48),r6
; r6 -> Wc
move.l (sp-28),r1
; num_radix_offset
]
[
tfr d5,d0
move.l (sp-52),r7
; r7 -> Wd
move.l (sp-56),r8
; r8 -> We
]
[
max2 d0,d4
adda r3,r2
; num_butterfly[s]
adda r3,r4
; num_subgroup[s]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
74
Freescale Semiconductor
DFT Reference Code
]
[
sxt.w d4,d0
move.w (r2),r14
; num_butterfly[s] ; butterflies/subgroup
move.w (r4),r13
; num_subgroup[s] ; subgroups/stage
]
[
extract #16,#16,d4,d4
move.l (sp-8),r0
; A input pointer
move.l (sp-12),r9
; A output pointer
]
[
max d0,d4
; Max real/imag
move.l (sp-68),r4
; psiScale
adda r3,r1
; num_radix_offset[s]
]
; For next stage input/output
[
clb d4,d0
; count reading bit
move.w #S54,d2
move.w (r1),r15
; num_radix_offset[s]
]
[
neg d0
; negates
move.l r9,(sp-8)
; base_address
move.l r0,(sp-12)
; out_address
]
[
sub #16,d0
; limit 16-bit
tfra r14,r2
; Middle stage twiddle factor's offset
tfra r9,r8
; A output
]
[
cmpgt.w #S53,d0
; Check if norm > S or norm <= S
move.w #S53,d1
]
[
iff sub d0,d1,d0
; scale down = S - d0 if norm <= S
ift clr d0
; scale down = 0
if norm > S
]
ENDIF
;// IF @DEF('FIX_SCALE')
[
move.l d0,(sp-64)
; Set norm to stack
move.w d0,(r4)+
; psiScale <- norm
]
[
move.l r4,(sp-68)
; psiScale
adda r15,r9
; B output pointer (long)
]
[
tfra r9,r10
; B output
move.l (sp-44),r4
; r4 -> Wb
]
[
adda r15,r10
; C output pointer (long)
move.l (sp-48),r5
; r5 -> Wc
]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
75
DFT Reference Code
[
tfra r10,r11
; C output
adda #4,r0,r1
; A input
]
[
adda r15,r11
; D output pointer (long)
dosetup1 radix5_dit_m_stage_loop2
]
[
tfra r11,r12
; D output
move.l #radix5parameter,r14 ; Point to the radix-5 parameter array
]
[
adda r15,r12
; E output pointer (long)
move.l #radix5parameter+4,r15 ; Point to the radix-5 parameter array
]
;--- Start Radix-5 (Middle stage)
[ ;
move.2f (r0)+n0,d12:d13
; d12.H = Ar, d13.H = Ai
move.l (r1)+n0,d14
; d14 = Br:Bi
]
[ ;
move.l (r0)+n0,d15
; d15 = Cr:Ci
move.l (sp-64),d8
; Scaling factor
]
[ ;
asrr2 d8,d14
; d14 = Br:Bi>>S
asrr2 d8,d15
; d15 = Cr:Ci>>S
move.l (sp-52),r6
; r6 -> Wd
move.l (sp-56),r7
; r7 -> We
]
[ ;
asrr d8,d12
; d12 Ar>>S
move.l (r0)+,d9
; d9 = Er:Ei
doen0 r13
; num_subgroup ; subgroups/stage
]
FALIGN
radix5_dit_m_stage_loop1:
; --> Butterflies/Subgroup
loopstart0
[
doen1 r2
; num_butterfly[s]
move.l (r4),d2
; d2 = Wbr:Wbi
]
FALIGN
radix5_dit_m_stage_loop2:
; --> Subgroup
loopstart1
[ ; 01
asrr2 d8,d9
; d9 = Er:Ei>>S
MPYRE d2,d14,d6
; d6 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d14,d7
; d7 = Bi' = (Br*Wbi + Bi*Wbr)
tfr d8,d1
; Scaling factor
move.l (r1)+n0,d8
; d8 = Dr:Di
move.l (r5),d10
; d10 = Wcr:Wci
]
[ ; 02
asrr2 d1,d8
; d8 = Dr:Di>>S
pack.2f d7,d6,d0
; d0 = Br':Bi'
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
76
Freescale Semiconductor
DFT Reference Code
MPYRE d10,d15,d14
MPYIM d10,d15,d15
move.l (r7),d3
]
[ ; 03
asrr d1,d13
mac2assar d9,d3,d0.h,d6
mac2aassi d9,d3,d0.l,d7
pack.2f d15,d14,d0
move.l (r6),d11
]
[ ; 04
sod2aaii d12,d6,d4
sod2aaii d13,d7,d5
mac2assar d8,d11,d0.h,d14
mac2aassi d8,d11,d0.l,d15
move.2f (r14),d0:d1
move.2f (r15),d8:d9
]
[ ; 05
move.l d6,d2
move.l d7,d3
sod2aaii d4,d14,d4
sod2aaii d5,d15,d5
]
[ ; 06
mac +d0,d2,d12
mac +d0,d3,d13
tfr d12,d14
tfr d13,d15
move.l d14,d10
move.l d15,d11
]
[ ; 07
mac -d8,d10,d12
mac -d8,d11,d13
mac -d8,d2,d14
mac -d8,d3,d15
adda #4,r0,r1
moves.2f d4:d5,(r8)+
]
[ ; 08
pack.2f d13,d12,d0
mac +d0,d10,d14
mac +d0,d11,d15
move.l (sp-64),d8
move.2f (r0)+n0,d12:d13
]
[ ; 09
mac2aassi d1,d3,d0.h,d4
;
mac2ssaai d1,d2,d0.l,d5
; d14 = Cr' = (Cr*Wcr - Ci*Wci)
; d15 = Ci' = (Cr*Wci + Ci*Wcr)
; d3 = Wer:Wei
;
;
;
;
;
d13 = Ai>>S
M2_1 = (Br'+Er'):(Br'-Er')
M2_2 = (Bi'+Ei'):(Bi'-Ei')
d0 = Cr':Ci'
d11 = Wdr:Wdi
;
;
;
;
;
;
Ar'+Br'+Er'
Ai'+Bi'+Ei'
M2_3 = (Cr'+Dr'):(Cr'-Dr')
M2_4 = (Ci'+Di'):(Ci'-Di')
d0:d1 = [a,b], Radix-5 parameters
d8:d9 = [c,d], Radix-5 parameters
; Aout_r = Ar'+Br'+Cr'+Dr'+Er'
; Aout_i = Ai'+Bi'+Ci'+Di'+Ei'
; Ar' +
; Ai' +
;
;
;
;
;
;
a*(Br'+Er')
a*(Bi'+Ei')
Ar' + (a*(Br'+Er')-c*(Cr'+Dr'))
Ai' + (a*(Bi'+Ei')-c*(Ci'+Di'))
Ar' - c*(Br'+Er')
Ai' - c*(Bi'+Ei')
A input
out:[Aout_r,Aout_i]
;
;
;
;
Ar' - (c*(Br'+Er')-a*(Cr'+Dr'))
Ai' - (c*(Bi'+Ei')-a*(Ci'+Di'))
Scaling factor
d12.H = Ar, d13.H = Ai
M5_1.H = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')),
; M5_1.L = Ar'+(a(Br'+Er'); c(Cr'+Dr'))-(b(Bi'-Ei'))
; M5_2.H = Ai'+(a(Bi'+Ei'); c(Ci'+Di'))-(b(Br'-Er')),
; M5_2.L = Ai'+(a(Bi'+Ei'); c(Ci'+Di'))+(b(Br'-Er'))
pack.2f d15,d14,d0
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
77
DFT Reference Code
move.l (r1)+n0,d14
move.l (r0)+n0,d15
]
[ ; 10
mac2aassi d9,d11,d4
mac2ssaai d9,d10,d5
mac2aassi d9,d3,d0.h,d6
mac2ssaai d9,d2,d0.l,d7
move.l (r0)+,d9
move.l (r4),d2
]
[ ; 11
asrr2 d8,d14
asrr2 d8,d15
mac2saasi d1,d11,d6
mac2assai d1,d10,d7
; d14 = Br:Bi
; d15 = Cr:Ci
; Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+
; (b(Bi'-Ei')+d(Ci'-Di')),
; Eout_r = Ar'+(a(Br'+Er'); c(Cr'+Dr'))-(b(Bi'-Ei')+d(Ci'-Di'))
; Bout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di')); (b(Br'-Er')+d(Cr'-Dr')), Eout_i =
; Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+
; (b(Br'-Er')+d(Cr'-Dr'))
; M5_3.H = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')),
; M5_3.L = Ar'-(c(Br'+Er')-a(Cr'+Dr'));(d(Bi'-Ei'))
; M5_4.H = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')),
; M5_4.L = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+
; (d(Br'-Er'))
; d9 = Er:Ei
; d2 = Wbr:Wbi
; d14 = Br:Bi>>S
; d15 = Cr:Ci>>S
; Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+
;(d(Bi'-Ei')-b(Ci'-Di')), Dout_r =
; Ar'-(c(Br'+Er')-a(Cr'+Dr')); (d(Bi'-Ei')-b(Ci'-Di'))
; Cout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di')); (d(Br'-Er')-b(Cr'-Dr')), Dout_i =
; Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+
; (d(Br'-Er')-b(Cr'-Dr'))
; out:[Bout_r,Bout_i]
; out:[Eout_r,Eout_i]
moves.2f d4:d5,(r9)+
move.2w d4:d5,(r12)+
]
[ ; 12
asrr d8,d12
; d12 = Ar>>S
moves.2f d6:d7,(r10)+
; out:[Cout_r,Cout_i]
move.2w d6:d7,(r11)+
; out:[Dout_r,Dout_i]
]
loopend1
[
addl2a r2,r4
; updata twiddle factor pointers
addl2a r2,r5
]
[
addl2a r2,r6
addl2a r2,r7
]
loopend0
[
move.l (sp-16),r2
; num_radix
move.l (sp-40),r3
; Load the stage counter
]
; 3 STALLS on r3
adda #2,r3,r3
; +2
[
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
78
Freescale Semiconductor
DFT Reference Code
move.l r3,(sp-40)
; Update the stage counter
adda r3,r2
; num_butterfly[s]
]
move.w (r2),r0
; num_radix[s]
cmpeqa.w #5,r0
; num_radix[s]=5 or not
[
jt radix5_dit_m_stage
; Go to radix-5 Middle stage processing if num_radix[s]=5
cmpeqa.w #1,r0
; num_radix[s]=1 or not
]
jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1
;--------------------------------------------------------; Radix-5 (Last stage)
;--------------------------------------------------------radix5_dit_last_stage:
IF @DEF('FIX_SCALE')
[
clr d12
move.l (sp-40),r3
; Stage counter
move.l (sp-28),r15
; num_radix_offset[0]
]
[
add #SCALE_RADIX_5,d12
move.l (sp-24),r14
; num_subgroup
move.l (sp-12),r9
; A output pointer
]
[
dosetup0 radix5_dit_last_stage_loop1
move.l (sp-8),r0
; A input pointer
]
[
move.w #2,n0
move.l (sp-68),r2
; psiScale
]
[
adda r3,r15
; num_radix_offset[s]
move.l (sp-44),r4
; r4 -> Wb
]
[
adda r3,r14
; siNumSubgroup[s]
move.w (r15),r15
; num_radix_offset[s]
]
[
adda #4,r0,r1
; A input
move.l (sp-48),r5
; r5 -> Wc
]
[
move.l (sp-52),r6
; r6 -> Wd
move.l (sp-56),r7
; r7 -> We
]
[
tfra r9,r8
; A output
move.w (r14),d5
; siNumSubgroup[s]
]
[
adda r15,r9
; B output pointer (long)
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
79
DFT Reference Code
doen0 d5
; Loop for num_butterfly
]
[
tfra r9,r10
; B output
move.l #radix5parameter,r14 ; Point to the radix-5 parameter array
]
[
adda r15,r10
; C output pointer (long)
move.l d12,(sp-64)
; Set norm to stack
]
[
tfra r10,r11
; C output
move.w d12,(r2)+
; psiScale <- norm
]
[
adda r15,r11
; D output pointer (long)
move.l r2,(sp-68)
; psiScale
]
[
tfra r11,r12
; D output
]
[
adda r15,r12
; E output pointer (long)
move.l #radix5parameter+4,r15 ; Point to the radix-5 parameter array
]
ELSE
;//IF @DEF('FIX_SCALE')
;-- Find max value (real & imag) & stage initialization
[
move.l (sp-8),r0
; input base (A)
move.l (sp-60),r2
; DFT point
]
; 3 STALLS on r2
[
move.w (r2),d15
; DFT point
move.l (sp-40),r3
; Stage counter
]
[
sub #8,d15
; d15-8
move.l (sp-28),r2
; num_radix_offset[0]
move.l (sp-24),r14 ; num_subgroup
]
[
asrr #2,d15
; (d15-8)/4
dosetup0 radix5_dit_last_stage_loop1
]
[
adda #8,r0,r1
; Input address
move.w #2,n0
]
[
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
move.2l (r0)+n0,d4:d5
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
80
Freescale Semiconductor
DFT Reference Code
move.2l (r1)+n0,d6:d7
]
[
abs2 d4,d4
abs2 d5,d5
abs2 d6,d6
abs2 d7,d7
doensh0 d15
; loop count for max search
]
; Loop for max (real,imag) search
loopstart0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
move.2l (r0)+n0,d0:d1
move.2l (r1)+n0,d2:d3
]
[
abs2 d0,d0
abs2 d1,d1
abs2 d2,d2
abs2 d3,d3
]
loopend0
[
max2 d0,d4
max2 d1,d5
max2 d2,d6
max2 d3,d7
adda r3,r2
; num_radix_offset[s]
adda r3,r14
; siNumSubgroup[s]
]
[
tfr d6,d0
tfr d7,d1
move.w (r2),r15
; num_radix_offset[s]
move.l (sp-12),r9
; A output pointer
]
[
max2 d0,d4
max2 d1,d5
move.l (sp-8),r0
; A input pointer
move.l (sp-68),r2
; psiScale
]
[
tfr d5,d0
move.l (sp-44),r4
; r4 -> Wb
move.l (sp-48),r5
; r5 -> Wc
]
[
max2 d0,d4
move.l (sp-52),r6
; r6 -> Wd
move.l (sp-56),r7
; r7 -> We
]
[
sxt.w d4,d0
tfra r9,r8
; A output
adda r15,r9
; B output pointer (long)
]
[
extract #16,#16,d4,d4
adda #4,r0,r1
; B input
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
81
DFT Reference Code
tfra r9,r10
; B output
]
[
max d0,d4
; Max real/imag
move.w (r14),d5
; siNumSubgroup[s]
adda r15,r10
; C output pointer (long)
]
[
clb d4,d0
; count reading bit
tfra r10,r11
; C output
doen0 d5
; Loop for num_butterfly
]
[
neg d0
; negates
adda r15,r11
; D output pointer (long)
move.l #radix5parameter,r14 ; Point to the radix-5 parameter array
]
[
sub #16,d0
; limit 16-bit
move.w #S52,d2
tfra r11,r12
; D output
]
[
cmpgt.w #S51,d0
; Check if norm > S or norm <= S
move.w #S51,d1
adda r15,r12
; E output pointer (long)
]
[
iff sub d0,d1,d12
; scale down = S - d0 if norm <= S
ift sub d0,d2,d12
; scale up
= S - d0 if norm > S
]
[
move.l d12,(sp-64)
; Set norm to stack
move.w d12,(r2)+
; psiScale <- norm
]
[
move.l r2,(sp-68)
; psiScale
move.l #radix5parameter+4,r15 ; Point to the radix-5 parameter array
]
ENDIF
;// IF @DEF('FIX_SCALE')
;--- Start Radix-5 (Last stage)
[ ;
move.2f (r0)+n0,d12:d13
; d12.H = Ar, d13.H = Ai
move.l (r1)+n0,d14
; d14 = Br:Bi
]
[ ;
move.l (r0)+n0,d15
; d15 = Cr:Ci
move.l (sp-64),d8
; Scaling factor
]
[ ;
asrr2 d8,d14
; d14 = Br:Bi>>S
asrr2 d8,d15
; d15 = Cr:Ci>>S
]
[ ;
asrr d8,d12
; d12 = Ar>>S
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
82
Freescale Semiconductor
DFT Reference Code
move.l (r0)+,d9
; d9 = Er:Ei
move.l (r4)+,d2
; d2 = Wbr:Wbi
]
FALIGN
radix5_dit_last_stage_loop1:
; --> Butterflies/Subgroup
loopstart0
[ ; 01
asrr2 d8,d9
; d9 = Er:Ei>>S
MPYRE d2,d14,d6
; d6 = Br' = (Br*Wbr - Bi*Wbi)
MPYIM d2,d14,d7
; d7 = Bi' = (Br*Wbi + Bi*Wbr)
tfr d8,d1
; Scaling factor
move.l (r1)+n0,d8
; d8 = Dr:Di
move.l (r5)+,d10
; d10 = Wcr:Wci
]
[ ; 02
asrr2 d1,d8
; d8 = Dr:Di>>S
pack.2f d7,d6,d0
MPYRE d10,d15,d14
; d14 = Cr' = (Cr*Wcr - Ci*Wci)
MPYIM d10,d15,d15
; d15 = Ci' = (Cr*Wci + Ci*Wcr)
move.l (r7)+,d3
; d3 = Wer:Wei
]
[ ; 03
asrr d1,d13
; d13 = Ai>>S
mac2assar d9,d3,d0.h,d6
; M2_1 = (Br'+Er'):(Br'-Er')
mac2aassi d9,d3,d0.l,d7
; M2_2 = (Bi'+Ei'):(Bi'-Ei')
pack.2f d15,d14,d0
move.l (r6)+,d11
; d11 = Wdr:Wdi
]
[ ; 04
sod2aaii d12,d6,d4
; Ar'+Br'+Er'
sod2aaii d13,d7,d5
; Ai'+Bi'+Ei'
mac2assar d8,d11,d0.h,d14 ; M2_3 = (Cr'+Dr'):(Cr'-Dr')
mac2aassi d8,d11,d0.l,d15 ; M2_4 = (Ci'+Di'):(Ci'-Di')
move.2f (r14),d0:d1
; d0:d1 = [a,b], Radix-5 parameters
move.2f (r15),d8:d9
; d8:d9 = [c,d], Radix-5 parameters
]
[ ; 05
move.l d6,d2
move.l d7,d3
sod2aaii d4,d14,d4
; Aout_r = Ar'+Br'+Cr'+Dr'+Er'
sod2aaii d5,d15,d5
; Aout_i = Ai'+Bi'+Ci'+Di'+Ei'
]
[ ; 06
mac +d0,d2,d12
; Ar' + a*(Br'+Er')
mac +d0,d3,d13
; Ai' + a*(Bi'+Ei')
tfr d12,d14
tfr d13,d15
move.l d14,d10
move.l d15,d11
]
[ ; 07
mac -d8,d10,d12
; Ar' + (a*(Br'+Er')-c*(Cr'+Dr'))
mac -d8,d11,d13
; Ai' + (a*(Bi'+Ei')-c*(Ci'+Di'))
mac -d8,d2,d14
; Ar' - c*(Br'+Er')
mac -d8,d3,d15
; Ai' - c*(Bi'+Ei')
adda #4,r0,r1
; B input
moves.2f d4:d5,(r8)+
; out:[Aout_r,Aout_i]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
83
DFT Reference Code
]
[ ; 08
pack.2f d13,d12,d0
mac +d0,d10,d14
mac +d0,d11,d15
move.l (sp-64),d8
move.2f (r0)+n0,d12:d13
]
[ ; 09
mac2aassi d1,d3,d0.h,d4
mac2ssaai d1,d2,d0.l,d5
pack.2f d15,d14,d0
move.l (r1)+n0,d14
move.l (r0)+n0,d15
]
[ ; 10
mac2aassi d9,d11,d4
mac2ssaai d9,d10,d5
mac2aassi d9,d3,d0.h,d6
mac2ssaai d9,d2,d0.l,d7
move.l (r0)+,d9
move.l (r4)+,d2
]
[ ; 11
asrr2 d8,d14
asrr2 d8,d15
mac2saasi d1,d11,d6
mac2assai d1,d10,d7
moves.2f d4:d5,(r9)+
move.2w d4:d5,(r12)+
]
[ ; 12
asrr d8,d12
moves.2f d6:d7,(r10)+
move.2w d6:d7,(r11)+
]
loopend0
;
;
;
;
Ar' - (c*(Br'+Er')-a*(Cr'+Dr'))
Ai' - (c*(Bi'+Ei')-a*(Ci'+Di'))
Scaling factor
d12.H = Ar, d13.H = Ai
;
;
;
;
M5_1.H
M5_1.L
M5_2.H
M5_2.L
=
=
=
=
Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')),
Ar'+(a(Br'+Er')-c(Cr'+Dr'))-(b(Bi'-Ei'))
Ai'+(a(Bi'+Ei')-c(Ci'+Di'))-(b(Br'-Er')),
Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+(b(Br'-Er'))
; d14 = Br:Bi
; d15 = Cr:Ci
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
Bout_r =
Ar'+(a(Br'+Er')-c(Cr'+Dr'))+
(b(Bi'-Ei')+d(Ci'-Di')), Eout_r =
Ar'+(a(Br'+Er')-c(Cr'+Dr'))(b(Bi'-Ei')+d(Ci'-Di'))
Bout_i =
Ai'+(a(Bi'+Ei')-c(Ci'+Di'))(b(Br'-Er')+d(Cr'-Dr')), Eout_i =
Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+
(b(Br'-Er')+d(Cr'-Dr'))
M5_3.H = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')),
M5_3.L = Ar'-(c(Br'+Er')-a(Cr'+Dr'))-(d(Bi'-Ei'))
M5_4.H = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')),
M5_4.L = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+(d(Br'-Er'))
d9 = Er:Ei
d2 = Wbr:Wbi
;
;
;
;
;
;
;
;
;
;
;
;
;
;
d14 = Br:Bi>>S
d15 = Cr:Ci>>S
Cout_r =
Ar'-(c(Br'+Er')-a(Cr'+Dr'))+
(d(Bi'-Ei')-b(Ci'-Di')), Dout_r =
Ar'-(c(Br'+Er')-a(Cr'+Dr'))(d(Bi'-Ei')-b(Ci'-Di'))
Cout_i =
Ai'-(c(Bi'+Ei')-a(Ci'+Di'))(d(Br'-Er')-b(Cr'-Dr')), Dout_i =
Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+
(d(Br'-Er')-b(Cr'-Dr'))
out:[Bout_r,Bout_i]
out:[Eout_r,Eout_i]
; d12 = Ar>>S
; out:[Cout_r,Cout_i]
; out:[Dout_r,Dout_i]
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
84
Freescale Semiconductor
DFT Reference Code
exit_mixed_radix:
[
move.l (my_OUTPUT_pointer),r14 ;// --> psFft->psiOut
move.l (sp-12),d3
; out_address psRadix.psiOut
]
[
move.l (sp-36),d4
adda #-72,sp,r5
]
[
tfra r5,sp
]
pop.2l d6:d7
pop.2l r6:r7
F_sc3850_dft_dit_complex_16x16_auto_scale_asm_end
[
rtsd
move.l d4,sr
]
move.l d3,(r14)
; psFft->psiOut = psRadix.psiOut;
endsec
Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0
Freescale Semiconductor
85
How to Reach Us:
Home Page:
www.freescale.com
Web Support:
http://www.freescale.com/support
USA/Europe or Locations Not Listed:
Freescale Semiconductor, Inc.
Technical Information Center, EL516
2100 East Elliot Road
Tempe, Arizona 85284
+1-800-521-6274 or
+1-480-768-2130
www.freescale.com/support
Europe, Middle East, and Africa:
Freescale Halbleiter Deutschland GmbH
Technical Information Center
Schatzbogen 7
81829 Muenchen, Germany
+44 1296 380 456 (English)
+46 8 52200080 (English)
+49 89 92103 559 (German)
+33 1 69 35 48 48 (French)
www.freescale.com/support
Information in this document is provided solely to enable system and software
implementers to use Freescale Semiconductor products. There are no express or
implied copyright licenses granted hereunder to design or fabricate any integrated
circuits or integrated circuits based on the information in this document.
Freescale Semiconductor reserves the right to make changes without further notice to
any products herein. Freescale Semiconductor makes no warranty, representation or
guarantee regarding the suitability of its products for any particular purpose, nor does
Freescale Semiconductor assume any liability arising out of the application or use of
any product or circuit, and specifically disclaims any and all liability, including without
limitation consequential or incidental damages. “Typical” parameters which may be
provided in Freescale Semiconductor data sheets and/or specifications can and do
vary in different applications and actual performance may vary over time. All operating
parameters, including “Typicals” must be validated for each customer application by
customer’s technical experts. Freescale Semiconductor does not convey any license
Japan:
Freescale Semiconductor Japan Ltd.
Headquarters
ARCO Tower 15F
1-8-1, Shimo-Meguro, Meguro-ku
Tokyo 153-0064
Japan
0120 191014 or
+81 3 5437 9125
[email protected]
under its patent rights nor the rights of others. Freescale Semiconductor products are
Asia/Pacific:
Freescale Semiconductor China Ltd.
Exchange Building 23F
No. 118 Jianguo Road
Chaoyang District
Beijing 100022
China
+86 010 5879 8000
[email protected]
claims, costs, damages, and expenses, and reasonable attorney fees arising out of,
For Literature Requests Only:
Freescale Semiconductor
Literature Distribution Center
+1-800 441-2447 or
+1-303-675-2140
Fax: +1-303-675-2150
LDCForFreescaleSemiconductor
@hibbertgroup.com
Document Number: AN3680
Rev. 0
11/2010
not designed, intended, or authorized for use as components in systems intended for
surgical implant into the body, or other applications intended to support or sustain life,
or for any other application in which the failure of the Freescale Semiconductor product
could create a situation where personal injury or death may occur. Should Buyer
purchase or use Freescale Semiconductor products for any such unintended or
unauthorized application, Buyer shall indemnify and hold Freescale Semiconductor
and its officers, employees, subsidiaries, affiliates, and distributors harmless against all
directly or indirectly, any claim of personal injury or death associated with such
unintended or unauthorized use, even if such claim alleges that Freescale
Semiconductor was negligent regarding the design or manufacture of the part.
Freescale, the Freescale logo, StarCore, and CodeWarrior are trademarks of
Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. All other product or service
names are the property of their respective owners.
© 2009–2010 Freescale Semiconductor, Inc.