Freescale Semiconductor Application Note Document Number: AN3680 Rev. 0, 11/2010 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core The Discrete Fourier transform (DFT) is a transform used for Fourier analysis of a finite sampled sequence of a signal. The DFT is widely employed in signal processing and related fields to analyze the frequencies contained in a signal. The Fast Fourier Transform (FFT) is a numerically efficient algorithm used to compute the DFT. For a complex N-point Fourier transform, the FFT reduces the number of complex multiplications from the order of N2 to the order of NlogN. Mixed radix-2/3/4/5 FFTs can be used to implement the DFT algorithm with reduced computation if the number of DFT point N = 2k 3l 4m 5n (k, l, m, and n are positive integers). For example, an FFT of size 1200 can be calculated in five stages using radices of 5, 4, and 3 (1,200 = 5 × 5 × 4 × 4 × 3). Therefore, an efficient implementation for the DFT is possible using the mixed radix FFT calculation. This application note describes the implementation of the DFT/IDFT using mixed radix-2/3/4/5 decimation in time (DIT) FFTs/IFFTs on the Freescale Semiconductor SC3850 Digital Signal Processor (DSP) core. This document discusses how to use new features available in the SC3850 core, such as dual-multiplier, to improve the performance of the FFTs. The data structure, code optimization, test vector generation, and performance results are also investigated in this document. Typical reference code is included in this document to demonstrate the implementation details. © 2009–2010 Freescale Semiconductor, Inc.. 1. 2. 3. 4. 5. 6. 7. A. Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mixed Radix FFT Algorithms . . . . . . . . . . . . . . . . . . 4 Implementation of DFT on StarCore SC3850 DSP . 20 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 33 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 DFT Reference Code . . . . . . . . . . . . . . . . . . . . . . . . 36 Introduction 1 Introduction The discrete Fourier transform (DFT) plays an important role in the analysis, design, and implementation of discrete-time signal processing algorithms and systems because efficient algorithms exist for the computation of the DFT. These efficient algorithms are called fast Fourier transform (FFT) algorithms. In terms of multiplications and additions, the FFT algorithms can be orders of magnitude more efficient than competing algorithms. It is well known that the DFT takes N2 complex multiplications and N2 complex additions for complex N-point. Direct computation of the DFT is inefficient. The basic idea of the FFT algorithm is to break up an N-point DFT transform into successive smaller and smaller transform known as a butterfly (basic computational element). The small transform used can be a 2-point DFT known as Radix-2, 4-point DFT known as Radix-4, or other points. A two-point butterfly requires 1 complex multiplication and 2 complex additions, and a 4-point butterfly requires 3 complex multiplications and 8 complex additions. Therefore, a Radix-2 FFT reduces the complexity of a N-point DFT down to (N/2)log2N complex multiplications and Nlog2N complex additions because there are log2N stages and at each stage there are N/2 2-point butterflies. For a Radix-4 FFT, there are log4N stages and, at each stage, there are N/4 butterflies. Thus, the total number of complex multiplication is (3N/4)log4N = (3N/8)log2N, and the number of required complex additions is 8(N/4)log4N = Nlog2N. Because the FFTs reduce the number of complex multiplications from the order of N2 to the order of NlogN, they are often used to compute the DFT. For instance, the size of the DFT/IDFT in 3GPP-LTE can be calculated as a product of the factors 2, 3, 4, and 5, which is equal to 2k 3l 4m 5n (k, l, m, and n are positive integers). The specific sizes are [12, 24, 36, 48, 60, 72, 96, 108, 120, 144, 180, 192, 216, 240, 288, 300, 324, 360, 384, 432, 480, 540, 576, 600, 648, 720, 768, 864, 900, 960, 972, 1080, 1152, and 1200]. Therefore, an efficient implementation of DFT/IDFT is possible using the mixed radix FFT/IFFT calculation. The FFTs are of importance to a wide variety of applications, such as telecommunications. For instance, M-point DFT and IDFT are used to convert the signal between frequency domain and time domain in Third Generation Partnership Project-Long Term Evolution (3GPP-LTE). To begin, consider an example that demonstrates how FFTs are used in real applications. In the 3GPP-LTE (Long Term Evolution), M-point DFT and Inverse DFT (IDFT) are used to convert the signal between frequency domain and time domain. 3GPP-LTE aims to provide for an uplink speed of up to 50 Mbps and a downlink speed of up to 100 Mbps. For this purpose, the 3GPP-LTE physical layer uses Orthogonal Frequency Division Multiple Access (OFDMA) on the downlink and Single Carrier-Frequency Division Multiple Access (SC-FDMA) on the uplink. Figure 1 shows the transmitter and receiver structure of OFDMA and SC-FDMA systems. We can see from this example that DFT and IDFT are the key elements used to represent changing signals in time and frequency domains. In real applications, FFTs are normally used instead of direct DFT calculation to achieve higher performance. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 2 Freescale Semiconductor Organization SC-FDMA Transmitter User N-point DFT {Xn} Subcarrier Mapping M-point IDFT Cyclic Prefix & Pulse Shapi ng DAC & RF Subcarrier DeMapping& Equalization M-point DFT Cyclic Prefix Removal RF & ADC Cyclic Prefix & Puls e Shaping DAC & RF Channel SC-FDMA Receiver Base Station N-point IDFT Uplink SC-FDMA OFDMA Transmitter Base Station Subcarrier Mapping M-point IDFT Channel OFDMA Receiver User {Xn} Subcarrier DeMapping& Eq ualization M-point DFT Cyclic Prefix Removal RF & ADC Downlink: OFDMA Figure 1. Transmitter and Receiver Structure of SC-FDMA and OFDMA Systems 2 Organization This application note describes how to implement DFT/IDFT using the mixed radix FFT/IFFT algorithms on the StarCore DSP SC3850.The rest of the document is organized as follows: • Section 3, “Mixed Radix FFT Algorithms”—Reviews the basic information on DFTs s and provides a brief introduction to the mixed radix FFT algorithms. • Section 4, “Implementation of DFT on StarCore SC3850 DSP”—Discusses the algorithm implementation on SC3850, fixed point issues, and how to optimize the implementation for SC3850. • Section 5, “Experimental Results”—Present results and discusses performance. • Section 6, “Conclusions”—Presents the conclusions. • Section 7, “References”—Provides a list of useful references. • Appendix A, “DFT Reference Code”—Provides example code. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 3 Mixed Radix FFT Algorithms 3 Mixed Radix FFT Algorithms The following sections provide background information, definitions, and examples for using FFT and inverse FFT algorithms. 3.1 Definition of DFT and IDFT The DFT X(k), k=0,1,2,...,N-1 of a sequence x(n), n=0,1,2,...,N-1 is defined as N–1 X(k) = -⎞ ∑ x ( n ) exp ⎛⎝ –j -----------N ⎠ 2πnk n=0 N–1 = ∑ x ( n )W Eqn. 1 nk N n=0 2πnk W Nnk = exp ⎛ – j -------------⎞ ⎝ N ⎠ Eqn. 2 2πnk 2πnk = cos ⎛ -------------⎞ – j sin ⎛ -------------⎞ ⎝ N ⎠ ⎝ N ⎠ In Equation 1 and Equation 2, N is the number of data, j = – 1 , and W Nnk is the twiddle factor. Equation 1 is called the N-point DFT of the sequence of x(n). For each value of k, the value of X(k) represents the Fourier transform at the frequency 2πk ---------- . The IDFT is defined as follows: N N–1 1 x ( n ) = ---N -⎞ ∑ X ( k ) exp ⎛⎝ j -----------N ⎠ 2πnk k=0 Eqn. 3 N–1 1 = ---N ∑ X ( k )W – nk N k=0 2πnk W N– nk = exp ⎛ j -------------⎞ ⎝ N ⎠ Eqn. 4 2πnk 2πnk = cos ⎛ -------------⎞ + j sin ⎛ -------------⎞ ⎝ N ⎠ ⎝ N ⎠ Equation 3 is essentially the same as Equation 1. The differences are that the exponent of the twiddle factor in Equation 3 is the negative of the one in Equation 1 and the scaling factor is 1/N. The IDFT can be simply computed using the same algorithms for DFT but with conjugated twiddle factors. Alternatively, we can use the same twiddles factors for DFT with conjugated input and output to compute IDFT. Equation 1 is also called the analysis equation and Equation 3 the synthesis equation. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 4 Freescale Semiconductor Mixed Radix FFT Algorithms From Equation 1, it is clear that computing X(k) for each k requires N complex multiplications and N-1 complex additions. So, for N values of k, that is, for the entire DFT, it requires N2 complex multiplications and N ( N – 1 ) ≈ N 2 complex additions. Thus, the DFT is very computationally intensive. Note that a multiplication of two complex numbers ( ( a + jb )c + jd ) = ( ac – bd ) + j ( bc + ad ) requires four real multiplications and two real additions. A complex addition ( a + jb ) + ( c + jd ) = ( a + c ) + j ( b + d ) requires two real additions. The Fast Fourier transforms (FFT) refers to algorithms that compute the DFT in a numerically efficient manner. Different radices 2, 3, 4, and 5 can be combined to calculate DFT if the number of DFT point N = 2k 3l 4m 5n (k, l, m, and n are positive integers). Therefore, in the following sections, we describe how to combine different radix FFTs and then review the decimation-in-time (DIT) algorithms for radix-2/3/4/5 FFTs. Please note that the DIT algorithm is more efficient to run on an SC3850 core because the core performs dual MACs. 3.2 Combination of Different Radix FFTs The number of input data (the number of DFT points), N, is equal to the product of the FFT radixes, which is defined as follows: N = R( 0) × R( 1) × R( 2) × … × R(S – 1 ) S–1 = ∏ R(s) Eqn. 5 s=0 In Equation 5, the parameters are defined as follows: • S: Specifies the number of FFT stages • s: Specifies stage index starting at the FFT input • R(s): Specifies radix of the FFT stage Neither he FFT stage radixes nor the R(s) parameters are constrained to be prime numbers. Mixed radix FFTs can be arbitrarily assigned in the sequence order. For example, the DFT of 300 points can be factored as 5 × 5 × 3 × 4 and can be calculated as a combination of either of the following FFT radixes: • (radix-5) × (radix-5) × (radix-3) × (radix-4) • (radix-4) × (radix-3) × (radix-5) × (radix-5) Each stage of the FFT represents N/R(s) executions of an R(s) point butterfly calculation. In this application note, radices 2, 3, 4, and 5 DIT FFTs are investigated because they are relatively easy to develop and provide a fairly large set of possible number of points for DFT calculation. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 5 Mixed Radix FFT Algorithms 3.3 Radix-5 DIT FFT The example uses the symmetry and periodicity properties in the derivation, which are defined as shown in Equation 6. symmetry property: periodicity property: N k + ---2 WN = – W nk Eqn. 6 W Nk + N = W Nk The Radix-5 DIT FFT algorithm breaks the DFT calculation into several 5-point DFTs. The Radix-5 DIT FFT can be derived as follows: N–1 k) = ∑ x ( n )WNnk n=0 N ---- – 1 5 = N ---- – 1 5 n=0 N ---- – 1 5 = N ---- – 1 5 N ---- – 1 5 ∑ x ( 5n )WN5nk + ∑ x ( 5n + 1 )WN( 5n + 1)k + ∑ x ( 5n + 2 )WN( 5n + 2)k + ∑ x ( 5n + 3 )WN( 5n + 3 )k + ∑ x ( 5n + 4 )WN( 5n + 4 ) n=0 = N ---- – 1 5 n=0 N ---- – 1 5 n=0 N ---- – 1 5 n=0 N ---- – 1 5 N ---- – 1 5 ∑ x ( 5n )WN5nk + WNk ∑ x ( 5n + 1 )WN5nk + WN2k ∑ x ( 5n + 2 )WN5nk + WN3k ∑ x ( 5n + 3 )WN5nk + WN4k ∑ x ( 5n + 4 )WN5nEqn. 7 n=0 n=0 N ---- – 1 5 N ---- – 1 5 n=0 N ---- – 1 5 n=0 N ---- – 1 5 n=0 N ---- – 1 5 ∑ x ( 5n )WNnk⁄ 5 + WNk ∑ x ( 5n + 1 )WNnk⁄ 5 + WN2k ∑ x ( 5n + 2 )WNnk⁄ 5 + WN3k ∑ x ( 5n + 3 )WNnk⁄ 5 + WN4k ∑ x ( 5n + 4 )WNn n=0 n=0 n=0 n=0 n=0 k Q ( k ) + W 2k R ( k ) + W 3k S ( k ) + W 4k T ( k ) = P ( k ) + WN N N N k = 0 , 1 , 2 , 3 , …, N – 1 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 6 Freescale Semiconductor Mixed Radix FFT Algorithms Each of the sums, P(k), Q(k), R(k), S(k), and T(k), in Equation 7 is recognized as an N/5-point DFT. Although the index k ranges over N values, k=0,1,2,...,N-1, each of the sums can be computed only for k=0,1,2,...,N/5-1, since they are periodic with period N/5. The transform X(k) can be broken into five parts as follows: X ( k ) = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) + W N3k S ( k ) + W N4k T ( k ) ⎛ N⎞ ⎛ N⎞ ⎛ N⎞ ⎛ N⎞ k + ---2 k + ---3 k + ---4 k + ---N ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ X ⎛⎝ k + ----⎞⎠ = P ( k ) + W N 5 Q ( k ) + W N 5 R ( k ) + W N 5 S ( k ) + W N 5 T ( k ) 5 = P ( k ) + W 1 / 5 W Nk Q ( k ) + W 2 / 5 W N2k R ( k ) + W 3 / 5 W N3k S ( k ) + W 4 / 5 W N4k T ( k ) ⎛ 2N⎞ ⎛ 2N⎞ ⎛ 2N⎞ ⎛ 2N⎞ k + ------2 k + ------3 k + ------4 k + ------2N ⎝ ⎠ ⎝ ⎝ ⎝ 5⎠ 5⎠ 5⎠ X ⎛ k + -------⎞ = P ( k ) + W N 5 Q ( k ) + W N R ( k ) + WN S ( k ) + WN T( k) ⎝ 5⎠ = P ( k ) + W 2 / 5 W Nk Q ( k ) + W 4 / 5 W N2k R ( k ) + W 1 / 5 W N3k S ( k ) + W 3 / 5 W N4k T ( k ) ⎛ 3N⎞ ⎛ 3N⎞ ⎛ 3N⎞ ⎛ 3N⎞ 2 ⎝ k + -------⎠ 3 ⎝ k + -------⎠ 4 ⎝ k + -------⎠ 3N ⎝ k + -------⎠ 5 5 5 X ⎛ k + -------⎞ = P ( k ) + W N 5 Q ( k ) + W N R ( k ) + WN S ( k ) + WN T( k) ⎝ ⎠ 5 Eqn. 8 = P ( k ) + W 3 / 5 W Nk Q ( k ) + W 1 / 5 W N2k R ( k ) + W 4 / 5 W N3k S ( k ) + W 2 / 5 W N4k T ( k ) ⎛ k + 4N -------⎞⎠ 5 4N ⎝ X ⎛ k + -------⎞ = P ( k ) + W N ⎝ 5⎠ 4N 2 ⎛⎝ k + -------⎞⎠ 5 Q ( k ) + WN 4N 3 ⎛⎝ k + -------⎞⎠ 5 R ( k ) + WN 4N 4 ⎛⎝ k + -------⎞⎠ 5 S ( k ) + WN T( k) = P ( k ) + W 4 / 5 W Nk Q ( k ) + W 3 / 5 W N2k R ( k ) + W 2 / 5 W N3k S ( k ) + W 1 / 5 W N4k T ( k ) N k = 0, 1, 2, …, ---- – 1 5 In the equation above, the periodicity propertyW Nk + N = W Nkis used to simplify W5k . For example,W 512 = W52. The four complex numbers W51, W52, W53, and W54 can be expressed as shown in Figure 2. Im(imaginary) W53 W54 b d -c a Re(real) 0 -d W52 -b W51 a = 0.3090169944 b = 0.9510565163 c = 0.8090169944 d = 0.5877852523 Figure 2. Complex numbers of W51, W52, W53, and W54 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 7 Mixed Radix FFT Algorithms 2π 2π 2π W 51 = exp ⎛ – j ------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = a – jb ⎝ 5⎠ ⎝ 5⎠ ⎝ 5⎠ 2π × 2 4π 4π W 52 = exp ⎛ – j ---------------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = – c – jd ⎝ ⎝ 5⎠ ⎝ 5⎠ 5 ⎠ W 53 Eqn. 9 6π 6π 2π × 3 = exp ⎛ – j ---------------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = – c + jd ⎝ 5⎠ ⎝ 5⎠ ⎝ 5 ⎠ 8π 8π 2π × 4 W 54 = exp ⎛ – j ---------------⎞ = cos ⎛ ------⎞ – j sin ⎛ ------⎞ = a + jb ⎝ 5⎠ ⎝ 5⎠ ⎝ 5 ⎠ The constants in Equation 9 are as follows: a = 0.3090169944, b = 0.9510565163, c = 0.8090169944, d = 0.5877852523. Figure 3 shows the corresponding Radix-5 DIT butterfly diagram. P(k) A Q(k) Wb B R(k) Wc C S(k) Wd D T(k) E We A’ (1,1,1,1,1) X(k) B’ (1,a-jb,-c-jd, -c+jd,a+jb) X(k+N/5) C’ (1,-c-jd,a+jb, a-jb,-c+jd) X(k+2N/5) D’ (1,-c+jd,a-jb, a+jb,-c-jd) E’ (1,a+jb,-c+jd, -c-jd,a-jb) 2π ⎞ ⎛ Wb = exp ⎜ − j k⎟ N ⎠ ⎝ 2π ⎛ ⎞ Wc = exp⎜ − j 2k ⎟ N ⎝ ⎠ ⎛ 2π ⎞ Wd = exp⎜ − j 3k ⎟ N ⎝ ⎠ (k = 0,1, ...,N/5-1) Aout Bout Cout X(k+3N/5) Dout X(k+3N/5) Eout a = 0x278e b = 0x79bc c = 0x678e d = 0x4b3d 2π ⎛ ⎞ We = exp ⎜ − j 4k ⎟ N ⎝ ⎠ Figure 3. Radix-5 DIT Butterfly Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 8 Freescale Semiconductor Mixed Radix FFT Algorithms The twiddle factors in Figure 3 are defined as follows: W Nk = Wb = Wbr + j × Wbi W N2k = Wc = Wcr + j × Wci W N3k = Wd = Wdr + j × Wdi Eqn. 10 W N4k = We = Wer + j × Wei The real and imaginary part of the outputs of Radix-5 DIT butterfly is specified as follows: Aout_r = Ar'+Br'+Cr'+Dr'+Er' Aout_i = Ai'+Bi'+Ci'+Di'+Ei' Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')+d(Ci'-Di')) Bout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di'))-(b(Br'-Er')+d(Cr'-Dr')) Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')-b(Ci'-Di')) Cout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')-b(Cr'-Dr')) Dout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))-(d(Bi'-Ei')-b(Ci'-Di')) Dout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+(d(Br'-Er')-b(Cr'-Dr')) Eout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))-(b(Bi'-Ei')+d(Ci'-Di')) Eout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+(b(Br'-Er')+d(Cr'-Dr')) where Ar’, Ai’, ...., Er’, and Ei’ are calculated as follows: Ar' = Ar Ai' = Ai Br' = (Br*Wbr - Bi*Wbi) Bi' = (Br*Wbi + Bi*Wbr) Cr' = (Cr*Wcr - Ci*Wci) Ci' = (Cr*Wci + Ci*Wcr) Dr' = (Dr*Wdr - Di*Wdi) Di' = (Dr*Wdi + Di*Wdr) Er' = (Er*Wer - Ei*Wei) Ei' = (Er*Wei + Ei*Wer) NOTE The principal difference between DIT and DIF (decimation in frequency) butterflies is that the order of calculation changes. In the DIF algorithm, the time domain data is “twiddled” before the sub-transforms are performed. In DIT, however, the sub-transforms are performed first, and the output is obtained by “twiddling” the resulting frequency domain data. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 9 Mixed Radix FFT Algorithms 3.4 Radix-4 DIT FFT The Radix-4 DIT FFT algorithm breaks a N-point DFT calculation into a number of 4-point DFTs (4-point butterflies). Compared with direct computation of N-point DFT, 4-point butterfly calculation requires much less operations. The Radix-4 DIT FFT can be derived as follows: N–1 X(k) = ∑ x( n )W nk N n=0 N ---- – 1 4 = N ---- – 1 4 ∑ x( 4n )W 4nk N n=0 N ---- – 1 4 = = + N ---- – 1 4 ∑ x( 4n + 1 )W ( 4n + 1 )k N + n=0 N ---- – 1 4 ∑ x( 4n + 2 )W 4nk N n=0 N ---- – 1 4 N ---- – 1 4 nk N⁄4 + W Nk n=0 ∑ x ( 4n + 3 )W 4nk N + W N2k N ---- – 1 4 ∑ x ( 4n + 2 )W 4nk N + W N3k n=0 ∑ x ( 4n + 3 )W nk N⁄4 n=0 N ---- – 1 4 ∑ x( 4n + 2)W + W N2k Eqn. 11 4nk N n=0 N ---- – 1 4 ∑ x ( 4n + 1 )W ( 4n + 3 )k N n=0 N ---- – 1 4 ∑ x ( 4n + 1 )W + W Nk n=0 ∑ x( 4n )W + n=0 N ---- – 1 4 ∑ x( 4n )W ( 4n + 2 )k N nk N⁄4 ∑ x ( 4n + 3 )W + W N3k n=0 nk N⁄4 n=0 = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) + W N3k S ( k ) k = 0, 1, 2, 3, …, N – 1 Each of the sums, P(k), Q(k), R(k), and S(k), in Equation 11 is recognized as an N/4-point DFT. Although the index k ranges over N values, k=0,1,2,...,N-1, each of the sums can be computed only for k=0,1,2,...,N/4-1, since they are periodic with period N/4. The transform X(k) can be broken into four parts as follows: X ( k ) = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) + W N3k S ( k ) ⎛k + N ----⎞ 4⎠ N ⎝ X ⎛ k + ----⎞ = P ( k ) + W N ⎝ 4⎠ N 2 ⎛ k + ----⎞ ⎝ 4⎠ Q ( k ) + WN N 3 ⎛ k + ----⎞ ⎝ 4⎠ R ( k ) + WN S( k ) = P ( k ) – jW Nk Q ( k ) – W N2k R ( k ) + jW N3k S ( k ) ⎛ k + 2N -------⎞⎠ 4 2N ⎝ X ⎛ k + -------⎞ = P ( k ) + W N ⎝ 4⎠ 2N 2 ⎛⎝ k + -------⎞⎠ 4 Q ( k ) + WN 2N 3 ⎛⎝ k + -------⎞⎠ 4 R ( k ) + WN S( k) Eqn. 12 = P ( k ) – W Nk Q ( k ) + W N2k R ( k ) – W N3k S ( k ) ⎛ k + 3N -------⎞ 4⎠ 3N ⎝ X ⎛ k + -------⎞ = P ( k ) + W N ⎝ 4⎠ 3N 2 ⎛ k + -------⎞ ⎝ 4⎠ Q ( k ) + WN 3N 3 ⎛ k + -------⎞ ⎝ 4⎠ R ( k ) + WN S( k) = P ( k ) + jW Nk Q ( k ) – W N2k R ( k ) – jW N3k S ( k ) N k = 0, 1, 2, …, ---- – 1 4 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 10 Freescale Semiconductor Mixed Radix FFT Algorithms Figure 4 shows the corresponding Radix-4 DIT butterfly diagram, and Figure 5 shows its simplified version. P(k) A’ A Q(k) Wc C Wb R(k) B S(k) Wd D B’ (1,1,1,1) X(k) (1,-j,-1,j) X(k+N/4) Aout Bout C’ (1,-1,1,-1) D’ (1,j,-1,-j) X(k+2N/4) Cout X(k+3N/4) Dout 2π ⎞ ⎛ Wb = exp ⎜ − j k⎟ N ⎠ ⎝ 2π ⎛ ⎞ Wc = exp ⎜ − j 2k ⎟ N ⎝ ⎠ 2π ⎞ ⎛ Wd = exp ⎜ − j 3k ⎟ N ⎝ ⎠ (k = 0,1, ...,N/4-1) Figure 4. Radix-4 DIT Butterfly A 0 Aout B n Bout C 2n Cout 3n Dout D Figure 5. Simplified Radix-4 DIT Butterfly The real and imaginary part of output data for Radix-4 DIT butterfly is specified in Equation 13. Aoutr′ = Ar + ( Cr × Wcr – Ci × Wci ) + ( Br × Wbr – Bi × Wbi ) + ( Dr × Wdr – Di × Wdi ) Aouti′ = Ai + ( Cr × Wci + Ci × Wcr ) + ( Br × Wbi + Bi × Wbr ) + ( Dr × Wdi + Di × Wdr Boutr′ = Ar – ( Cr × Wcr – Ci × Wci ) + ( Br × Wbi + Bi × Wbr ) – ( Dr × Wdi + Di × Wdr ) Bouti′ = Ai – ( Cr × Wci + Ci × Wcr ) – ( Br × Wbr – Bi × Wbi ) + ( Dr × Wdr – Di × Wdi ) Coutr′ = Ar + ( Cr × Wcr – Ci × Wci ) – ( Br × Wbr – Bi × Wbi ) – ( Dr × Wdr – Di × Wdi ) Eqn. 13 Couti′ = Ai + ( Cr × Wci + Ci × Wcr ) – ( Br × Wbi + Bi × Wbr ) – ( Dr × Wdi + Di × Wdr ) Doutr′ = Ar – ( Cr × Wcr – Ci × Wci ) – ( Br × Wbi + Bi × Wbr ) + ( Dr × Wdi + Di × Wdr ) Douti′ = Ai – ( Cr × Wci + Ci × Wcr ) + ( Br × Wbr – Bi × Wbi ) – ( Dr × Wdr – Di × Wdi ) Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 11 Mixed Radix FFT Algorithms 3.5 Radix-3 DIT FFT The Radix-3 DIF FFT can be derived as shown in Equation 14. X(k) = ∑ x ( n )W nk N n=0 N ---- – 1 3 = N ---- – 1 3 ∑ x ( 3n )W 3nk N n=0 + ∑ x ( 3n + 1 )W = ( 3n + 1 )k N + n=0 N ---- – 1 3 = N ---- – 1 3 ∑ x( 3n + 2)W n=0 N ---- – 1 3 ∑ x ( 3n )W 3nk N n=0 N ---- – 1 3 N ---- – 1 3 nk N⁄3 + W Nk n=0 N ---- – 1 3 ∑ x ( 3n + 1 )W + W Nk n=0 ∑ x ( 3n )W ( 3n + 2 )k N 3nk N + W N2k ∑ x( 3n + 2 )W Eqn. 14 3nk N n=0 N ---- – 1 3 ∑ x( 3n + 1 )W nk N⁄3 + W N2k n=0 ∑ x ( 3n + 2 )W nk N⁄3 n=0 = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) k = 0 1 2 3 N–1 Each of the sums, P(k), Q(k), and R(k), in Equation 14 is recognized as an N/3-point DFT. The transform X(k) can be broken into three parts as shown in Equation 15. X ( k ) = P ( k ) + W Nk Q ( k ) + W N2k R ( k ) ⎛ N⎞ ⎛ N⎞ 2 ⎝ k + ----⎠ N ⎝ k + ----⎠ X ⎛ k + ----⎞ = P ( k ) + W N 3 Q ( k ) + W N 3 R ( k ) ⎝ ⎠ 3 = P ( k ) + W 1 / 3 W Nk Q ( k ) + W 2 / 3 W N2k R ( k ⎛ 2N⎞ ⎛ Eqn. 15 2N⎞ 2 ⎝ k + -------⎠ 2N ⎝ k + -------⎠ 3 X ⎛ k + -------⎞ = P ( k ) + W N 3 Q ( k ) + W N R(k ⎝ ⎠ 3 = P ( k ) + W 2 / 3 W Nk Q ( k ) + W 1 / 3 W N2k R ( k N k = 0, 1, 2, …, ---- – 1 3 In Equation 15, the periodicity propertyWNk + N = W Nkis used to simplifyW 34 = W 31. The complex numbers W31 and W32 can be expressed as shown in Equation 16. 1 3 2π W 31 = exp ⎛ – j ------⎞ = – --- – ------- j ⎝ 3⎠ 2 2 W 32 Eqn. 16 1 3 2π × 2 = exp ⎛ – j ---------------⎞ = – --- + ------- j ⎝ 2 2 3 ⎠ Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 12 Freescale Semiconductor Mixed Radix FFT Algorithms Figure 6 shows the corresponding Radix-3 DIT butterfly diagram. P(k) A’ A Q(k) Wb B R(k) Wc Aout X(k+N/3) B’ (1,-1/2-j -1/2+j C X(k) (1,1,1) 3 /2, Bout 3 /2) X(k+2N/3) C’ (1, -1/2+j -1/2-j 2π ⎞ ⎛ Wb = exp ⎜ − j k⎟ N ⎠ ⎝ 2π ⎛ ⎞ Wc = exp⎜ − j 2k ⎟ N ⎝ ⎠ 3 /2, Cout 3 /2) (k = 0,1, ...,N/3-1) Figure 6. Radix-3 DIT Butterfly The twiddle factors in Figure 6 are defined as shown in Equation 17. W Nk = Wb = Wbr + j × Wbi W N2k = Wc = Wcr + j × Wci Eqn. 17 The real and imaginary part of the outputs of Radix-3 DIT butterfly is specified as follows: Aout_r = Ar'+Br'+Cr' Aout_i = Ai'+Bi'+Ci' Bout_r = Ar'-(Br’+Cr’)/2+(Bi’-Ci’)* 3 /2 Bout_i = Ai'-(Bi’+Ci’)/2-(Br’-Cr’)* 3 /2 Cout_r = Ar'-(Br’+Cr’)/2-(Bi’-Ci’)* 3 /2 Cout_i = Ai'-(Bi’+Ci’)/2+(Br’-Cr’)* 3 /2 where Ar’, Ai’, Br’, Bi’, Cr’, and Ci’ are calculated as follows: Ar' = Ar Ai' = Ai Br' = (Br*Wbr - Bi*Wbi) Bi' = (Br*Wbi + Bi*Wbr) Cr' = (Cr*Wcr - Ci*Wci) Ci' = (Cr*Wci + Ci*Wcr) Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 13 Mixed Radix FFT Algorithms 3.6 Radix-2 DIT FFT The Radix-2 DIF FFT can be derived as shown in Equation 18. X(k) = ∑ x ( n )W nk N n=0 N ---- – 1 2 = N ---- – 1 2 ∑ x ( 2n )W 2nk N + n=0 ∑ x( 2n + 1 )W n=0 N ---- – 1 2 = = ( 2n + 1 )k N N ---- – 1 2 ∑ x ( 2n )W 2nk N ∑ x( 2n + 1 )W + W Nk n=0 n=0 N ---- – 1 2 N ---- – 1 2 ∑ x ( 2n )W nk N⁄2 + W Nk n=0 Eqn. 18 2nk N ∑ x( 2n + 1)W nk N⁄2 n=0 = P ( k ) + W Nk Q ( k ) k = 0 1 2 3 N–1 Each of the sums, P(k) and Q(k), in Equation 18 is recognized as an N/2-point DFT. The transform X(k) can be broken into two parts as shown in Equation 19. X ( k ) = P ( k ) + W Nk Q ( k ) ⎛ N⎞ N ⎝ k + ----⎠ X ⎛ k + ----⎞ = P ( k ) + W N 2 Q ( k ⎝ 2⎠ Eqn. 19 = P ( k ) – W Nk Q ( k ) N k = 0, 1, 2, …, ---- – 1 2 Figure 7 shows the corresponding Radix-2 DIT butterfly diagram. P(k) A’ A Q(k) B Wb (1,1) (1,-1) B’ X(k) Aout X(k+N/2) Bout 2π ⎞ ⎛ Wb = exp ⎜ − j k⎟ N ⎠ ⎝ (k = 0,1, ...,N/2-1) Figure 7. Radix-2 DIT Butterfly Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 14 Freescale Semiconductor Mixed Radix FFT Algorithms The twiddle factors in Figure 7 are defined in Equation 20. W Nk = Wb = Wbr + j × Wb Eqn. 20 The real and imaginary part of the outputs of Radix-2 DIT butterfly is specified as follows: Aout_r Aout_i Bout_r Bout_i 3.7 = = = = Ar+(Br*Wbr Ai+(Br*Wbi Ar-(Br*Wbr Ai-(Br*Wbi + + Bi*Wbi) Bi*Wbr) Bi*Wbi) Bi*Wbr) An Example of 12-Point Mixed Radix DIT FFT An example of 12-point mixed Radix DIT FFT diagram is shown in Figure 8 to illustrate an entire DIT algorithm. There are 2 stages to process the whole DFT computation. Radix-4 DIT butterflies are used in the first stage, and there 3 butterflies; Radix-3 DIT butterflies are used in the second stage, and there 4 butterflies. x(0) X(0) 0 0 0 x(3) 0 X(1) 0 x(2) 0 x(6) 0 x(9) X(3) 0 x(1) 0 x(4) 0 X(5) 0 x(7) X(4) 0 1 2 X(6) 3 X(7) 0 x(10) 0 x(2) 0 x(5) 0 2 0 4 x(8) X(8) X(9) X(10) 0 6 x(11) X(11) Stage Radix Butterflies / Subgroup Subgroups 0 4 1 3 1 3 4 1 Figure 8. An Example of a 12-point Mixed Radix DIT FFT Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 15 Mixed Radix FFT Algorithms In Figure 8, the inputs are digit-reversed ordered while the outputs are presented in a normal order. The digital reversed order of the inputs are stored in an array and pre-calculated outside of the DFT function. The pointer of the array is passed into the DFT function to get the digital reversed indices. The digital reversed indices n can be calculated as follows: n = r ( 0 )A ( 0 ) + r ( 1 )A ( 1 ) + … + r ( S – 1 )A ( S – 1 ) Eqn. 21 where Ss –is1 the total number of stages, r ( s ) = { 0, 1, …, R ( s ) – 1 } specifies possible digits of Radix R(s), and A(s ) = ∏ R ( m ) specifies the product of the radixes of the following stages. m=0 For example, consider the digit-reversed order of the 300-point DFT. The 300-point DFT can be factored as 4 × 3 × 5 × 5 and we assume that the calculated radix of FFT stage is performed in this order. So the products of radixes of the following FFT stages and the digits at each stage are At first stage, A(0) = 3 x 5 x 5 = 75, r(0) = {0,1,2,3}, At second stage, A(1) = 5 x 5 = 25, r(0) = {0,1,2}, At third stage, A(2) = 5, r(0) = {0,1,2,3,4}, At fourth stage, A(3) = 1, r(0) = {0,1,2,3,4}, The final output indexes in the digit-reversed order are calculated as k 0 75 150 225 25 100 175 250 50 125 200 275 5 80 155 230 ... = = = = = = = = = = = = = = = = = r(0)A(0) 0 75 150 225 0 75 150 225 0 75 150 225 0 75 150 225 + + + + + + + + + + + + + + + + + r(1)A(1) 0 0 0 0 25 25 25 25 50 50 50 50 0 0 0 0 + + + + + + + + + + + + + + + + + r(2)A(2) 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 + + + + + + + + + + + + + + + + + r(3)A(3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 16 Freescale Semiconductor Mixed Radix FFT Algorithms The general C program to calculate the output indexes in the digit-reversed order is shown below. This program can calculate the output indexes in the digit-reversed order up to 6 FFT stages. //================================================= // Input indices are in the digit-reversed order //================================================= short *psiRadix; // Pointer to the array of radix every stage short *psiDigitReversedAddress; // Pointer to the Digit Reversed Address short short short short short short siI; siI0,siI1,siI2,siI3,siI4,siI5; siIr0,siIr1,siIr2,siIr3,siIr4,siIr5; siR[6]; // Local variable for radix every stage siRR[6][5]; // Digits of the radix number my_psiSubgroup[6]; // Product of radixes of the following DFT stages // Radix number every stage for(siI=0;siI<6;siI++) { siR[siI] = psiRadix[siI]; if(psiRadix[siI] == 0) { siR[siI] = 1; } } // Loop siIr0 = siIr1 = siIr2 = siIr3 = siIr4 = siIr5 = index to calculate digit reversed address siR[0]; siR[1]; siR[2]; siR[3]; siR[4]; siR[5]; // Digits of the radix number for(siI=0;siI<5;siI++) { siRR[0][siI] = siI; siRR[1][siI] = siI; siRR[2][siI] = siI; siRR[3][siI] = siI; siRR[4][siI] = siI; siRR[5][siI] = siI; } // Product of radixes of the following DFT stages my_psiSubgroup[0] = siIr1*siIr2*siIr3*siIr4*siIr5; for(siI=1;siI<6;siI++) { if(psiRadix[siI] == 0) { my_psiSubgroup[siI] = 0; } else { my_psiSubgroup[siI] = my_psiSubgroup[siI-1]/siR[siI]; } } Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 17 Mixed Radix FFT Algorithms // Digit reversed address siI = 0; for(siI5=0;siI5<siIr5;siI5++) { for(siI4=0;siI4<siIr4;siI4++) { for(siI3=0;siI3<siIr3;siI3++) { for(siI2=0;siI2<siIr2;siI2++) { for(siI1=0;siI1<siIr1;siI1++) { for(siI0=0;siI0<siIr0;siI0++) { psiDigitReversedAddress[siI] = siRR[0][siI0]*my_psiSubgroup[0] + siRR[1][siI1]*my_psiSubgroup[1] + siRR[2][siI2]*my_psiSubgroup[2] + siRR[3][siI3]*my_psiSubgroup[3] + siRR[4][siI4]*my_psiSubgroup[4] + siRR[5][siI5]*my_psiSubgroup[5]; siI++; } } } } } } 3.8 Mixed Radix Inverse FFT Algorithms The common method of calculating the IFFT is to conjugate the twiddle factors and use the FFT routine to get the IFFT. However, this method needs an additional memory to store the twiddle factors. Therefore, it is not an efficient way for reduction of memory requirement. If the complex conjugate of IDFT (Equation 3) is considered, the equation is defined as shown in Equation 22. 1 x∗ ( n ) = ---N ∑ 2πnk X ( k ) exp ⎛ j -------------⎞ ⎝ N ⎠ k=0 N–1 1 = ---N -⎞ ∑ X∗ ( k ) exp ⎛⎝ ( –j ) -----------N ⎠ 2πnk k=0 N–1 1 = ---N Eqn. 22 ⎧ 2πnk ⎫ -⎞ – j sin ⎛ -------------⎞ ⎬ ∑ X∗ ( k )⎨⎩ cos ⎛⎝ -----------⎝ N ⎠ N ⎠ ⎭ 2πnk k=0 N–1 1 = ---N ∑ X∗ ( k )W nk N Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 18 Freescale Semiconductor Mixed Radix FFT Algorithms The asterisk indicates the complex conjugate. In this form, there is no need to have additional memory for the twiddle factors since the twiddle factor is the same as the DFT. The IDFT is calculated by applying the DFT on the complex conjugate of X(k). The complex conjugate of the resulting time signal, x(n), is the output of the IDFT (with appropriate scaling). The advantage of this calculation is that the same twiddle factors can be used for both the DFT and the IDFT, and there is no need to save additional conjugated twiddle factors. The following code describes the Radix-5 DIT butterfly calculations of the IFFT: Note that the conjugating process is embedded in the equations. Aout_r = Ar'+Br'+Cr'+Dr'+Er' Aout_i = Ai'-Bi'-Ci'-Di'-Ei' Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')+d(Ci'-Di')) Bout_i = Ai'-(a(Bi'+Ei')-c(Ci'+Di'))+(b(Br'-Er')+d(Cr'-Dr')) Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')-b(Ci'-Di')) Cout_i = Ai'+(c(Bi'+Ei')-a(Ci'+Di'))+(d(Br'-Er')-b(Cr'-Dr')) Dout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))-(d(Bi'-Ei')-b(Ci'-Di')) Dout_i = Ai'+(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')-b(Cr'-Dr')) Eout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))-(b(Bi'-Ei')+d(Ci'-Di')) Eout_i = Ai'-(a(Bi'+Ei')-c(Ci'+Di'))-(b(Br'-Er')+d(Cr'-Dr')) where Ar', Ai', ...., Er', and Ei' are calculated as follows: Ar' = Ar Ai' = Ai Br' = (Br*Wbr + Bi*Wbi) Bi' = (Br*Wbi - Bi*Wbr) Cr' = (Cr*Wcr + Ci*Wci) Ci' = (Cr*Wci - Ci*Wcr) Dr' = (Dr*Wdr + Di*Wdi) Di' = (Dr*Wdi - Di*Wdr) Er' = (Er*Wer + Ei*Wei) Ei' = (Er*Wei - Ei*Wer) a = 0.3090169944 b = 0.9510565163 c = 0.8090169944 d = 0.5877852523 The following code listing describes the Radix-4 DIT butterfly calculations of the IFFT: Aout_r = (Ar'+Cr')+(Br'+Dr') Aout_i = (Ai'-Ci')-(Bi'+Di')) Bout_r = (Ar'+Cr')-(Br'+Dr') Bout_i = (Ai'-Ci')+(Bi'+Di') Cout_r = (Ar'-Cr')+(Bi'-Di') Cout_i = (Ai'+Ci')+(Br'-Dr')) Dout_r = (Ar'-Cr')-(Bi'-Di') Dout_i = (Ai'+Ci')-(Br'-Dr') where Ar', Ai', ...., Dr', and Di' are calculated as follows: Ar' = Ar Ai' = -Ai Br' = (Br*Wbr + Bi*Wbi) Bi' = (Br*Wbi - Bi*Wbr) Cr' = (Cr*Wcr + Ci*Wci) Ci' = (Cr*Wci - Ci*Wcr) Dr' = (Dr*Wdr + Di*Wdi) Di' = (Dr*Wdi - Di*Wdr) Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 19 Implementation of DFT on StarCore SC3850 DSP The following code listing describes the Radix-3 DIT butterfly calculations of the IFFT: Aout_r = Ar'+Br'+Cr' Aout_i = Ai'-Bi'-Ci' Bout_r = Ar'-(Br’+Cr’)/2+(Bi’-Ci’)* 3 /2 Bout_i = Ai'+(Bi’+Ci’)/2+(Br’-Cr’)* 3 /2 Cout_r = Ar'-(Br’+Cr’)/2-(Bi’-Ci’)* 3 /2 Cout_i = Ai'+(Bi’+Ci’)/2-(Br’-Cr’)* 3 /2 where Ar’, Ai’, Br’, Bi’, Cr’, and Ci’ are calculated as follows: Ar' = Ar Ai' = Ai Br' = (Br*Wbr + Bi*Wbi) Bi' = (Br*Wbi - Bi*Wbr) Cr' = (Cr*Wcr + Ci*Wci) Ci' = (Cr*Wci - Ci*Wcr) The following code listing describes the Radix-2 DIT butterfly calculations of the IFFT: Aout_r Aout_i Bout_r Bout_i 4 = = = = Ar+(Br*Wbr Ai-(Br*Wbi Ar-(Br*Wbr Ai+(Br*Wbi + + - Bi*Wbi) Bi*Wbr) Bi*Wbi) Bi*Wbr) Implementation of DFT on StarCore SC3850 DSP In this section, an implementation of the mixed Radix DIT FFT algorithm will be presented. Please note that the implementation of IDFT is similar, and thus the reference code of IDFT is not listed in this application note. 4.1 Overflow in Fixed-Point Computation and Scaling The SC3850 is a fixed-point digital signal processor. In fixed-point processors, numbers are represented either as integers or as fractions between [-1.0, +1.0). In the implementation of the DFT algorithm, fractional number are used to interpret all input data, output data, and twiddle factors to avoid multiplication overflow because a fractional number multiplied with another fractional number always results in value less than one (under an assumption that the two operants can’t take -1 at the same time). Another overflow comes from additions in the DFT calculation. For example, in a Radix-5 DIT butterfly, although the inputs of the butterfly have real and imaginary parts with magnitudes less than one, the maximum magnitude of outputs can have a up to 3-bit growth as shown in Equation 23. A = = 2 ( Ar + Br + Cr + Dr + Er ) + ( Ai + Bi + Ci + Di + Ei ) 2 (1 + 1 + 1 + 1 + 1) + (1 + 1 + 1 + 1 + 1) 2 2 Eqn. 23 = 5 2 = 7.071 Scaling is an important operation in fixed-point DSP applications to avoid the overflow with the loss of precision and dynamic range. When the inputs are scaled to a smaller range, some of the lower-order bits of the original value are lost through truncation or rounding resulting in reduced precision. Thus, scaling must be done with great care, balancing the need to eliminate the possibility of overflow with the need to maintain adequate dynamic range and precision. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 20 Freescale Semiconductor Implementation of DFT on StarCore SC3850 DSP Two scaling methods to prevent overflowing are implemented in the DFT calculation. The first method is to scale down the input data by a fixed number of bits at each stage without considering the range of the input data. This is called fixed scaling method. The scaling factor is determined based on the maximum bit growth of the butterfly calculation. For example, the inputs to the Radix-5 butterfly are scaled by 3 bits (3-bit right shift) since it may cause up to 3-bit growth. This method is easy to implement. However, it may results in unnecessary scaling down of the input data, especially when the input data values are in a small range. Another method is to detect the bit growth at the output of each DFT stage. Once the number of bit growth is known, the number of scaling bits can be determined to reverse the bit growth. This method is called automatic scaling method. The precision of the DFT stage can be enhanced by predicting bit growth of an upcoming stage and normalizing the data accordingly. The automatic scaling method algorithm is as follows: 1. The input data are scaled down at the first stage to gain extra guard bits. The scaling factors are 2 for the first Radix-2 stage and 4 for the first Radix-4 stage. 2. At each butterfly calculation, search the maximum magnitude of the input data and detect the bit growth. The number of leading zeros or ones (except the sign bit) of the data with the maximum magnitude is recorded. 3. Based on the number of leading bits and the maximum bit growth S of the current DFT stage, the number of scaling bits is determined. The input data to the DFT stage are shifted accordingly. Table 1 shows the detailed scaling parameters used at different stages in the DFT calculation. Table 1. Automatic Scaling at Different Stages Radix-2 First Stage Automatic Scaling Automatic Scaling Automatic Scaling Automatic Scaling if(leadbits > 2) Shift = 2 - leadBits; else Shift = 2 - leadBits N/A N/A Radix-4 First Stage if(leadbits > 3) Shift = 3- leadBits; else Shift = 3- leadBits N/A N/A N/A N/A N/A N/A Radix-3 Middle Stage if(leadBits > 2) Shift = 0; else if(leadBits == 2) Shift = 1; else Shift = 2; Radix-4 Middle Stage if(leadBits > 3) Shift = 0; else if(leadBits == 3) Shift = 2; else Shift = 2; Radix-5 Middle Stage if(leadBits > 3) Shift = 0; else Shift = 3 - leadBits; Radix-3 Last Stage if(leadBits > 2) Shift = 0; else Shift = 2 - leadBits; N/A N/A Radix-5 Last Stage if(leadBits > 2) Shift = 2 - leadBits; else Shift = 2 - leadBits; 4. The final output data obtained by the DFT kernel may need to be scaled because the automatic scaling process may use different scaling factors for different input data. The DFT kernel does not implement this step since difference applications may require different scaling factors. Users can scale the output data outside of the DFT kernel based on application requirement. The following code is used to find the maximum magnitude and determines the number of leading bits for the automatic scaling. Four real parts and four imaginary parts are compared simultaneously in a loop within two cycles. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 21 Implementation of DFT on StarCore SC3850 DSP ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] [ move.w (r2),d15 ; DFT point ] [ tfr d15,d14 sub #8,d15 ; d15-8 ] [ asrr #2,d15 ; (d15-8)/4 ] [ adda #8,r0,r1 ; Input address move.w #2,n0 ] [ move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search FALIGN loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.l (sp-40),r3 ; Stage counter ] [ Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 22 Freescale Semiconductor Implementation of DFT on StarCore SC3850 DSP tfr d6,d0 tfr d7,d1 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ max2 d0,d4 max2 d1,d5 ] [ tfr d5,d0 ] [ clr d15 max2 d0,d4 move.l (sp-32),r13 ; digit_reversed_address ] [ add #2,d15 ; scale down = 2 sxt.w d4,d0 adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] ] [ extract #16,#16,d4,d4 move.l (sp-68),r1 ; psiScale ] [ max d0,d4 ; Max real/imag move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ clb d4,d0 ; count reading bit move.w (r2),r14 ; num_butterfly[s] move.w (r4),r5 ; num_subgroup[s] ] [ neg d0 ; negates dosetup0 radix2_dit_1_stage_loop1 move.l (sp-28),r4 ; num_radix_offset ] [ clr d4 sub #16,d0 ; limit 16-bit move.w #S22,d2 ] [ add #1,d4 cmpgt.w #(S21+1),d0 ; Check if siNorm (leading bits) > S21+1 move.w #S21,d1 ] [ ift clr d15 ; scale down = 0 if siNorm > S21+1 ifa cmpeq.w #(S21+1),d0 ; Check if siNorm == S21+1 ] tfrt d4,d15 ; scale down = 1 if siNorm == S21+1 ; d15 is the automatic scaling bits Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 23 Implementation of DFT on StarCore SC3850 DSP 4.2 Implementation of Mixed Radix DIT FFTs Figure 9 shows the actual process of calculation for 600-point DFT as an example. Ex. N = 600 Bit Reverse 0 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT 1 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT 2 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT 3 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT N/R(s)-4 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT N/R(s)-3 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT N/R(s)-2 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT N/R(s)-1 Radix-2 FFT Radix-4 FFT Radix-3 FFT Radix-5 FFT Radix-5 FFT First stage R(s)=2, s=0 Middle stage R(s)=4, s=1 R(s)=3, s=2 Last stage R(s)=5, s=3 R(s)=5, s=4 2 x 4 x 3 x 5 x 5 -> 5 stages Figure 9. Diagram of 1024-point FFT Calculation Using Radix-4 DIT Algorithm In this implementation, the DFT is factorized as 2 × 4 × 3 × 5 × 5. There are S=5 stages, and the stage index s = 0, 1, 2, 3, 4. The whole DFT process is summarized as follows: 1. First stage: The input data are loaded in the bit-reversed addressing mode. The bit-reversed indices are calculated outside the DFT kernel and they are passed into the kernel as one of the parameters. The input data is checked to detect the maximum magnitude, and the scaling factor is determined. The Radix-2 (R(s=0)=2) DIT butterfly calculation is repeated for N/2 times on the input data. In the first stage, the twiddle factors are all equal to 1. As a result, there is no multiplication in the first stage. The data are scaled according to the scaling factor to prevent the overflow. 2. Middle stage: There 3 middle stages in this example. In each stage, there are N/R(s) Radix-R(s) DIT butterfly calculations, where s = 1, 2, 3. The twiddle factors vary with stages. — In stage s=1, the twiddle factors are equal to (WN75k, WN2*75k, WN3*75k) (k = 0, N/300-1). — In stage s=2, the twiddle factors are (WN25k, WN2*25k) (k = 0, 1, 2, ... , N/75-1). — In stage s=3, the twiddle factors are (WN5k, WN2*5k, WN3*5k,WN4*5k) (k = 0, 1, 2, ... , N/25-1), and so on. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 24 Freescale Semiconductor Implementation of DFT on StarCore SC3850 DSP The input data is checked to detect the maximum magnitude, and the scaling factor is determined. The data are scaled according to the scaling factor to prevent the overflow. 3. Last stage: The Radix-5 (R(s=4)=5) DIT butterfly calculation is repeated for N/5 times. The twiddle factors are (WNk, WN2k, WN3k,WN4k) (k = 0, 1, 2, ... , N/5-1) in this stage. The scaling factor is determined based on the maximum magnitude of the input data and the data are scaled accordingly. Input and output data are 16-bit complex array, each of length N. Two separate buffers are used to store the input and output data, as shown in Figure 10. Each stage uses the input and output data buffer one after the other. Therefore, if stage n takes the input data buffer as the input and the output data buffer as the output, the next stage, n + 1, uses the output data buffer as the input and the input data buffer as the output. The final output data are stored in input data buffer if the number of stages S is even or in output data buffer if S is odd. Input Data Buffer Number of stages N x 2-byte x 2 IQ [bytes] Output Data Buffer Final output data Even Input data buffer Odd Output data buffer Input and output data are 16-bit complex array N x 2-byte x 2 IQ [bytes] Figure 10. Input and Output Data Buffers At each stage, the output data are stored into the memory and they are used as input to the next stage. For example, if Radix-4 butterflies are used in stage n, then the outputs of a butterfly are A’, B’, C’, and D’. Note that A’, B’, C’, and D’ are not consecutively stored in memory, as shown in Figure 11. The input addressing of the first stage is in digital reversed order Input addressing r0 A B C Output addressing r8 D A’ r1 N/4 r9 B’ N/4 N r10 C’ N/4 r11 D’ N/4 , Figure 11. Input and output addressing in the DFT implementation Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 25 Implementation of DFT on StarCore SC3850 DSP The offset value between the output data samples is calculated as shown in Equation 24. N 4 × ----------R(s) Eqn. 24 where s specifies the stage index and R ( s ) is the Radix of the FFT stage s . The offset value is multiplied by 4 to align the 2-byte for real part and 2-byte for imaginary part. 4.3 Optimization Used in DFT Implementation The optimization techniques improve the performance by taking special advantage of StarCore’s parallel computation capability. The SC3850 cores efficiently deploy a Variable Length Execution Set (VLES) instruction set. The SC38500 architecture has: • Four parallel ALUs, each is capable of performing dual MAC and most other arithmetic operations • Two Address Arithmetic Units (AAUs) for address arithmetic operations A VLES can contain up to four data ALU instructions and two AAU instructions. To maximize the computational performance, four ALUs and two AAUs should be utilized simultaneously as much as possible. Because the StarCore architecture has high instruction-level parallelism, it is possible to schedule independent blocks in parallel to increase performance. Code optimization also considers the memory structure to improve the performance. The SC3850 architecture provides a total sixteen 40-bit data registers, D0-D15 and sixteen 32-bit address registers, R0-R15. The dual Harvard architecture in the SC3850 is capable to access up to two 64-bit data per cycle. The following subsections present the optimization techniques used to increase the speed of the DIT DFT algorithm. 4.3.1 Single Instruction Multiple Data (SIMD) and Parallel Computing The multiplication and addition operations in the DIT butterflies can be calculated in parallel using the multiple ALUs in SC3850. The SC3850 has some instructions which enable faster implementation of the DIT butterfly algorithms. The instructions used for the computation are shown in Table 2. Table 2. SC3850 SIMD Instructions for DFT Instruction MAX2 Da,Db ABS2 Da,Dn ASRR2 Da,Dn SOD2FFCC MAC2FFGGR Description Writes the larger of the signed 16-bit values in each of the corresponding portions in a pair of source data registers to the second of the two registers (SIMD). Compares the high and low portions of the two registers independently. Stores the absolute values of the high portion and low portion of the source data register in the high portion and low portion, respectively, of the destination data register (SIMD). Shifts the bits in each portion of the destination data register to the right, the number of positions specified in an immediate unsigned 4-bit value (SIMD). Sum or difference of two 16-bit values - function & cross (SIMD instruction). Performs two separate 16-bit additions or subtractions between the high and low portions of two source data registers and stores the results in the two portions of the destination data register. FF: A for addition and S for subtraction CC: XX for crossed and II for not crossed SIMD signed fractional multiply and wide accumulate - real Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 26 Freescale Semiconductor Implementation of DFT on StarCore SC3850 DSP Table 2. SC3850 SIMD Instructions for DFT Instruction MAC2FFGGI MOVE2.2F MOVE2.4F MOVERH.4F MOVERL.4F Description SIMD signed fractional multiply and wide accumulate - imaginary Transfers 2 16-bit fractional data between the memory and one data registers in packed 20-bit format, in a single 32-bit access Transfers 4 16-bit fractional data between the memory and two data registers in packed 20-bit format, in a single 64-bit access Move Four Wide High Fractional Words to Memory With Scaling, Rounding, and Saturation (AGU) Move Four Wide Low Fractional Words to Memory With Scaling, Rounding, and Saturation (AGU) The functionality of the MAC2FFGGR and MACFFGGI are shown in Figure 12 and Figure 13, respectively. Figure 12. MAC2AASSR Da,Db,Ds.H/L,Dn Figure 13. MAC2AASSI Da,Db,Ds.H/L,Dn Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 27 Implementation of DFT on StarCore SC3850 DSP In the implementation, the parallel computing technique is used to maximize the usage of StarCore multiple ALUs for the calculation of independent output values that have an overlap input source data values. For example, in a Radix-4 DIT butterfly, outputs A’, C’, B’, and D’ can be calculated in parallel. 4.3.2 Packed Data Move In the StarCore architecture, most element combinations that are supported have data width up to 64-bit. This means that four short words (4 × 16-bits) or two long words (2 × 32-bits) can be fetched at the same time. For example, the MOVE2.2F, MOVE2.4F, MOVERH.4F and MOVERL.4F instructions are used in the DFT computation to efficiently load and store the data samples. For this purpose, the memory addresses should be aligned to the access width of the instructions used. For example, 8-byte (4 × short words) accessing should be aligned to the 8-byte addresses. 4.3.3 Software Pipelining Software pipelining is an optimization technique where a sequence of instructions is transformed into a pipeline of several copies of the sequence. The sequences then work in parallel to leverage more of the available parallelism of the architecture. For example, two set of Radix-4 DIT butterflies can be calculated inside one loop with the software pipelining. 4.3.4 Function Alignment Functions should be aligned so that they fall on cache boundaries. On a Change of Flow (COF), the core has only filled one fetch buffer. If the destination VLES is within two fetch sets, the core will have to fill another fetch buffer and that will cause a stall. To avoid this situation, ensure that the destination VLES of all COF operations are contained within one fetch set. The FALIGN directive will pad the previous VLES with NOP to ensure the following VLES is contained within one fetch set. 4.4 DFT Implementation Diagrams This is an implementation of the mixed Radix FFTs, where Radix-5, 4, 3, and 2 FFTs are combined together to perform the DFT computation. Decimation in time (DIT) algorithm is used. Bit-reversed addressing is not performed inside the kernel. The "psiDigitReversedAddress" array contains offsets for digit reversed address, which is pre calculated. The butterfly diagrams are shown here to help understand the DFT algorithm. The DFT reference code is listed at Appendix A, “DFT Reference Code”. The implementation diagrams of IDFT are the same except some operation sign changes. The IDFT reference code is not included to save space. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 28 Freescale Semiconductor Implementation of DFT on StarCore SC3850 DSP Figure 14. First Stage Radix-2 Butterfly Diagram Figure 15. Middle Stage Radix-3 Butterfly Diagram Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 29 Implementation of DFT on StarCore SC3850 DSP Figure 16. Last Stage Radix-3 Butterfly Diagram Figure 17. First Stage Radix-4 Butterfly Diagram Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 30 Freescale Semiconductor Implementation of DFT on StarCore SC3850 DSP Figure 18. Middle Stage Radix-4 Butterfly Diagram Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 31 Implementation of DFT on StarCore SC3850 DSP Figure 19. Middle and Last Stage Radix-5 Butterfly Diagram Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 32 Freescale Semiconductor Experimental Results 5 Experimental Results This section provide the performance results of the DFT/IDFT kernel running on SC3850. The real-time cycle counts for the implemented DIT DFT and IDFT kernels are measured using the SC3850 cycle accurate simulator. CodeWarrior Integrated Development Environment (IDE) R10.0 is used to build the DFT kernels and the corresponding test harness. The cycle counts for the DIT DFT and IDFT kernels are shown in Figure 20 and Figure 21. Automatic scaling method finds the maximum magnitude data at each DFT stage and thus requires roughly 30% more computation as shown in the figure. 16 SC3850 DFT Performance 14 Cycles (Thousands) 12 10 8 Fixed Scaling 6 Auto Scaling 4 2 52 0 8 8 6 2 11 97 90 76 64 57 4 4 8 6 0 0 0 48 38 32 28 21 18 12 96 60 36 12 0 DFT points Figure 20. SC3850 DIT DFT Cycle Count at Different Points Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 33 Experimental Results 16 SC3850 IDFT Performance 14 Cycles (Thousands) 12 10 8 Fixed Scaling 6 Auto Scaling 4 2 76 8 90 0 97 2 11 52 12 0 18 0 21 6 28 8 32 4 38 4 48 0 57 6 64 8 60 96 12 36 0 IDFT points Figure 21. SC3850 DIT IDFT Cycle Count at Different Points Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 34 Freescale Semiconductor Conclusions 6 Conclusions The purpose of this application note is to develop DIT DFT/IDFT algorithm implemented on the SC3850 core architecture. The DFT/IDFT is computed by a combination of Radix-2, 3, 4, and 5 FFTs/IFFTs to achieve good speed performance. The implementation benefits the following SC3850 features: • VLES execution model, which utilizes maximum parallelism by allowing multiple address generation and data arithmetic logic units to execute multiple instructions in a single cycle • Special SIMD instructions, such as MAX2, ABS2, SOD2ffcc, MAC2ffggR, and MAC2ffggI • MOVER instructions with scaling, rounding, and limiting to avoid overflowing • Memory access with multiple data, such as MOVE2.4F, MOVE2.2f, MOVERH.4f, and MOVERL.4F. The SC3850 fixed-point code incurs very little loss of accuracy compared to the floating-point Matlab results. Fixed scaling and automatic scaling methods can be used to avoid the overflowing in the DFT calculation. In fixed scaling method, the scaling factors for the DFT stages are pre defined, and the algorithm scales the data without considering the data dynamic range. On the contrary, the automatic scaling method scales the data based on their dynamic range to obtain good bit precision, and thus produce results with higher SNR. But it needs to find the maximum magnitude of the input data to each stage, resulting in roughly 30% more computation. The powerful architecture and instruction set of the SC3850 core permits flexible and compact coding of the algorithms in assembly language. The optimization of the DIT DFT/IDFT is done by taking special advantage of StarCore parallel computation capabilities, such as SIMD, parallel computing, and software pipelining. 7 References 1. Discrete-Time Signal Processing (Second Edition). Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Prentice Hall International, 1999. 2. The Fast Fourier Transform and Its Applications. E. Oran Brigham, Prentice Hall, 1988. 3. SC3850 DSP Core Reference Manual. Rev.A, Freescale Semiconductor, 08/2008. 4. Functional Differences Between the SC3400 and the SC3x50 Cores. Rev. C 06/2008. 5. SC3850 Programmer’s Guide. Rev. A, 06/2008. 6. Implementation of Radix-4 Fast Fourier Transform on StarCore DSP SC3400. AN3426 Rev. 0, 06/2007. 7. Implementation of Mixed Radix Fast Fourier Transform (FFT) and Inverse FFT on StarCore DSP SC3400. AN3443 Rev. A, 11/2007. 8. Software Optimization of FFTs and IFFTs Using the SC3850 Core. AN3666 Rev. C, 1/2009 NOTE Freescale documents in the above list that use a letter for a revision designator are only available under NDA. Contact your local sales office or representative to access to these documents. Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 35 DFT Reference Code Appendix A DFT Reference Code DEFINE FIX_SCALE '1' UNDEF FIX_SCALE IF @DEF('FIX_SCALE') SCALE_RADIX_2 EQU 1 SCALE_RADIX_3 EQU 2 SCALE_RADIX_4 EQU 2 SCALE_RADIX_5 EQU 3 MSG 'FIX_SCALE is ON' ELSE MSG 'FIX_SCALE is OFF' ENDIF S51 S52 S53 S54 S31 S32 S33 S34 S41 S42 S43 S44 S21 S22 equ equ equ equ equ equ equ equ equ equ equ equ equ equ 2 2 3 3 2 2 2 2 2 2 2 2 1 1 ; ; ; ; ; ; ; ; ; ; ; ; ; ; Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling Scaling check check check check check check check check check check check check check check for for for for for for for for for for for for for for radix-5 radix-5 radix-5 radix-5 radix-3 radix-3 radix-3 radix-3 radix-4 radix-4 radix-4 radix-4 radix-2 radix-2 stage stage stage stage stage stage stage stage stage stage stage stage stage stage (Last stage) (Last stage) (Middle stage) (Middle stage) (Last stage) (Last stage) (Middle stage) (Middle stage) (Middle stage) (Middle stage) (1st stage) (1st stage) (1st stage) (1st stage) section .data radix5parameter align 256 dcw $278E dcw $79BC dcw $678E dcw $4B3D radix3parameter dcw $6ED9 ; ; ; ; radix-5 radix-5 radix-5 radix-5 a b c d ; radix-3 sqrt(3)/2 ALIGN 8 my_OUTPUT_pointer: DS 8 endsec section .text global _sc3850_dft_dit_complex_16x16_auto_scale_asm align 16 _sc3850_dft_dit_complex_16x16_auto_scale_asm init: Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 36 Freescale Semiconductor DFT Reference Code push.2l d6:d7 push.2l r6:r7 [ clr d3 ; Clear for status register stack clr d4 ; Clear stage counter adda #72,sp,r4 move.l (r0)+,r1 ; base_address ] [ tfra r4,sp move.l r0,(my_OUTPUT_pointer) ;// --> psFft->psiOut ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] [ move.l move.l ] (r0)+,r14 r1,(sp-8) ; out_address ; base_address (r0)+,r2 r14,(sp-12) ; num_radix ; out_address (r0)+,r3 r2,(sp-16) ; num_butterfly ; num_radix (r0)+,r4 r3,(sp-20) ; num_subgroup ; num_butterfly (r0)+,r5 r4,(sp-24) ; num_radix_offset ; num_subgroup (r0)+,r6 r5,(sp-28) ; digit_reversed_address ; num_radix_offset r6,(sp-32) sr,d3 ; digit_reversed_address ; Status Register d3,(sp-36) ; Status register #$0BE4008C,SR ; no scaling, Two's-complement rounding, 32 bit saturate, ; 16 bit saturate, W20=0 (r0)+,r7 d4,(sp-40) ; wb ; Initialize the stage counter r7,(sp-44) (r0)+,r7 ; wb ; wc r7,(sp-48) (r0)+,r7 ; wc ; wd Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 37 DFT Reference Code [ move.l r7,(sp-52) ; move.l (r0)+,r7 ; ] [ move.l (r0)+,r1 ; move.w (r2),d0 ; ] [ move.l (r0)+,r4 ; move.l r7,(sp-56) ; ] [ move.l r1,(sp-60) ; move.l r4,(sp-68) ; ] move.w (r2),d0 ; cmpeq.w #2,d0 ; jf radix4_dit_1_stage wd we DFT point num_radix pliAutoScale we DFT point pliAutoScale num_radix[s] num_radix[s]=0 or not ; Go to adix4_1st_stage if num_radix[s]!=2 ;/////////////////////////////////////////////////////////////////////// ; Radix-2 Stage: ;/////////////////////////////////////////////////////////////////////// radix2_init: ;--------------------------------------------------------; 1st stage (radix-2) ;--------------------------------------------------------radix2_dit_1_stage: IF @DEF('FIX_SCALE') [ move.l (sp-60),r2 ; DFT point move.l (sp-40),r3 ; Stage counter ] [ move.l (sp-32),r13 ; digit_reversed_address move.l (sp-68),r1 ; psiScale ] [ move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ dosetup0 radix2_dit_1_stage_loop1 move.l (sp-28),r4 ; num_radix_offset ] [ move.w (r2),d14 ; DFT point move.w #SCALE_RADIX_2,d15 ] ELSE ;//IF @DEF('FIX_SCALE') ;-- Find max value (real & imag) & stage initialization Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 38 Freescale Semiconductor DFT Reference Code [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] [ move.w (r2),d15 ; DFT point ] [ tfr d15,d14 sub #8,d15 ; d15-8 ] [ asrr #2,d15 ; (d15-8)/4 ] [ adda #8,r0,r1 ; Input address move.w #2,n0 ] [ move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.l (sp-40),r3 ; Stage counter ] [ tfr d6,d0 tfr d7,d1 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 39 DFT Reference Code ] [ max2 d0,d4 max2 d1,d5 ] [ tfr d5,d0 ] [ clr d15 max2 d0,d4 move.l (sp-32),r13 ; digit_reversed_address ] [ add #2,d15 ; scale down = 2 sxt.w d4,d0 adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] ] [ extract #16,#16,d4,d4 move.l (sp-68),r1 ; psiScale ] [ max d0,d4 ; Max real/imag move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ clb d4,d0 ; count reading bit move.w (r2),r14 ; num_butterfly[s] move.w (r4),r5 ; num_subgroup[s] ] [ neg d0 ; negates dosetup0 radix2_dit_1_stage_loop1 move.l (sp-28),r4 ; num_radix_offset ] [ clr d4 sub #16,d0 ; limit 16-bit move.w #S22,d2 ] [ add #1,d4 cmpgt.w #S21,d0 ; Check if siNorm > S21 move.w #S21,d1 ] [ ift clr d15 ; scale down = 0 if siNorm > S21+1 ifa cmpeq.w #S21,d0 ; Check if siNorm == S21 ] tfrt d4,d15 ; scale down = 1 if siNorm == S21+1 ENDIF ;// IF @DEF('FIX_SCALE') [ move.l d15,(sp-64) ; Set norm to stack Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 40 Freescale Semiconductor DFT Reference Code move.w d15,(r1)+ ; psiScale <- norm ] [ move.l r1,(sp-68) ; psiScale asrr #2,d14 ; DFT point/4 move.l d15,r5 ; scale down ] [ doen0 d14 ; loop count adda #2,r13,r14 ] [ adda r3,r4 ; num_radix_offset[s] move.l SR,d3 ] [ move.w (r4),r1 ; num_radix_offset[s] move.l (sp-32),r15 ; digit_reversed_address ] [ move.l d3,r4 move.l #(_psiDigitReversedAddress_absolut_address),r15 ] ;/// ; For next stage input/output [ move.l r9,(sp-8) ; base_address move.l r0,(sp-12) ; out_address ] [ bmset #$1000,SR.L ; W20=1 tfra r9,r8 ; A output ] [ bmclr #$0080,SR.L ; SM2=0 adda r1,r9 ; B output pointer (long) ] [ bmclr #$0030,SR.L ; SCM=00 tfra r9,r10 ; B output ] ;--- Start Radix-2 (1st stage) [ move.l (r15)+,r0 ; digit_reversed_address(i+0) cmpeqa.w #1,r5 ] [ ift bmset #$0010,SR.L ; SCM=01 ifa move.l (r15)+,r1 ; digit_reversed_address(i+1) ] [ move.l (r15)+,r2 ; digit_reversed_address(i+2) cmpeqa.w #2,r5 ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 41 DFT Reference Code [ ift bmset #$0030,SR.L ; SCM=11 ifa move.l (r15)+,r3 ; digit_reversed_address(i+3) ] [ MOVE2.2F (r0),d0 ; d0=Ar:Ai move.l (r15)+,r0 ; digit_reversed_address(i+0) ] [ MOVE2.2F (r1),d1 ; d1=Br:Bi move.l (r15)+,r1 ; digit_reversed_address(i+1) ] [ MOVE2.2F (r2),d4 ; d4=Ar:Ai move.l (r15)+,r2 ; digit_reversed_address(i+2) ] FALIGN radix2_dit_1_stage_loop1: loopstart0 [ ; 01 MOVE2.2F (r3),d5 ; d5=Br:Bi move.l (r15)+,r3 ; digit_reversed_address(i+3) ] [ ; 02 sod2aaii d1,d0,d2 ; d2=(Ar+Br),(Ai+Bi) sod2ssii d1,d0,d6 ; d3=(Ar-Br),(Ai-Bi) sod2aaii d5,d4,d3 ; d6=(Ar+Br),(Ai+Bi) sod2ssii d5,d4,d7 ; d7=(Ar-Br),(Ai-Bi) MOVE2.2F (r0),d0 ; d0=Ar:Ai move.l (r15)+,r0 ; digit_reversed_address(i+0) ] [ ; 03 MOVE2.2F (r1),d1 ; d1=Br:Bi move.l (r15)+,r1 ; digit_reversed_address(i+1) ] [ ; 04 MOVER2.4F d2:d3,(r8)+ ; out:[Aout_r,Aout_i], out:[Aout_r,Aout_i] MOVE2.2F (r2),d4 ; d4=Ar:Ai ] [ ; 05 MOVER2.4F d6:d7,(r9)+ ; out:[Bout_r,Bout_i], out:[Bout_r,Bout_i] move.l (r15)+,r2 ; digit_reversed_address(i+2) ] loopend0 [ move.l (sp-16),r2 ; num_radix move.l (sp-40),r3 ; Load the stage counter ] [ move.l r4,d3 adda #2,r3,r3 ; +2 ] [ move.l d3,SR adda r3,r2 ; num_butterfly[s] ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 42 Freescale Semiconductor DFT Reference Code [ move.w (r2),r0 ; num_radix[s] move.l r3,(sp-40) ; Update the stage counter ] cmpeqa.w #0,r0 ; num_radix[s]=0 or not [ jt radix3_dit_last_stage ; Go to radix-3 Last stage processing if num_radix[s]=0 cmpeqa.w #5,r0 ; num_radix[s]=5 or not ] [ jt radix5_dit_m_stage ; Go to radix-5 Middle stage processing if num_radix[s]=5 cmpeqa.w #1,r0 ; num_radix[s]=1 or not ] [ jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1 cmpeqa.w #4,r0 ; num_radix[s]=4 or not ] jt radix4_dit_m_stage ; Go to radix-4 Middle stage processing if num_radix[s]=4 ;/////////////////////////////////////////////////////////////////////// ; Radix-3 Stages: ;/////////////////////////////////////////////////////////////////////// radix3_init: ;--------------------------------------------------------; Radix-3 (Middle stage) ;--------------------------------------------------------radix3_dit_m_stage: IF @DEF('FIX_SCALE') [ move.l (sp-40),r3 ; Stage counter dosetup0 radix3_dit_m_stage_loop1 ] [ move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ move.l (sp-44),r5 ; r5 -> Wb move.l (sp-28),r12 ; num_radix_offse ] [ move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ dosetup1 radix3_dit_m_stage_loop2 dosetup2 radix3_dit_m_stage_loop2_2 ] [ adda r3,r2 ; num_butterfly[s] adda r3,r12 ; num_radix_offse[s] ] [ move.w (r2),n0 ; num_butterfly[s],The stage twiddle factor's offset Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 43 DFT Reference Code move.w (r2),r13 ; num_butterfly[s] ] [ move.w (r12),r15 ; num_radix_offse[s] move.l (sp-68),r2 ; psiScale ] [ adda r3,r4 ; num_subgroup[s] move.l (sp-48),r6 ; r6 -> Wc ] [ move.w (r4),r14 ; num_subgroup[s] move.w #SCALE_RADIX_3,d15 ] ELSE ;//IF @DEF('FIX_SCALE') move.l (sp-40),r3 ; Stage counter ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] [ move.w (r2),d15 ; DFT point ] [ sub #8,d15 ; d15-8 ] [ asrr #2,d15 ; (d15-8)/4 ] [ adda #8,r0,r1 ; Input address move.w #2,n0 ] [ move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 44 Freescale Semiconductor DFT Reference Code max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ tfr d6,d0 tfr d7,d1 move.l (sp-44),r5 ; r5 -> Wb move.l (sp-28),r12 ; num_radix_offse ] [ max2 d0,d4 max2 d1,d5 dosetup0 radix3_dit_m_stage_loop1 dosetup1 radix3_dit_m_stage_loop2 ] [ tfr d5,d0 move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ max2 d0,d4 dosetup2 radix3_dit_m_stage_loop2_2 ] [ sxt.w d4,d0 adda r3,r12 ; num_radix_offse[s] ] [ extract #16,#16,d4,d4 adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] ] [ max d0,d4 ; Max real/imag move.l (sp-48),r6 ; r6 -> Wc move.w (r12),r15 ; num_radix_offse[s] ] [ clb d4,d0 ; count reading bit move.w (r2),r13 ; num_butterfly[s] move.w (r4),r14 ; num_subgroup[s] ] [ clr d14 neg d0 ; negates Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 45 DFT Reference Code move.l (sp-68),r2 ; psiScale ] [ add #1,d14 ; d14 = 1 sub #16,d0 ; limit 16-bit move.w #S34,d2 ; d2 = 2 move.w #2,d15 ; scale down = 2; ] [ cmpgt.w #S33,d0 ; Check if siNorm > S33 move.w #S33,d1 ] [ ift clr d15 ; scale down = 0 if siNorm > S33+1 ifa cmpeq.w #S33,d0 ; Check if siNorm == S33 ] [ tfrt d14,d15 ; scale down = 1 if siNorm == S33+1 tfra r13,n0 ; The stage twiddle factor's offset ] ENDIF ;// IF @DEF('FIX_SCALE') ; For next stage input/output [ move.l r9,(sp-8) ; base_address at the next stage move.l r0,(sp-12) ; out_address at the next stage ] [ move.l d15,(sp-64) ; Set norm to stack move.w d15,(r2)+ ; psiScale <- norm ] [ move.l r2,(sp-68) ; psiScale move.l d15,r11 ] [ tfra r9,r8 ; A output move.l SR,d4 ] [ adda r15,r9 ; B output pointer (long) move.l d4,r3 ] [ tfra r9,r10 ; B output move.f #$4000,d7 ; M2_3 = (1/2):0 ] [ tfra r13,r12 bmset #$1000,SR.L ; W20=1 ] [ adda r15,r10 ; C output pointer (long) bmclr #$0080,SR.L ; SM2=0 ] [ asra r12 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 46 Freescale Semiconductor DFT Reference Code bmclr #$0030,SR.L ; SCM=00 ] ;--- Start Radix-3 (Middle stage) [ move.l #radix3parameter,r15 ; Point to the radix-3 parameter array cmpeqa.w #1,r11 ] [ ift bmset #$0010,SR.L ; SCM=01 ifa asra r14 ] [ doen0 r14 cmpeqa.w #2,r11 ] [ ift bmset #$0030,SR.L ; SCM=11 ifa move.2f (r0)+,d4:d5 ; IAre:IAim=Ar:Ai ] [ bmtsts #$0001,r13.l ; if num_butterfly[s] is odd, T=1 tfra r12,r13 ] [ suba #1,r13 move.l (r0)+,d2 ; v_IB1=Br:Bi ] ; Intra loop FALIGN radix3_dit_m_stage_loop1: ; --> Butterflies/Subgroup loopstart0 [ doen1 r12 move.l (r5)+n0,d9 ; v_WB1=Wbr:Wbi ] FALIGN radix3_dit_m_stage_loop2: ; --> Subgroup loopstart1 [ ;01 MPYRE d2,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r6),d10 ; v_WC1=Wcr:Wci ] [ ;02 mac2assar d8,d10,d12.H,d6 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.2f (r0)+,d14:d15 ; IAre:IAim=Ar:Ai ] ;stall [ ;04 sod2aaii d4,d6,d0 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d5,d13,d1 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' mac -d6,d7,d4 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 47 DFT Reference Code mac -d13,d7,d5 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d13,d8 move.l (r0)+,d2 ; v_IB1=Br:Bi ] [ ;05 mac2aassi d8,d10,d4.H,d4 ; Bout_r = Ar' - (Br'+Cr')*1/2 + (Bi'-Ci')*sqrt(3)/2 :Cout_r = Ar' - (Br'+Cr')*1/2 - (Bi'-Ci')*sqrt(3)/2 MPYRE d2,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r6),d10 ; v_WC1=Wcr:Wci ] [ ;06 mac2assar d8,d10,d12.H,d12 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.l d6,d8 ] ;stall [ ;08 sod2aaii d14,d12,d2 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d15,d13,d3 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' mac -d12,d7,d14 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac2ssaai d8,d10,d5.H,d5 ; Bout_i=Ai' - (Bi'+Ci')*1/2 - (Br'-Cr')*sqrt(3)/2 :Cout_i=Ai' - (Bi'+Ci')*1/2 + (Br'-Cr')*sqrt(3)/2 move.l d13,d8 ] [ ;09 mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r mac -d13,d7,d15 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d12,d8 move.2f (r0)+,d12:d13 ; IAre:IAim=Ar:Ai ] [ ;10 mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i MOVERH.4F d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i] move.l (r0)+,d2 ; v_IB1=Br:Bi ] [ ;11 tfr d12,d4 tfr d13,d5 MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i] MOVERL.4F d4:d5:d14:d15,(r10)+ ; out:[Cout_r,Cout_i],[Cout_r,Cout_i] ] loopend1 [ iff addl2a n0,r6 iff move.l (r5)+n0,d9 ; v_WB1=Wbr:Wbi ] [ MPYRE d2,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r6),d10 ; v_WC1=Wcr:Wci ] [ mac2assar d8,d10,d12.H,d6 ; M2_1 = (Br'+Cr'):(Br'-Cr') Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 48 Freescale Semiconductor DFT Reference Code mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.2f (r0)+,d14:d15 ; IAre:IAim=Ar:Ai ] [ ift addl2a n0,r6 ift move.l (r5),d9 ; v_WB1=Wbr:Wbi ;stall ] [ sod2aaii d4,d6,d0 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d5,d13,d1 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' mac -d6,d7,d4 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac -d13,d7,d5 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d13,d8 move.l (r0)+,d2 ; v_IB1=Br:Bi ] [ mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r MPYRE d2,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r6),d10 ; v_WC1=Wcr:Wci ] [ mac2assar d8,d10,d12.H,d12 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.l d6,d8 ] [ ift addl2a n0,r5 ift doen2 r12 ifa sod2aaii d14,d12,d2 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d15,d13,d3 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' ;stall ] [ iff doen2 r13 ifa mac -d12,d7,d14 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac2ssaai d8,d10,d5.H,d5 ; Bout_i:Cout_i move.l d13,d8 ] [ mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r mac -d13,d7,d15 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d12,d8 move.2f (r0)+,d12:d13 ; IAre:IAim=Ar:Ai ] [ mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i moverh.4f d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i] move.l (r0)+,d2 ; v_IB1=Br:Bi ] radix3_dit_m_stage_loop2_2: ; --> Subgroup Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 49 DFT Reference Code loopstart2 [ ;01 tfr d12,d4 tfr d13,d5 MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i] MOVERL.4F d4:d5:d14:d15,(r10)+ ; out:[Cout_r,Cout_i],[Cout_r,Cout_i] ] [ ;02 MPYRE d2,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r6),d10 ; v_WC1=Wcr:Wci ] [ ;03 mac2assar d8,d10,d12.H,d6 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.2f (r0)+,d14:d15 ; IAre:IAim=Ar:Ai ] ;STALL [ ;05 sod2aaii d4,d6,d0 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d5,d13,d1 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' mac -d6,d7,d4 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac -d13,d7,d5 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d13,d8 move.l (r0)+,d2 ; v_IB1=Br:Bi ] [ ;06 mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r MPYRE d2,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r6),d10 ; v_WC1=Wcr:Wci ] [ ;07 mac2assar d8,d10,d12.H,d12 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.l d6,d8 ] ;stall [ ;09 sod2aaii d14,d12,d2 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d15,d13,d3 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' mac -d12,d7,d14 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac2ssaai d8,d10,d5.H,d5 ; Bout_i:Cout_i move.l d13,d8 ] [ ;10 mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r mac -d13,d7,d15 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d12,d8 move.2f (r0)+,d12:d13 ; IAre:IAim=Ar:Ai ] [ ;11 mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 50 Freescale Semiconductor DFT Reference Code MOVERH.4F d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i] move.l (r0)+,d2 ; v_IB1=Br:Bi ] loopend2 [ tfr d12,d4 tfr d13,d5 MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i] MOVERL.4F d4:d5:d14:d15,(r10)+ ; out:[Cout_r,Cout_i],[Cout_r,Cout_i] ] addl2a n0,r6 loopend0 [ move.l r3,d3 move.l (sp-40),r3 ; Load the stage counter ] [ move.l d3,SR move.l (sp-16),r2 ; num_radix ] adda #2,r3,r3 ; +2 [ move.l r3,(sp-40) ; Update the stage counter adda r3,r2 ; num_butterfly[s] ] move.w (r2),r0 ; num_radix[s] cmpeqa.w #3,r0 ; num_radix[s]=3 or not [ jt radix3_dit_m_stage ; Go to radix-3 Middle stage processing if num_radix[s]=3 cmpeqa.w #5,r0 ; num_radix[s]=5 or not ] [ jt radix5_dit_m_stage ; Go to radix-5 Middle stage processing if num_radix[s]=5 cmpeqa.w #1,r0 ; num_radix[s]=1 or not ] jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1 ;--------------------------------------------------------; Radix-3 (Last stage) ;--------------------------------------------------------radix3_dit_last_stage: IF @DEF('FIX_SCALE') [ move.l (sp-40),r3 move.l (sp-28),r2 ] [ move.l (sp-24),r4 move.l (sp-44),r12 ] [ move.l (sp-12),r8 move.l (sp-48),r13 ] [ clr d15 ; Stage counter ; num_radix_offse ; num_subgroup ; r12 -> Wb ; A output pointer ; r13 -> Wc Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 51 DFT Reference Code move.l (sp-8),r0 ; A input pointer move.l (sp-68),r1 ; psiScale ] [ add #SCALE_RADIX_3,d15 adda r3,r2 ; num_radix_offse[s] dosetup0 radix3_dit_last_stage_loop1 ] [ move.w (r2),r15 ; num_radix_offse[s] adda r3,r4 ; num_subgroup[s] ] [ move.w (r4),r14 ; num_subgroup[s] tfra r8,r9 ; A output ] [ move.l d15,(sp-64) ; Set norm to stack adda r15,r9 ; B output pointer (long) ] [ tfra r9,r10 ; B output move.w d15,(r1)+ ; psiScale <- norm ] [ adda r15,r10 ; C output pointer (long) move.l #radix3parameter,r15 ; Point to the radix-3 parameter array ] ELSE ;//IF @DEF('FIX_SCALE') ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] [ move.w (r2),d15 ; DFT point ] [ sub #8,d15 ; d15-8 ] [ asrr #2,d15 ; (d15-8)/4 ] [ adda #8,r0,r1 ; Input address move.w #2,n0 ] [ move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 52 Freescale Semiconductor DFT Reference Code [ abs2 d4,d4 abs2 d6,d6 doensh0 d15 ] abs2 d5,d5 abs2 d7,d7 ; loop count for max search ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ tfr d6,d0 tfr d7,d1 move.l (sp-40),r3 ; Stage counter move.l (sp-28),r2 ; num_radix_offse ] [ max2 d0,d4 max2 d1,d5 move.l (sp-44),r12 ; r12 -> Wb move.l (sp-48),r13 ; r13 -> Wc ] [ tfr d5,d0 move.l (sp-8),r0 ; A input pointer move.l (sp-12),r8 ; A output pointer ] [ max2 d0,d4 move.l (sp-68),r1 ; psiScale ] [ sxt.w d4,d0 ] [ extract #16,#16,d4,d4 ] [ max d0,d4 ; Max real/imag adda r3,r2 ; num_radix_offse[s] adda r3,r4 ; num_subgroup[s] ] [ Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 53 DFT Reference Code clb d4,d0 ; count reading bit dosetup0 radix3_dit_last_stage_loop1 move.w (r2),r15 ; num_radix_offse[s] ] [ neg d0 ; negates move.w (r4),r14 ; num_subgroup[s] tfra r8,r9 ; A output ] [ sub #16,d0 ; limit 16-bit move.w #S32,d2 adda r15,r9 ; B output pointer (long) ] [ clr d11 cmpgt.w #S31,d0 ; Check if norm > S or norm <= S move.w #S31,d1 tfra r9,r10 ; B output ] [ iff sub d0,d1,d15 ; scale down = S - d0 if norm <=S ift sub d0,d2,d15 ; scale up = S - d0 if norm > S ] [ max d11,d15 ; shift shouldn't be to the left adda r15,r10 ; C output pointer (long) move.l #radix3parameter,r15 ; Point to the radix-3 parameter array ] [ move.l d15,(sp-64) ; Set norm to stack move.w d15,(r1)+ ; psiScale <- norm ] ENDIF ;// IF @DEF('FIX_SCALE') [ move.l r1,(sp-68) ; psiScale move.l d15,r1 ] ;--- Start Radix-3 (Last stage) [ asra r14 move.l SR,d3 ] [ move.l d3,r3 bmset #$1000,SR.L ] [ doen0 r14 ] ; Inner loop [ move.2f (r0)+,d4:d5 bmclr #$0080,SR.L ] ; W20=1 ; set the loop ; IAre:IAim=Ar:Ai ; SM2=0 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 54 Freescale Semiconductor DFT Reference Code [ move.w (r4),r14 ; num_subgroup[s] bmclr #$0030,SR.L ; SCM=00 ] [ cmpeqa.w #1,r1 move.l (r0)+,d12 ; v_IB1=Br:Bi ] [ ift bmset #$0010,SR.L ; SCM=01 ifa move.2l (r12)+,d8:d9 ; v_WB1=Wbr:Wbi ] [ MPYRE d12,d8,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d12,d8,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.l (r13)+,d10 ; v_WC1=Wcr:Wci ] [ mac2assar d8,d10,d12.H,d6 ; M2_1 = (Br'+Cr'):(Br'-Cr') move.f #$4000,d7 ; M2_3 = (1/2):0 cmpeqa.w #2,r1 ] [ ift bmset #$0030,SR.L ; SCM=11 ifa mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) ] [ bmtsts #$0001,r14.l ; if num_subgroup[s] is odd, T=1 move.2f (r0)+,d14:d15 ; IAre:IAim=Ar:Ai ] FALIGN radix3_dit_last_stage_loop1: loopstart0 [ ;01 sod2aaii d4,d6,d0 ; Aout_r = Ar + M2_1_H // Aout_r sod2aaii d5,d13,d1 ; Aout_i = Ai + M2_2_H // Aout_i mac -d6,d7,d4 ; M3_1_H = Ar - M2_1_H*1/2 // mac -d13,d7,d5 ; M3_2_H = Ai - M2_2_H*1/2 // move.l d13,d8 move.l (r0)+,d12 ; v_IB1=Br:Bi ] [ ;02 mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r MPYRE d12,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d12,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r0)+,d8 ; v_IC1=Cr:Ci move.2l (r13)+,d10:d11 ; v_WC1=Wcr:Wci ] [ ;03 mac2assar d8,d10,d12.H,d12 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') tfr d6,d9 move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) = Ar'+Br'+Cr' = Ai'+Bi'+Ci' Ar' - (Br'+Cr')*1/2 Ai' - (Bi'+Ci')*1/2 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 55 DFT Reference Code ] [ ;04 sod2aaii d14,d12,d2 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d15,d13,d3 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' move.l d13,d8 ] [ ;05 tfr d5,d0 mac -d12,d7,d14 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac -d13,d7,d15 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 MOVERH.4F d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i] move.2f (r0)+,d2:d3 ; IAre:IAim=Ar:Ai ] [ ;06 mac2ssaai d9,d10,d0.H,d5 ; Bout_i:Cout_i mac2aassi d8,d10,d14.H,d14 ; Bout_r:Cout_r move.l d12,d8 move.l (r0)+,d12 ; v_IB1=Br:Bi ] [ ;07 mac2ssaai d8,d10,d15.H,d15 ; Bout_i:Cout_i tfr d11,d10 move.l (r12)+,d9 ; v_WB1=Wbr:Wbi move.l (r0)+,d8 ; v_IC1=Cr:Ci ] [ ;08 MPYRE d12,d9,d12 ; M1_1 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d12,d9,d13 ; M1_2 = Bi' = (Br*Wbi + Bi*Wbr) move.l (r12)+,d9 ; v_WB1=Wbr:Wbi ] [ ;09 tfr d2,d4 tfr d3,d5 mac2assar d8,d10,d12.H,d6 ; M2_1 = (Br'+Cr'):(Br'-Cr') mac2aassi d8,d10,d13.H,d13 ; M2_2 = (Bi'+Ci'):(Bi'-Ci') MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i] MOVERL.4F d4:d5:d14:d15,(r10)+; out:[Cout_r,Cout_i],[Cout_r,Cout_i] ] [ ;10 move.f (r15),d10 ; M3_4 = (sqrt(3)/2):(0) move.2f (r0)+,d14:d15 ; IAre:IAim=Ar:Ai ;stall ] loopend0 [ sod2aaii d4,d12,d0 ; Aout_r = Ar + M2_1_H // Aout_r = Ar'+Br'+Cr' sod2aaii d5,d13,d1 ; Aout_i = Ai + M2_2_H // Aout_i = Ai'+Bi'+Ci' mac -d12,d7,d4 ; M3_1_H = Ar - M2_1_H*1/2 // Ar' - (Br'+Cr')*1/2 mac -d13,d7,d5 ; M3_2_H = Ai - M2_2_H*1/2 // Ai' - (Bi'+Ci')*1/2 move.l d13,d8 ] [ mac2aassi d8,d10,d4.H,d4 ; Bout_r:Cout_r move.l d12,d8 ] [ ift MOVERH.4F d0:d1:d2:d3,(r8)+ ; out:[Aout_r,Aout_i] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 56 Freescale Semiconductor DFT Reference Code ifa mac2ssaai d8,d10,d5.H,d5 ; Bout_i:Cout_i move.l r3,d6 ] [ ift MOVERH.4F d4:d5:d14:d15,(r9)+ ; out:[Bout_r,Bout_i],[Bout_r,Bout_i] ift MOVERL.4F d4:d5:d14:d15,(r10)+ ; out:[Cout_r,Cout_i],[Cout_r,Cout_i] ] [ jmp exit_mixed_radix move.l d6,SR ] ;/////////////////////////////////////////////////////////////////////// ; Radix-4 Stages: ;/////////////////////////////////////////////////////////////////////// radix4_init: ;--------------------------------------------------------; Radix-4 (1st stage) ;--------------------------------------------------------radix4_dit_1_stage: IF @DEF('FIX_SCALE') [ clr d15 move.l (sp-20),r15 ; num_butterfly move.l (sp-28),r4 ; num_radix_offset ] [ add #SCALE_RADIX_4,d15 move.l (sp-8),r0 ; A input pointer move.l (sp-12),r8 ; A output pointer ] [ dosetup0 radix4_dit_1_stage_loop1 move.l (sp-68),r2 ; psiScale ] [ move.l d15,(sp-64) ; Set norm to stack move.l d15,r5 ; psiScale <- norm ] [ move.w (r4),r14 ; num_radix_offset[s] move.w (r15),r15 ; num_butterfly[s] ] ; For next stage input/output [ move.l r8,(sp-8) ; base_address at the next stage move.l r0,(sp-12) ; out_address at the next stage ] [ tfra r8,r9 ; A output move.w d15,(r2)+ ; psiScale <- norm ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 57 DFT Reference Code [ move.l r2,(sp-68) ; psiScale adda r14,r9 ; B output ] [ tfra r9,r10 ; B output suba #1,r15 ] [ adda r14,r10 ; C output doen0 r15 ; loop0 = num_butterfly[s] ] [ tfra r10,r11 ; C output move.l #(_psiDigitReversedAddress_absolut_address),r15 ] ELSE ;//IF @DEF('FIX_SCALE') ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] move.w #2,n0 [ move.w (r2),d15 ; DFT point adda #8,r0,r1 ; Input address ] [ sub #8,d15 ; d15-8 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ asrr #2,d15 ; (d15-8)/4 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 58 Freescale Semiconductor DFT Reference Code [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 ] [ tfr d6,d0 tfr d7,d1 dosetup0 radix4_dit_1_stage_loop1 ] [ max2 d0,d4 max2 d1,d5 move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ tfr d5,d0 ] [ max2 d0,d4 move.w (r4),r5 ; num_subgroup[s] move.l (sp-28),r4 ; num_radix_offset ] [ sxt.w d4,d0 move.w (r2),r15 ; num_butterfly[s] move.w #S44,d2 ] [ extract #16,#16,d4,d4 move.l (sp-68),r2 ; psiScale ] [ max d0,d4 ; Max real/imag move.w #S43,d1 ] [ clb d4,d0 ; count leading bit ] [ neg d0 ; negates move.w (r4),r14 ; num_radix_offset[s] suba #1,r15 ] [ clr d14 sub #16,d0 ; limit 16-bit suba #1,r5 ; r5 - 1 tfra r9,r8 ; A output ] [ add #1,d14 cmpgt.w #S43,d0 ; Check if siNorm > S43 move.w #2,d15 ; scale down = 2 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 59 DFT Reference Code adda r14,r9 ; B output ] [ ift clr d15 ; scale down = 0 if siNorm > S43+1 ifa cmpeq.w #S43,d0 ; Check if siNorm = S43 tfra r9,r10 ; B output ] [ tfrt d14,d15 ; scale down = 1 if siNorm = S43 adda r14,r10 ; C output ] [ move.w d15,(r2)+ ; psiScale <- norm tfra r10,r11 ; C output ] [ move.l d15,(sp-64) ; Set norm to stack move.l r2,(sp-68) ; psiScale ] [ doen0 r15 ; loop0 = num_butterfly[s] move.l d15,r5 ; psiScale <- norm ] ; For next stage input/output [ move.l r8,(sp-8) ; base_address at the next stage move.l r0,(sp-12) ; out_address at the next stage ] move.l #(_psiDigitReversedAddress_absolut_address),r15 ENDIF ;// IF @DEF('FIX_SCALE') [ adda r14,r11 ; D output ;--- Start Radix-4 (1st stage) move.l (r15)+,r0 ] [ move.l SR,d3 move.l (r15)+,r1 ] [ move.l d3,r4 move.l (r15)+,r2 ] [ bmset #$1000,SR.L move.l (r15)+,r3 ] [ bmclr #$0080,SR.L MOVE2.2F (r0),d0 ] [ bmclr #$0030,SR.L move.l (r15)+,r0 ; digit_reversed_address(i+0) ; digit_reversed_address(i+1) ; digit_reversed_address(i+2) ; W20=1 ; digit_reversed_address(i+3) ; SM2=0 ; d0=Ar:Ai ; SCM=00 ; digit_reversed_address(i+0) Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 60 Freescale Semiconductor DFT Reference Code ] [ cmpeqa.w #1,r5 MOVE2.2F (r1),d2 ; d2=Br:Bi ] [ ift bmset #$0010,SR.L ; SCM=01 ifa move.l (r15)+,r1 ; digit_reversed_address(i+1) ] [ cmpeqa.w #2,r5 MOVE2.2F (r2),d1 ; d1=Cr:Ci ] [ ift bmset #$0030,SR.L ; SCM=11 ifa move.l (r15)+,r2 ; digit_reversed_address(i+2) ] [ MOVE2.2F (r3),d3 ; d3=Dr:Di move.l (r15)+,r3 ; digit_reversed_address(i+3) ] [ sod2aaii d1,d0,d4 ; d0=(Ar+Cr),(Ai+Ci) sod2ssii d1,d0,d5 ; d1=(Ar-Cr),(Ai-Ci) sod2aaii d3,d2,d6 ; d2=(Br+Dr),(Bi+Di) sod2ssii d3,d2,d7 ; d3=(Br-Dr),(Bi-Di) MOVE2.2F (r0),d0 ; d0=Ar:Ai ] FALIGN radix4_dit_1_stage_loop1: loopstart0 [ MOVE2.2F (r1),d2 ; d2=Br:Bi move.l (r15)+,r0 ; digit_reversed_address(i+0) ] [ sod2aaii d6,d4,d4 ; d4=(Ar':Ai')=((Ar+Br)+(Cr+Dr),(Ai+Bi)+(Ci+Di)) sod2saxx d7,d5,d5 ; d5=(Br':Bi')=((Ar-Cr)+(Bi-Di),(Ai-Ci)-(Br-Dr)) sod2ssii d6,d4,d6 ; d6=(Cr':Ci')=((Ar+Cr)-(Br+Dr),(Ai+Ci)-(Bi+Di)) sod2asxx d7,d5,d7 ; d7=(Dr':Di')=((Ar-Cr)-(Bi-Di),(Ai-Ci)+(Br-Dr)) MOVE2.2F (r2),d1 ; d1=Cr:Ci move.l (r15)+,r1 ; digit_reversed_address(i+1) ] [ MOVER2.2F d4,(r8)+ ; out:[Aout_r,Aout_i], out:[Bout_r,Bout_i] move.l (r15)+,r2 ; digit_reversed_address(i+2) ] [ MOVER2.2F d5,(r9)+ ; out:[Aout_r,Aout_i], out:[Bout_r,Bout_i] MOVE2.2F (r3),d3 ; d3=Dr:Di ] [ MOVER2.2F d6,(r10)+ ; out:[Cout_r,Cout_i], out:[Dout_r,Dout_i] move.l (r15)+,r3 ; digit_reversed_address(i+3) ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 61 DFT Reference Code [ sod2aaii d1,d0,d4 ; d0=(Ar+Cr),(Ai+Ci) sod2ssii d1,d0,d5 ; d1=(Ar-Cr),(Ai-Ci) sod2aaii d3,d2,d6 ; d2=(Br+Dr),(Bi+Di) sod2ssii d3,d2,d7 ; d3=(Br-Dr),(Bi-Di) MOVER2.2F d7,(r11)+ ; out:[Cout_r,Cout_i], out:[Dout_r,Dout_i] MOVE2.2F (r0),d0 ; d0=Ar:Ai ] loopend0 [ sod2aaii d6,d4,d4 ; d4=(Ar':Ai')=((Ar+Br)+(Cr+Dr),(Ai+Bi)+(Ci+Di)) sod2saxx d7,d5,d5 ; d5=(Br':Bi')=((Ar-Cr)+(Bi-Di),(Ai-Ci)-(Br-Dr)) sod2ssii d6,d4,d6 ; d6=(Cr':Ci')=((Ar+Cr)-(Br+Dr),(Ai+Ci)-(Bi+Di)) sod2asxx d7,d5,d7 ; d7=(Dr':Di')=((Ar-Cr)-(Bi-Di),(Ai-Ci)+(Br-Dr)) move.l (sp-16),r2 ; num_radix move.l (sp-40),r3 ; Load the stage counter ] [ MOVER2.2F d4,(r8)+ ; out:[Aout_r,Aout_i] MOVER2.2F d5,(r9)+ ; out:[Bout_r,Bout_i] ] [ MOVER2.2F d6,(r10)+ ; out:[Cout_r,Cout_i] move.l r4,d3 ] [ MOVER2.2F d7,(r11)+ ; out:[Dout_r,Dout_i] adda #2,r3,r3 ; +2 ] [ move.l r3,(sp-40) ; Update the stage counter adda r3,r2 ; num_butterfly[s] ] [ move.w (r2),r0 ; num_radix[s] move.l d3,SR ] [ cmpeqa.w #0,r0 ; num_radix[s]=0 or not ] [ jt radix3_dit_last_stage ; Go to radix-3 Last stage processing if num_radix[s]=0 cmpeqa.w #3,r0 ; num_radix[s]=3 or not ] [ jt radix3_dit_m_stage ; Go to radix-3 Middle stage processing if num_radix[s]=3 cmpeqa.w #5,r0 ; num_radix[s]=5 or not ] [ jt radix5_dit_m_stage ; Go to radix-5 Middle stage processing if num_radix[s]=5 cmpeqa.w #1,r0 ; num_radix[s]=1 or not ] jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1 ;--------------------------------------------------------; Radix-4 (Middle stage) ;--------------------------------------------------------radix4_dit_m_stage: Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 62 Freescale Semiconductor DFT Reference Code IF @DEF('FIX_SCALE') [ clr d15 move.l (sp-40),r3 ; Stage counter dosetup0 radix4_dit_m_stage_loop1 ; radix4 loop1 (outer loop) ] [ add #SCALE_RADIX_4,d15 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ move.l (sp-44),r5 ; r5 -> Wb move.l (sp-48),r6 ; r6 -> Wc ] [ move.l (sp-28),r12 ; num_radix_offset move.l (sp-52),r7 ; r7 -> Wd ] [ dosetup1 radix4_dit_m_stage_loop2_1 ; radix4 loop2_1 (inner loop) move.l (sp-8),r0 ; A input pointer ] [ adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] ] [ move.w (r2),r13 ; num_butterfly[s] move.l (sp-68),r2 ; psiScale ] [ move.w (r4),r14 ; num_subgroup[s] move.l (sp-12),r9 ; A output pointer ] [ dosetup2 radix4_dit_m_stage_loop2_2 ; radix4 loop2_2 (inner loop) adda r3,r12 ; num_radix_offset[s] ] [ move.w #3,n3 ; output offset for radix-4 ] ELSE ;//IF @DEF('FIX_SCALE') [ move.l (sp-40),r3 ; Stage counter ] ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] [ move.w (r2),d15 ; DFT point ] [ sub #8,d15 ; d15-8 ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 63 DFT Reference Code [ asrr #2,d15 ; (d15-8)/4 ] [ adda #8,r0,r1 ; Input address move.w #2,n3 ] [ move.2l (r0)+n3,d0:d1 move.2l (r1)+n3,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n3,d4:d5 move.2l (r1)+n3,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n3,d0:d1 move.2l (r1)+n3,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ tfr d6,d0 tfr d7,d1 move.l (sp-44),r5 ; r5 -> Wb move.l (sp-48),r6 ; r6 -> Wc ] [ max2 d0,d4 max2 d1,d5 move.l (sp-28),r12 ; num_radix_offset move.l (sp-52),r7 ; r7 -> Wd ] [ tfr d5,d0 dosetup0 radix4_dit_m_stage_loop1 ; radix4 loop1 (outer loop) ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 64 Freescale Semiconductor DFT Reference Code [ max2 d0,d4 dosetup1 radix4_dit_m_stage_loop2_1 ; radix4 loop2_1 (inner loop) ] [ move.l (sp-8),r0 ; A input pointer ] [ sxt.w d4,d0 dosetup2 radix4_dit_m_stage_loop2_2 ; radix4 loop2_2 (inner loop) ] [ extract #16,#16,d4,d4 move.l (sp-12),r9 ; A output pointer ] [ max d0,d4 ; Max real/imag adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] ] [ clb d4,d0 ; count reading bit move.w (r2),r13 ; num_butterfly[s] move.w (r4),r14 ; num_subgroup[s] ] [ move.l (sp-68),r2 ; psiScale ] [ neg d0 ; negates adda r3,r12 ; num_radix_offset[s] ] [ clr d14 sub #16,d0 ; limit 16-bit move.w #S42,d2 move.w #3,n3 ; output offset for radix-4 ] [ add #2,d14 cmpgt.w #(S41+1),d0 ; Check if siNorm > S41+1 move.w #2,d15 ; scale down = 2 ] [ ift clr d15 ; scale down = 0 if siNorm > S41+1 ifa cmpeq.w #(S41+1),d0 ; Check if siNorm = S41+1 ] [ tfrt d14,d15 ; scale down = 2 if siNorm = S41+1 move.w #S41,d1 ] ENDIF ;// IF @DEF('FIX_SCALE') [ move.l d15,(sp-64) ; Set norm to stack move.w d15,(r2)+ ; psiScale <- norm ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 65 DFT Reference Code [ move.l r2,(sp-68) ; psiScale move.w (r12),r15 ; num_radix_offset[s] ] [ tfra r13,n0 ; The stage twiddle factor's offset tfra r14,n1 ] ; For next stage input/output [ move.l r9,(sp-8) ; base_address move.l r0,(sp-12) ; out_address ] [ tfra r9,r8 ; A output adda r15,r9 ; B output pointer (long) ] [ move.l d15,r1 ; psiScale <- norm tfra r9,r10 ; B output ] ;--- Start Radix-4 (Middle stage) [ adda r15,r10 move.l SR,d3 ] [ move.l d3,r4 bmset #$1000,SR.L ; ] [ tfra r10,r11 ; bmclr #$0080,SR.L ; ] [ adda r15,r11 ; bmclr #$0030,SR.L ; ] [ move.l (sp-44),r5 cmpeqa.w #1,r1 ] [ ift bmset #$0010,SR.L ifa move.l (sp-48),r6 ] [ move.l (sp-52),r7 cmpeqa.w #2,r1 ] [ ift bmset #$0030,SR.L ifa move.l (r6)+n0,d2 ; C output pointer (long) W20=1 C output SM2=0 D output pointer (long) SCM=00 ; r5 -> Wb ; SCM=01 ; r6 -> Wc ; r7 -> Wd ; SCM=11 ; d14=Wcr:Wci Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 66 Freescale Semiconductor DFT Reference Code ] [ tfra r13,n0 bmtsts #$0001,r13.l ] [ asra r13 asra r14 ] [ adda #-1,r13,r12 adda #8,r0,r1 ] [ MOVE2.2F (r0)+,d0 MOVE2.2F (r1)+,d1 ] [ ;01 mac2assar d1,d2,d0.H,d12 mac2aassi d1,d2,d0.L,d13 MOVE2.2F (r0)+n3,d8 move.l (r5)+n0,d10 ] [ ;02 mac2assar d8,d10,d12.H,d12 mac2aassi d8,d10,d12.L,d4 mac2aassi d8,d10,d13.H,d13 mac2saasr d8,d10,d13.L,d5 move.l (r7)+n0,d11 doen0 r14 ] FALIGN radix4_dit_m_stage_loop1: loopstart0 [ doen1 r13 MOVE2.2F (r1)+n3,d9 ] ;/// Inter loop FALIGN radix4_dit_m_stage_loop2_1 loopstart1 [ ;03 mac2assar d9,d11,d12 mac2ssaai d9,d11,d4 mac2aassi d9,d11,d13 mac2assar d9,d11,d5 ; Input address ; d0=Ar:Ai ; d1=Cr:Ci ; ; ; ; ; ; M1_1.H = IA[re] M1_1.L = IA[re] M1_2.H = IA[im] M1_2.L = IA[im] d8=Br:Bi d10=Wbr:Wbi + + - IC[re] IC[re] IC[re] IC[re] * * * * WC[re] WC[re] WC[im] WC[im] + + - IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] ; ; ; ; ; ; ; ; ; ; M2_1.H = M1_1.H + IB[re] M2_1.L = M1_1.H - IB[re] M2_2.H = M1_1.L + IB[re] M2_2.L = M1_1.L - IB[re] M2_3.H = M1_2.H + IB[re] M2_3.L = M1_2.H - IB[re] M2_4.H = M1_2.L - IB[re] M2_4.L = M1_2.L + IB[re] d11=Wdr:Wdi loop0 = num_subgroup[s] * * * * * * * * WB[re] WB[re] WB[im] WB[im] WB[im] WB[im] WB[re] WB[re] + + + + - IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] * * * * * * * * WB[im] WB[im] WB[re] WB[re] WB[re] WB[re] WB[im] WB[im] ; --> Butterflies/Subgroup ; loop1 = num_butterfly[s] ; d9=Dr:Di ; --> Subgroup ; OA[re] = M2_1.H + ID[re] ; OC[re] = M2_1.L - ID[re] ; OB[re] = M2_2.H - ID[re] ; OD[re] = M2_2.L + ID[re] ; OA[im] = M2_3.H + ID[re] ; OC[im] = M2_3.L - ID[re] ; OB[im] = M2_4.H + ID[re] ; OD[im] = M2_4.L - ID[re] * * * * * * * * WD[re] WD[re] WD[im] WD[im] WD[im] WD[im] WD[re] WD[re] + + + + ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] * * * * * * * * WD[im] WD[im] WD[re] WD[re] WD[re] WD[re] WD[im] WD[im] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 67 DFT Reference Code MOVE2.2F (r0)+,d0 MOVE2.2F (r1)+,d8 ] [ ;01 mac2assar d8,d2,d0.H,d6 ; d0=Ar:Ai ; d8=Cr:Ci + + - IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] ; M2_1.H = M1_1.H + IB[re] * WB[re] ; M2_1.L = M1_1.H - IB[re] * WB[re] + mac2aassi d8,d10,d6.L,d14 ; M2_2.H = M1_1.L + IB[re] * WB[im] + ; M2_2.L = M1_1.L - IB[re] * WB[im] mac2aassi d8,d10,d7.H,d7 ; M2_3.H = M1_2.H + IB[re] * WB[im] + ; M2_3.L = M1_2.H - IB[re] * WB[im] mac2saasr d8,d10,d7.L,d15 ; M2_4.H = M1_2.L - IB[re] * WB[re] + ; M2_4.L = M1_2.L + IB[re] * WB[re] MOVE2.2F (r0)+,d0 ; d0=Ar:Ai, MOVE2.2F (r1)+,d1 ; d1=Cr:Ci ] [ ;03 mac2assar d9,d11,d6 ; OA[re] = M2_1.H + ID[re] * WD[re] ; OC[re] = M2_1.L - ID[re] * WD[re] + mac2ssaai d9,d11,d14 ; OB[re] = M2_2.H - ID[re] * WD[im] ; OD[re] = M2_2.L + ID[re] * WD[im] + mac2aassi d9,d11,d7 ; OA[im] = M2_3.H + ID[re] * WD[im] + ; OC[im] = M2_3.L - ID[re] * WD[im] mac2assar d9,d11,d15 ; OB[im] = M2_4.H + ID[re] * WD[re] ; OD[im] = M2_4.L - ID[re] * WD[re] + MOVE2.2F (r0)+n3,d8 ; d8=Br:Bi, MOVE2.2F (r1)+n3,d9 ; d9=Dr:Di ] [ ;01 mac2assar d1,d2,d0.H,d12 ; M1_1.H = IA[re] + IC[re] * WC[re] ; M1_1.L = IA[re] - IC[re] * WC[re] + mac2aassi d1,d2,d0.L,d13 ; M1_2.H = IA[im] + IC[re] * WC[im] + ; M1_2.L = IA[im] - IC[re] * WC[im] MOVERH.4F d12:d13:d6:d7,(r8)+ ; save OA1:OA2 MOVERL.4F d12:d13:d6:d7,(r10)+ ; save OC1:OC2 ] [ ;02 mac2assar d8,d10,d12.H,d12 ; M2_1.H = M1_1.H + IB[re] * WB[re] ; M2_1.L = M1_1.H - IB[re] * WB[re] + mac2aassi d8,d10,d12.L,d4 ; M2_2.H = M1_1.L + IB[re] * WB[im] + ; M2_2.L = M1_1.L - IB[re] * WB[im] mac2aassi d8,d10,d13.H,d13 ; M2_3.H = M1_2.H + IB[re] * WB[im] + ; M2_3.L = M1_2.H - IB[re] * WB[im] mac2saasr d8,d10,d13.L,d5 ; M2_4.H = M1_2.L - IB[re] * WB[re] + ; M2_4.L = M1_2.L + IB[re] * WB[re] MOVERH.4F d4:d5:d14:d15,(r9)+ ; save OB1:OB2 MOVERL.4F d4:d5:d14:d15,(r11)+ ; save OD1:OD2 ] loopend1 [ IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] * * * * * * * * WB[im] WB[im] WB[re] WB[re] WB[re] WB[re] WB[im] WB[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] * * * * * * * * WD[im] WD[im] WD[re] WD[re] WD[re] WD[re] WD[im] WD[im] IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] * * * * * * * * WB[im] WB[im] WB[re] WB[re] WB[re] WB[re] WB[im] WB[im] mac2aassi d8,d2,d0.L,d7 MOVE2.2F (r0)+n3,d8 MOVE2.2F (r1)+n3,d9 ] [ ;02 mac2assar d8,d10,d6.H,d6 ; M1_1.H = IA[re] + IC[re] * WC[re] ; M1_1.L = IA[re] - IC[re] * WC[re] ; M1_2.H = IA[im] + IC[re] * WC[im] ; M1_2.L = IA[im] - IC[re] * WC[im] ; d8=Br:Bi ; d9=Dr:Di Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 68 Freescale Semiconductor DFT Reference Code iff doen2 r12 iff move.l (r6)+n0,d2 ifa tfr d9,d3 ] [ ;01 iff move.l (r5)+n0,d10 iff move.l (r7)+n0,d11 ifa mac2assar d1,d2,d0.H,d12 IC[re] IC[re] IC[re] IC[re] * * * * WC[re] WC[re] WC[im] WC[im] + + - IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] ; M2_1.H = M1_1.H + ; M2_1.L = M1_1.H d8,d10,d12.L,d4 ; M2_2.H = M1_1.L + ; M2_2.L = M1_1.L d8,d10,d13.H,d13 ; M2_3.H = M1_2.H + ; M2_3.L = M1_2.H d8,d10,d13.L,d5 ; M2_4.H = M1_2.L ; M2_4.L = M1_2.L + (r0)+,d0 ; d0=Ar:Ai, (r1)+,d1 ; d1=Cr:Ci IB[re] IB[re] IB[re] IB[re] IB[re] IB[re] IB[re] IB[re] * * * * * * * * WB[re] WB[re] WB[im] WB[im] WB[im] WB[im] WB[re] WB[re] + + + + - IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] IB[im] * * * * * * * * WB[im] WB[im] WB[re] WB[re] WB[re] WB[re] WB[im] WB[im] ID[re] ID[re] ID[re] ID[re] ID[re] ID[re] ID[re] ID[re] * * * * * * * * WD[re] WD[re] WD[im] WD[im] WD[im] WD[im] WD[re] WD[re] + + + + ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] * * * * * * * * WD[im] WD[im] WD[re] WD[re] WD[re] WD[re] WD[im] WD[im] d1,d2,d0.L,d13 ] [ ;02 mac2assar d8,d10,d12.H,d12 mac2aassi mac2saasr MOVE2.2F MOVE2.2F ] [ ;03 mac2assar ; d10=Wbr:Wbi ; d11=Wdr:Wdi + + - mac2aassi mac2aassi ; loop1 = num_butterfly[s] ; d2=Wcr:Wci d9,d11,d12 mac2ssaai d9,d11,d4 mac2aassi d9,d11,d13 mac2assar d9,d11,d5 MOVE2.2F (r0)+n3,d8 MOVE2.2F (r1)+n3,d9 ] [ ift doen2 r13 ift move.l (r6)+n0,d2 ] [ ;01 ift move.l (r5)+n0,d10 ift move.l (r7)+n0,d11 ifa mac2assar d1,d2,d0.H,d6 mac2aassi d1,d2,d0.L,d7 ] [ ;02 ifa mac2assar d8,d10,d6.H,d6 mac2aassi d8,d10,d6.L,d14 ; M1_1.H = ; M1_1.L = ; M1_2.H = ; M1_2.L = ; ; ; ; ; ; ; ; IA[re] IA[re] IA[im] IA[im] OA[re] = M2_1.H OC[re] = M2_1.L OB[re] = M2_2.H OD[re] = M2_2.L OA[im] = M2_3.H OC[im] = M2_3.L OB[im] = M2_4.H OD[im] = M2_4.L ; d8=Br:Bi, ; d9=Dr:Di + + + + - ; loop1 = num_butterfly[s] ; d14=Wcr:Wci ; d10=Wbr:Wbi ; d11=Wdr:Wdi ; M1_1.H = ; M1_1.L = ; M1_2.H = ; M1_2.L = IA[re] IA[re] IA[im] IA[im] + + - IC[re] IC[re] IC[re] IC[re] * * * * WC[re] WC[re] WC[im] WC[im] + + - IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] ; M2_1.H = M1_1.H ; M2_1.L = M1_1.H ; M2_2.H = M1_1.L ; M2_2.L = M1_1.L + + - IB[re] IB[re] IB[re] IB[re] * * * * WB[re] WB[re] WB[im] WB[im] + + - IB[im] IB[im] IB[im] IB[im] * * * * WB[im] WB[im] WB[re] WB[re] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 69 DFT Reference Code mac2aassi d8,d10,d7.H,d7 mac2saasr d8,d10,d7.L,d15 * * * * WB[im] WB[im] WB[re] WB[re] + + - IB[im] IB[im] IB[im] IB[im] * * * * WB[re] WB[re] WB[im] WB[im] * * * * * * * * WD[re] WD[re] WD[im] WD[im] WD[im] WD[im] WD[re] WD[re] + + + + ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] * * * * * * * * WD[im] WD[im] WD[re] WD[re] WD[re] WD[re] WD[im] WD[im] ; M1_1.H = IA[re] + IC[re] * WC[re] ; M1_1.L = IA[re] - IC[re] * WC[re] mac2aassi d1,d2,d0.L,d13 ; M1_2.H = IA[im] + IC[re] * WC[im] ; M1_2.L = IA[im] - IC[re] * WC[im] MOVERH.4F d12:d13:d6:d7,(r8)+ ; save OA1:OA2 MOVERL.4F d12:d13:d6:d7,(r10)+ ; save OC1:OC2 ] [ ;02 mac2assar d8,d10,d12.H,d12 ; M2_1.H = M1_1.H + IB[re] * ; IB[im] * WB[im] ; M2_1.L = M1_1.H - IB[re] * ; IB[im] * WB[im] mac2aassi d8,d10,d12.L,d4 ; M2_2.H = M1_1.L + IB[re] * ; IB[im] * WB[re] ; M2_2.L = M1_1.L - IB[re] * ; IB[im] * WB[re] mac2aassi d8,d10,d13.H,d13 ; M2_3.H = M1_2.H + IB[re] * ; IB[im] * WB[re] ; M2_3.L = M1_2.H - IB[re] * ; IB[im] * WB[re] mac2saasr d8,d10,d13.L,d5 ; M2_4.H = M1_2.L - IB[re] * ; IB[im] * WB[im] ; M2_4.L = M1_2.L + IB[re] * ; IB[im] * WB[im] MOVERH.4F d4:d5:d14:d15,(r9)+ ; save OB1:OB2 MOVERL.4F d4:d5:d14:d15,(r11)+ ; save OD1:OD2 ] [ ;03 mac2assar d9,d11,d12 ; OA[re] = M2_1.H + ID[re] * WD[re] ; OC[re] = M2_1.L - ID[re] * WD[re] mac2ssaai d9,d11,d4 ; OB[re] = M2_2.H - ID[re] * WD[im] ; OD[re] = M2_2.L + ID[re] * WD[im] mac2aassi d9,d11,d13 ; OA[im] = M2_3.H + ID[re] * WD[im] ; OC[im] = M2_3.L - ID[re] * WD[im] + + - IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] * * * * * * WD[im] WD[im] WD[re] WD[re] WD[re] WD[re] MOVE2.2F (r0)+,d0 MOVE2.2F (r1)+,d1 ] FALIGN radix4_dit_m_stage_loop2_2 loopstart2 [ ;03 mac2assar d9,d11,d6 mac2ssaai d9,d11,d14 mac2aassi d9,d11,d7 mac2assar d9,d11,d15 MOVE2.2F (r0)+n3,d8 MOVE2.2F (r1)+n3,d9 ] [ ;01 mac2assar d1,d2,d0.H,d12 ; M2_3.H = M1_2.H ; M2_3.L = M1_2.H ; M2_4.H = M1_2.L ; M2_4.L = M1_2.L ; d0=Ar:Ai, ; d1=Cr:Ci + + IB[re] IB[re] IB[re] IB[re] ; --> Subgroup ; OA[re] = M2_1.H + ID[re] ; OC[re] = M2_1.L - ID[re] ; OB[re] = M2_2.H - ID[re] ; OD[re] = M2_2.L + ID[re] ; OA[im] = M2_3.H + ID[re] ; OC[im] = M2_3.L - ID[re] ; OB[im] = M2_4.H + ID[re] ; OD[im] = M2_4.L - ID[re] ; d8=Br:Bi, ; d9=Dr:Di WB[re] WB[re] + WB[im] + WB[im] WB[im] + WB[im] WB[re] + WB[re] - + + + - ID[im] ID[im] ID[im] ID[im] ID[im] ID[im] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 70 Freescale Semiconductor DFT Reference Code mac2assar d9,d11,d5 MOVE2.2F (r0)+,d0 MOVE2.2F (r1)+,d1 ] [ ;01 mac2assar d1,d2,d0.H,d6 ; OB[im] = M2_4.H + ID[re] * WD[re] - ID[im] * WD[im] ; OD[im] = M2_4.L - ID[re] * WD[re] + ID[im] * WD[im] ; d0=Ar:Ai, ; d1=Cr:Ci IC[im] IC[im] IC[im] IC[im] * * * * WC[im] WC[im] WC[re] WC[re] ; M2_1.H = M1_1.H + IB[re] * WB[re] - IB[im] ; M2_1.L = M1_1.H - IB[re] * WB[re] + IB[im] mac2aassi d8,d10,d6.L,d14 ; M2_2.H = M1_1.L + IB[re] * WB[im] + IB[im] ; M2_2.L = M1_1.L - IB[re] * WB[im] - IB[im] mac2aassi d8,d10,d7.H,d7 ; M2_3.H = M1_2.H + IB[re] * WB[im] + IB[im] ; M2_3.L = M1_2.H - IB[re] * WB[im] - IB[im] mac2saasr d8,d10,d7.L,d15 ; M2_4.H = M1_2.L - IB[re] * WB[re] + IB[im] ; M2_4.L = M1_2.L + IB[re] * WB[re] - IB[im] MOVE2.2F (r0)+,d0 ; d0=Ar:Ai, MOVE2.2F (r1)+,d1 ; d1=Cr:Ci ] loopend2 [ ;03 mac2assar d9,d11,d6 ; OA[re] = M2_1.H + ID[re] * WD[re] - ID[im] ; OC[re] = M2_1.L - ID[re] * WD[re] + ID[im] mac2ssaai d9,d11,d14 ; OB[re] = M2_2.H - ID[re] * WD[im] - ID[im] ; OD[re] = M2_2.L + ID[re] * WD[im] + ID[im] mac2aassi d9,d11,d7 ; OA[im] = M2_3.H + ID[re] * WD[im] + ID[im] ; OC[im] = M2_3.L - ID[re] * WD[im] - ID[im] mac2assar d9,d11,d15 ; OB[im] = M2_4.H + ID[re] * WD[re] - ID[im] ; OD[im] = M2_4.L - ID[re] * WD[re] + ID[im] MOVE2.2F (r0)+n3,d8 ; d8=Br:Bi move.l (r5)+n0,d10 ; d10=Wbr:Wbi ] [ move.l (r6)+n0,d2 ; d2 =Wcr:Wci move.l (r7)+n0,d11 ; d11=Wdr:Wdi ] [ ;01 mac2assar d1,d2,d0.H,d12 ; M1_1.H = IA[re] + IC[re] * WC[re] - IC[im] ; M1_1.L = IA[re] - IC[re] * WC[re] + IC[im] mac2aassi d1,d2,d0.L,d13 ; M1_2.H = IA[im] + IC[re] * WC[im] + IC[im] ; M1_2.L = IA[im] - IC[re] * WC[im] - IC[im] MOVERH.4F d12:d13:d6:d7,(r8)+ ; save OA1:OA2 MOVERL.4F d12:d13:d6:d7,(r10)+ ; save OC1:OC2 ] [ ;02 mac2assar d8,d10,d12.H,d12 ; M2_1.H = M1_1.H + IB[re] * WB[re] - IB[im] ; M2_1.L = M1_1.H - IB[re] * WB[re] + IB[im] mac2aassi d8,d10,d12.L,d4 ; M2_2.H = M1_1.L + IB[re] * WB[im] + IB[im] ; M2_2.L = M1_1.L - IB[re] * WB[im] - IB[im] mac2aassi d8,d10,d13.H,d13 ; M2_3.H = M1_2.H + IB[re] * WB[im] + IB[im] ; M2_3.L = M1_2.H - IB[re] * WB[im] - IB[im] * * * * * * * * WB[im] WB[im] WB[re] WB[re] WB[re] WB[re] WB[im] WB[im] * * * * * * * * WD[im] WD[im] WD[re] WD[re] WD[re] WD[re] WD[im] WD[im] * * * * WC[im] WC[im] WC[re] WC[re] * * * * * * WB[im] WB[im] WB[re] WB[re] WB[re] WB[re] mac2aassi d1,d2,d0.L,d7 MOVE2.2F (r0)+n3,d8 MOVE2.2F (r1)+n3,d9 ] [ ;02 mac2assar d8,d10,d6.H,d6 ; M1_1.H = IA[re] + IC[re] * WC[re] ; M1_1.L = IA[re] - IC[re] * WC[re] ; M1_2.H = IA[im] + IC[re] * WC[im] ; M1_2.L = IA[im] - IC[re] * WC[im] ; d8=Br:Bi, ; d9=Dr:Di + + - Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 71 DFT Reference Code mac2saasr d8,d10,d13.L,d5 ; M2_4.H = M1_2.L - IB[re] * WB[re] + IB[im] * WB[im] ; M2_4.L = M1_2.L + IB[re] * WB[re] - IB[im] * WB[im] MOVERH.4F d4:d5:d14:d15,(r9)+ ; save OB1:OB2 MOVERL.4F d4:d5:d14:d15,(r11)+ ; save OD1:OD2 ] loopend0 [ move.l (sp-16),r2 ; num_radix move.l (sp-40),r3 ; Load the stage counter ] move.l r4,d3 [ move.l d3,SR adda #2,r3,r3 ; +2 ] [ move.l r3,(sp-40) ; Update the stage counter adda r3,r2 ; num_butterfly[s] ] move.w (r2),r0 ; num_radix[s] cmpeqa.w #0,r0 ; num_radix[s]=0 or not [ jt radix3_dit_last_stage ; Go to radix-3 Last stage processing if num_radix[s]=0 cmpeqa.w #3,r0 ; num_radix[s]=3 or not ] [ jt radix3_dit_m_stage ; Go to radix-3 Middle stage processing if num_radix[s]=3 cmpeqa.w #1,r0 ; num_radix[s]=1 or not ] jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1 cmpeqa.w #4,r0 ; num_radix[s]=4 or not ] jt radix4_dit_m_stage ; Go to radix-4 Middle stage processing if num_radix[s]=4 ;/////////////////////////////////////////////////////////////////////// ; Radix-5 Stages: ;/////////////////////////////////////////////////////////////////////// radix5_init: ;--------------------------------------------------------; Radix-5 (Middle stage) ;--------------------------------------------------------radix5_dit_m_stage: IF @DEF('FIX_SCALE') [ clr d0 move.l (sp-20),r2 move.l (sp-24),r4 ] [ add #SCALE_RADIX_5,d0 move.l (sp-40),r3 move.l (sp-44),r5 ] [ ; num_butterfly ; num_subgroup ; Stage counter ; r5 -> Wb Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 72 Freescale Semiconductor DFT Reference Code move.w #2,n0 dosetup0 radix5_dit_m_stage_loop1 ] [ move.l (sp-48),r6 ; r6 -> Wc move.l (sp-28),r1 ; num_radix_offset ] [ move.l (sp-52),r7 ; r7 -> Wd move.l (sp-56),r8 ; r8 -> We ] [ adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] ] [ move.w (r2),r14 ; num_butterfly[s] ; butterflies/subgroup move.w (r4),r13 ; num_subgroup[s] ; subgroups/stage ] [ move.l (sp-12),r9 ; A output pointer move.l (sp-68),r4 ; psiScale ] [ move.l (sp-8),r0 ; A input pointer adda r3,r1 ; num_radix_offset[s] ] [ move.w (r1),r15 ; num_radix_offset[s] ] ; For next stage input/output [ move.l r9,(sp-8) ; base_address move.l r0,(sp-12) ; out_address ] [ tfra r14,r2 ; Middle stage twiddle factor's offset tfra r9,r8 ; A output ] ELSE ;//IF @DEF('FIX_SCALE') ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] ; 3 STALLS on r2 [ move.w (r2),d15 ; DFT point move.l (sp-40),r3 ; Stage counter ] [ sub #8,d15 ; d15-8 ] [ asrr #2,d15 ; (d15-8)/4 ] [ Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 73 DFT Reference Code adda #8,r0,r1 ; Input address move.w #2,n0 ] [ move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 move.2l (r1)+n0,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.l (sp-20),r2 ; num_butterfly move.l (sp-24),r4 ; num_subgroup ] [ tfr d6,d0 tfr d7,d1 move.l (sp-44),r5 ; r5 -> Wb dosetup0 radix5_dit_m_stage_loop1 ] [ max2 d0,d4 max2 d1,d5 move.l (sp-48),r6 ; r6 -> Wc move.l (sp-28),r1 ; num_radix_offset ] [ tfr d5,d0 move.l (sp-52),r7 ; r7 -> Wd move.l (sp-56),r8 ; r8 -> We ] [ max2 d0,d4 adda r3,r2 ; num_butterfly[s] adda r3,r4 ; num_subgroup[s] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 74 Freescale Semiconductor DFT Reference Code ] [ sxt.w d4,d0 move.w (r2),r14 ; num_butterfly[s] ; butterflies/subgroup move.w (r4),r13 ; num_subgroup[s] ; subgroups/stage ] [ extract #16,#16,d4,d4 move.l (sp-8),r0 ; A input pointer move.l (sp-12),r9 ; A output pointer ] [ max d0,d4 ; Max real/imag move.l (sp-68),r4 ; psiScale adda r3,r1 ; num_radix_offset[s] ] ; For next stage input/output [ clb d4,d0 ; count reading bit move.w #S54,d2 move.w (r1),r15 ; num_radix_offset[s] ] [ neg d0 ; negates move.l r9,(sp-8) ; base_address move.l r0,(sp-12) ; out_address ] [ sub #16,d0 ; limit 16-bit tfra r14,r2 ; Middle stage twiddle factor's offset tfra r9,r8 ; A output ] [ cmpgt.w #S53,d0 ; Check if norm > S or norm <= S move.w #S53,d1 ] [ iff sub d0,d1,d0 ; scale down = S - d0 if norm <= S ift clr d0 ; scale down = 0 if norm > S ] ENDIF ;// IF @DEF('FIX_SCALE') [ move.l d0,(sp-64) ; Set norm to stack move.w d0,(r4)+ ; psiScale <- norm ] [ move.l r4,(sp-68) ; psiScale adda r15,r9 ; B output pointer (long) ] [ tfra r9,r10 ; B output move.l (sp-44),r4 ; r4 -> Wb ] [ adda r15,r10 ; C output pointer (long) move.l (sp-48),r5 ; r5 -> Wc ] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 75 DFT Reference Code [ tfra r10,r11 ; C output adda #4,r0,r1 ; A input ] [ adda r15,r11 ; D output pointer (long) dosetup1 radix5_dit_m_stage_loop2 ] [ tfra r11,r12 ; D output move.l #radix5parameter,r14 ; Point to the radix-5 parameter array ] [ adda r15,r12 ; E output pointer (long) move.l #radix5parameter+4,r15 ; Point to the radix-5 parameter array ] ;--- Start Radix-5 (Middle stage) [ ; move.2f (r0)+n0,d12:d13 ; d12.H = Ar, d13.H = Ai move.l (r1)+n0,d14 ; d14 = Br:Bi ] [ ; move.l (r0)+n0,d15 ; d15 = Cr:Ci move.l (sp-64),d8 ; Scaling factor ] [ ; asrr2 d8,d14 ; d14 = Br:Bi>>S asrr2 d8,d15 ; d15 = Cr:Ci>>S move.l (sp-52),r6 ; r6 -> Wd move.l (sp-56),r7 ; r7 -> We ] [ ; asrr d8,d12 ; d12 Ar>>S move.l (r0)+,d9 ; d9 = Er:Ei doen0 r13 ; num_subgroup ; subgroups/stage ] FALIGN radix5_dit_m_stage_loop1: ; --> Butterflies/Subgroup loopstart0 [ doen1 r2 ; num_butterfly[s] move.l (r4),d2 ; d2 = Wbr:Wbi ] FALIGN radix5_dit_m_stage_loop2: ; --> Subgroup loopstart1 [ ; 01 asrr2 d8,d9 ; d9 = Er:Ei>>S MPYRE d2,d14,d6 ; d6 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d14,d7 ; d7 = Bi' = (Br*Wbi + Bi*Wbr) tfr d8,d1 ; Scaling factor move.l (r1)+n0,d8 ; d8 = Dr:Di move.l (r5),d10 ; d10 = Wcr:Wci ] [ ; 02 asrr2 d1,d8 ; d8 = Dr:Di>>S pack.2f d7,d6,d0 ; d0 = Br':Bi' Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 76 Freescale Semiconductor DFT Reference Code MPYRE d10,d15,d14 MPYIM d10,d15,d15 move.l (r7),d3 ] [ ; 03 asrr d1,d13 mac2assar d9,d3,d0.h,d6 mac2aassi d9,d3,d0.l,d7 pack.2f d15,d14,d0 move.l (r6),d11 ] [ ; 04 sod2aaii d12,d6,d4 sod2aaii d13,d7,d5 mac2assar d8,d11,d0.h,d14 mac2aassi d8,d11,d0.l,d15 move.2f (r14),d0:d1 move.2f (r15),d8:d9 ] [ ; 05 move.l d6,d2 move.l d7,d3 sod2aaii d4,d14,d4 sod2aaii d5,d15,d5 ] [ ; 06 mac +d0,d2,d12 mac +d0,d3,d13 tfr d12,d14 tfr d13,d15 move.l d14,d10 move.l d15,d11 ] [ ; 07 mac -d8,d10,d12 mac -d8,d11,d13 mac -d8,d2,d14 mac -d8,d3,d15 adda #4,r0,r1 moves.2f d4:d5,(r8)+ ] [ ; 08 pack.2f d13,d12,d0 mac +d0,d10,d14 mac +d0,d11,d15 move.l (sp-64),d8 move.2f (r0)+n0,d12:d13 ] [ ; 09 mac2aassi d1,d3,d0.h,d4 ; mac2ssaai d1,d2,d0.l,d5 ; d14 = Cr' = (Cr*Wcr - Ci*Wci) ; d15 = Ci' = (Cr*Wci + Ci*Wcr) ; d3 = Wer:Wei ; ; ; ; ; d13 = Ai>>S M2_1 = (Br'+Er'):(Br'-Er') M2_2 = (Bi'+Ei'):(Bi'-Ei') d0 = Cr':Ci' d11 = Wdr:Wdi ; ; ; ; ; ; Ar'+Br'+Er' Ai'+Bi'+Ei' M2_3 = (Cr'+Dr'):(Cr'-Dr') M2_4 = (Ci'+Di'):(Ci'-Di') d0:d1 = [a,b], Radix-5 parameters d8:d9 = [c,d], Radix-5 parameters ; Aout_r = Ar'+Br'+Cr'+Dr'+Er' ; Aout_i = Ai'+Bi'+Ci'+Di'+Ei' ; Ar' + ; Ai' + ; ; ; ; ; ; a*(Br'+Er') a*(Bi'+Ei') Ar' + (a*(Br'+Er')-c*(Cr'+Dr')) Ai' + (a*(Bi'+Ei')-c*(Ci'+Di')) Ar' - c*(Br'+Er') Ai' - c*(Bi'+Ei') A input out:[Aout_r,Aout_i] ; ; ; ; Ar' - (c*(Br'+Er')-a*(Cr'+Dr')) Ai' - (c*(Bi'+Ei')-a*(Ci'+Di')) Scaling factor d12.H = Ar, d13.H = Ai M5_1.H = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')), ; M5_1.L = Ar'+(a(Br'+Er'); c(Cr'+Dr'))-(b(Bi'-Ei')) ; M5_2.H = Ai'+(a(Bi'+Ei'); c(Ci'+Di'))-(b(Br'-Er')), ; M5_2.L = Ai'+(a(Bi'+Ei'); c(Ci'+Di'))+(b(Br'-Er')) pack.2f d15,d14,d0 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 77 DFT Reference Code move.l (r1)+n0,d14 move.l (r0)+n0,d15 ] [ ; 10 mac2aassi d9,d11,d4 mac2ssaai d9,d10,d5 mac2aassi d9,d3,d0.h,d6 mac2ssaai d9,d2,d0.l,d7 move.l (r0)+,d9 move.l (r4),d2 ] [ ; 11 asrr2 d8,d14 asrr2 d8,d15 mac2saasi d1,d11,d6 mac2assai d1,d10,d7 ; d14 = Br:Bi ; d15 = Cr:Ci ; Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+ ; (b(Bi'-Ei')+d(Ci'-Di')), ; Eout_r = Ar'+(a(Br'+Er'); c(Cr'+Dr'))-(b(Bi'-Ei')+d(Ci'-Di')) ; Bout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di')); (b(Br'-Er')+d(Cr'-Dr')), Eout_i = ; Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+ ; (b(Br'-Er')+d(Cr'-Dr')) ; M5_3.H = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')), ; M5_3.L = Ar'-(c(Br'+Er')-a(Cr'+Dr'));(d(Bi'-Ei')) ; M5_4.H = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')), ; M5_4.L = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+ ; (d(Br'-Er')) ; d9 = Er:Ei ; d2 = Wbr:Wbi ; d14 = Br:Bi>>S ; d15 = Cr:Ci>>S ; Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+ ;(d(Bi'-Ei')-b(Ci'-Di')), Dout_r = ; Ar'-(c(Br'+Er')-a(Cr'+Dr')); (d(Bi'-Ei')-b(Ci'-Di')) ; Cout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di')); (d(Br'-Er')-b(Cr'-Dr')), Dout_i = ; Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+ ; (d(Br'-Er')-b(Cr'-Dr')) ; out:[Bout_r,Bout_i] ; out:[Eout_r,Eout_i] moves.2f d4:d5,(r9)+ move.2w d4:d5,(r12)+ ] [ ; 12 asrr d8,d12 ; d12 = Ar>>S moves.2f d6:d7,(r10)+ ; out:[Cout_r,Cout_i] move.2w d6:d7,(r11)+ ; out:[Dout_r,Dout_i] ] loopend1 [ addl2a r2,r4 ; updata twiddle factor pointers addl2a r2,r5 ] [ addl2a r2,r6 addl2a r2,r7 ] loopend0 [ move.l (sp-16),r2 ; num_radix move.l (sp-40),r3 ; Load the stage counter ] ; 3 STALLS on r3 adda #2,r3,r3 ; +2 [ Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 78 Freescale Semiconductor DFT Reference Code move.l r3,(sp-40) ; Update the stage counter adda r3,r2 ; num_butterfly[s] ] move.w (r2),r0 ; num_radix[s] cmpeqa.w #5,r0 ; num_radix[s]=5 or not [ jt radix5_dit_m_stage ; Go to radix-5 Middle stage processing if num_radix[s]=5 cmpeqa.w #1,r0 ; num_radix[s]=1 or not ] jt radix5_dit_last_stage ; Go to radix-5 Last stage processing if num_radix[s]=1 ;--------------------------------------------------------; Radix-5 (Last stage) ;--------------------------------------------------------radix5_dit_last_stage: IF @DEF('FIX_SCALE') [ clr d12 move.l (sp-40),r3 ; Stage counter move.l (sp-28),r15 ; num_radix_offset[0] ] [ add #SCALE_RADIX_5,d12 move.l (sp-24),r14 ; num_subgroup move.l (sp-12),r9 ; A output pointer ] [ dosetup0 radix5_dit_last_stage_loop1 move.l (sp-8),r0 ; A input pointer ] [ move.w #2,n0 move.l (sp-68),r2 ; psiScale ] [ adda r3,r15 ; num_radix_offset[s] move.l (sp-44),r4 ; r4 -> Wb ] [ adda r3,r14 ; siNumSubgroup[s] move.w (r15),r15 ; num_radix_offset[s] ] [ adda #4,r0,r1 ; A input move.l (sp-48),r5 ; r5 -> Wc ] [ move.l (sp-52),r6 ; r6 -> Wd move.l (sp-56),r7 ; r7 -> We ] [ tfra r9,r8 ; A output move.w (r14),d5 ; siNumSubgroup[s] ] [ adda r15,r9 ; B output pointer (long) Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 79 DFT Reference Code doen0 d5 ; Loop for num_butterfly ] [ tfra r9,r10 ; B output move.l #radix5parameter,r14 ; Point to the radix-5 parameter array ] [ adda r15,r10 ; C output pointer (long) move.l d12,(sp-64) ; Set norm to stack ] [ tfra r10,r11 ; C output move.w d12,(r2)+ ; psiScale <- norm ] [ adda r15,r11 ; D output pointer (long) move.l r2,(sp-68) ; psiScale ] [ tfra r11,r12 ; D output ] [ adda r15,r12 ; E output pointer (long) move.l #radix5parameter+4,r15 ; Point to the radix-5 parameter array ] ELSE ;//IF @DEF('FIX_SCALE') ;-- Find max value (real & imag) & stage initialization [ move.l (sp-8),r0 ; input base (A) move.l (sp-60),r2 ; DFT point ] ; 3 STALLS on r2 [ move.w (r2),d15 ; DFT point move.l (sp-40),r3 ; Stage counter ] [ sub #8,d15 ; d15-8 move.l (sp-28),r2 ; num_radix_offset[0] move.l (sp-24),r14 ; num_subgroup ] [ asrr #2,d15 ; (d15-8)/4 dosetup0 radix5_dit_last_stage_loop1 ] [ adda #8,r0,r1 ; Input address move.w #2,n0 ] [ move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 move.2l (r0)+n0,d4:d5 Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 80 Freescale Semiconductor DFT Reference Code move.2l (r1)+n0,d6:d7 ] [ abs2 d4,d4 abs2 d5,d5 abs2 d6,d6 abs2 d7,d7 doensh0 d15 ; loop count for max search ] ; Loop for max (real,imag) search loopstart0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 move.2l (r0)+n0,d0:d1 move.2l (r1)+n0,d2:d3 ] [ abs2 d0,d0 abs2 d1,d1 abs2 d2,d2 abs2 d3,d3 ] loopend0 [ max2 d0,d4 max2 d1,d5 max2 d2,d6 max2 d3,d7 adda r3,r2 ; num_radix_offset[s] adda r3,r14 ; siNumSubgroup[s] ] [ tfr d6,d0 tfr d7,d1 move.w (r2),r15 ; num_radix_offset[s] move.l (sp-12),r9 ; A output pointer ] [ max2 d0,d4 max2 d1,d5 move.l (sp-8),r0 ; A input pointer move.l (sp-68),r2 ; psiScale ] [ tfr d5,d0 move.l (sp-44),r4 ; r4 -> Wb move.l (sp-48),r5 ; r5 -> Wc ] [ max2 d0,d4 move.l (sp-52),r6 ; r6 -> Wd move.l (sp-56),r7 ; r7 -> We ] [ sxt.w d4,d0 tfra r9,r8 ; A output adda r15,r9 ; B output pointer (long) ] [ extract #16,#16,d4,d4 adda #4,r0,r1 ; B input Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 81 DFT Reference Code tfra r9,r10 ; B output ] [ max d0,d4 ; Max real/imag move.w (r14),d5 ; siNumSubgroup[s] adda r15,r10 ; C output pointer (long) ] [ clb d4,d0 ; count reading bit tfra r10,r11 ; C output doen0 d5 ; Loop for num_butterfly ] [ neg d0 ; negates adda r15,r11 ; D output pointer (long) move.l #radix5parameter,r14 ; Point to the radix-5 parameter array ] [ sub #16,d0 ; limit 16-bit move.w #S52,d2 tfra r11,r12 ; D output ] [ cmpgt.w #S51,d0 ; Check if norm > S or norm <= S move.w #S51,d1 adda r15,r12 ; E output pointer (long) ] [ iff sub d0,d1,d12 ; scale down = S - d0 if norm <= S ift sub d0,d2,d12 ; scale up = S - d0 if norm > S ] [ move.l d12,(sp-64) ; Set norm to stack move.w d12,(r2)+ ; psiScale <- norm ] [ move.l r2,(sp-68) ; psiScale move.l #radix5parameter+4,r15 ; Point to the radix-5 parameter array ] ENDIF ;// IF @DEF('FIX_SCALE') ;--- Start Radix-5 (Last stage) [ ; move.2f (r0)+n0,d12:d13 ; d12.H = Ar, d13.H = Ai move.l (r1)+n0,d14 ; d14 = Br:Bi ] [ ; move.l (r0)+n0,d15 ; d15 = Cr:Ci move.l (sp-64),d8 ; Scaling factor ] [ ; asrr2 d8,d14 ; d14 = Br:Bi>>S asrr2 d8,d15 ; d15 = Cr:Ci>>S ] [ ; asrr d8,d12 ; d12 = Ar>>S Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 82 Freescale Semiconductor DFT Reference Code move.l (r0)+,d9 ; d9 = Er:Ei move.l (r4)+,d2 ; d2 = Wbr:Wbi ] FALIGN radix5_dit_last_stage_loop1: ; --> Butterflies/Subgroup loopstart0 [ ; 01 asrr2 d8,d9 ; d9 = Er:Ei>>S MPYRE d2,d14,d6 ; d6 = Br' = (Br*Wbr - Bi*Wbi) MPYIM d2,d14,d7 ; d7 = Bi' = (Br*Wbi + Bi*Wbr) tfr d8,d1 ; Scaling factor move.l (r1)+n0,d8 ; d8 = Dr:Di move.l (r5)+,d10 ; d10 = Wcr:Wci ] [ ; 02 asrr2 d1,d8 ; d8 = Dr:Di>>S pack.2f d7,d6,d0 MPYRE d10,d15,d14 ; d14 = Cr' = (Cr*Wcr - Ci*Wci) MPYIM d10,d15,d15 ; d15 = Ci' = (Cr*Wci + Ci*Wcr) move.l (r7)+,d3 ; d3 = Wer:Wei ] [ ; 03 asrr d1,d13 ; d13 = Ai>>S mac2assar d9,d3,d0.h,d6 ; M2_1 = (Br'+Er'):(Br'-Er') mac2aassi d9,d3,d0.l,d7 ; M2_2 = (Bi'+Ei'):(Bi'-Ei') pack.2f d15,d14,d0 move.l (r6)+,d11 ; d11 = Wdr:Wdi ] [ ; 04 sod2aaii d12,d6,d4 ; Ar'+Br'+Er' sod2aaii d13,d7,d5 ; Ai'+Bi'+Ei' mac2assar d8,d11,d0.h,d14 ; M2_3 = (Cr'+Dr'):(Cr'-Dr') mac2aassi d8,d11,d0.l,d15 ; M2_4 = (Ci'+Di'):(Ci'-Di') move.2f (r14),d0:d1 ; d0:d1 = [a,b], Radix-5 parameters move.2f (r15),d8:d9 ; d8:d9 = [c,d], Radix-5 parameters ] [ ; 05 move.l d6,d2 move.l d7,d3 sod2aaii d4,d14,d4 ; Aout_r = Ar'+Br'+Cr'+Dr'+Er' sod2aaii d5,d15,d5 ; Aout_i = Ai'+Bi'+Ci'+Di'+Ei' ] [ ; 06 mac +d0,d2,d12 ; Ar' + a*(Br'+Er') mac +d0,d3,d13 ; Ai' + a*(Bi'+Ei') tfr d12,d14 tfr d13,d15 move.l d14,d10 move.l d15,d11 ] [ ; 07 mac -d8,d10,d12 ; Ar' + (a*(Br'+Er')-c*(Cr'+Dr')) mac -d8,d11,d13 ; Ai' + (a*(Bi'+Ei')-c*(Ci'+Di')) mac -d8,d2,d14 ; Ar' - c*(Br'+Er') mac -d8,d3,d15 ; Ai' - c*(Bi'+Ei') adda #4,r0,r1 ; B input moves.2f d4:d5,(r8)+ ; out:[Aout_r,Aout_i] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 83 DFT Reference Code ] [ ; 08 pack.2f d13,d12,d0 mac +d0,d10,d14 mac +d0,d11,d15 move.l (sp-64),d8 move.2f (r0)+n0,d12:d13 ] [ ; 09 mac2aassi d1,d3,d0.h,d4 mac2ssaai d1,d2,d0.l,d5 pack.2f d15,d14,d0 move.l (r1)+n0,d14 move.l (r0)+n0,d15 ] [ ; 10 mac2aassi d9,d11,d4 mac2ssaai d9,d10,d5 mac2aassi d9,d3,d0.h,d6 mac2ssaai d9,d2,d0.l,d7 move.l (r0)+,d9 move.l (r4)+,d2 ] [ ; 11 asrr2 d8,d14 asrr2 d8,d15 mac2saasi d1,d11,d6 mac2assai d1,d10,d7 moves.2f d4:d5,(r9)+ move.2w d4:d5,(r12)+ ] [ ; 12 asrr d8,d12 moves.2f d6:d7,(r10)+ move.2w d6:d7,(r11)+ ] loopend0 ; ; ; ; Ar' - (c*(Br'+Er')-a*(Cr'+Dr')) Ai' - (c*(Bi'+Ei')-a*(Ci'+Di')) Scaling factor d12.H = Ar, d13.H = Ai ; ; ; ; M5_1.H M5_1.L M5_2.H M5_2.L = = = = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+(b(Bi'-Ei')), Ar'+(a(Br'+Er')-c(Cr'+Dr'))-(b(Bi'-Ei')) Ai'+(a(Bi'+Ei')-c(Ci'+Di'))-(b(Br'-Er')), Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+(b(Br'-Er')) ; d14 = Br:Bi ; d15 = Cr:Ci ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; Bout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))+ (b(Bi'-Ei')+d(Ci'-Di')), Eout_r = Ar'+(a(Br'+Er')-c(Cr'+Dr'))(b(Bi'-Ei')+d(Ci'-Di')) Bout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di'))(b(Br'-Er')+d(Cr'-Dr')), Eout_i = Ai'+(a(Bi'+Ei')-c(Ci'+Di'))+ (b(Br'-Er')+d(Cr'-Dr')) M5_3.H = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+(d(Bi'-Ei')), M5_3.L = Ar'-(c(Br'+Er')-a(Cr'+Dr'))-(d(Bi'-Ei')) M5_4.H = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))-(d(Br'-Er')), M5_4.L = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+(d(Br'-Er')) d9 = Er:Ei d2 = Wbr:Wbi ; ; ; ; ; ; ; ; ; ; ; ; ; ; d14 = Br:Bi>>S d15 = Cr:Ci>>S Cout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))+ (d(Bi'-Ei')-b(Ci'-Di')), Dout_r = Ar'-(c(Br'+Er')-a(Cr'+Dr'))(d(Bi'-Ei')-b(Ci'-Di')) Cout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))(d(Br'-Er')-b(Cr'-Dr')), Dout_i = Ai'-(c(Bi'+Ei')-a(Ci'+Di'))+ (d(Br'-Er')-b(Cr'-Dr')) out:[Bout_r,Bout_i] out:[Eout_r,Eout_i] ; d12 = Ar>>S ; out:[Cout_r,Cout_i] ; out:[Dout_r,Dout_i] Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 84 Freescale Semiconductor DFT Reference Code exit_mixed_radix: [ move.l (my_OUTPUT_pointer),r14 ;// --> psFft->psiOut move.l (sp-12),d3 ; out_address psRadix.psiOut ] [ move.l (sp-36),d4 adda #-72,sp,r5 ] [ tfra r5,sp ] pop.2l d6:d7 pop.2l r6:r7 F_sc3850_dft_dit_complex_16x16_auto_scale_asm_end [ rtsd move.l d4,sr ] move.l d3,(r14) ; psFft->psiOut = psRadix.psiOut; endsec Software Optimization of DFTs and IDFTs Using the StarCore SC3850 DSP Core, Rev. 0 Freescale Semiconductor 85 How to Reach Us: Home Page: www.freescale.com Web Support: http://www.freescale.com/support USA/Europe or Locations Not Listed: Freescale Semiconductor, Inc. Technical Information Center, EL516 2100 East Elliot Road Tempe, Arizona 85284 +1-800-521-6274 or +1-480-768-2130 www.freescale.com/support Europe, Middle East, and Africa: Freescale Halbleiter Deutschland GmbH Technical Information Center Schatzbogen 7 81829 Muenchen, Germany +44 1296 380 456 (English) +46 8 52200080 (English) +49 89 92103 559 (German) +33 1 69 35 48 48 (French) www.freescale.com/support Information in this document is provided solely to enable system and software implementers to use Freescale Semiconductor products. There are no express or implied copyright licenses granted hereunder to design or fabricate any integrated circuits or integrated circuits based on the information in this document. Freescale Semiconductor reserves the right to make changes without further notice to any products herein. Freescale Semiconductor makes no warranty, representation or guarantee regarding the suitability of its products for any particular purpose, nor does Freescale Semiconductor assume any liability arising out of the application or use of any product or circuit, and specifically disclaims any and all liability, including without limitation consequential or incidental damages. “Typical” parameters which may be provided in Freescale Semiconductor data sheets and/or specifications can and do vary in different applications and actual performance may vary over time. All operating parameters, including “Typicals” must be validated for each customer application by customer’s technical experts. Freescale Semiconductor does not convey any license Japan: Freescale Semiconductor Japan Ltd. Headquarters ARCO Tower 15F 1-8-1, Shimo-Meguro, Meguro-ku Tokyo 153-0064 Japan 0120 191014 or +81 3 5437 9125 [email protected] under its patent rights nor the rights of others. Freescale Semiconductor products are Asia/Pacific: Freescale Semiconductor China Ltd. Exchange Building 23F No. 118 Jianguo Road Chaoyang District Beijing 100022 China +86 010 5879 8000 [email protected] claims, costs, damages, and expenses, and reasonable attorney fees arising out of, For Literature Requests Only: Freescale Semiconductor Literature Distribution Center +1-800 441-2447 or +1-303-675-2140 Fax: +1-303-675-2150 LDCForFreescaleSemiconductor @hibbertgroup.com Document Number: AN3680 Rev. 0 11/2010 not designed, intended, or authorized for use as components in systems intended for surgical implant into the body, or other applications intended to support or sustain life, or for any other application in which the failure of the Freescale Semiconductor product could create a situation where personal injury or death may occur. Should Buyer purchase or use Freescale Semiconductor products for any such unintended or unauthorized application, Buyer shall indemnify and hold Freescale Semiconductor and its officers, employees, subsidiaries, affiliates, and distributors harmless against all directly or indirectly, any claim of personal injury or death associated with such unintended or unauthorized use, even if such claim alleges that Freescale Semiconductor was negligent regarding the design or manufacture of the part. Freescale, the Freescale logo, StarCore, and CodeWarrior are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. All other product or service names are the property of their respective owners. © 2009–2010 Freescale Semiconductor, Inc.