AC398: Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2 / SmartFusion2 Mathblock - Libero SoC v11.4

Application Note AC398
Implementation of 9x9 Multiplications,
Wide-Multiplier, and Extended Addition Using
IGLOO2/SmartFusion2 Mathblock - Libero SoC
v11.4
Purpose . . . . . . . . . .
Introduction . . . . . . . .
References . . . . . . . .
Design Requirements . . .
Using 9x9 Multiplier Mode
Overview . . . .
Configuration . .
Guidelines . . .
Design Examples
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ed
ed
Table of Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
3
3
.
.
.
.
.
.
.
.
.
.
.
.
3
3
6
6
Wide-Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
rs
Overview . . . .
Configuration . .
Guidelines . . .
Design Examples
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 14
. 15
. 15
. 15
Extended Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 21
. 21
. 21
. 21
pe
Overview . . . .
Configuration . .
Guidelines . . .
Design Examples
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Appendix A - Design Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
List of Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Su
Purpose
This application note highlights the design guidelines and different implementation methods to achieve
better performance results while implementing wide-multipliers, 9-bit×9-bit multiplications, and extended
addition with the IGLOO®2 field programmable gate array (FPGA)/SmartFusion®2 system-on-chip (SoC)
FPGA mathblock (MACC). The 9-bit×9-bit multiplications, wide-multiplier, and extended addition are
ideal for applications with high-performance and computationally intensive signal processing operations.
Some of them are finite impulse response (FIR) filtering, fast fourier transforms (FFTs), and digital
up/down conversion. These functions are widely used in video processing, 2D/3D image processing,
wireless, industrial applications, and other digital signal processing (DSP) applications.
September 2014
© 2014 Microsemi Corporation
1
Introduction
Introduction
The IGLOO2/SmartFusion2 mathblock architecture has been optimized to implement various common
DSP functions with maximum performance and minimum logic resource utilization. The dedicated routing
region around the mathblock and the feedback paths provided in each mathblock result in routing
improvements. The IGLOO2/SmartFusion2 mathblock has a variety of features for fast and easy
implementation of many basic math functions. The high speed multiplier (9×9, 18×18), adder/subtractor,
and accumulator in mathblock delivers high speed math functions. For more information on
IGLOO2/SmartFusion2 mathblock, refer to IGLOO2 FPGA Fabric User Guide/SmartFusion2 FPGA
Fabric User Guide and for usage of mathblock refer to the Inferring Microsemi SmartFusion2 MACC
Blocks Application Note.
•
Using 9x9 Multiplier Mode
•
Wide-Multiplier
•
Extended Addition
References
ed
ed
This application note explains the design considerations and different methods for implementing the
following:
The following documents are referenced in this document.
•
IGLOO2 FPGA Fabric User Guide
•
SmartFusion2 FPGA Fabric User Guide
Inferring Microsemi SmartFusion2 MACC Blocks Application Note
IGLOO2/SmartFusion2 Hard Multiplier AddSub Configuration User Guide
•
IGLOO2/SmartFusion2 Hard Multiplier Accumulator Configuration User Guide
•
IGLOO2/SmartFusion2 Hard Multiplier Configuration User Guide
Su
pe
rs
•
•
Revision 1
2
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Design Requirements
Table 1 shows the design requirements.
Table 1 • Design Requirements
Design Requirements
Description
Hardware Requirements
Host PC
Any 64-bit Windows Operating System
Software Requirements
Libero® System-on-Chip (SoC)
v11.4
®
v10.3
ed
ed
Modelsim
Using 9x9 Multiplier Mode
Overview
The 9-bit×9-bit multipliers are extensively used in low precision video processing applications. In video
applications, the color conversion formats such as YUV to RGB, RGB to YUV, and RGB to YCbCr,
NTSC, PAL etc., 9-bit×9-bit multipliers are used. In image processing, the operations involving 8-bit RGB
such as 3×3, 5×5, 7×7 matrix multiplications, image enhancement techniques, scaling, resizing etc., 9bit×9-bit multipliers are used. The IGLOO2/SmartFusion2 device addresses these applications by using
mathblock in dot product (DOTP) mode.
rs
The following sections explain the DOTP configurations and capabilities, guidelines, different
implementation methods with design examples, and their performance and simulation results.
The mathblock when configured in DOTP mode has two independent 9-bit×9-bit multipliers followed by
adder. The sum of the dual independent 9×9 multiplier (DOTP) result is stored in upper 35 bits of 44-bit
register. In DOTP mode, mathblock implements the following equation:
pe
Multiplier result = (A[8:0] x B[17:9] + A 17:9] x B[8:0]) x 29
EQ 1
Configuration
The IGLOO2/SmartFusion2 mathblock in DOTP mode can be used in three different configurations.
These configurations are available in the Libero software, Catalog > Arithmetic as given below:
Multiplier
Su
•
3
•
Multiplier accumulator
•
Multiplier addsub
R e vi s i o n 1
Using 9x9 Multiplier Mode
Figure 1 shows the dot product multiplier adder with the IGLOO2/SmartFusion2 mathblock.
SF2/GL2 MACC
A0[8:0]
B0[8:0]
CARRYOUT/OVERFLOW
A1[8:0]
ed
ed
C[43:0]
CDOUT[43:0]
B1[8:0]
C[43:0]
Carryin
PN = PN-1 + (A0*B0 + A1*B1) + Carryin + C[43:0]
Su
pe
rs
Figure 1 • Dot Product Multiplier Adder
Revision 1
4
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Figure 2 shows the dot product multiplier accumulator with mathblock.
SF2/GL2 MACC
A0[8:0]
B0[8:0]
CARRYOUT/OVERFLOW
A1[8:0]
ed
ed
P[43:0]
CDOUT[43:0]
B1[8:0]
C[43:0]
Carryin
0 or 1
rs
0’s
CDIN
PN = (A0*B0 + A1*B1) + Carryin + C[43:0] + CDIN
pe
Figure 2 • Dot Product Multiplier Accumulator
Figure 3 shows the implemented DOTP multiplier.
SF2/GL2 MACC
Su
A0[8:0]
B0[8:0]
P[18:0]
A1[8:0]
B1[8:0]
P = A0*B0 + A1*B1
Figure 3 • Dot Product Multiplier
5
R e vi s i o n 1
Using 9x9 Multiplier Mode
Math Functions with DOTP
When DOTP is enabled, several mathematical functions can be implemented. Some of them are listed in
Table 2.
Single Mathblock (DOTP Enabled)
Table 2 • Math Functions with DOTP
Conditions
Implemented Equations
Y = P² + M×N
P = A[8:0] = B[17:9]; Q = A[17:9] = B[8:0]
Y = P² + Q²
A[8:0] = B[17:9] = 1; B = A[17:9]; Q = B[8:0]
Y = 1 + Q²
A[8:0] = B[17:9] = 1; P = A[17:9]; Q = B[8:0]
Y = 1 + P×Q
ed
ed
P = A[8:0] = B[17:9]; M = A[17:9]; N = B[8:0]
P = A[8:0] = A[17:9]; Q = B[17:9] = B[8:0]
Y = P×Q + P×Q = 2×P×Q
In this method, several 9-bit mathematical functions can be implemented using DOTP mode with a single
mathblock.
Guidelines
Microsemi recommends to use the following when designing with DOTP multiplier:
To perform Y = A×B + C×D equation, instantiate Arithmetic IP cores with DOTP enabled for 9×9
multiplications. This avoids inferring two 18×18 multipliers.
•
Register the inputs and outputs, when using Arithmetic IP cores (Mathblock).
•
The registered inputs and outputs must use the same clock.
•
Use the cascaded feature to connect the multiple mathblocks. This is achieved by connecting the
cascade output (CDOUT) of one MACC block to the cascade input (CDIN) of another mathblock.
rs
•
For more information on VHDL/Verilog coding styles for inferring mathblocks, refer to the Inferring
Microsemi SmartFusion2 MACC Blocks Application Note.
pe
Design Examples
This section illustrates the 9×9 Multiplier mode usage with the following design examples:
•
Example 1: 6-tap FIR Filter Using Multiple Mathblocks
•
Example 2: 6-tap FIR Filter Using Single Mathblock
•
Example 3: Alpha Blending
Su
Example 1: 6-tap FIR Filter Using Multiple Mathblocks
This design example (Figure 4 on page 7) shows the 6-tap FIR filter (systolic FIR filter) implementation
with multiple mathblocks and also shows the performance results of the implementation.
Design Description
The 6-tap FIR filter design with multiple mathblocks is a systolic architecture implementation, refer
Figure 4 on page 7. This architecture utilizes a single IGLOO2/SmartFusion2 mathblock to perform two
independent 9×9 multiplications followed by an addition, instead of using two mathblocks that have a
single multiplication unit. With this architecture implementation, only three mathblocks are required to
design a 6-tap FIR filter. The 6-tap FIR design uses cascaded chains (CDOUT to CDIN) for propagating
the sum to achieve the best performance and reducing fabric resources. In this implementation
technique, the mathblock is configured as DOTP multiplier Adder. Eight Pipeline registers are added in
fabric only at the input.
Revision 1
6
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
When designing n-tap systolic FIR filters with IGLOO2/SmartFusion2 mathblock for 9-bit input data and
9-bit coefficient, only n/2 mathblocks are utilized, saving n/2 mathblock resources.
6 - tap FIR (9-bit x 9-bit)
Xin[8:0]
C0 [8:0]
C1[8:0]
C2[8:0]
C3[8:0]
C4[8:0]
C5[8:0]
reset_n
clk
CDIN
CDIN
CDIN
SF2/GL2 MACC
ed
ed
Zeros
SF2/GL2 MACC
SF2/GL2 MACC
Yn_out
Figure 4 • 6-tap Systolic FIR Filter
rs
In this design, the FIR filter generates outputs for every clock cycle after an initial latency of 10 clock
cycles.
Total initial latency = 8 clock cycles for 6 input samples + 2 clock cycles (MACC block input and output
are registered).
= 10 clock cycles
pe
Design Files
Su
For information on the implementation of the 6-tap FIR filter design, refer to the FIR_6_tap.vhd design
file provided in <Design files 'FIR_6_TAP>.
7
R e vi s i o n 1
Using 9x9 Multiplier Mode
Hardware Configuration
Su
pe
rs
ed
ed
For 6-tap systolic FIR filter, mathblock is configured as DOTP multiplier adder with inputs and outputs
registered, refer to Figure 5.
Figure 5 • DOTP Multiplier Adder for 6-tap Systolic FIR
Synthesis and Place-and-Route Results
Figure 6 on page 9 shows the 6-tap systolic FIR filter resource utilization that uses multiple mathblocks.
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
Revision 1
8
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
ed
ed
Resource Utilization
Figure 6 • Resource Utilization for a 6-tap Systolic FIR Filter
Place-and-Route Results
pe
rs
The frequency of operation is achieved with this implementation after place-and-route, refer to Figure 7.
Figure 7 • Place-and-Route Results for 6-tap Systolic FIR Filter
Simulation Results
Su
Figure 8 shows the post synthesis simulation results. The coefficient values (c0-c5) are configured in
design as C0 = 5, C1 = 3, C2 = 7, C3 = -4, C4 = 1, C5 = -2. The simulation results show that the 6-tap FIR
filter outputs on every clock cycle. It has an initial latency of 10 clock cycles.
Figure 8 • 6-tap FIR Filter Post Synthesis Simulation
9
R e vi s i o n 1
Using 9x9 Multiplier Mode
Example 2: 6-tap FIR Filter Using Single Mathblock
This design example shows the 6-tap FIR filter implementation with single-mathblock (MAC FIR filter)
and also shows the performance result of the implementations, refer to Figure 9.
Design Description
ed
ed
The 6-tap FIR filter can also be implemented with a single mathblock as shown in Figure 9. This design
uses coefficient memory where coefficients are stored and input memory that stores input samples. The
control logic reads two consecutive coefficients from the coefficient memory and two consecutive input
samples from the input memory and provides it to mathblock. Due to dual independent 9-bit×9-bit
multipliers, the filter result is calculated in four clock cycles instead of six clock cycles that has a single
multiplier and accumulator.
If a single multiplier and accumulator is used for sum of the products, the number of cycles taken for
result is same as the number of coefficients or number of taps used in filter design. With this relationship,
the performance of a single multiplier and accumulator is given as follows:
Maximum input sample rate = System Clock / (Number of taps + 1)
With IGLOO2/SmartFusion2 mathblock, that is, for two products followed accumulator, the sample rate
= Clock /((1/2 × number of taps)+1)
For 6-tap FIR filter, sample rate = Clock/(6/2 + 1) = Clock/4
Single MAC 6-tap FIR (9-bit×9-bit)
Coef_addr
FiltOp_en
clk
reset_n
Xin[8:0]
Data_addr
Coef 1 [8:0]
Input 2 [8:0]
Input 1 [8:0]
Input samples
8×9
(depth×width)
pe
Xin_valid
Coef_in[8:0]
Coef 2 [8:0]
rs
Control
logic
Coefficient
memory
8×9
(depth×width)
Coef_valid
Filter_en
Su
ready
SF2/GL2 MACC
Yn_out
Figure 9 • 6-tap FIR Filter With Single Mathblock
Design Files
For information on the implementation of the 6-tap FIR filter design, refer to the MAC_FIR_6_tap.vhd
design file provided in <Design files' FIR_6_TAP_singleMACC>.
Hardware Configuration
In this implementation, the mathblock used is DOTP multiplier accumulator as shown in Figure 10 on
page 11.
Revision 1
10
Su
pe
rs
ed
ed
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Figure 10 • Dot Product Multiplier Accumulator
11
R e visio n 1
Using 9x9 Multiplier Mode
Synthesis and Place-and-Route Results
Figure 11 shows the resource utilization results for the 6-tap FIR filter with a single mathblock.
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
rs
ed
ed
Resource Utilization
Figure 11 • Resource Utilization Results for a Single MAC FIR
Place-and-Route Results
Su
pe
The frequency of operation achieved with this implementation after place-and-route is shown in
Figure 12.
Figure 12 • Place-and-Route Results for Single MAC FIR
Example 3: Alpha Blending
The following example shows the implementation of Alpha blending used in image processing as shown
in Figure 13 on page 13. Alpha blending is the process of combining a translucent foreground color with
a background color, thereby producing a new blended color.
Revision 1
12
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Design Description
The Alpha blending for each Rnew, Gnew, Bnew as shown in Figure 13 is implemented using the following
equations:
Rnew = (1-alpha) x R0 [7:0] + alpha x R1[7:0]
EQ 2
Gnew = (1-alpha) x G0 [7:0] + alpha x G1[7:0]
EQ 3
Bnew = (1-alpha) x B0 [7:0] + alpha x B1[7:0]
EQ 4
RGB0[23:0]
(Image1 Pixel)
RGB1[23:0]
(Image2 Pixel)
(1-Alpha)
Alpha
(1-Alpha)
rs
Alpha
ed
ed
This implementation uses three mathblocks to output R', G', B' values simultaneously for blended image.
Each mathblock is configured as dot product multiplier for performing 9-bit×9-bit multiplications.
SF2/GL2 MACC
(1-Alpha)
Alpha
SF2/GL2MACC
Rnew
Gnew
SF2/GL2 MACC
Bnew
pe
Figure 13 • Alpha Blending Implementation Using IGLOO2/SmartFusion2 Mathblocks
Hardware Configuration
For Alpha blending, mathblock is configured as DOTP multiplier with inputs and outputs registered.
Synthesis and Place-and-Route Results
Figure 14 on page 14 shows the Alpha blending resource utilization using three mathblocks.
Su
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
13
R e visio n 1
Wide-Multiplier
ed
ed
Resource Utilization
Figure 14 • Resource Utilization Results for Alpha Blending
Place-and-Route Results
pe
rs
The frequency of operation achieved with this implementation after place-and-route is shown in
Figure 15.
Su
Figure 15 • Place-and-Route Results for Alpha Blending
Wide-Multiplier
Overview
The wide-multipliers are extensively used in high precision (more than 18×18 multiplication) wireless and
medical applications. These applications require high precision at every stage when implementing
complex arithmetic functions used in FFT, filters etc. Military, test, and high-performance computing also
require performance and precision requirements, and sometimes require single-precision and doubleprecision floating-point calculations for implementing complex matrix operations and signal transforms.
To implement DSP functions that require high precision, the IGLOO2/SmartFusion2 device offers
implementing wide-multipliers (that is, operands width more than 18×18) with the IGLOO2/SmartFusion2
mathblock. The wide-multipliers are implemented by cascading multiple IGLOO2/SmartFusion2
mathblocks using CDOUT and CDIN to propagate the result and to achieve the best performance
results.
Revision 1
14
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
This section describes wide-multiplier guidelines and different implementation methods with design
example to achieve the best performance results.
Configuration
When implementing the wide-multipliers, the IGLOO2/SmartFusion2 mathblock is configured in Normal
mode to function as normal multiplier (18×18), normal multiplier accumulator, and normal multiplier
addsub.
Guidelines
It is recommended to use the following for implementing wide-multiplier to achieve the best results.
The inputs and output are registered with the same clock.
•
Add pipeline stages in RTL, so that the synthesis tool can automatically infer registers of
mathblock or register the inputs and outputs of mathblock, if arithmetic cores (Mathblock) are
used.
•
CDOUT of one mathblock is connected to the CDIN of another mathblock.
Design Examples
ed
ed
•
This section shows the wide-multiplier with the following design examples:
•
Multiplier 32×32 implementation using multiple mathblock
•
Multiplier 32×32 implementation using single mathblock
The following section explains the 32×32 multiplier implementation with multiple mathblocks and with
single mathblock. It also shows the performance results for both the implementations.
rs
Example1: Multiplier 32×32 Implementation Using Multiple Mathblocks
The following section explains the 32×32 multiplier implementation with multiple mathblocks and shows
the performance results.
Design Description
The 32×32 multiplier is implemented using the following algorithm:
pe
A = (AH × 217) + AL;
B = (BH × 217) + BL;
A×B = (AH × 217 + AL) × (BH × 217 + BL)
Su
= ((AH×BH) × 234) + ((AH×BL +AL×BH) × 217) + AL×BL
15
R e visio n 1
Wide-Multiplier
The 32×32 multiplier is implemented efficiently using four mathblocks without using fabric resources to
produce 64-bit result as shown in Figure 16 and Figure 17 on page 17. To achieve best performance
results, mathblock input and output registers are to be used.
AH = A[31],A[31],A[31], A[31:17] AL = ‘0’ , A[16:0]
A[31:0] x B[31:0] =
x
BH = B[31],B[31], B[31], B[31:17] BL = ‘0’ , B[16:0]
43
Mathblock1
ALBL[33:17]
43
AH x BL
33
ALBL[16:0]
43
33
AL x BH
AH x BH
17 bit offset
AHBL[16:0]
0
ALBH[16:0]
ALBH[33:17]
SignExtend 12 bits
29
0
AHBL[33:17]
SignExtend 12 bits
Mathblock3
0
ed
ed
SignExtend 10 bits
Mathblock2
AL x BL
33
17 bit offset
0
34 bit offset
AHBH[31:17]
pe
rs
Mathblock4
AHBH[16:0]
P[63:34]
P[33:17]
P[16:0]
Su
Figure 16 • 32x32 Multiplication
Revision 1
16
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Mulplier 32x32
BL
AL
BH
Zero’s
AL
BL
BH
AH
17
ed
ed
17
AH
SF2/GL2 MACC
P[16:0]
SF2/GL2 MACC
SF2/GL2 MACC
SF2/GL2 MACC
P[63:34]
P[33:17]
Figure 17 • Implementation of 32x32 Multiplier
Design Files
For
information
rs
When implementing using HDL, to infer mathblock input and output registers by synthesis tool, pipeline
stages are added at output and input to achieve maximum throughput. In this design, two pipeline stages
are added at input and output. Refer to design files for information on implementation of 32x32 multiplier.
on
the
implementation
of
the
multiplier
32×32
design,
refer
to
the
Mult32×32_multipleMACC.vhd design file provided in <Design files -> Mult32×32_multipleMACC>.
pe
Hardware Configuration
For 32×32 multiplier using single mathblock, mathblock is configured to function as normal multiplier,
normal multiplier addsub with ARSHFT enabled, inputs and outputs registered.
Normal Multiplier Accumulator —> Pn = Pn-1 + CARRYIN + C +/- A0×B0
Normal Multiplier Addsub —> Pn = D + CARRYIN + C +/- A0×B0 (if ARSHFT is disabled)
—> Pn = (D>>17) + CARRYIN + C +/- A0×B0 (if ARSHFT is enabled)
Su
Normal Multiplier —> P = A0×B0
Synthesis and Place-and-Route Results
Figure 18 on page 18 shows the 32×32 multiplier resource utilization when using multiple mathblocks.
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
17
R e visio n 1
Wide-Multiplier
ed
ed
Resource Utilization
Figure 18 • Resource Utilization for Multiple Mathblocks
Place-and-Route Results
pe
rs
The frequency of operation achieved with this implementation after place-and-route is shown in
Figure 19.
Figure 19 • Place-and-Route Results for 32×32 With Multiple Mathblock
Example 2: 32×32 Multiplier Implementation Using Single Mathblock
Su
The following section explains the 32×32 multiplier implementation with a single mathblock and also
shows the performance results.
Design Description
The 32×32 multiplier is implemented using the same algorithm as shown in "Example 1: 6-tap FIR Filter
Using Multiple Mathblocks" section on page 6.
A×B = ((AH×BH) × 234) + ((AH×BL +AL×BH) × 217) + AL×BL
= ((AH×BH) × 234) + (AH×BL × 217) + (AL×BH × 217) + AL×BL
In this implementation, the four multiplications are computed using a single mathblock in sequential
manner. The control finite-state machine (FSM) in the design provides the inputs to the mathblock
sequentially in four successive states as shown in Figure 20 on page 19 and appropriately enables the
shift operation in the corresponding state. The mathblock used in this design is configured as normal
multiplier accumulator Arithmetic IP core. Refer to the Hard Multiplier Accumulator User Guide for
configuration.
Revision 1
18
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
The time taken to generate output = 4 clock cycles for providing inputs
+ 2 clock cycles as the inputs and output is registered
+ 2 clock cycles by mathblock at input and output.
= 8 clock cycles
reset_n
SF2/GL2 MACC Block
A L[17 :0 ] ,B L[ 17 :0 ]
clk
AH [17 : 0] , B L[17 : 0]
B [ 31 : 0 ]
A
AL [17 : 0 ], B H[17 : 0]
A H[ 17 :0 ] , BH[ 17 : 0 ]
A [ 31 : 0 ]
Curr_State
mul_en
ed
ed
B
Zeros
P
C
D
Control FSM
ARSHFT
Result
mul_result_valid
Multiplier 32 x 32
Design Files
rs
Figure 20 • Multiplier 32×32 with One MACC Block
For more information on the implementation of the multiplier 32×32 design, refer to the Mult32×32.vhd
design file provided in <Design files'Mult32×32>.
pe
Hardware Configuration
For 32×32 multiplier using single mathblock, it is configured to function as normal multiplier accumulator
with inputs and outputs registered.
Synthesis and Place-and-Route results
Figure 21 on page 20 shows the 32×32 multiplier resource utilization when using a single mathblock.
Su
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
19
R e visio n 1
Wide-Multiplier
ed
ed
Resource Utilization
Figure 21 • Resource Utilization for a Single Mathblock
Place-and-Route Results
pe
rs
The frequency of operation is achieved with this implementation after place-and-route is shown in
Figure 22.
Figure 22 • Place-and-Route Results for 32×32 Multiplier with Single Mathblock
Simulation Results
Su
Figure 23 shows the post synthesis simulation results. The simulation result shows that the multiplier
outputs on 8 clock cycles after input is provided.
Figure 23 • Multiplier 32×32 Post Synthesis Simulation Results
Revision 1
20
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Extended Addition
Overview
Mathblock has a 3-input adder and supports accumulation up to 44 bits. In some applications, such as
floating point multiplication, complex-FFT and filters, high precision data has to be maintained at every
stage. These DSP functions require more than 44-bit addition (extended addition) which can be realized
using the IGLOO2/SmartFusion2 mathblock (3-input adder) and fabric logic. The extended addition is
implemented by dividing the addition into two parts. The lower part (LSB) of addition is implemented
using IGLOO2/SmartFusion2 mathblock and upper part (MSB) of addition is implemented with minimal
fabric adder logic.
For a 2-input addition, the inputs can be from any one of the following:
2. Multiplier output and CDIN
ed
ed
1. CDIN and C input
3. Multiplier output and C input
For a 3-input addition, the inputs are from multiplier output, CDIN, and C-input. To perform arithmetic
additions, the IGLOO2/SmartFusion2 mathblock provides Carryin input and Carryout signal for
propagating the carry from one mathblock to another mathblock or from mathblock to fabric logic.
Configuration
When implementing the extended addition, the IGLOO2/SmartFusion2 mathblock is configured in
Normal mode to function as normal multiplier addsub.
Guidelines
Mathblock must be configured to function as multiplier adder/subtractor to perform 2-input
extended signed addition.
•
Add Pipeline stages in RTL, so that the synthesis tool can automatically infer registers of
mathblock or register the inputs and outputs of mathblock, if arithmetic cores (Mathblock) are
used.
pe
rs
•
•
Make sure that the CDOUT of one mathblock is connected to the CDIN of another mathblock.
Design Examples
This section shows the extended addition with the following design examples:
2-input extended signed addition
•
3-input extended signed addition
Su
•
Example 1: 2-input Signed Extended Addition
The following section shows a 2-input extended signed addition—if one operand is more than 44-bit
wide. In this section, it is also shown that the 2-input extended signed addition implementation logic with
fabric resources are implemented with the multiplier adder.
21
R e visio n 1
Extended Addition
Design Description
2-Input Addition
For computing 2-input extended signed addition Z = U + V, with one operand width more than the
mathblock output width 44, the following logic must be implemented in fabric as shown in Figure 24.
ed
ed
Figure 24 • 2-input Extended Signed Addition
Where U is an m-bit value (where m > 44), V is a sign-extended n-bit value (where n < 44). The 2-input
extended signed addition is divided in to two parts. The lower part is computed in the mathblock and the
upper part is computed in the fabric.
Z = (Sumupper, Sumlower)
EQ 5
The lower part of the sum, Z = U + V, is calculated by providing the U[(n-1): 0], V[(n-1): 0] inputs to the
mathblock, where n = 44 is mathblock output width.
Sumlower = U[(n-1): 0] + V[(n-1): 0]
EQ 6
The Upper part of sum Z = U + V is calculated as shown below:
(where U[m: n], V[m: n] are the MSB bits)
rs
Sumupper = U[m: n] + V[m: n]
EQ 7
V [m: n] = {S, S….S, X},
S = P[n-1] AND X
pe
Where,
P [n-1] is MSB of Sumlower
X is the overflow of the Sumlower (from the mathblock)
(m-n-1) number of S's must be appended in MSB bits of the V[m: n].
Hardware Implementation
Su
Figure 25 on page 23 shows the operand width of C as 52-bit wide and explains the implementation for
2-input extended signed addition. For 3-input addition, mathblock is configured as multiplier addsub in
Normal mode. The upper part and lower part of the sum are shown as follows:
For 52-bit, 2-input extended signed addition,
Sumlower = C[43:0] + A[17:0]×B[17:0]
Sumupper = {C[51:44] + {S, S, S, CARRYOUT}}
Result [51:0] = {Sumupper, Sumlower}
Result [51:0] = {C[51:44] + {S, S, S, CARRYOUT}}, P[43:0]
Where,
S = P[43] AND CARRYOUT
Revision 1
22
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Fabric Logic for 2-input Adder
SF2/GL2 MACC
A [ 17:0]
P[ 43:0]
B [ 17:0]
C[43: 0]
P[43]
CARYYOUT
X
S
ed
ed
Result [ 51:0]
U[ 8: 0] = {S,S,S,S,S,S,X }
C [ 51: 44]
Design Files
For
information
rs
Figure 25 • Fabric Logic for 2-input Extended Addition
on
the
implementation
of
the
2-input
extended
addition,
refer
to
the
Extended_adder_2_input.vhd design file provided in <Design files'Extended_adder_2_input>.
pe
Synthesis and Place-and-Route Results
Figure 26 on page 24 shows the 2-input extended addition resource utilization when using the mathblock
and fabric logic.
Su
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
23
R e visio n 1
Extended Addition
ed
ed
Resource Utilization with Fabric Adder Logic
Figure 26 • Resource Utilization for 2-input Extended Addition with Fabric Resources
pe
rs
Place-and-Route Results with Fabric Adder Logic
The frequency of operation achieved with this implementation after place-and-route is shown in
Figure 27.
Su
Figure 27 • Place-and-Route Results for 2-input Extended Addition with Fabric Resources
Revision 1
24
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Simulation Results
ed
ed
Figure 28 show the post synthesis simulation results. The simulation result shows that the 2-input
addition outputs on the next clock cycle after the input is provided.
Figure 28 • Post Synthesis Simulation Results for 2-Input Extended Addition with Fabric Adder
Example 1: 3-input Signed Extended Addition
The following section explains the 3-input extended signed addition, if one or more operands are more
than 44-bit wide. In this section, it shows the 3-input extended signed addition implementation logic with
fabric resources.
Design Description
rs
3-input Extended Addition
Su
pe
For performing 3-input extended addition, Z = T + U + V, with two operands width more than the
mathblock input width 44, the following logic must be implemented in fabric as shown in Figure 29.
Figure 29 • 3-input Extended Signed Addition
Where, T and U are m-bit values (where m > 44), V is a sign-extended n-bit value (where n < 44). The
3-input extended signed addition is divided in two parts. The lower part is computed in the mathblock and
the upper part is computed in the fabric.
Z = {Sumupper, Sumlower}
EQ 8
The lower part of the sum Z = T + U + V, is calculated by providing the {'0', T[(n-2): 0]},
{'0', U [(n-2}: 0]}, V [(n-1): 0] inputs to Mathblock, where n = 44 is mathblock output width.
Sumlower = {'0', T[(n-2): 0]} + {'0', U[(n-2): 0]} + V[(n-1): 0]
EQ 9
The upper part of sum Z = T + U + V is calculated as shown below
Sumupper = T[m: n-1] + U[m: n-1] + V[m: n]
EQ 10
25
R e visio n 1
Extended Addition
(where T[m: n], U[m: n], V[m: n] are the MSB bits)
V [m: n] = {S, S….S, X, P [n-1]}
S = P[n-1] AND X
Where 'P [n-1]' is the MSB bit of the Sumlower
X is the overflow of the Sumlower (from the mathblock),
(m-n-2) number of S's should be appended in MSB bits of the V[m: n].
Hardware Implementation
Figure 30 shows the operand widths of C, D are 52-bit wide and explains implementation for 3-input
extended signed addition. For 3-input addition, mathblock is configured as multiplier addsub in Normal
mode. The lower part of the sum and upper part of the sum are shown as follows:
ed
ed
For 52-bit, 3-input extended signed addition,
Sumlower = P [43:0] = {'0', C [42:0]} + {'0', D [42:0]} + A[17:0]×B[17:0]
Sumupper = {C[51:44] + {S, S, S, CARRYOUT}}
Result [51:0] = {Sumupper, Sumlower}
Result [51:0] = {C[51:43] + D[51:43] + {S, S, S, S, S, S, S, CARRYOUT, P[43]}}, P[42:0]
Where, S = P[43] AND CARRYOUT
6)0$&&
)DEULF/RJLFIRULQSXWDGGHU
$>@
&>@
pe
=>@
3>@
;
6
Su
3>@
&$55<287
'>@
3>@
rs
%>@
6)0$&&
^666666;3>@`
&>@
'>@
Figure 30 • Fabric Logic for 3-input Extended Addition
Revision 1
26
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Design Files
For more information on how to implement the 3-input extended addition, refer to the
Extended_adder_3_input.vhd design file provided in <Design files'Extended_adder_3_input>.
Synthesis and Place-and-Route Results
Figure 31 shows the 3-input extended addition resource utilization when using fabric logic.
Note: The results shown are specific to the IGLOO2 device. Similar results can be achieved using the
SmartFusion2 device. Refer to SmartFusion2 design files for more information.
rs
ed
ed
Resource Utilization with Fabric Adder Logic Implemented with MACC Block
Figure 31 • Resource Utilization for 3-input Extended Addition with Fabric Resources
Place-and-Route Results with Fabric Adder Logic Implemented with MACC Block
Su
pe
The frequency of operation achieved with this implementation after place-and-route is shown in
Figure 32.
Figure 32 • Place-and-Route Results for 3-input Extended Addition with Fabric Resources
27
R e visio n 1
Conclusion
Simulation Results
ed
ed
Figure 33 shows the post synthesis simulation results. The simulation result shows that the 3-input
addition outputs on the three clock cycles after the input is provided.
Figure 33 • Post Synthesis Simulation Results for 3-input Extended Addition with Fabric Adder
Tools Required
The example designs for 9x9 Multiplier mode, wide-multiplier, and extended addition are developed,
synthesized, and simulated using the following software tools on the IGLOO2 M2GL050/SmartFusion2
M2S050 device:
Software Tools
11.4.0.112
•
Modelsim 10.3a
•
Synplify pro I-2013.09M-SP1-1
IP Cores
•
rs
•
Arithmetic IP cores v 1.0.100
pe
Conclusion
Su
This application notes explains IGLOO2/SmartFusion2 mathblock features such as 9x9 Multiplier mode,
wide-multiplier, and extended addition. This document also provides implementation techniques and
guidelines along with the design examples for the 9x9 multiplication, wide-multiplier, and extended
addition for optimum performance.
Revision 1
28
Implementation of 9x9 Multiplications, Wide-Multiplier, and Extended Addition Using IGLOO2/SmartFusion2 Mathblock
Application Note
Appendix A - Design Files
Download the design files (VHDL) from the Microsemi SoC Products Group website:
http://soc.microsemi.com/download/rsc/?f=m2s_m2gl_ac398_implementation_of_9x9_widemultiplier_ex
tended_addition_liberov11p4_an_df
Su
pe
rs
ed
ed
Refer to the Readme.txt file included in the design file for the directory structure and description.
29
R e visio n 1
List of Changes
List of Changes
The following table lists critical changes that were made in each revision of the chapter in the demo
guide.
Date
Changes
Page
Updated the document for Libero v11.4 software release (SAR 59686).
NA
Revision 0
(June 2013)
Initial release.
NA
Su
pe
rs
ed
ed
Revision 1
(September 2014)
Revision 1
30
ed
ed
rs
pe
Su
Microsemi Corporate Headquarters
One Enterprise, Aliso Viejo CA 92656 USA
Within the USA: +1 (800) 713-4113
Outside the USA: +1 (949) 380-6100
Sales: +1 (949) 380-6136
Fax: +1 (949) 215-4996
E-mail: [email protected]
Microsemi Corporation (Nasdaq: MSCC) offers a comprehensive portfolio of semiconductor
and system solutions for communications, defense and security, aerospace, and industrial
markets. Products include high-performance and radiation-hardened analog mixed-signal
integrated circuits, FPGAs, SoCs, and ASICs; power management products; timing and
synchronization devices and precise time solutions, setting the world's standard for time; voice
processing devices; RF solutions; discrete components; security technologies and scalable
anti-tamper products; Power-over-Ethernet ICs and midspans; as well as custom design
capabilities and services. Microsemi is headquartered in Aliso Viejo, Calif. and has
approximately 3,400 employees globally. Learn more at www.microsemi.com.
© 2014 Microsemi Corporation. All rights reserved. Microsemi and the Microsemi logo are trademarks of
Microsemi Corporation. All other trademarks and service marks are the property of their respective owners.
51900274-1/09.14
Similar pages