High-speed MARS hardware

High-Speed MARS Hardware
Akashi Satoh†, Nobuyuki Ooba†, Kohji Takano†, Edward D’Avignon††
IBM research, Tokyo Research Laboratory, IBM Japan Ltd., 1623-14, Shimotsuruma,
Yamato-shi, Kanagawa 242-8502, Japan
{akashi, ooba, chano}@jp.ibm.com
††
IBM Corporation, Poughkeepsie, NY 12601, USA
[email protected]
March 15, 2000
†
Abstract. High-speed MARS encryption/decryption hardware was developed using a 0.18µm IBM CMOS
technology. In order to boost performance, a special adder and multiplier was designed by optimizing the adder
block structure and interconnections between adder cells using signal delay profiles. A description of the
hardware including block diagrams and data flow diagrams is presented. One of the most critical portions of
the design is the special adder and multiplier. The design philosophy and tradeoffs used in these pieces are
discussed. Finally, performance and size estimates are presented along with the rationale behind them. The
design achieves 677Mbit/s data rate for encryption when using cipher block chaining and 1.28Gbit/s for
decryption and other encryption modes in 13.8Kgates + 2.25Kbyte SRAM.
1.
Introduction
MARS [1] is a symmetric-key block cipher, supporting 128-bit blocks and a variable key size. It is designed to
take advantage of the powerful operations supported by today’s computers, resulting in a much improved
security/performance tradeoff over existing ciphers. We developed high-speed MARS hardware for use when
additional performance or security is required over a software implementation. Since MARS uses 32-bit
multiplications and additions in conjunction with S-box lookups, it is essential for MARS hardware to have a
high-speed multiplier and adder. The key to realizing high-speed arithmetic circuits is to first break one operation
into parallel sub-operation blocks, then precisely adjust and control the number of signal delays from each block.
We developed an automatic circuit generation program, which optimizes the parallel block structure and the
wiring interconnection by using the signal delay profiles. A high-speed adder with the combination of carry-skip
[2] and carry-select [3] techniques designed for an RSA encryption LSI [4] was implemented in the final stage of
the multiplier. These arithmetic circuits boost the speed of MARS hardware while maintaining compact silicon
area.
In this paper, we first show the data path level design of the MARS hardware with an overview of the MARS
algorithm and how encryption and decryption are performed. Next, we discuss the techniques that apply to the
adders and multipliers to realize the high-speed MARS computation. Finally, we give estimated performance
results and the size of the MARS hardware.
2.
2.1.
MARS Algorithm and Hardware Architecture
Hardware Block Diagram
We designed the MARS hardware entirely from the gate level to the chip level, so that it is ready for chip
fabrication. Figure 1 shows the block diagram of the hardware. It has a chip external bus, which consists of a
32-bit data bus, a 10-bit address bus, four control signals, and a clock, to interface with external logic, such as a
CPU. Through the bus, the external logic will read and write message data and the key. The hardware has a
forward/backward mixer, a cryptographic core for MARS encryption/decryption, and a key expander for key setup.
During those operations, two S-boxes and key storage are accessed. Each S-box is a 32-bit × 256-word SRAM.
The key storage is a 32-bit × 64-word SRAM.
1
Data
32bit
Address
Control
Clock
Chip interface and
controller
MARS Hardware
128bit
32bit × 2
Forward/backward
mixer
S-box 0
128bit
Cryptographic core
S-box 1
Key expander
Key storage
32bit
32bit
Figure 1. Block diagram of MARS hardware.
2.2.
Encryption Procedure
The MARS encryption procedure has three phases: 8-round “forward mixing,” 16-round “cryptographic core,”
and 8-round “backward mixing,” as shown in Figure 2. Figure 3 shows the type-3 Feistel network structure of
MARS. A 128-bit plain text block is divided into four 32-bit data words M0, M1, M2, M3, and encrypted as four
words D0, D1, D2 and D3. In the figure, ⊕ denotes XOR, “<<<n” and “>>>n” denote n-bit cyclic left and right
rotations, respectively. The lower 32 bits of the results of addition, subtraction and multiplication are used; the
higher bits are discarded. MARS uses S-box (32-bit × 512-word table) lookups in the key setup, encryption, and
decryption procedures. The S-box is composed of two 256-entry tables S0 (the first 256 words) and S1 (the last
256 words), used in the forward and backward mixing phases. The decryption procedure is the inverse of the
encryption operation, and all circuits shown in this paper are used for both procedures by switching selectors in the
data paths.
M0
M1
M2
M3
Four 32-bit words
Forward Mixing
(D0, D1, D2, D3) = (M0, M1, M2, M3) + (K0, K1, K2, K3)
For i = 0 to 7 do {
D1 = (D1 ⊕ S0[1st byte of D0]) + S1[2nd byte of D0]
D2 = D2 + S0[3rd byte of D0]
D3 = D3 ⊕ S1[4th byte of D0]
D0 = D0 + D1 (if i = 1,5) + D3 (if i = 0,4)
(D0, D1, D2, D3) = (D1, D2, D3, D0)
}
Cryptographic Core
For i = 0 to 15 do {
R = ((D0<<<13) × K2i+5) <<< 10
M = (D0 + K2i+4) <<< (low 5 bits of (R>>>5))
L = (S[low 9bits of M] ⊕ (R>>>5) ⊕ R) <<< (low 5bits of R)
D1 = D1 + L (if i < 8) ⊕ R (if i ≥ 8)
D2 = D2 + M
D3 = D3 ⊕ R (if i < 8) + L (if i ≥ 8)
(D0, D1, D2, D3) = (D1, D2, D3, D0<<<13)
* S is the concatenation
}
of S0 and S1
Backward Mixing
For i = 0 to 7 do {
D0 = D0 - D1 (if i = 3,7) - D 3 (if i = 2,6)
D1 = D1 ⊕ S1[1st byte of D0]
D2 = D2- S0[4th byte of D0]
D3 = (D3 - S1[3rd byte of D0]) ⊕ S0[2nd byte of D0]
(D0, D1, D2, D3) = (D1, D2, D3, D0<<<24)
}
(D0, D1, D2, D3) = (D0, D1, D2, D3) - (K36, K37, K38, K 39)
D0
D1
D2
D3
Figure 2. MARS encryption procedure.
2
M0
M1 M2 M 3
K1 K2 K3
K0
b0
S0
S1
S0
b3
S1
b1
b2
8-round forward mixing
24>>>
E-func
R
<<<5
<<<5
<<<13
R
E-func
<<<13
M
L
IN
8-round cryptographic core
forward transformation
<<<13
5
M
<<<
9
Keven S
L
<<<
XOR
R
E-func
5
Kodd
M
L
ADD
8-round cryptographic core
backward transformation
SUB
MUL
S
b0
b1
b2
b3
S1
S0
S1
S0
S-box
<<< data-dependent rotation
8-round backward mixing
<<<24
K36
D0
K37 K38 K39
D1 D2 D3
Figure 3. Type-3 Feistel network structure.
S0
S1
Swap
D0
F
B
0
D2
D3
S0[0]
>>>24 or 0
F,B
B2,6
F0,4
B3,7
F1,5
BK
FK
D1
+
-
S1[0]
0
0
S1[1]
F
B
FK,BK
F
B
BK
FK
S1[3]
+
S0[2]
S0[3]
0
0
F,B
BK
FK
+
-
+
-
S1[2]
F
F,FK,BK
F
B
BK
FK
K3
K0
K1
K2
K 36
K 37
K 38
K 39
+
+
-
S0[1]
B
<<<24 or 0
+ XOR
+
- ADD/SUB
0
F,FK,BK
+
S0[0]: S0[1st byte of D[0]]
S1[2]: S1[3rd byte of D[0]]
Figure 4. Forward / backward mixing data paths.
Figure 4 shows the circuit block diagram of the forward and backward mixing data paths. This circuit is also
shared by the encryption and decryption procedures. Switching the selectors changes the order of the operations.
The S-boxes, S0 and S1, are implemented by three-port SRAMs, one port for the write and two ports for the read
operations. The thick lines show the critical path for the backward mixing process, which contains subtraction,
S-box, subtraction, and XOR operations in order. Two sets of key registers K0-3 and K36-39 are dedicated to this
mixing operation, and eight key words are copied from the 32-bit × 40-word expanded keys stored in SRAM “K”.
3
This circuit block is used 9 times in the forward mixing mode, then one cycle is required to add the sub-keys K0-3 to
the data D0-3, and then 8 times in the rounds of mixing operation. The backward mixing operation takes 9 cycles.
Figure 5 is the block diagram of the cryptographic core (Feistel network) data path. The thick lines specify the
critical path. It consists of a multiplier, two XORs, a conditional rotator, an adder and a selector. The S-box read
operation is executed in parallel with the multiplication, so that the memory access time does not affect the critical
path. The S-box shares the SRAM used for the forward and backward phases shown in Figure 4. The
cryptographic core operation uses this circuit in the 8-round keyed forward transformation followed by the 8-round
keyed backward transformation. The cryptographic core requires 16 cycles for each 128-bit block encryption.
E-Func
D0
+ XOR
+ ADD
MUL
*
even
+
<<<13
<<<13
odd
K
9
*
S
<<<5
5
<<<
D1
D2
+
D3
<<<5
+
B
F
F
2:1
B
2:1
5
<<<
+
L
R
M
+
+
F
B
B
2:1
F
2:1
Figure 5. Cryptographic core data path.
The cryptographic core and the forward/backward mixer can operate simultaneously on separate 128-bit blocks
when four-port (one for write and three for read) SRAM is used as the S-box. A 128-bit bus connection can swap
data between these two circuits without additional cycles. If we share the circuits of Figure 4 with forward and
backward mixing operations to save hardware resources, 18 cycles are required for one set of encryption
procedures. A timing chart for this case, which is suitable for electronic codebook (ECB) encryption mode and all
decryption modes, is given in Figure 6 (a). The data throughput of this architecture is 128 bits / 18 cycles. For
cipher block chaining (CBC) encryption mode, the encrypted data D in the previous cycle is required before
starting the current encryption of block M. In this case, the mixing phases cannot be pipelined with the
cryptographic core. CBC operations require 34 cycles, with the throughput becoming 128 bits / 34 cycles. The
timing chart for cipher block chaining encryption mode is shown in Figure 6 (b).
(a) Piplined Operation
M
M
Fwd-Mix
M
Fwd-Mix
Fwd-Mix
Crypto-Core
Crypto-Core
Bkwd-Mix
16 CLKs
9 CLKs
Crypto-Core
Bkwd-Mix
D
9 CLKs
Bkwd-Mix
D
D
18 CLKs
(b) Non-Piplined Operation
M
M
Fwd-Mix
Fwd-Mix
Crypto-Core
Crypto-Core
Bkwd-Mix
9 CLKs
16 CLKs
34 CLKs
9 CLKs
Bkwd-Mix
D
D
Figure 6. Timing chart of MARS encryption.
4
2.3.
Key Expansion
The key expansion procedure, shown in Figure 7, expands the user-supplied key array, k0, …, kn-1, into a 40-word
internal key array, K0, …, K39. The range of n is from 4 to 14 32 bit words, that is, MARS supports user key
lengths from 128 bits to 448 bits. In the figure, bit-wise OR and AND are denoted by ∨ and ∧, respectively. The
block diagram of the key expander data paths is shown in Figure 8. The major components of the key expansion
circuit are a barrel rotator, two registers, an adder, and multiplexers. The key storage “K” is implemented using a
three-port SRAM. It is capable of one write and two read operations in parallel. We designed the key expander
with a small number of latches in order to keep it small in size. The temporary storage T, which is used during the
key expansion procedure, is implemented in the SRAM. For this reason, the key storage has 64 entries of 32 bits
data. Key expansion takes 752 to 848 cycles depending on the value of the key.
k0
k1
kn-1
Initialization
(T0, ..., Tn-1) = (k 0, ..., kn-1)
Tn = n
(Tn+1, ..., T14 ) = (0, ..., 0)
T1
T0
T14
Linear Key-Word Expansion
For j = 0 to 3 do{
For i = 0 to 14 do {
Ti = T i ⊕ ((Ti-7 mod 15 ⊕ Ti-2 mod 15 ) <<< 3) ⊕ (4i+j)
}
S-box Based Stirring of Key-Words
Repeat 4 times {
For i = 0 to 14 do {
Ti = (T i + S[low 9 bits of Ti-1 mod 15]) <<< 9
}}}
For i = 0 to 9 do {
K10j+i = T4i mod 15
}
K0
K1
K39
Modifying Multiplication Key-Words
B = {0xa4a8d57b; 0x5b5d193b; 0xc8a8309b; 0x73f9a978}
For i = 5 to 35 step 2 do {
j = LSB of K i
w = K i ∨ "0...011"
M=0
For j = 2 to 30 do {
jth bit of M = 1
if jth bit of w belongs to a sequence of 10 consecutive
th
0’s or 1’s, and equals to (j-1) th and (j+1) bit of w
}
r = least five bits of Ki-1
p = B j <<< r
K i = w ⊕ (p ∧ M)
}
K1
K0
K39
Figure 7. Key expansion procedure.
S
+ XOR
+ ADD
+
<<<9
<<<
2:1
+265
∧
4i+j
+
Reg
+
<<<3
M
+
W
∨3
K
5:1
Figure 8. Key expander data path.
5
+
3.
3.1.
High-Speed Adder and Multiplier
High-Speed Adder
In this section, we first explain the design of a high-speed adder employing a combination of carry-skip [2] and
carry-select [3] techniques used in the RSA encryption LSI [4]. This adder is used in the E-function and in the last
stage of the multiplier. It is one of the most critical parts affecting MARS hardware performance.
Figure 9 shows the basic structure of the adder. It consists of ripple-carry adder blocks where each successive
block is one bit longer than the block immediately below along with a carry-skip path jumping over each adder
block. The delays in the ripple-carry adders and the carry-skip path are well balanced so that every carry
propagates from the LSB to the MSB without waiting for the results from the other blocks. To simplify the figure,
a full adder cell FA is used in the first bit of each adder block. It can be replaced by a half adder cell in the actual
implementation.
gi +1
p i +1
z 10
G3
C3
x9
y 89
G2
C2
C1
P0
x0
y0
FA
C0
z0
FA
z2
x4
y4
FA
z1
x3
y3
G1
x2
y2
G0
x5
y5
x1
y1
P1
FA
FA
P2
z5
FA
FA
x8
y8
FA
z4
x7
y7
FA
z3
x6
y6
P3
z9
z8
FA
0 1
0 1
0 1
0 1
Block 0
Block 1
Block 2
Block 3
xi
yi
zi
Cj
gi
z7
pi
Full Adder Cell
MSB
z6
LSB
Image of Carry Propagation
Figure 9. High-speed adder.
If two or more bits of xi, yi and gi are ‘1’ in the i-th full-adder cell, carry gi+1 = 1 is generated and fed to the next cell.
The cell never generates a carry if both xi and yi are ‘0’, regardless of the input gi. If gi+1 = 0, either xi or yi is ‘0’ and
the other is ‘1’, it will generate a carry if carry Cj = 1 comes up from the lower ripple-carry adder block j-1. For
example, when (x3, x4, x5) = (1, 1, 1) and (y3, y4, y5) = (0, 0, 0), block 2 does not generate carries g4, g5, g6 (= G2).
However, if the carry C2 = 1 reached the block, the carry output C3 immediately becomes ‘1.’ This means that the
carry Cj can skip over the blocks one after another by pre-calculating a condition between xi and yi in each adder
block j. The condition is defined by
Pj =
∏
i
xi ⊕ yi = 1, where ⊕ is XOR.
By making the adder block size bigger toward the MSB side, the propagation time of Pj and Cj are equalized, and
therefore the total delay time is minimized. Output zi initially holds a sum as if the block carry Cj is 0, and is
inverted by the XOR gate if Cj = 1 comes up later.
3.2.
High-Speed Multiplier
A standard n-bit × n-bit multiplier gives a 2n-bit result by repeatedly summing up the n-bit partial product rows.
The multiplier used in MARS is not required to calculate the higher half of the result, as shown in Figure 10, so it
is faster and smaller than standard multipliers. The high-speed techniques described in this section, however, can
be applied to any multiplier. Figure 11 shows a Wallace tree [5], which is an adder cell array commonly used in a
multiplier to reduce the number of partial product rows. The tree takes three rows and produces one carry row and
6
one sum row, so the full adder cell, FA, is called “3:2 compressor.” This reduction is repeated until there are only
two partial product rows, which are added together with a high-speed carry-propagation adder. Several tree
architectures, which use 4:2, 6:2 and 9:2 compressors, were proposed [6][7] to optimize the critical path of this tree,
but these compressors basically consist of 3:2 compressors. Booth encoding [8] is widely used to reduce the
number of partial products, but it is a kind of 4:2 compression technique and does not change the tree structure.
Oklobdzija et al [9] suggested that not all inputs and outputs from a compressor contribute equally to the delay, and
the difference in using 4:2 and higher order compressors is not in the structure of the compressor but in the way
they are interconnected.
(a) Normal Multiplier
(b) MARS Multiplier
Multiplicant
Multiplicator
Partial Products
Result
Figure 10. Partial products in MARS multiplier (n = 8).
In Figure 11, the input signals x and y of the full adder FA pass through two XORs to the output s, but the input ci
goes through only one XOR gate. The full-adder FA and half-adder HA located at the later stage of the tree are
shaded in the figure. The delay profile of the tree is shown with the same shading. Here, all the XOR, NAND and
AND gates are assumed to have the same propagation delay. The two signals fed into the adder at the bit-5
location come from the third-stage half adders marked with ‘*,’ but the right signal arrives earlier than the left one.
In addition, the propagation delay from an input to an output varies with the types of gate and input pin locations.
For example, AND usually operates faster than XOR, and NAND is faster than AND. For that reason, we
developed an optimal Wallace tree generation program in consideration of six delay propagation paths of a full
adder (combination of the three inputs to two outputs) and four paths of a half adder, based on a 0.18µm IBM
CMOS standard cell library.
Partial
Product
Inputs
FA
FA
FA
FA
FA
FA
FA
FA
FA
HA
2nd
FA
Full Adder Cell
x y ci
HA
1st
Half Adder Cell
x y
HA co
FA
FA
FA
FA
FA
HA 3rd
*
FA
FA
Wallace Tree
Adder Array
7
6
Gate 5
Delay 4
3
2
1
*
HA 4th
6
s
co
HA 3rd
HA 4th
7
s
5
4
3
Carry-Propagation Adder
2
1
Delay Profile of
Wallace Tree
7
0
6
5
4
Bit Location
Figure 11. MARS multiplier using Wallace tree and its delay profile (n = 8).
7
3
2
1
0
Since the delay of the final carry-propagation adder is an addendum to the Wallace tree delay, the adder should
have the optimized carry-propagation path for the tree delay profile. At the same time, we should consider the
adder structure to determine the tree interconnection. Figure 12 shows the carry-propagation path in high-speed
adders with equal and non-equal input signal arrival profiles. Both adders are identical to the one shown in Figure
9. The adder exhibits the best performance for the equal input profile (a). In case (b), the carry skipping over the
adder blocks, though carry generator CGEN, has to wait until the carries propagates from the ripple-carry adder
blocks. This is due to the slow input signals. In other words, an adder which is faster than the input delay slope is
not needed. A simple ripple-carry adder can run fast enough in this case. The input signals at bit 4, 8 and 9 arrive
very quickly, but these fast inputs also waste time waiting for the carry propagation from the next adder cells. To
optimize performance of the multiplier, we have to make the positive delay slope gentle, and make the top of the
hill as low as possible in the Wallace tree. This is achieved by optimizing the connection between the full adder
and half adder gates according to their pin-to-pin internal delay profiles.
11
(b) Non-equal arrival time
7
9
8
7
4
(a) Equal arrival time
5
4
3
1
1
1
0
2
CG
2
1
FA
FA HA
3
2
FA
1
FA
HA
4
CG
4
3
CG
5
1
FA
2
FA
FA
HA
0
HA
0
0
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
8
9
0
FA
6
FA
3
FA
FA
FA
FA
Carry Generator
6
FA
Full Adder
5
4
HA
Half Adder
4
3
2
1
1
Input Delay Profile
1
CG
7
CG HA
0
0
FA
CG HA
CG
10
2
3
4
5
6
Delay
7
8
Bit
9
Figure 12. Carry propagation in high-speed adder on equal and non-equal input signal arrival profiles.
(a)
3
(b)
Carry Skip
Ripple Carry
1
1
2
Carry Skip
Ripple Carry
5
3
2
3
CG
CG
CG
FA
FA
CG HA HA
CG
CG
FA
FA
HA
FA
FA HA
HA
FA
HA
FA
FA
FA
Input Delay Profile
Input Delay Profile
FA
0
FA
FA
FA
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
Figure 13. Adder selection over input delay profile.
Figure 13 shows an example adder structure with a positive delay slope profile. From bits 0 to 2, the slope is
steeper than one FA delay, so a ripple carry adder is chosen for this part. A carry skip adder with bit blocks 1-1-2-3
is used after bit 3, in example (a). The operation of one half adder HA with a carry generator CGEN is identical to
that of one full adder FA, so they are replaced in example (b) to simplify the structure.
8
In Figures 12 and 13, the input delay time in each bit location is defined by a multiple of one adder cell delay, thus
it is not difficult to optimize the adder structure. Actually, as shown in Figure 14(a), there is slack time between the
input signals C and P into CGEN at the 9th bit. In case (a), C is generated earlier than P, and waits for the arrival of
P at CGEN. When we move the location of CGEN one bit left (to bit 8) so that the carry C does not waste time, the
signal P has to wait instead. It is not clear which choice is better until the final adder cell is placed, and we have to
choose the right combination of bit locations where CGENs are placed. In case (a), the output signal delay from the
CGEN is longer than that of case (b), but the carry C reaches a higher bit. We should keep the slope of the carry path
over the adder blocks as gentle as possible. Therefore, the Wallace tree generator should employ a structure that
has smaller value of delay/bit shown as the slope of triangles in the figure. If the structures (a) and (b) have the
same slope, then we chose the former because it has higher probability to have fewer CGEN cells.
(a)
<Example>
bit
C
CG
slack FAP
delay
CG
FA
HA
5
6
C
FA
CG
FA
FA
HA
HA
HA
8
CG
P
FA
7
C
CG
9
5
(b)
6
7
CG
P
FA
FA
HA
HA
8
9
5
6
HA
7
8
9
<Example>
bit
C
delay
HA
5
6
CG
CG
FA
HA
7
HA
8
P
FA
bit
FA
FA HA
CG
5
FA
FA HA
HA
HA
9
CG
delay
CG
P
slack
P
CG
FA
C
C
HA
6
7
8
9
5
6
7
8
9
Figure 14. Carry propagation block design.
Figure 15 shows the actual delay profile of the MARS multiplier using 0.18µm IBM copper CMOS technology
under nominal (VDD=1.8V, Temp=25°C and Leff=0.11µm) and worst case (VDD=1.65V, Temp=125°C and
Leff=0.14µm) conditions. The output delay from the Wallace tree has an almost perfect gentle positive slope. The
delay line that looks like a saw blade shows the ripple carry adder blocks. The carry skips over them smoothly. As
a result, using the techniques described in this chapter realize a 2.32ns (nominal) to 3.41ns (worst) operation for the
32-bit MARS multiplier with a compact size of 3.2Kgates.
(b) Worst Case
(a) Nominal Case
Delay Time (ns)
Delay Time (ns)
Carry Skip
Adder Block
Wallace Tree
Carry Skip
Adder Block
Wallace Tree
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0
Bit Location
2
4
6
8
10
12
14
16
18
Bit Location
Figure 15. Actual delay profile of the MARS multiplier.
9
20
22
24
26
28
30
4.
Performance Evaluation
We designed MARS hardware including an external interface, the control logic, and the calculation core. All the
design files are written in VHDL 93. We synthesized the design using a 0.18µm IBM copper CMOS standard cell
technology and evaluated its performance and size.
Table 1 shows the gate size of the logic where one gate is the size of a 2-input NAND. The memory area for the
key register and S-box is shown in Table 2. We get data throughputs of 128bits / 34cycles for CBC encryption and
128bits / 18cycles for all decryption and other encryption modes, assuming a four-port SRAM implementation.
However, the area of a four-port SRAM becomes larger than that of the logic part. If we do not need the high data
rate, the area can be greatly reduced by using fewer-port SRAMs and a mask ROM. A two-port SRAM (one for
read and one for write) for the key register halves the memory area while adding only one additional cycle for one
128-byte block encryption process. If the S-box is implemented with a single-port memory or a ROM, the cycles
for the forward/backward mixing increase to 34, then the throughput of all encryption and decryption modes
becomes 128 bits / 50 cycles.
The critical path delays in the forward/backward mixer and the cryptographic core under nominal and worst case
conditions are shown in Figure 16. The longer delay of the cryptographic core, 5.57ns, determines the operation
frequency of 180MHz (= 1 / 5.57ns) (122MHz worst case). As a result, we get a maximum data throughput of
677Mbit/s (459Mbits/s worst case) for CBC encryption and 1.28Gbit/s (867Mbits/s worst case) for other modes.
All decryption modes achieve maximum throughput of 1.28Gbit/s (867Mbits/s worst case). The throughput and
gate sizes for other memory implementations are summarized in Table 3.
Table 1. Logic area
Circuit Block
Key Expansion
Enc/Dec Controller
Enc/Dec Data path
Interface + Memory Controller
Total
Gate Size
2.2K
4.5K
6.1K
1.0K
13.8K
Table 2. Memory area
Function
Key Register
(256bytes)
S-box
(2Kbytes)
Type
3-port SRAM
2-prot SRAM
4-port SRAM
3-port SRAM
1-port SRAM
ROM
Gate Size
6.8K
4.8K
46.2K
30.8K
15.4K
6.3K
Table 3. Performance of each implementation
Memory Type
Throughput
Key
S-box
Total Gate Size
(Mem+Logic)
3-port
3-port
3-port
2-port
2-port
4-port
3-port
1-port
1-port
ROM*
66.8K
51.4K
36.0K
34.0K
24.9K
CBC Encryption
Nominal Case
677Mbit/s (34cycles)
677Mbit/s (34cycles)
460Mbit/s (50cycles)
451Mbit/s (51cycles)
263Mbit/s (51cycles)
Worst Case
459Mbit/s
459Mbit/s
312Mbit/s
306Mbit/s
263Mbit/s
Other Encryption and
All Decryption Modes
Nominal Case
1.28Gbit/s (18cycles)
677Mbit/s (34cycles)
460Mbit/s (50cycles)
451Mbit/s (51cycles)
263Mbit/s (51cycles)
Worst Case
867Mbit/s
459Mbit/s
312Mbit/s
306Mbit/s
263Mbit/s
* 105MHz operation limited by ROM performance
10
(a) Forward/Backword Mixer
adder1
1.10
adder2
1.29
1.62
S-box
1.22
selector+latch
1.70
1.90
Total (Nominal Case)
5.31ns
1.80
Total
7.61ns (Worst Case)
2.50
(b) Cryptographic Core
adder1
1.10
1.62
multiplyer
2.32
rotator
0.99
3.41
selector+latch
1.16
Total (Nominal Case)
5.57ns
1.45
1.70
Total (Worst Case)
8.18ns
Figure 16. Critical path delay.
The technology chosen for the above estimations is a low cost copper CMOS technology several generations
behind the state of the art CMOS technology. As such the performance cannot be directly compared with that of
software running on today’s high performance microprocessors. If built using a newer CMOS technology the
performance can be expected to improve by approximately 60%.
5.
Conclusion
We designed MARS hardware and estimated its size and performance. Since the MARS algorithm uses 32-bit
additions and multiplications, its performance is highly dependent on the hardware design of the adder and
multiplier. We designed multipliers and adders, which fully take into account the carry propagation delay. This
work demonstrates that MARS can be implemented efficiently in hardware, both in terms of area and performance.
We believe the design point chosen is a reasonable tradeoff of area vs. performance. We do not claim that this is
the highest performance MARS design possible. Other tradeoffs may yield faster hardware implementations.
Considering the size and performance in both hardware and software along with the very high security, we believe
MARS is well suited to serve as the Advanced Encryption Standard algorithm.
Acknowledgement
The authors are grateful to Ms. Carolynn Burwick, Dr. Don Coppersmith, Dr. Mike Matyas, Mr. Hideto Niijima,
and Mr. Nev Zunic for their reviews and comments. We also would like to thank Mrs. Akemi Ogura for
supporting us generously with CAD tool operation.
References
[1]
C. Burwick, D. Coppersmith, E. D’Avignon, R. Gennaro, S. Halevi, C. Jutla, S. M. Matyas, L. O’Connor, M.
Peyravian, D. Safford and N. Zunic: “MARS – a candidate cipher for AES,”
http://csrc.nist.gov/encryption/aes/round2/AESAlgs /MARS/mars.pdf, Aug. 1999.
[2]
M. Lehman and N. Burla: “Skip Techniques for High-Speed Carry Propagation in Binary Arithmetic Units,”
IRE Trans. Elec. Comput., vol. EC-10, pp. 691-698, Dec. 1961.
[3]
O. J. Bedrij: “Carry-Select Adder,” IRE Trans. Elec. Comput., vol. EC-11, pp. 340-346, June 1962.
[4]
A. Satoh, Y. Kobayashi, H. Niijima, N. Ooba, S. Munetoh, and S. Sone: “A High-Speed Small RSA
Encryption LSI with Low Power Dissipation,” LNCS 1396, pp. 174-187, 1997.
[5]
C. S. Wallace: “Suggestion for a Fast Multiplier,” IEEE Trans. Computers, vol. 13, no. 2, pp.14-17, Feb.
1964.
[6]
A. Weinberger: “4:2 Carry Save Adder Module,” IBM Technical Disclosure Bulletin, vol. 23, Jan. 1981.
11
[7]
P. Song and G. De Michelli: “Circuit and Architecture Trade-Offs for High Speed Multiplication,” IEEE J.
Solid State Circuits, vol. 26, no. 9, Sept. 1991.
[8]
A. D. Booth: “A Signed Binary Multiplication Technique,” Quarterly J. Mechanical Applications in Math,
vol. 4, part 2, pp. 236-240, 1951.
[9]
V. G. Oklobdzija, D. Villeger and S. S. Liu: “A Method for Speed Optimized Partial Product Reduction and
Generation of Fast Parallel Multipliers Using an Algorithmic Approach,” IEEE Trans. on Comp., vol. 35, no.
3, pp. 294-305, Mar. 1996.
12