High-Speed MARS Hardware Akashi Satoh†, Nobuyuki Ooba†, Kohji Takano†, Edward D’Avignon†† IBM research, Tokyo Research Laboratory, IBM Japan Ltd., 1623-14, Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan {akashi, ooba, chano}@jp.ibm.com †† IBM Corporation, Poughkeepsie, NY 12601, USA [email protected] March 15, 2000 † Abstract. High-speed MARS encryption/decryption hardware was developed using a 0.18µm IBM CMOS technology. In order to boost performance, a special adder and multiplier was designed by optimizing the adder block structure and interconnections between adder cells using signal delay profiles. A description of the hardware including block diagrams and data flow diagrams is presented. One of the most critical portions of the design is the special adder and multiplier. The design philosophy and tradeoffs used in these pieces are discussed. Finally, performance and size estimates are presented along with the rationale behind them. The design achieves 677Mbit/s data rate for encryption when using cipher block chaining and 1.28Gbit/s for decryption and other encryption modes in 13.8Kgates + 2.25Kbyte SRAM. 1. Introduction MARS [1] is a symmetric-key block cipher, supporting 128-bit blocks and a variable key size. It is designed to take advantage of the powerful operations supported by today’s computers, resulting in a much improved security/performance tradeoff over existing ciphers. We developed high-speed MARS hardware for use when additional performance or security is required over a software implementation. Since MARS uses 32-bit multiplications and additions in conjunction with S-box lookups, it is essential for MARS hardware to have a high-speed multiplier and adder. The key to realizing high-speed arithmetic circuits is to first break one operation into parallel sub-operation blocks, then precisely adjust and control the number of signal delays from each block. We developed an automatic circuit generation program, which optimizes the parallel block structure and the wiring interconnection by using the signal delay profiles. A high-speed adder with the combination of carry-skip [2] and carry-select [3] techniques designed for an RSA encryption LSI [4] was implemented in the final stage of the multiplier. These arithmetic circuits boost the speed of MARS hardware while maintaining compact silicon area. In this paper, we first show the data path level design of the MARS hardware with an overview of the MARS algorithm and how encryption and decryption are performed. Next, we discuss the techniques that apply to the adders and multipliers to realize the high-speed MARS computation. Finally, we give estimated performance results and the size of the MARS hardware. 2. 2.1. MARS Algorithm and Hardware Architecture Hardware Block Diagram We designed the MARS hardware entirely from the gate level to the chip level, so that it is ready for chip fabrication. Figure 1 shows the block diagram of the hardware. It has a chip external bus, which consists of a 32-bit data bus, a 10-bit address bus, four control signals, and a clock, to interface with external logic, such as a CPU. Through the bus, the external logic will read and write message data and the key. The hardware has a forward/backward mixer, a cryptographic core for MARS encryption/decryption, and a key expander for key setup. During those operations, two S-boxes and key storage are accessed. Each S-box is a 32-bit × 256-word SRAM. The key storage is a 32-bit × 64-word SRAM. 1 Data 32bit Address Control Clock Chip interface and controller MARS Hardware 128bit 32bit × 2 Forward/backward mixer S-box 0 128bit Cryptographic core S-box 1 Key expander Key storage 32bit 32bit Figure 1. Block diagram of MARS hardware. 2.2. Encryption Procedure The MARS encryption procedure has three phases: 8-round “forward mixing,” 16-round “cryptographic core,” and 8-round “backward mixing,” as shown in Figure 2. Figure 3 shows the type-3 Feistel network structure of MARS. A 128-bit plain text block is divided into four 32-bit data words M0, M1, M2, M3, and encrypted as four words D0, D1, D2 and D3. In the figure, ⊕ denotes XOR, “<<<n” and “>>>n” denote n-bit cyclic left and right rotations, respectively. The lower 32 bits of the results of addition, subtraction and multiplication are used; the higher bits are discarded. MARS uses S-box (32-bit × 512-word table) lookups in the key setup, encryption, and decryption procedures. The S-box is composed of two 256-entry tables S0 (the first 256 words) and S1 (the last 256 words), used in the forward and backward mixing phases. The decryption procedure is the inverse of the encryption operation, and all circuits shown in this paper are used for both procedures by switching selectors in the data paths. M0 M1 M2 M3 Four 32-bit words Forward Mixing (D0, D1, D2, D3) = (M0, M1, M2, M3) + (K0, K1, K2, K3) For i = 0 to 7 do { D1 = (D1 ⊕ S0[1st byte of D0]) + S1[2nd byte of D0] D2 = D2 + S0[3rd byte of D0] D3 = D3 ⊕ S1[4th byte of D0] D0 = D0 + D1 (if i = 1,5) + D3 (if i = 0,4) (D0, D1, D2, D3) = (D1, D2, D3, D0) } Cryptographic Core For i = 0 to 15 do { R = ((D0<<<13) × K2i+5) <<< 10 M = (D0 + K2i+4) <<< (low 5 bits of (R>>>5)) L = (S[low 9bits of M] ⊕ (R>>>5) ⊕ R) <<< (low 5bits of R) D1 = D1 + L (if i < 8) ⊕ R (if i ≥ 8) D2 = D2 + M D3 = D3 ⊕ R (if i < 8) + L (if i ≥ 8) (D0, D1, D2, D3) = (D1, D2, D3, D0<<<13) * S is the concatenation } of S0 and S1 Backward Mixing For i = 0 to 7 do { D0 = D0 - D1 (if i = 3,7) - D 3 (if i = 2,6) D1 = D1 ⊕ S1[1st byte of D0] D2 = D2- S0[4th byte of D0] D3 = (D3 - S1[3rd byte of D0]) ⊕ S0[2nd byte of D0] (D0, D1, D2, D3) = (D1, D2, D3, D0<<<24) } (D0, D1, D2, D3) = (D0, D1, D2, D3) - (K36, K37, K38, K 39) D0 D1 D2 D3 Figure 2. MARS encryption procedure. 2 M0 M1 M2 M 3 K1 K2 K3 K0 b0 S0 S1 S0 b3 S1 b1 b2 8-round forward mixing 24>>> E-func R <<<5 <<<5 <<<13 R E-func <<<13 M L IN 8-round cryptographic core forward transformation <<<13 5 M <<< 9 Keven S L <<< XOR R E-func 5 Kodd M L ADD 8-round cryptographic core backward transformation SUB MUL S b0 b1 b2 b3 S1 S0 S1 S0 S-box <<< data-dependent rotation 8-round backward mixing <<<24 K36 D0 K37 K38 K39 D1 D2 D3 Figure 3. Type-3 Feistel network structure. S0 S1 Swap D0 F B 0 D2 D3 S0[0] >>>24 or 0 F,B B2,6 F0,4 B3,7 F1,5 BK FK D1 + - S1[0] 0 0 S1[1] F B FK,BK F B BK FK S1[3] + S0[2] S0[3] 0 0 F,B BK FK + - + - S1[2] F F,FK,BK F B BK FK K3 K0 K1 K2 K 36 K 37 K 38 K 39 + + - S0[1] B <<<24 or 0 + XOR + - ADD/SUB 0 F,FK,BK + S0[0]: S0[1st byte of D[0]] S1[2]: S1[3rd byte of D[0]] Figure 4. Forward / backward mixing data paths. Figure 4 shows the circuit block diagram of the forward and backward mixing data paths. This circuit is also shared by the encryption and decryption procedures. Switching the selectors changes the order of the operations. The S-boxes, S0 and S1, are implemented by three-port SRAMs, one port for the write and two ports for the read operations. The thick lines show the critical path for the backward mixing process, which contains subtraction, S-box, subtraction, and XOR operations in order. Two sets of key registers K0-3 and K36-39 are dedicated to this mixing operation, and eight key words are copied from the 32-bit × 40-word expanded keys stored in SRAM “K”. 3 This circuit block is used 9 times in the forward mixing mode, then one cycle is required to add the sub-keys K0-3 to the data D0-3, and then 8 times in the rounds of mixing operation. The backward mixing operation takes 9 cycles. Figure 5 is the block diagram of the cryptographic core (Feistel network) data path. The thick lines specify the critical path. It consists of a multiplier, two XORs, a conditional rotator, an adder and a selector. The S-box read operation is executed in parallel with the multiplication, so that the memory access time does not affect the critical path. The S-box shares the SRAM used for the forward and backward phases shown in Figure 4. The cryptographic core operation uses this circuit in the 8-round keyed forward transformation followed by the 8-round keyed backward transformation. The cryptographic core requires 16 cycles for each 128-bit block encryption. E-Func D0 + XOR + ADD MUL * even + <<<13 <<<13 odd K 9 * S <<<5 5 <<< D1 D2 + D3 <<<5 + B F F 2:1 B 2:1 5 <<< + L R M + + F B B 2:1 F 2:1 Figure 5. Cryptographic core data path. The cryptographic core and the forward/backward mixer can operate simultaneously on separate 128-bit blocks when four-port (one for write and three for read) SRAM is used as the S-box. A 128-bit bus connection can swap data between these two circuits without additional cycles. If we share the circuits of Figure 4 with forward and backward mixing operations to save hardware resources, 18 cycles are required for one set of encryption procedures. A timing chart for this case, which is suitable for electronic codebook (ECB) encryption mode and all decryption modes, is given in Figure 6 (a). The data throughput of this architecture is 128 bits / 18 cycles. For cipher block chaining (CBC) encryption mode, the encrypted data D in the previous cycle is required before starting the current encryption of block M. In this case, the mixing phases cannot be pipelined with the cryptographic core. CBC operations require 34 cycles, with the throughput becoming 128 bits / 34 cycles. The timing chart for cipher block chaining encryption mode is shown in Figure 6 (b). (a) Piplined Operation M M Fwd-Mix M Fwd-Mix Fwd-Mix Crypto-Core Crypto-Core Bkwd-Mix 16 CLKs 9 CLKs Crypto-Core Bkwd-Mix D 9 CLKs Bkwd-Mix D D 18 CLKs (b) Non-Piplined Operation M M Fwd-Mix Fwd-Mix Crypto-Core Crypto-Core Bkwd-Mix 9 CLKs 16 CLKs 34 CLKs 9 CLKs Bkwd-Mix D D Figure 6. Timing chart of MARS encryption. 4 2.3. Key Expansion The key expansion procedure, shown in Figure 7, expands the user-supplied key array, k0, …, kn-1, into a 40-word internal key array, K0, …, K39. The range of n is from 4 to 14 32 bit words, that is, MARS supports user key lengths from 128 bits to 448 bits. In the figure, bit-wise OR and AND are denoted by ∨ and ∧, respectively. The block diagram of the key expander data paths is shown in Figure 8. The major components of the key expansion circuit are a barrel rotator, two registers, an adder, and multiplexers. The key storage “K” is implemented using a three-port SRAM. It is capable of one write and two read operations in parallel. We designed the key expander with a small number of latches in order to keep it small in size. The temporary storage T, which is used during the key expansion procedure, is implemented in the SRAM. For this reason, the key storage has 64 entries of 32 bits data. Key expansion takes 752 to 848 cycles depending on the value of the key. k0 k1 kn-1 Initialization (T0, ..., Tn-1) = (k 0, ..., kn-1) Tn = n (Tn+1, ..., T14 ) = (0, ..., 0) T1 T0 T14 Linear Key-Word Expansion For j = 0 to 3 do{ For i = 0 to 14 do { Ti = T i ⊕ ((Ti-7 mod 15 ⊕ Ti-2 mod 15 ) <<< 3) ⊕ (4i+j) } S-box Based Stirring of Key-Words Repeat 4 times { For i = 0 to 14 do { Ti = (T i + S[low 9 bits of Ti-1 mod 15]) <<< 9 }}} For i = 0 to 9 do { K10j+i = T4i mod 15 } K0 K1 K39 Modifying Multiplication Key-Words B = {0xa4a8d57b; 0x5b5d193b; 0xc8a8309b; 0x73f9a978} For i = 5 to 35 step 2 do { j = LSB of K i w = K i ∨ "0...011" M=0 For j = 2 to 30 do { jth bit of M = 1 if jth bit of w belongs to a sequence of 10 consecutive th 0’s or 1’s, and equals to (j-1) th and (j+1) bit of w } r = least five bits of Ki-1 p = B j <<< r K i = w ⊕ (p ∧ M) } K1 K0 K39 Figure 7. Key expansion procedure. S + XOR + ADD + <<<9 <<< 2:1 +265 ∧ 4i+j + Reg + <<<3 M + W ∨3 K 5:1 Figure 8. Key expander data path. 5 + 3. 3.1. High-Speed Adder and Multiplier High-Speed Adder In this section, we first explain the design of a high-speed adder employing a combination of carry-skip [2] and carry-select [3] techniques used in the RSA encryption LSI [4]. This adder is used in the E-function and in the last stage of the multiplier. It is one of the most critical parts affecting MARS hardware performance. Figure 9 shows the basic structure of the adder. It consists of ripple-carry adder blocks where each successive block is one bit longer than the block immediately below along with a carry-skip path jumping over each adder block. The delays in the ripple-carry adders and the carry-skip path are well balanced so that every carry propagates from the LSB to the MSB without waiting for the results from the other blocks. To simplify the figure, a full adder cell FA is used in the first bit of each adder block. It can be replaced by a half adder cell in the actual implementation. gi +1 p i +1 z 10 G3 C3 x9 y 89 G2 C2 C1 P0 x0 y0 FA C0 z0 FA z2 x4 y4 FA z1 x3 y3 G1 x2 y2 G0 x5 y5 x1 y1 P1 FA FA P2 z5 FA FA x8 y8 FA z4 x7 y7 FA z3 x6 y6 P3 z9 z8 FA 0 1 0 1 0 1 0 1 Block 0 Block 1 Block 2 Block 3 xi yi zi Cj gi z7 pi Full Adder Cell MSB z6 LSB Image of Carry Propagation Figure 9. High-speed adder. If two or more bits of xi, yi and gi are ‘1’ in the i-th full-adder cell, carry gi+1 = 1 is generated and fed to the next cell. The cell never generates a carry if both xi and yi are ‘0’, regardless of the input gi. If gi+1 = 0, either xi or yi is ‘0’ and the other is ‘1’, it will generate a carry if carry Cj = 1 comes up from the lower ripple-carry adder block j-1. For example, when (x3, x4, x5) = (1, 1, 1) and (y3, y4, y5) = (0, 0, 0), block 2 does not generate carries g4, g5, g6 (= G2). However, if the carry C2 = 1 reached the block, the carry output C3 immediately becomes ‘1.’ This means that the carry Cj can skip over the blocks one after another by pre-calculating a condition between xi and yi in each adder block j. The condition is defined by Pj = ∏ i xi ⊕ yi = 1, where ⊕ is XOR. By making the adder block size bigger toward the MSB side, the propagation time of Pj and Cj are equalized, and therefore the total delay time is minimized. Output zi initially holds a sum as if the block carry Cj is 0, and is inverted by the XOR gate if Cj = 1 comes up later. 3.2. High-Speed Multiplier A standard n-bit × n-bit multiplier gives a 2n-bit result by repeatedly summing up the n-bit partial product rows. The multiplier used in MARS is not required to calculate the higher half of the result, as shown in Figure 10, so it is faster and smaller than standard multipliers. The high-speed techniques described in this section, however, can be applied to any multiplier. Figure 11 shows a Wallace tree [5], which is an adder cell array commonly used in a multiplier to reduce the number of partial product rows. The tree takes three rows and produces one carry row and 6 one sum row, so the full adder cell, FA, is called “3:2 compressor.” This reduction is repeated until there are only two partial product rows, which are added together with a high-speed carry-propagation adder. Several tree architectures, which use 4:2, 6:2 and 9:2 compressors, were proposed [6][7] to optimize the critical path of this tree, but these compressors basically consist of 3:2 compressors. Booth encoding [8] is widely used to reduce the number of partial products, but it is a kind of 4:2 compression technique and does not change the tree structure. Oklobdzija et al [9] suggested that not all inputs and outputs from a compressor contribute equally to the delay, and the difference in using 4:2 and higher order compressors is not in the structure of the compressor but in the way they are interconnected. (a) Normal Multiplier (b) MARS Multiplier Multiplicant Multiplicator Partial Products Result Figure 10. Partial products in MARS multiplier (n = 8). In Figure 11, the input signals x and y of the full adder FA pass through two XORs to the output s, but the input ci goes through only one XOR gate. The full-adder FA and half-adder HA located at the later stage of the tree are shaded in the figure. The delay profile of the tree is shown with the same shading. Here, all the XOR, NAND and AND gates are assumed to have the same propagation delay. The two signals fed into the adder at the bit-5 location come from the third-stage half adders marked with ‘*,’ but the right signal arrives earlier than the left one. In addition, the propagation delay from an input to an output varies with the types of gate and input pin locations. For example, AND usually operates faster than XOR, and NAND is faster than AND. For that reason, we developed an optimal Wallace tree generation program in consideration of six delay propagation paths of a full adder (combination of the three inputs to two outputs) and four paths of a half adder, based on a 0.18µm IBM CMOS standard cell library. Partial Product Inputs FA FA FA FA FA FA FA FA FA HA 2nd FA Full Adder Cell x y ci HA 1st Half Adder Cell x y HA co FA FA FA FA FA HA 3rd * FA FA Wallace Tree Adder Array 7 6 Gate 5 Delay 4 3 2 1 * HA 4th 6 s co HA 3rd HA 4th 7 s 5 4 3 Carry-Propagation Adder 2 1 Delay Profile of Wallace Tree 7 0 6 5 4 Bit Location Figure 11. MARS multiplier using Wallace tree and its delay profile (n = 8). 7 3 2 1 0 Since the delay of the final carry-propagation adder is an addendum to the Wallace tree delay, the adder should have the optimized carry-propagation path for the tree delay profile. At the same time, we should consider the adder structure to determine the tree interconnection. Figure 12 shows the carry-propagation path in high-speed adders with equal and non-equal input signal arrival profiles. Both adders are identical to the one shown in Figure 9. The adder exhibits the best performance for the equal input profile (a). In case (b), the carry skipping over the adder blocks, though carry generator CGEN, has to wait until the carries propagates from the ripple-carry adder blocks. This is due to the slow input signals. In other words, an adder which is faster than the input delay slope is not needed. A simple ripple-carry adder can run fast enough in this case. The input signals at bit 4, 8 and 9 arrive very quickly, but these fast inputs also waste time waiting for the carry propagation from the next adder cells. To optimize performance of the multiplier, we have to make the positive delay slope gentle, and make the top of the hill as low as possible in the Wallace tree. This is achieved by optimizing the connection between the full adder and half adder gates according to their pin-to-pin internal delay profiles. 11 (b) Non-equal arrival time 7 9 8 7 4 (a) Equal arrival time 5 4 3 1 1 1 0 2 CG 2 1 FA FA HA 3 2 FA 1 FA HA 4 CG 4 3 CG 5 1 FA 2 FA FA HA 0 HA 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 FA 6 FA 3 FA FA FA FA Carry Generator 6 FA Full Adder 5 4 HA Half Adder 4 3 2 1 1 Input Delay Profile 1 CG 7 CG HA 0 0 FA CG HA CG 10 2 3 4 5 6 Delay 7 8 Bit 9 Figure 12. Carry propagation in high-speed adder on equal and non-equal input signal arrival profiles. (a) 3 (b) Carry Skip Ripple Carry 1 1 2 Carry Skip Ripple Carry 5 3 2 3 CG CG CG FA FA CG HA HA CG CG FA FA HA FA FA HA HA FA HA FA FA FA Input Delay Profile Input Delay Profile FA 0 FA FA FA 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Figure 13. Adder selection over input delay profile. Figure 13 shows an example adder structure with a positive delay slope profile. From bits 0 to 2, the slope is steeper than one FA delay, so a ripple carry adder is chosen for this part. A carry skip adder with bit blocks 1-1-2-3 is used after bit 3, in example (a). The operation of one half adder HA with a carry generator CGEN is identical to that of one full adder FA, so they are replaced in example (b) to simplify the structure. 8 In Figures 12 and 13, the input delay time in each bit location is defined by a multiple of one adder cell delay, thus it is not difficult to optimize the adder structure. Actually, as shown in Figure 14(a), there is slack time between the input signals C and P into CGEN at the 9th bit. In case (a), C is generated earlier than P, and waits for the arrival of P at CGEN. When we move the location of CGEN one bit left (to bit 8) so that the carry C does not waste time, the signal P has to wait instead. It is not clear which choice is better until the final adder cell is placed, and we have to choose the right combination of bit locations where CGENs are placed. In case (a), the output signal delay from the CGEN is longer than that of case (b), but the carry C reaches a higher bit. We should keep the slope of the carry path over the adder blocks as gentle as possible. Therefore, the Wallace tree generator should employ a structure that has smaller value of delay/bit shown as the slope of triangles in the figure. If the structures (a) and (b) have the same slope, then we chose the former because it has higher probability to have fewer CGEN cells. (a) <Example> bit C CG slack FAP delay CG FA HA 5 6 C FA CG FA FA HA HA HA 8 CG P FA 7 C CG 9 5 (b) 6 7 CG P FA FA HA HA 8 9 5 6 HA 7 8 9 <Example> bit C delay HA 5 6 CG CG FA HA 7 HA 8 P FA bit FA FA HA CG 5 FA FA HA HA HA 9 CG delay CG P slack P CG FA C C HA 6 7 8 9 5 6 7 8 9 Figure 14. Carry propagation block design. Figure 15 shows the actual delay profile of the MARS multiplier using 0.18µm IBM copper CMOS technology under nominal (VDD=1.8V, Temp=25°C and Leff=0.11µm) and worst case (VDD=1.65V, Temp=125°C and Leff=0.14µm) conditions. The output delay from the Wallace tree has an almost perfect gentle positive slope. The delay line that looks like a saw blade shows the ripple carry adder blocks. The carry skips over them smoothly. As a result, using the techniques described in this chapter realize a 2.32ns (nominal) to 3.41ns (worst) operation for the 32-bit MARS multiplier with a compact size of 3.2Kgates. (b) Worst Case (a) Nominal Case Delay Time (ns) Delay Time (ns) Carry Skip Adder Block Wallace Tree Carry Skip Adder Block Wallace Tree 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 Bit Location 2 4 6 8 10 12 14 16 18 Bit Location Figure 15. Actual delay profile of the MARS multiplier. 9 20 22 24 26 28 30 4. Performance Evaluation We designed MARS hardware including an external interface, the control logic, and the calculation core. All the design files are written in VHDL 93. We synthesized the design using a 0.18µm IBM copper CMOS standard cell technology and evaluated its performance and size. Table 1 shows the gate size of the logic where one gate is the size of a 2-input NAND. The memory area for the key register and S-box is shown in Table 2. We get data throughputs of 128bits / 34cycles for CBC encryption and 128bits / 18cycles for all decryption and other encryption modes, assuming a four-port SRAM implementation. However, the area of a four-port SRAM becomes larger than that of the logic part. If we do not need the high data rate, the area can be greatly reduced by using fewer-port SRAMs and a mask ROM. A two-port SRAM (one for read and one for write) for the key register halves the memory area while adding only one additional cycle for one 128-byte block encryption process. If the S-box is implemented with a single-port memory or a ROM, the cycles for the forward/backward mixing increase to 34, then the throughput of all encryption and decryption modes becomes 128 bits / 50 cycles. The critical path delays in the forward/backward mixer and the cryptographic core under nominal and worst case conditions are shown in Figure 16. The longer delay of the cryptographic core, 5.57ns, determines the operation frequency of 180MHz (= 1 / 5.57ns) (122MHz worst case). As a result, we get a maximum data throughput of 677Mbit/s (459Mbits/s worst case) for CBC encryption and 1.28Gbit/s (867Mbits/s worst case) for other modes. All decryption modes achieve maximum throughput of 1.28Gbit/s (867Mbits/s worst case). The throughput and gate sizes for other memory implementations are summarized in Table 3. Table 1. Logic area Circuit Block Key Expansion Enc/Dec Controller Enc/Dec Data path Interface + Memory Controller Total Gate Size 2.2K 4.5K 6.1K 1.0K 13.8K Table 2. Memory area Function Key Register (256bytes) S-box (2Kbytes) Type 3-port SRAM 2-prot SRAM 4-port SRAM 3-port SRAM 1-port SRAM ROM Gate Size 6.8K 4.8K 46.2K 30.8K 15.4K 6.3K Table 3. Performance of each implementation Memory Type Throughput Key S-box Total Gate Size (Mem+Logic) 3-port 3-port 3-port 2-port 2-port 4-port 3-port 1-port 1-port ROM* 66.8K 51.4K 36.0K 34.0K 24.9K CBC Encryption Nominal Case 677Mbit/s (34cycles) 677Mbit/s (34cycles) 460Mbit/s (50cycles) 451Mbit/s (51cycles) 263Mbit/s (51cycles) Worst Case 459Mbit/s 459Mbit/s 312Mbit/s 306Mbit/s 263Mbit/s Other Encryption and All Decryption Modes Nominal Case 1.28Gbit/s (18cycles) 677Mbit/s (34cycles) 460Mbit/s (50cycles) 451Mbit/s (51cycles) 263Mbit/s (51cycles) Worst Case 867Mbit/s 459Mbit/s 312Mbit/s 306Mbit/s 263Mbit/s * 105MHz operation limited by ROM performance 10 (a) Forward/Backword Mixer adder1 1.10 adder2 1.29 1.62 S-box 1.22 selector+latch 1.70 1.90 Total (Nominal Case) 5.31ns 1.80 Total 7.61ns (Worst Case) 2.50 (b) Cryptographic Core adder1 1.10 1.62 multiplyer 2.32 rotator 0.99 3.41 selector+latch 1.16 Total (Nominal Case) 5.57ns 1.45 1.70 Total (Worst Case) 8.18ns Figure 16. Critical path delay. The technology chosen for the above estimations is a low cost copper CMOS technology several generations behind the state of the art CMOS technology. As such the performance cannot be directly compared with that of software running on today’s high performance microprocessors. If built using a newer CMOS technology the performance can be expected to improve by approximately 60%. 5. Conclusion We designed MARS hardware and estimated its size and performance. Since the MARS algorithm uses 32-bit additions and multiplications, its performance is highly dependent on the hardware design of the adder and multiplier. We designed multipliers and adders, which fully take into account the carry propagation delay. This work demonstrates that MARS can be implemented efficiently in hardware, both in terms of area and performance. We believe the design point chosen is a reasonable tradeoff of area vs. performance. We do not claim that this is the highest performance MARS design possible. Other tradeoffs may yield faster hardware implementations. Considering the size and performance in both hardware and software along with the very high security, we believe MARS is well suited to serve as the Advanced Encryption Standard algorithm. Acknowledgement The authors are grateful to Ms. Carolynn Burwick, Dr. Don Coppersmith, Dr. Mike Matyas, Mr. Hideto Niijima, and Mr. Nev Zunic for their reviews and comments. We also would like to thank Mrs. Akemi Ogura for supporting us generously with CAD tool operation. References [1] C. Burwick, D. Coppersmith, E. D’Avignon, R. Gennaro, S. Halevi, C. Jutla, S. M. Matyas, L. O’Connor, M. Peyravian, D. Safford and N. Zunic: “MARS – a candidate cipher for AES,” http://csrc.nist.gov/encryption/aes/round2/AESAlgs /MARS/mars.pdf, Aug. 1999. [2] M. Lehman and N. Burla: “Skip Techniques for High-Speed Carry Propagation in Binary Arithmetic Units,” IRE Trans. Elec. Comput., vol. EC-10, pp. 691-698, Dec. 1961. [3] O. J. Bedrij: “Carry-Select Adder,” IRE Trans. Elec. Comput., vol. EC-11, pp. 340-346, June 1962. [4] A. Satoh, Y. Kobayashi, H. Niijima, N. Ooba, S. Munetoh, and S. Sone: “A High-Speed Small RSA Encryption LSI with Low Power Dissipation,” LNCS 1396, pp. 174-187, 1997. [5] C. S. Wallace: “Suggestion for a Fast Multiplier,” IEEE Trans. Computers, vol. 13, no. 2, pp.14-17, Feb. 1964. [6] A. Weinberger: “4:2 Carry Save Adder Module,” IBM Technical Disclosure Bulletin, vol. 23, Jan. 1981. 11 [7] P. Song and G. De Michelli: “Circuit and Architecture Trade-Offs for High Speed Multiplication,” IEEE J. Solid State Circuits, vol. 26, no. 9, Sept. 1991. [8] A. D. Booth: “A Signed Binary Multiplication Technique,” Quarterly J. Mechanical Applications in Math, vol. 4, part 2, pp. 236-240, 1951. [9] V. G. Oklobdzija, D. Villeger and S. S. Liu: “A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach,” IEEE Trans. on Comp., vol. 35, no. 3, pp. 294-305, Mar. 1996. 12