AN3038, Using the ColdFire EMAC Unit to Improve RSA Performance - Application Notes

Freescale Semiconductor
Application Note
AN3038
Rev. 0, 09/2005
Using the ColdFire EMAC Unit to
Improve RSA Performance
by: Jim Stephens
Freescale Semiconductor
The widely used RSA public key cryptographic
algorithm requires modular exponentiation of large
integers (typically 512 to 2048 bits). A significant
amount of work has been done on efficient methods for
making this calculation. A critical element in a high
performance RSA implementation is the modular
multiplication function. This application note describes a
technique for speeding up the modular multiplication
using the enhanced multiply accumulate (EMAC) unit
present on most ColdFire processors. A low level
function is described that can be incorporated in RSA
implementations to achieve a 2.5x improvement in
multiplication performance.1
At a top level, the modular exponentiation function can
be efficiently implemented using the square and multiply
method or similar algorithms (m-ary, sliding window).
This breaks the problem down to a series of modular
multiplications. The Montgomery multiplication
algorithm is an efficient way of calculating the required
modular multiplications and squarings. The
1. Based on timing of a 256-bit multiply running a ColdFire V2
processor using the Green Hills compiler v3.6.
© Freescale Semiconductor, Inc., 2005. All rights reserved.
Table of Contents
Appendix A
mp_mul64.c Code
Appendix B
mul_add64.s Code
Appendix C
mp_mul.c Code
Montgomery algorithm practically eliminates the need to perform division, but still requires efficient
implementation of regular multiple precision multiplication.
There are various ways to implement Montgomery multiplication. However, the performance of an
implementation is mostly dependent on the low level multiplication function used to calculate and
accumulate partial products. An efficient function, mul_add64(), does a 1 by k digit multiply and addition
using 64-bit digits. This function can be used in multiple precision multiplication, squaring, and reduction
functions. Use of the mul_add64() function is illustrated using a regular multiple precision multiplication
function. Application to squaring and reduction is straightforward.
The core of a multiple precision multiply function that makes the calculation t = a*b is shown below. The
full function is in the file mp_mul.c.
for(i=size-1; i>=0; i--)
{
carry = 0;
for(j=size-1;j>=0;j--)
{
z0
= (a[i]&0xffff) * (b[j]&0xffff);
z1a = (a[i]&0xffff) * (b[j]>>16);
z1b = (a[i]>>16) * (b[j]&0xffff);
z2
= (a[i]>>16) * (b[j]>>16);
temp = z0;
temp += (ULLONG)(z1a) <<16;
temp += (ULLONG)(z1b) <<16;
temp += (ULLONG)(z2) <<32;
temp += (ULLONG) t[i+j+1] + carry;
t[i+j+1] = (temp & 0xffffffffL);
carry = temp>>32;
}
t[i]=carry;
}
Regular Multiple Precision Multiply
In this example, the input and output values are represented by arrays of unsigned long (32-bit) elements,
with the most significant element first. The calculation is made using 32-bit digits, with the inner loop
performing a full 32-bit multiply.
To improve performance, an EMAC-based multiply and add function, mul_add64(), is used to replace the
inner loop of the multiply function, as shown below. The full function is in the file mp_mul64.c.
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
2
Freescale Semiconductor
for(i=size-1; i>=0; i--)
{
carry = 0;
mul_add64(&a[i], &b[0], size, &carry, &t[i+1]);
t[i] = carry;
}
EMAC Based Multiple Precision Multiply
To make better use of the EMAC's capabilities, the digit size is increased to 64-bits. The input and output
arrays (a, b, and t) use unsigned, long (64-bit) elements. Therefore, the value of the size input is reduced
by a factor of two. The mul_add64() function performs a series of 64-bit multiplies and adds the products,
as shown in the pseudocode below.
For i = k-1 to 0
{c,z} = x*y[i] + t[i] + c
t[i] = z
Pseudocode for mul_add64()
Note that all values except the counter are 64-bits, and the carry (c) is both an input and an output.
The mul_add64() function is implemented using the Comba method in ColdFire assembly language. All
of the code inside the loop is completely unrolled. A full 64-bit multiply is implemented using 16 mac.w
instructions. Using the Comba method, partial products can be accumulated directly. The only requirement
to do this is an extension register to store carry bits, which the EMAC unit has. To optimize performance,
instructions are scheduled such that results are not immediately read from the accumulator after a mac.w
instruction. The code for this function can be found in the file mul_add64.s.
Benchmark tests show that a multiple precision multiply using the mul_add64() function is more than
twice as fast as the 32-bit C implementation. This can be attributed to using the Comba method to take
advantage of the ColdFire EMAC unit features. The mul_add64() function can be used directly or as a
reference for developers of high performance RSA implementations on ColdFire processors.
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
Freescale Semiconductor
3
References
1. C. Koc, “High-Speed RSA Implementation,” RSA Laboratories, 1994
2. C. Koc, T. Acar, B. Kaliski, “Analyzing and Comparing Montgomery Multiplication
Algorithms,” IEEE Micro, 1996
3. P. Comba, “Exponentiation Cryptosystems on the IBM PC,” IBM Systems Journal, 1990
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
4
Freescale Semiconductor
Appendix A
mp_mul64.c Code
/*
mp_mul64.c
multiple precision multiply using mul_add64 function
t = a * b
The size input is the length of a and b which have
64-bit elements. Data is most significant element first.
*/
#define ULONG unsigned long
#define ULLONG unsigned long long
void mul_add64(ULLONG *x, ULLONG y[], int k, ULLONG *c, ULLONG t[]);
void
mp_mul64(ULLONG a[],
ULLONG b[],
ULLONG t[],
int size)
{
int
ULLONG
i;
carry;
for(i=size; i<size*2; i++)
{
t[i] = 0;
}
for(i=size-1;i>=0;i--)
{
carry = 0;
mul_add64(&a[i],&b[0],size,&carry,&t[i+1]);
t[i] = carry;
}
}
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
Freescale Semiconductor
5
Appendix B
mul_add64.s Assembly File
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
mul_add64.s
Multiply and add function with 64 bit values.
ColdFire assembly implementation using the EMAC and the
Comba method. This function assumes k > 0
void mul_add64(ULLONG *x, ULLONG y[], int k, ULLONG *c, ULLONG t[])
8
12
16
20
24
For i = k-1 to 0
{c,z} = x*y[i] + t[i] + c
t[i] = z
{a0,a1}
{a2,a3}
{d0,d1}
{d2,d3}
a5
a6
d4,d5,d6
d7
.globl
mul_add64:
link.w
movem.l
moveq
move.l
movclr.l
movclr.l
moveq
move.l
move.l
move.l
move.l
subq.l
lsl.l
add.l
add.l
move.l
move.l
move.l
loop:
move.l
move.l
move.l
moveq
=
=
=
=
=
=
=
=
x
y
c
z
scratch ptr
frame ptr
scratch data reg
16 (for shifts)
mul_add64
%a6,#-40
%d2-%d7/%a2-%a5,(%sp)
#0x40,%d6
%d6,%macsr
%acc0,%d6
%acc1,%d6
#16,%d7
8(%a6),%a5
(%a5)+,%a0
(%a5),%a1
16(%a6),%d5
#1,%d5
#3,%d5
%d5,12(%a6)
%d5,24(%a6)
20(%a6),%a5
(%a5)+,%d0
(%a5),%d1
12(%a6),%a5
(%a5)+,%a2
(%a5),%a3
#8,%d5
; unsigned integer mode
; clear acc and ext
; d7 used for shifts
; load x {a0,a1}
;
;
;
;
;
;
d5 = k
d5 = k-1
d5 = 8*(k-1)
y_ptr = &y[k-1]
t_ptr = &t[k-1]
load c (moved to z in loop)
; load y[i] {a2,a3}
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
6
Freescale Semiconductor
sub.l
mac.w
move.l
move.l
clr.l
clr.l
movclr.l
mac.w
mac.w
add.l
clr.l
addx.l
addx.l
move.l
movclr.l
mac.w
mac.w
mac.w
move.l
lsl.l
lsr.l
lsl.l
add.l
add.l
addx.l
clr.l
addx.l
move.l
movclr.l
mac.w
mac.w
mac.w
mac.w
add.l
addx.l
move.l
movclr.l
mac.w
mac.w
mac.w
move.l
lsl.l
lsr.l
lsl.l
add.l
add.l
addx.l
addx.l
move.l
movclr.l
mac.w
mac.w
add.l
%d5,12(%a6)
%a1.l,%a3.l
%d0,%d2
%d1,%d3
%d0
%d1
%acc0,%d6
%a1.l,%a3.u
%a1.u,%a3.l
%d6,%d3
%d5
%d5,%d2
%d5,%d1
%accext01,%d4
%acc0,%d5
%a1.l,%a2.l
%a1.u,%a3.u
%a0.l,%a3.l; x3*y1
%d5,%d6
%d7,%d4
%d7,%d5
%d7,%d6
%d4,%d5
%d6,%d3
%d5,%d2
%d6
%d6,%d1
%accext01,%d4
%acc0,%d5
%a1.l,%a2.u
%a1.u,%a2.l
%a0.l,%a3.u; x3*y2
%a0.u,%a3.l; x4*y1
%d5,%d2
%d4,%d1
%accext01,%d4
%acc0,%d5
%a1.u,%a2.u
%a0.l,%a2.l
%a0.u,%a3.u; x4*y2
%d5,%d6
%d7,%d4
%d7,%d5
%d7,%d6
%d4,%d5
%d6,%d2
%d5,%d1
%d0,%d0
%accext01,%d4
%acc0,%d5
%a0.l,%a2.u
%a0.u,%a2.l
%d5,%d1
; y_ptr-; x1*y1
; move c to z
; clear c
;
;
;
;
d6 = {z2,z1}
x1*y2
x2*y1
d3 += {z2,z1}
(add col 1)
; d2 += carry
; d1 = carry
; d4:d5 = {z4,z3,z2}
; x1*y3
; x2*y2
;
;
;
;
;
;
;
add col 2
d4 = {z4,0}
d5 = {0,z3}
d6 = {z2,0}
d5 = {z4,z3}
d3 += {z2,0}
d2 += {z4,z3} + carry
; d1 += carry
; d4:d5 = {z4,z3,z2}
; x1*y4
; x2*y3
; d2 += {z3,z2} (add col3)
; d1 += {0,z4}
; d4:d5 = {c2,c1,z4}
; x2*y4
; x3*y3
;
;
;
;
;
;
;
;
add col 4
d4 = {c2,0}
d5 = {0,c1}
d6 = {z4,0}
d5 = {c2,c1}
d2 += {z4,0}
d1 += {c2,c1} + carry
d0 = carry
;
;
;
;
d4:d5 = {c3,c2,c1}
x3*y4
x4*y3
d1 += {c2,c1} (add col 5)
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
Freescale Semiconductor
7
addx.l
move.l
movclr.l
mac.w
move.l
lsl.l
lsr.l
lsl.l
add.l
add.l
addx.l
movclr.l
add.l
move.l
add.l
move.l
addx.l
clr.l
addx.l
addx.l
move.l
move.l
subq.l
move.l
moveq
sub.l
bne.w
%d4,%d0
%accext01,%d4
%acc0,%d5
%a0.u,%a2.u
%d5,%d6
%d7,%d4
%d7,%d5
%d7,%d6
%d4,%d5
%d6,%d1
%d5,%d0
%acc0,%d5
%d5,%d0
24(%a6),%a5
4(%a5),%d3
(%a5),%d5
%d5,%d2
%d5
%d5,%d1
%d5,%d0
%d2,(%a5)
%d3,4(%a5)
#8,%a5
%a5,24(%a6)
#1,%d5
%d5,16(%a6)
loop
; d0 += {0,c3}
move.l
move.l
move.l
20(%a6),%a5
%d0,(%a5)+
%d1,(%a5)
; store c
movem.l
unlk
rts
(%sp),%d2-%d7/%a2-%a5
%a6
;
;
;
;
;
;
;
;
;
;
;
;
;
d4:d5 = {c4,c3,c2}
x4*y4
add col 6
d4 = {c4,0}
d5 = {0,c3}
d6 = {c2,0}
d5 = {c4,c3}
d1 += {c2,0}
d0 += {c4,c3}
d5 = {c4,c3} (add col 7)
d0 += {c4,c3}
a5 = t_ptr
add t[i]
; t[i] = z
; t_ptr--
; k-; loop k times
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
8
Freescale Semiconductor
Appendix C
mp_mul.c Code
/*
mp_mul.c
multiple precision multiply using long data type
t = a * b
The size input is the length of a and b.
Data is most significant element first.
*/
#define ULONG unsigned long
#define ULLONG unsigned long long
void
mp_mul(ULONG a[],
ULONG b[],
ULONG t[],
int size)
{
int
ULONG
ULONG
ULLONG
i,j,k;
carry;
z0,z1a,z1b,z2;
temp;
for(k=size; k<size*2; k++)
{
t[k] = 0;
}
for(i=size-1;i>=0;i--)
{
carry = 0;
for(j=size-1;j>=0;j--)
{
z0
= (a[i]&0xffff) * (b[j]&0xffff);
z1a = (a[i]&0xffff) * (b[j]>>16);
z1b = (a[i]>>16) * (b[j]&0xffff);
z2
= (a[i]>>16) * (b[j]>>16);
temp = z0;
temp += (ULLONG)(z1a) <<16;
temp += (ULLONG)(z1b) <<16;
temp += (ULLONG)(z2) <<32;
temp += (ULLONG) t[i+j+1] + carry;
t[i+j+1] = (temp & 0xffffffffL);
carry = temp>>32;
}
t[i]=carry;
}
}
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
Freescale Semiconductor
9
This page intentionally left blank.
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
10
Freescale Semiconductor
This page intentionally left blank.
Using the ColdFire EMAC Unit to Improve RSA Performance, Rev. 0
Freescale Semiconductor
11
How to Reach Us:
Home Page:
www.freescale.com
E-mail:
[email protected]
USA/Europe or Locations Not Listed:
Freescale Semiconductor
Technical Information Center, CH370
1300 N. Alma School Road
Chandler, Arizona 85224
+1-800-521-6274 or +1-480-768-2130
[email protected]
Europe, Middle East, and Africa:
Freescale Halbleiter Deutschland GmbH
Technical Information Center
Schatzbogen 7
81829 Muenchen, Germany
+44 1296 380 456 (English)
+46 8 52200080 (English)
+49 89 92103 559 (German)
+33 1 69 35 48 48 (French)
[email protected]
Japan:
Freescale Semiconductor Japan Ltd.
Headquarters
ARCO Tower 15F
1-8-1, Shimo-Meguro, Meguro-ku,
Tokyo 153-0064
Japan
0120 191014 or +81 3 5437 9125
[email protected]
Asia/Pacific:
Freescale Semiconductor Hong Kong Ltd.
Technical Information Center
2 Dai King Street
Tai Po Industrial Estate
Tai Po, N.T., Hong Kong
+800 2666 8080
[email protected]
For Literature Requests Only:
Freescale Semiconductor Literature Distribution Center
P.O. Box 5405
Denver, Colorado 80217
1-800-441-2447 or 303-675-2140
Fax: 303-675-2150
[email protected]
AN3038
Rev. 0, 09/2005
Information in this document is provided solely to enable system and software
implementers to use Freescale Semiconductor products. There are no express or
implied copyright licenses granted hereunder to design or fabricate any integrated
circuits or integrated circuits based on the information in this document.
Freescale Semiconductor reserves the right to make changes without further notice to
any products herein. Freescale Semiconductor makes no warranty, representation or
guarantee regarding the suitability of its products for any particular purpose, nor does
Freescale Semiconductor assume any liability arising out of the application or use of any
product or circuit, and specifically disclaims any and all liability, including without
limitation consequential or incidental damages. “Typical” parameters that may be
provided in Freescale Semiconductor data sheets and/or specifications can and do vary
in different applications and actual performance may vary over time. All operating
parameters, including “Typicals”, must be validated for each customer application by
customer’s technical experts. Freescale Semiconductor does not convey any license
under its patent rights nor the rights of others. Freescale Semiconductor products are
not designed, intended, or authorized for use as components in systems intended for
surgical implant into the body, or other applications intended to support or sustain life,
or for any other application in which the failure of the Freescale Semiconductor product
could create a situation where personal injury or death may occur. Should Buyer
purchase or use Freescale Semiconductor products for any such unintended or
unauthorized application, Buyer shall indemnify and hold Freescale Semiconductor and
its officers, employees, subsidiaries, affiliates, and distributors harmless against all
claims, costs, damages, and expenses, and reasonable attorney fees arising out of,
directly or indirectly, any claim of personal injury or death associated with such
unintended or unauthorized use, even if such claim alleges that Freescale
Semiconductor was negligent regarding the design or manufacture of the part.
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners.
© Freescale Semiconductor, Inc. 2005. All rights reserved.