PDF

POWER8 in-core cryptography
An introduction to using AES instructions
Leonidas S. Barbosa ([email protected])
Software engineer, IBM
IBM
21 September 2015
POWER8 provides in-core instructions that when used in cryptography applications improve
performance, speeding up crypto/decrypto using Advanced Encryption Standard (AES). This
article explains how to use the in-core instructions.
POWER8 is a family of super-scalar symmetric multiprocessors based on the POWER
architecture. The POWER8 series introduced enhancements in its cryptographic capabilities,
which implement in-core enhancements by using the Advanced Encryption Standard (AES)
symmetric key cryptography standard.
The POWER8 AES instruction set provides five vector instructions to process AES block cipher
encryption/decryption. POWER8 also provides instructions for multiplication in Galois Field, used
to implement the Galois Counter Mode (GCM) and GHASH algorithms [1].
This article introduces these cryptographic instructions and shows simple examples to
demonstrate how you can use them to implement AES or AES Modes in your application or driver.
What is AES?
The Advanced Encryption Standard is also known as Rijndael. It was established as a standard for
the encryption of electronic data by the U.S National Institute of Standards and Technology (NIST)
in 2001. It's a symmetric key algorithm that processes data blocks of 16 bytes/128 bits. In other
words, it is a block cipher algorithm. The 128-bit block fits in a VMX/VSX 128-bit register. Keys for
this algorithm can be 128, 192 or 256 bits long. The POWER8 architecture lets you implement the
AES algorithm with five instructions to run critical steps in the AES algorithm in-core, especially the
expansion key and AES encryption/decryption rounds parts of the algorithm.
Vector Multimedia eXtension (VMX)
One of the POWER8 enhancements is the implementation of an integrated multi-pipeline vector
SIMD-type instruction, which supports 32, 128-bit VMX vector registers. Vector data can be
represented in different ways, as shown in the following table.
© Copyright IBM Corporation 2015
POWER8 in-core cryptography
Trademarks
Page 1 of 12
developerWorks®
ibm.com/developerWorks/
qword
dword
dword
word
word
hword
0x00
hword
0x01
0x02
word
wword
0x03
0x04
hword
0x05
0x06
word
hword
0x07
0x08
hword
0x09
0x0a
hword
0x0b
0x0c
hword
0x0d
0x0e
0x0f
* hword = 2 bytes, word = 4 bytes, dword = 8 bytes, qword = 16 bytes.
For the purpose of using AES, consider using the full 16-bytes vector, which can handle the largest
AES key or the state/cipher text during the encryption/decryption steps.
AES Algorithm
The AES algorithm can be split into the follow steps:
• KeyExpansion/Generate Round keys
• RotWord
• SubBytes
• Rcon Xor
• InitialRound
• AddKeyRound
• Rounds
• SubBytes
• ShiftRows
• MixColumns
• AddRoundKey
• Final Round (no Mixcolumns)
• SubBytes
• ShiftRows
• AddRoundKey
Key Expansion/Generate Round Keys shows an overview of the algorithm. Each of the steps are
described in the following sections.
POWER8 in-core cryptography
Page 2 of 12
ibm.com/developerWorks/
developerWorks®
Figure 1. AES fluxogram
Key Expansion/Generate Round Keys
The Key Expansion/Generate Round Keys step starts with a given key and expands it to multiple
keys. 128-bit keys are expanded to 11 keys. 192-bit keys are expanded to 13 keys. 256-bit keys
are expanded to 15 keys.
The first expanded key is generated from the last word of the original key and is processed in three
steps that produce a word, which is then used to generate all 4 words of an expanded key. The
next round uses the last word from the key generated in the previous round. This process repeats
until all keys are generated. Keep in mind that regardless of the size of the initial key, AES always
uses 16-byte keys internally.
RotWord Step
The Rotate Word (RotWord) step processes a word and rotates its bytes as follows:
Bytes 0 1 2 3 -> 1 2 3 0
Example:
Given a word:
79 d2 85 46
The RotWord step would result in:
d2 85 46 79
POWER8 in-core cryptography
Page 3 of 12
developerWorks®
ibm.com/developerWorks/
SubBytes Step
The SubBytes step uses a Substitution box (S-Box) that replaces bytes in a word by the word's
8
4
3
multiplicative inverse in Galois Field GF(28) = GF(2)[x]/(x +X +x +x+1) [2]. For decryption, it uses
an Inverse S-Box.
Figure 2. S-Box
For example, using the S-Box, byte 0x9a is replaced by 0xb8. This step is done internally using the
vcipher and vcipherlast instructions. Inverse S-Box operations are done internally with the vncipher
and vncipherlast instructions.
For example, given the word:
af 7f 67 98
The SubBytes(word)operation would yield
79 d2 85 46
Rcon Xor Step
The Rcon or Round Counter step is the exponential of 2 to a user-specified value [3]. In AES, this
value is the round number.
The number of AES rounds needed depends on the size of the key. For 128-bit keys, AES requires
up to rcon(10); for 192-bit keys up to rcon(8); and for 256-bit keys up to rcon(7). Thus, for all AES
possibilities, we need to have rcon 1 to 10 saved:
Rcon(1-10) = 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36
POWER8 in-core cryptography
Page 4 of 12
ibm.com/developerWorks/
developerWorks®
To generate the first expand key, add with rcon, we proceed with key-word Xor rcon(1) or key-word
xor 01 00 00 00.
Here's a full example of how the expand key works
Key: 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd
Splitting key
W0 = ac 2b 3c
W1 = ee 04 11
W2 = a1 4b 5c
W3 = 6a b9 1c
into words for the first key in key expand buffer:
dd
44
d1
dd
Key expand algorithm always uses the last word to execute steps, in our case W3:
X1 <- RotWord(w3)
X1 = b9 1c dd 6a
Y1 <- SubBytes(X1)
Y1 = 56 9c c1 02
Rcon(1) = 01 00 00 00
Z1 <- Y1 xor 01 00 00 00
Z1 = 57 9c c1 02
-------------------Second key in key expand buffer:
W4 = W0 xor Z1 = fb b7 fd df
W5 = (W4 xor W1) = 15 b3 ec 9b
W6 = (W5 xor W2) = b4 f8 b0 4a
W7 = (W6 xor W3) = de 41 ac 97
______________________________
X2 <- RodWord(W7)
X2 = 41 ac 97 de
Y2 <- SubBytes(X2)
Y2 = 83 91 88 1d
Rcon(2) = 02 00 00 00
Z2 <- Y2 xor 02 00 00 00
Z2 = 81 91 88 1d
-----------------------------Third key in key expand buffer:
W8 =
W4 xor z2 = 7a 26 75 c2
W9 = (W8 xor W5) = 6f 95 99 59
W10 = (W9 xor W6) = db 6d 29 13
W11 = (W10 xor W7) = 52 c8 58 40
(...)
Now let's look at creating the expand-keys by using the POWER8 vector and AES instructions.
First, we need to look at the AES instructions. As previously described, P8 comes with five AES
instructions: vcipher, vcipherlast, vncipher, vncipherlast, and vsbox. Let's focus on the first two:
vcipher VRT,VRA,VRB
State ← VR[VRA]
RoundKey ← VR[VRB]
vtemp1 ← SubBytes(State)
vtemp2 ← ShiftRows(vtemp1)
vtemp3 ← MixColumns(vtemp2)
VR[VRT] ← vtemp3 ^ RoundKey
is the cipher text in the current round or even the plain text in the first step of AES. Both
values are 16 bytes or an AES block size. RoundKey, as the name suggests, is the Key in the
State
POWER8 in-core cryptography
Page 5 of 12
developerWorks®
ibm.com/developerWorks/
current round. Because VMX vectors are 16 bytes, they can handle the full size round keys and
cipher text.
vcipherlast VRT,VRA,VRB
State ←VR[VRA]
RoundKey ← VR[VRB]
vtemp1 ← SubBytes(State)
vtemp2 ← ShiftRows(vtemp1)
VR[VRT] ← vtemp2 ^ RoundKey
vcipherlast
is the same as vcipher, except it has one step less, MixColumns.
and vncipherlast, are exactly the same as vcipher and vcipherlast, except they use
inverse steps and are intended for decryption.
vncipher
Power8 does not have a specific instruction for Key Expand. But the vcipherlast, with some
additional steps, can be used to achieve the Key Expand operations.
The following steps show an example of how to use vcipherlast to perform an Expand Key
operation:
In this example, the rcon pointer is already loaded into a vector—you may want to look in the vmxcrypto driver for more information [5]. Also note that in Power PC Assembly code registers are
referenced by numbers, not by names. For example, vperm 3,1,1,5 is taking vr3 as the register
result and using vr1 and vr5 as parameters. See [4] for more details.
/**
* vr1 is the first key = 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd
* vr5 is a mask to rotate a word in applied for all four words in our key.
* vr5 = 0x0d0e0f0c 0d0e0f0c 0d0e0f0c 0d0e0f0c
* vr3 is the key in use destination
* vr4 is the first rcon loaded: 01 00 00 00 01 00 00 00 01 00 00 00
* 01 00 00 00
**/
Loop128:
1 vperm 3,1,1,5
2 vsldoi 6,0,1,12
3 vcipherlast 3,3,4
4 vxor 1,1,6
5 vsldoi 6,0,6,12
6 vxor 1,1,6
7 vsldoi 6,0,6,12
8 vxor 1,1,6
9 vadduwm 4,4,4
10 vxor 1,1,3
11 bdnz Loop128
Line 1 applies a mask against the key. In this case, after the vperm instruction vr3 will be:
0xb91cdd6a
0xb91cdd6a 0xb91cdd6a 0xb91cdd6a
Line 2 results in:
0x0000000ac2b3cddee041144a14b5cd1
POWER8 in-core cryptography
Page 6 of 12
ibm.com/developerWorks/
developerWorks®
Line 3 calls vcipherlast to execute SubBytes, ShiftRows, and an xor with Rcon(n). By the
definition of the ShiftRows function in Power ISA 2.07B [4], ShiftRows has no effect when applied
in this vector. In this particular scenario, it performs only SubBytes and xor Rcon(n). In other
words, it generates the first word Z or Z1. Thus, after vcipherlast we have:
Z1: 0x579cc102 579cc102 579cc102 579cc102
Lines 4 to 8 perform the math behind the key words generation in the key expansion algorithm.
W4
W5
W6
W7
=
=
=
=
(W0
(W1
(W2
(W3
xor
xor
xor
xor
Z1)
W4)
W5)
W6)
This can be rewritten as:
W4'
W5'
W6'
W7'
=
=
=
=
W0
W1 xor W0
W2 xor W1 or W0
W3 xor W2 Xor W1 xor W0
Where ' means a temporary word before Z1 is applied.
The detailed operations of lines 4 to 8 are:
W0
W1
W2
W3
vr1 0xac2b3cdd 0xee041144 0xa14b5cd1 0x6ab91cdd
W0
W1
W2
w3
vr6 0x00000000 0xac2b3cdd 0xee041144 0xa14b5cd1
------------------------------------------W4'
W5'
temp-W6' temp-W7'
vr1 0xac2b3cdd 0x422f2d99 0x4f4f4d95 0xcbf2400c
W0
W1
vr6 0x00000000 0x00000000 0xac2b3cdd 0xee041144
------------------------------------------W4'
W5'
W6'
temp-W7'
vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x25f65148
W0
vr6 0x00000000 0x00000000 0x00000000 0xac2b3cdd
------------------------------------------W4'
W5'
W6'
W7'
vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95
Line 9 adds the rcon for the next round:
vr4 02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00
Finally, Line 10 applies Z1 words to generate the first key that is expanded.
W4'
W5'
W6'
W7'
vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95
Z1
Z1
Z1
Z1
vr3 0x579cc102 0x579cc102 0x579cc102 0x579cc102
------------------------------------------W4
W5
W6
W7
vr1 0xfbb7fddf 0x15b3ec9b 0xb4f8b04a 0xde41ac97
POWER8 in-core cryptography
Page 7 of 12
developerWorks®
ibm.com/developerWorks/
Line 11 jumps back to the beginning of the loop and repeats all previous steps according to the
number of needed round numbers.
In comparison with Key Expand, AES rounds are simple because they require only the expanded
keys and the data to be encrypted or decrypted.
Following is a simple example that shows how to use the in-core instructions. For a more accurate
code example, see Appendix A.
/**
* vr0 is our state or the vector register where our
* plaintext/point address resides.
* vr1 is the key0 provided by the user or first key
* vr2 is the second generated by expand key
* vr3 is the third and so on till vr11
**/
1 vxor 0,0,1
2 vcipher 0,0,2
3 vcipher 0,0,3
4 vcipher 0,0,4
5 vcipher 0,0,5
...
11 vcipherlast 0,0,11
Line 1 key0 is added to the initial state of AES.
Line 2 is the first round of AES with key1.
Line 3 is the second round of AES, and so on.
Line 11 is the last round of AES with the last key10.
Kernel Driver that uses POWER8 in-core instructions
vmx-crypto is the Kernel driver that supports AES in-core for POWER8. Initially, the driver supports
AES in CBC and CTR modes. It also supports the GHASH algorithm. It is available in Kernel 4.1
and later. It's both little- and big-endian capable.
To verify if your kernel is using vmx-crypto, you can run: lsmod | grep vmx. If your machine is
not using it already, you can modprobe vmx-crypto and then verify again with lsmod or even cat /
proc/crypto | less and look for the p8 prefix. The algorithms/modes supported by the driver are:
• name : ghash
driver : p8_ghash
module : vmx_crypto
• name : aes
driver : p8_aes
module : vmx_crypto
• name : cbc(aes)
driver : p8_aes_cbc
module : vmx_crypto
POWER8 in-core cryptography
Page 8 of 12
ibm.com/developerWorks/
developerWorks®
• name : ctr(aes)
driver : p8_aes_ctr
module : vmx_crypto
POWER8 in-core instructions in user space
Many projects use OpenSSL as their crypto provider. Starting with Version 1.0.2 of OpenSSL,
the code implements the SSL cryptography by using in-core P8 instructions. If enabled on the
running system, this version of OpenSSL (and later) provides better performance by using the
VMX POWER8 assembly codes and hardware optimization. Because so many applications use
Open SSL for their cryptography, this enhanced OpenSSL enables a wide variety of applications to
take advantage of the POWER8 AES instructions.
Conclusion
In-core instructions on POWER8 systems give you the ability to implement a cryptography stack
that uses the power of POWER8 hardware directly, which helps your code to perform well on
cryptography benchmark tests.
References
[1] Brian Hall; Ryan Arnold; Peter Bergner; Wainer dos Santos Moschetta; Robert Enenkel; Pat
Haugen; Michael R. Meissner; Alex Mericas; Philipp Oehler; Berni Shiefer; Brian F. Veale; Suresh
Warrier; Daniel Zabawa; Adhemerval Zanella. Performance Optimization and Tuning Techniques
for IBM Processors, including IBM POWER8 An IBM Redbooks publication, 2014.
[2] G. David Forney, Principles of Digital Communication II - Spring 2005. Introduction to Finite
Fields, 2005.
[3] Federal Information Processing Standards Publication – Announcing the Advanced Encryption
Standard – AES, 2001.
[4] IBM. Power ISA 2.07B. Vector Facilities, 2015. p.217. Available in: https://www.power.org/
documentation/power-isa-v-2-07b/
[5] Cerri, M. VMX-crypto driver. Available in: http://lxr.free-electrons.com/source/drivers/crypto/
vmx/.
Appendix A
/**
* vmx-crypto AES encrypt/OpenSSL aes encryption
* At kernel: /drivers/crypto/vmx/aesp8-ppc.S
* At OpenSSL: /crypto/aes/aesp8-ppc.s
**/
.aes_p8_encrypt:
POWER8 in-core cryptography
Page 9 of 12
developerWorks®
ibm.com/developerWorks/
lwz 6,240(5)
lis 0,0xfc00
mfspr 12,256
li 7,15
mtspr 256,0
lvx 0,0,3
neg 11,4
lvx 1,7,3
lvsl 2,0,3
lvsl 3,0,11
li 7,16
vperm 0,0,1,2
lvx 1,0,5
lvsl 5,0,5
srwi 6,6,1
lvx 2,7,5
addi 7,7,16
subi 6,6,1
vperm 1,1,2,5
vxor 0,0,1
lvx 1,7,5
addi 7,7,16
mtctr 6
.Loop_enc:
vperm 2,2,1,5
vcipher 0,0,2
lvx 2,7,5
addi 7,7,16
vperm 1,1,2,5
vcipher 0,0,1
lvx 1,7,5
addi 7,7,16
bdnz .Loop_enc
vperm 2,2,1,5
vcipher 0,0,2
lvx 2,7,5
vperm 1,1,2,5
vcipherlast 0,0,1
vspltisb 2,-1
vxor 1,1,1
li 7,15
vperm 2,1,2,3
lvx 1,0,4
vperm 0,0,0,3
vsel 1,1,0,2
lvx 4,7,4
stvx 1,0,4
vsel 0,0,4,2
stvx 0,7,4
mtspr 256,12
blr
/**
* vmx-crypto AES decrypt
* At kernel: /drivers/crypto/vmx/aesp8-ppc.S
* At OpenSSL: /crypto/aes/aesp8-ppc.s
**/
.aes_p8_decrypt:
POWER8 in-core cryptography
Page 10 of 12
ibm.com/developerWorks/
developerWorks®
lwz 6,240(5)
lis 0,0xfc00
mfspr 12,256
li 7,15
mtspr 256,0
lvx 0,0,3
neg 11,4
lvx 1,7,3
lvsl 2,0,3
lvsl 3,0,11
li 7,16
vperm 0,0,1,2
lvx 1,0,5
lvsl 5,0,5
srwi 6,6,1
lvx 2,7,5
addi 7,7,16
subi 6,6,1
vperm 1,1,2,5
vxor 0,0,1
lvx 1,7,5
addi 7,7,16
mtctr 6
.Loop_dec:
vperm 2,2,1,5
vncipher 0,0,2
lvx 2,7,5
addi 7,7,16
vperm 1,1,2,5
vncipher 0,0,1
lvx 1,7,5
addi 7,7,16
bdnz .Loop_dec
vperm 2,2,1,5
vncipher 0,0,2
lvx 2,7,5
vperm 1,1,2,5
vncipherlast 0,0,1
vspltisb 2,-1
vxor 1,1,1
li 7,15
vperm 2,1,2,3
lvx 1,0,4
vperm 0,0,0,3
vsel 1,1,0,2
lvx 4,7,4
stvx 1,0,4
vsel 0,0,4,2
stvx 0,7,4
mtspr 256,12
blr
© Copyright IBM Corporation 2015
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)
POWER8 in-core cryptography
Page 11 of 12
developerWorks®
POWER8 in-core cryptography
ibm.com/developerWorks/
Page 12 of 12