POWER8 in-core cryptography An introduction to using AES instructions Leonidas S. Barbosa ([email protected]) Software engineer, IBM IBM 21 September 2015 POWER8 provides in-core instructions that when used in cryptography applications improve performance, speeding up crypto/decrypto using Advanced Encryption Standard (AES). This article explains how to use the in-core instructions. POWER8 is a family of super-scalar symmetric multiprocessors based on the POWER architecture. The POWER8 series introduced enhancements in its cryptographic capabilities, which implement in-core enhancements by using the Advanced Encryption Standard (AES) symmetric key cryptography standard. The POWER8 AES instruction set provides five vector instructions to process AES block cipher encryption/decryption. POWER8 also provides instructions for multiplication in Galois Field, used to implement the Galois Counter Mode (GCM) and GHASH algorithms [1]. This article introduces these cryptographic instructions and shows simple examples to demonstrate how you can use them to implement AES or AES Modes in your application or driver. What is AES? The Advanced Encryption Standard is also known as Rijndael. It was established as a standard for the encryption of electronic data by the U.S National Institute of Standards and Technology (NIST) in 2001. It's a symmetric key algorithm that processes data blocks of 16 bytes/128 bits. In other words, it is a block cipher algorithm. The 128-bit block fits in a VMX/VSX 128-bit register. Keys for this algorithm can be 128, 192 or 256 bits long. The POWER8 architecture lets you implement the AES algorithm with five instructions to run critical steps in the AES algorithm in-core, especially the expansion key and AES encryption/decryption rounds parts of the algorithm. Vector Multimedia eXtension (VMX) One of the POWER8 enhancements is the implementation of an integrated multi-pipeline vector SIMD-type instruction, which supports 32, 128-bit VMX vector registers. Vector data can be represented in different ways, as shown in the following table. © Copyright IBM Corporation 2015 POWER8 in-core cryptography Trademarks Page 1 of 12 developerWorks® ibm.com/developerWorks/ qword dword dword word word hword 0x00 hword 0x01 0x02 word wword 0x03 0x04 hword 0x05 0x06 word hword 0x07 0x08 hword 0x09 0x0a hword 0x0b 0x0c hword 0x0d 0x0e 0x0f * hword = 2 bytes, word = 4 bytes, dword = 8 bytes, qword = 16 bytes. For the purpose of using AES, consider using the full 16-bytes vector, which can handle the largest AES key or the state/cipher text during the encryption/decryption steps. AES Algorithm The AES algorithm can be split into the follow steps: • KeyExpansion/Generate Round keys • RotWord • SubBytes • Rcon Xor • InitialRound • AddKeyRound • Rounds • SubBytes • ShiftRows • MixColumns • AddRoundKey • Final Round (no Mixcolumns) • SubBytes • ShiftRows • AddRoundKey Key Expansion/Generate Round Keys shows an overview of the algorithm. Each of the steps are described in the following sections. POWER8 in-core cryptography Page 2 of 12 ibm.com/developerWorks/ developerWorks® Figure 1. AES fluxogram Key Expansion/Generate Round Keys The Key Expansion/Generate Round Keys step starts with a given key and expands it to multiple keys. 128-bit keys are expanded to 11 keys. 192-bit keys are expanded to 13 keys. 256-bit keys are expanded to 15 keys. The first expanded key is generated from the last word of the original key and is processed in three steps that produce a word, which is then used to generate all 4 words of an expanded key. The next round uses the last word from the key generated in the previous round. This process repeats until all keys are generated. Keep in mind that regardless of the size of the initial key, AES always uses 16-byte keys internally. RotWord Step The Rotate Word (RotWord) step processes a word and rotates its bytes as follows: Bytes 0 1 2 3 -> 1 2 3 0 Example: Given a word: 79 d2 85 46 The RotWord step would result in: d2 85 46 79 POWER8 in-core cryptography Page 3 of 12 developerWorks® ibm.com/developerWorks/ SubBytes Step The SubBytes step uses a Substitution box (S-Box) that replaces bytes in a word by the word's 8 4 3 multiplicative inverse in Galois Field GF(28) = GF(2)[x]/(x +X +x +x+1) [2]. For decryption, it uses an Inverse S-Box. Figure 2. S-Box For example, using the S-Box, byte 0x9a is replaced by 0xb8. This step is done internally using the vcipher and vcipherlast instructions. Inverse S-Box operations are done internally with the vncipher and vncipherlast instructions. For example, given the word: af 7f 67 98 The SubBytes(word)operation would yield 79 d2 85 46 Rcon Xor Step The Rcon or Round Counter step is the exponential of 2 to a user-specified value [3]. In AES, this value is the round number. The number of AES rounds needed depends on the size of the key. For 128-bit keys, AES requires up to rcon(10); for 192-bit keys up to rcon(8); and for 256-bit keys up to rcon(7). Thus, for all AES possibilities, we need to have rcon 1 to 10 saved: Rcon(1-10) = 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36 POWER8 in-core cryptography Page 4 of 12 ibm.com/developerWorks/ developerWorks® To generate the first expand key, add with rcon, we proceed with key-word Xor rcon(1) or key-word xor 01 00 00 00. Here's a full example of how the expand key works Key: 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd Splitting key W0 = ac 2b 3c W1 = ee 04 11 W2 = a1 4b 5c W3 = 6a b9 1c into words for the first key in key expand buffer: dd 44 d1 dd Key expand algorithm always uses the last word to execute steps, in our case W3: X1 <- RotWord(w3) X1 = b9 1c dd 6a Y1 <- SubBytes(X1) Y1 = 56 9c c1 02 Rcon(1) = 01 00 00 00 Z1 <- Y1 xor 01 00 00 00 Z1 = 57 9c c1 02 -------------------Second key in key expand buffer: W4 = W0 xor Z1 = fb b7 fd df W5 = (W4 xor W1) = 15 b3 ec 9b W6 = (W5 xor W2) = b4 f8 b0 4a W7 = (W6 xor W3) = de 41 ac 97 ______________________________ X2 <- RodWord(W7) X2 = 41 ac 97 de Y2 <- SubBytes(X2) Y2 = 83 91 88 1d Rcon(2) = 02 00 00 00 Z2 <- Y2 xor 02 00 00 00 Z2 = 81 91 88 1d -----------------------------Third key in key expand buffer: W8 = W4 xor z2 = 7a 26 75 c2 W9 = (W8 xor W5) = 6f 95 99 59 W10 = (W9 xor W6) = db 6d 29 13 W11 = (W10 xor W7) = 52 c8 58 40 (...) Now let's look at creating the expand-keys by using the POWER8 vector and AES instructions. First, we need to look at the AES instructions. As previously described, P8 comes with five AES instructions: vcipher, vcipherlast, vncipher, vncipherlast, and vsbox. Let's focus on the first two: vcipher VRT,VRA,VRB State ← VR[VRA] RoundKey ← VR[VRB] vtemp1 ← SubBytes(State) vtemp2 ← ShiftRows(vtemp1) vtemp3 ← MixColumns(vtemp2) VR[VRT] ← vtemp3 ^ RoundKey is the cipher text in the current round or even the plain text in the first step of AES. Both values are 16 bytes or an AES block size. RoundKey, as the name suggests, is the Key in the State POWER8 in-core cryptography Page 5 of 12 developerWorks® ibm.com/developerWorks/ current round. Because VMX vectors are 16 bytes, they can handle the full size round keys and cipher text. vcipherlast VRT,VRA,VRB State ←VR[VRA] RoundKey ← VR[VRB] vtemp1 ← SubBytes(State) vtemp2 ← ShiftRows(vtemp1) VR[VRT] ← vtemp2 ^ RoundKey vcipherlast is the same as vcipher, except it has one step less, MixColumns. and vncipherlast, are exactly the same as vcipher and vcipherlast, except they use inverse steps and are intended for decryption. vncipher Power8 does not have a specific instruction for Key Expand. But the vcipherlast, with some additional steps, can be used to achieve the Key Expand operations. The following steps show an example of how to use vcipherlast to perform an Expand Key operation: In this example, the rcon pointer is already loaded into a vector—you may want to look in the vmxcrypto driver for more information [5]. Also note that in Power PC Assembly code registers are referenced by numbers, not by names. For example, vperm 3,1,1,5 is taking vr3 as the register result and using vr1 and vr5 as parameters. See [4] for more details. /** * vr1 is the first key = 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd * vr5 is a mask to rotate a word in applied for all four words in our key. * vr5 = 0x0d0e0f0c 0d0e0f0c 0d0e0f0c 0d0e0f0c * vr3 is the key in use destination * vr4 is the first rcon loaded: 01 00 00 00 01 00 00 00 01 00 00 00 * 01 00 00 00 **/ Loop128: 1 vperm 3,1,1,5 2 vsldoi 6,0,1,12 3 vcipherlast 3,3,4 4 vxor 1,1,6 5 vsldoi 6,0,6,12 6 vxor 1,1,6 7 vsldoi 6,0,6,12 8 vxor 1,1,6 9 vadduwm 4,4,4 10 vxor 1,1,3 11 bdnz Loop128 Line 1 applies a mask against the key. In this case, after the vperm instruction vr3 will be: 0xb91cdd6a 0xb91cdd6a 0xb91cdd6a 0xb91cdd6a Line 2 results in: 0x0000000ac2b3cddee041144a14b5cd1 POWER8 in-core cryptography Page 6 of 12 ibm.com/developerWorks/ developerWorks® Line 3 calls vcipherlast to execute SubBytes, ShiftRows, and an xor with Rcon(n). By the definition of the ShiftRows function in Power ISA 2.07B [4], ShiftRows has no effect when applied in this vector. In this particular scenario, it performs only SubBytes and xor Rcon(n). In other words, it generates the first word Z or Z1. Thus, after vcipherlast we have: Z1: 0x579cc102 579cc102 579cc102 579cc102 Lines 4 to 8 perform the math behind the key words generation in the key expansion algorithm. W4 W5 W6 W7 = = = = (W0 (W1 (W2 (W3 xor xor xor xor Z1) W4) W5) W6) This can be rewritten as: W4' W5' W6' W7' = = = = W0 W1 xor W0 W2 xor W1 or W0 W3 xor W2 Xor W1 xor W0 Where ' means a temporary word before Z1 is applied. The detailed operations of lines 4 to 8 are: W0 W1 W2 W3 vr1 0xac2b3cdd 0xee041144 0xa14b5cd1 0x6ab91cdd W0 W1 W2 w3 vr6 0x00000000 0xac2b3cdd 0xee041144 0xa14b5cd1 ------------------------------------------W4' W5' temp-W6' temp-W7' vr1 0xac2b3cdd 0x422f2d99 0x4f4f4d95 0xcbf2400c W0 W1 vr6 0x00000000 0x00000000 0xac2b3cdd 0xee041144 ------------------------------------------W4' W5' W6' temp-W7' vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x25f65148 W0 vr6 0x00000000 0x00000000 0x00000000 0xac2b3cdd ------------------------------------------W4' W5' W6' W7' vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95 Line 9 adds the rcon for the next round: vr4 02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 Finally, Line 10 applies Z1 words to generate the first key that is expanded. W4' W5' W6' W7' vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95 Z1 Z1 Z1 Z1 vr3 0x579cc102 0x579cc102 0x579cc102 0x579cc102 ------------------------------------------W4 W5 W6 W7 vr1 0xfbb7fddf 0x15b3ec9b 0xb4f8b04a 0xde41ac97 POWER8 in-core cryptography Page 7 of 12 developerWorks® ibm.com/developerWorks/ Line 11 jumps back to the beginning of the loop and repeats all previous steps according to the number of needed round numbers. In comparison with Key Expand, AES rounds are simple because they require only the expanded keys and the data to be encrypted or decrypted. Following is a simple example that shows how to use the in-core instructions. For a more accurate code example, see Appendix A. /** * vr0 is our state or the vector register where our * plaintext/point address resides. * vr1 is the key0 provided by the user or first key * vr2 is the second generated by expand key * vr3 is the third and so on till vr11 **/ 1 vxor 0,0,1 2 vcipher 0,0,2 3 vcipher 0,0,3 4 vcipher 0,0,4 5 vcipher 0,0,5 ... 11 vcipherlast 0,0,11 Line 1 key0 is added to the initial state of AES. Line 2 is the first round of AES with key1. Line 3 is the second round of AES, and so on. Line 11 is the last round of AES with the last key10. Kernel Driver that uses POWER8 in-core instructions vmx-crypto is the Kernel driver that supports AES in-core for POWER8. Initially, the driver supports AES in CBC and CTR modes. It also supports the GHASH algorithm. It is available in Kernel 4.1 and later. It's both little- and big-endian capable. To verify if your kernel is using vmx-crypto, you can run: lsmod | grep vmx. If your machine is not using it already, you can modprobe vmx-crypto and then verify again with lsmod or even cat / proc/crypto | less and look for the p8 prefix. The algorithms/modes supported by the driver are: • name : ghash driver : p8_ghash module : vmx_crypto • name : aes driver : p8_aes module : vmx_crypto • name : cbc(aes) driver : p8_aes_cbc module : vmx_crypto POWER8 in-core cryptography Page 8 of 12 ibm.com/developerWorks/ developerWorks® • name : ctr(aes) driver : p8_aes_ctr module : vmx_crypto POWER8 in-core instructions in user space Many projects use OpenSSL as their crypto provider. Starting with Version 1.0.2 of OpenSSL, the code implements the SSL cryptography by using in-core P8 instructions. If enabled on the running system, this version of OpenSSL (and later) provides better performance by using the VMX POWER8 assembly codes and hardware optimization. Because so many applications use Open SSL for their cryptography, this enhanced OpenSSL enables a wide variety of applications to take advantage of the POWER8 AES instructions. Conclusion In-core instructions on POWER8 systems give you the ability to implement a cryptography stack that uses the power of POWER8 hardware directly, which helps your code to perform well on cryptography benchmark tests. References [1] Brian Hall; Ryan Arnold; Peter Bergner; Wainer dos Santos Moschetta; Robert Enenkel; Pat Haugen; Michael R. Meissner; Alex Mericas; Philipp Oehler; Berni Shiefer; Brian F. Veale; Suresh Warrier; Daniel Zabawa; Adhemerval Zanella. Performance Optimization and Tuning Techniques for IBM Processors, including IBM POWER8 An IBM Redbooks publication, 2014. [2] G. David Forney, Principles of Digital Communication II - Spring 2005. Introduction to Finite Fields, 2005. [3] Federal Information Processing Standards Publication – Announcing the Advanced Encryption Standard – AES, 2001. [4] IBM. Power ISA 2.07B. Vector Facilities, 2015. p.217. Available in: https://www.power.org/ documentation/power-isa-v-2-07b/ [5] Cerri, M. VMX-crypto driver. Available in: http://lxr.free-electrons.com/source/drivers/crypto/ vmx/. Appendix A /** * vmx-crypto AES encrypt/OpenSSL aes encryption * At kernel: /drivers/crypto/vmx/aesp8-ppc.S * At OpenSSL: /crypto/aes/aesp8-ppc.s **/ .aes_p8_encrypt: POWER8 in-core cryptography Page 9 of 12 developerWorks® ibm.com/developerWorks/ lwz 6,240(5) lis 0,0xfc00 mfspr 12,256 li 7,15 mtspr 256,0 lvx 0,0,3 neg 11,4 lvx 1,7,3 lvsl 2,0,3 lvsl 3,0,11 li 7,16 vperm 0,0,1,2 lvx 1,0,5 lvsl 5,0,5 srwi 6,6,1 lvx 2,7,5 addi 7,7,16 subi 6,6,1 vperm 1,1,2,5 vxor 0,0,1 lvx 1,7,5 addi 7,7,16 mtctr 6 .Loop_enc: vperm 2,2,1,5 vcipher 0,0,2 lvx 2,7,5 addi 7,7,16 vperm 1,1,2,5 vcipher 0,0,1 lvx 1,7,5 addi 7,7,16 bdnz .Loop_enc vperm 2,2,1,5 vcipher 0,0,2 lvx 2,7,5 vperm 1,1,2,5 vcipherlast 0,0,1 vspltisb 2,-1 vxor 1,1,1 li 7,15 vperm 2,1,2,3 lvx 1,0,4 vperm 0,0,0,3 vsel 1,1,0,2 lvx 4,7,4 stvx 1,0,4 vsel 0,0,4,2 stvx 0,7,4 mtspr 256,12 blr /** * vmx-crypto AES decrypt * At kernel: /drivers/crypto/vmx/aesp8-ppc.S * At OpenSSL: /crypto/aes/aesp8-ppc.s **/ .aes_p8_decrypt: POWER8 in-core cryptography Page 10 of 12 ibm.com/developerWorks/ developerWorks® lwz 6,240(5) lis 0,0xfc00 mfspr 12,256 li 7,15 mtspr 256,0 lvx 0,0,3 neg 11,4 lvx 1,7,3 lvsl 2,0,3 lvsl 3,0,11 li 7,16 vperm 0,0,1,2 lvx 1,0,5 lvsl 5,0,5 srwi 6,6,1 lvx 2,7,5 addi 7,7,16 subi 6,6,1 vperm 1,1,2,5 vxor 0,0,1 lvx 1,7,5 addi 7,7,16 mtctr 6 .Loop_dec: vperm 2,2,1,5 vncipher 0,0,2 lvx 2,7,5 addi 7,7,16 vperm 1,1,2,5 vncipher 0,0,1 lvx 1,7,5 addi 7,7,16 bdnz .Loop_dec vperm 2,2,1,5 vncipher 0,0,2 lvx 2,7,5 vperm 1,1,2,5 vncipherlast 0,0,1 vspltisb 2,-1 vxor 1,1,1 li 7,15 vperm 2,1,2,3 lvx 1,0,4 vperm 0,0,0,3 vsel 1,1,0,2 lvx 4,7,4 stvx 1,0,4 vsel 0,0,4,2 stvx 0,7,4 mtspr 256,12 blr © Copyright IBM Corporation 2015 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/) POWER8 in-core cryptography Page 11 of 12 developerWorks® POWER8 in-core cryptography ibm.com/developerWorks/ Page 12 of 12