I’ve been working to optimize our AES implementation. BitLocker encrypts and decrypts more data than all other features in Windows Vista combined, so we have the most to gain from a fast implementation.

I won’t bore you with the details of optimizing AES in assembler. Let’s just say that the Pentium 4 has various quirks that make performance measurement and optimization harder than you’d think. And of course the code has to be fast on a variety of CPU architectures, so you cannot optimize for one CPU only.

Properly testing an AES implementation is hard if you don’t already have a good AES implementation. NIST included some test vectors in FIPS 197 but these are limited to one or two key/plaintext/ciphertext values for each key length. It is quite conceivable that you could have a small error in an implementation that would not be caught by the FIPS 197 test vectors.

So I defined my own test vectors using the construction the Twofish team used for the Twofish test vectors. This is actually a generic construction that you can use for any block cipher.

Notation:

b          # bytes in a plaintext or ciphertext block

k          # bytes in a key

E(K,P)  Block cipher encryption function, K = key, P = plaintext

D(K,P)  Block cipher decryption function, K = key, C = ciphertext

Test algorithm:

S = a string of k+b zero bytes.

repeat 1000 times:

n = length(S)

K = S[n-k..n-1]                         the last k bytes of S

P = S[n-k-b..n-k-1]                   the b bytes just before K

append E(K,E(K,P)) to S

The last b bytes of S are the test vector value.

When implementing this you don’t have to keep the whole S. You only need to store the last n+b bytes of S, plus another b bytes to store the data you are appending to S.

There is a reason why each iteration uses two consecutive block encryptions. In most software implementations each time you use a key you first compute an expanded key which is then used by the encryption function. If you only use the expanded key once the test would never notice if the encryption function were to accidentally damage the expanded key it is using. (That is an easy mistake to make in assembler.) By encrypting twice we make sure that the test is sensitive to this type of error.

One very nice property is that you can compute the recursion backwards. Given the last n+b bytes of S you can compute all the earlier values using the decryption function and check that you get the original all-zero starting string. That is how the decryption function is tested.

It is hard to see how an implementation could pass this test and still be wrong. There are 2000 encryptions in each test, for a total of at least 320000 S-box lookups. Even if the implementation uses a separate S-box for the last round (which is slightly different) there are 32000 lookups into that S-box, so every table entry will be used several times. There could be a hidden bug if your implementation uses very large tables with 2^16 entries, but I’ve never seen that done in practice.

Here are the actual values for AES (with b=16 of course):

AES-128 (k=16): 0xbd, 0x88, 0x3f, 0x01, 0x03, 0x5e, 0x58, 0xf4, 0x2f, 0x9d, 0x81, 0x2f, 0x2d, 0xac, 0xbc, 0xd8

AES-192 (k=24): 0x41, 0xaf, 0xb1, 0x00, 0x4c, 0x07, 0x3d, 0x92, 0xfd, 0xef, 0xa8, 0x4a, 0x4a, 0x6b, 0x26, 0xad

AES-256 (k=32): 0xc8, 0x4b, 0x0f, 0x3a, 0x2c, 0x76, 0xdd, 0x98, 0x71, 0x90, 0x0b, 0x07, 0xf0, 0x9b, 0xdd, 0x3e

Feel free to use this to test your own AES implementation. If you verify these test vectors against a known good AES implementation, please let me know and I’ll update this post with a list of people and implementations that have verified these vectors. (That should also catch the case of me making a mistake in this posting.) I can be reached at <my-first-name>@microsoft.com.

Update:

2006-06-06 Srdjan Sobajic verified the AES-128 test vector on an 8051 implementation.

2006-06-21 Dominik Weber verified the test vectors in C on an x86.

2006-12-06 Taral verified test vectors on AMD64 against OpenSSL.

2006-12-19 Sergey Gerasimenko verified the AES-128 test vectors on a Hitachi SH2 implementation.

- Niels Ferguson (developer)