How to search encrypted data is a question that came up several times on forums and I should have blogged about this issue earlier, but better later than never.

So the problem is that we have just encrypted that confidential column in our table, but we would also like to continue retrieving records based on its values, and we would like to do this in an efficient manner, which in a database means making use of an index. How can we achieve this?

Let's first start by figuring out why we can't directly index the encrypted data and search on it. The encryption algortihms in SQL Server 2005 are salted. By salting, I mean that the encryption algorithms are always using a random initialization vector (IV), which leads to the following property: encrypting twice the same piece of data using the same key will produce two different ciphertexts. The benefit of this is that if in a table we have a column value appearing several times and we encrypt that column, then the fact that the value repeats in the column is no longer apparent by examining the encrypted data. This adds additional protection against the analysis of the ciphertext (encrypted data), but also prevents the ability to search it efficiently. Also note that while the absence of salting would have permited equality searches, range searches would still not be possible as encryption algorithms are not preserving order.

So, how can we perform an equality search given that the encryption is salted and does not permit this? The solution is to add an additional column for holding hashes of the cleartext. The column containing hashes can then be indexed, and searching a piece of sensitive data can be done by first hashing it and then searching the hash value in the new column. This is all nice and easy, but it introduces the threat of a dictionary attack: an attacker could build hashes for as much secret data as he can generate or guess, and then he can verify whether this secret data exists in the table by comparing his hashes with the ones stored in the hash column. This is a significant threat.

How can we mitigate the threat of a dictionary attack against the secret data hashes? I will first describe the general idea behind the solution before moving to its description. The general idea is that we would want to add some secret to the hash computation, such that an attacker would not be able to construct a dictionary of hashes without knowing that secret. This mechanism is known as a MAC - message authentication code. There can be several ways of implementing MACs, but one solution can be to generate some random secret X and keep it stored encrypted some place in the database - it could be, for example, stored in a special table and encrypted with the same key that encrypts the secret data. When building the hashes of the secret data, this random secret can be added to the secret data before computing the hash. This would make the hash values dependent on this X, so even if an attacker knows an individual secret and the hashing algorithm, he still cannot compute the proper hash that we store, without knowing X. When searching data, we would hash X and the secret that we need to look up, and then we would search the resulting value in the hash column (which is now actually a MAC column). For more information on MACs, see for example "Applied Cryptography" by Bruce Schneier (MACs are covered in pages 455-459 of the second edition).

To conclude, encrypted data cannot be directly searched, but searches can be performed on MACs of the secret data that are stored in separate columns.