Canonicalization is the process of preparing a message for signing. This process is necessary because of the way email is handled in transit by various mail servers. For example, some mail relayers handle white space and line wraps just fine, others do not and strip them or insert them. All email was once 7-bit ASCII and now most of it is 8-bit ASCII. What happens if the message is forwarded through one and then the other?
The intent of canonicalization is to make a minimal transformation of the message for the purpose of signing the message (the actual message itself is not changed). Thus, DomainKeys specifies two types of methods of canonicalizing the message (man, there are a lot of red, wavy underlines in my LiveWriter... let's hit Ignore All... ah, that's better, maybe I should stop inventing words as I go along).
From RFC 4870:
The "simple" Canonicalization Algorithm
If the body consists entirely of empty lines, then the header/body line is similarly ignored. For those of you who don't understand geek-speak, CRLF means "carriage-return line-feed", which is the equivalent of hitting "enter" on your keyboard to wrap the line.
The "nofws" Canonicalization Algorithm
The "No Folding Whitespace" algorithm (nofws) is more complicated than the "simple" algorithm for two reasons; folding whitespace is removed from all lines and header continuation lines are unwrapped.
If the body consists entirely of empty lines, then the header/body line is similarly ignored.
So, we see that the process of canonicalization is arranging the headers and body of the message such that we can later reconstruct it to verify that the contents of it before sending it are the same as after receiving it.
PingBack from http://www.artofbam.com/wordpress/?p=2823
> All email was once 7-bit ASCII
In ancient history that was true. Subsequently (and still anciently) escape sequences were added to shift into and out of various encoding modes for most of the world's written languages.
> and now most of it is 8-bit ASCII.
There is no such thing as 8-bit ASCII. Most ANSI code pages are 8-bits, and ISO-8859 encoding schemes are 8-bits. Some ANSI code pages are mixtures of 8 and 16 bits. ISO-2022 schemes encode 16-bit characters in sequences of 8-bit bytes. UTF-8 does the same. Anyway, ASCII is still 7-bits, ASCII is essentially a 7-bit ANSI code page, and even .Net gets that part right.
(I don't think UTF-16, as used by Windows NT-series APIs and .Net, can be sent through e-mail without the coding being changed to UTF-8.)
I just noticed that even today e-mail messages with ISO-8859-1 character set are encoded into sequences of 7-bit characters in order to traverse mail servers. On the web you can get encodings into sequences of 8-bit characters (UTF-8 and national encodings) but in e-mail they're encoded into sequences of 7-bit characters (UTF-7 and national encodings).
When creating a new message in Outlook Express, it offers one possibility as encoding into UTF-8, but I wonder how that fares when going through mail servers.