Terry Zink: Security Talk

Discussing Internet security in (mostly) plain English

Sender authentication part 30: The canonicalization process

Sender authentication part 30: The canonicalization process

  • Comments 3

Canonicalization is the process of preparing a message for signing.  This process is necessary because of the way email is handled in transit by various mail servers.  For example, some mail relayers handle white space and line wraps just fine, others do not and strip them or insert them.  All email was once 7-bit ASCII and now most of it is 8-bit ASCII.  What happens if the message is forwarded through one and then the other? 

The intent of canonicalization is to make a minimal transformation of the message for the purpose of signing the message (the actual message itself is not changed).  Thus, DomainKeys specifies two types of methods of canonicalizing the message (man, there are a lot of red, wavy underlines in my LiveWriter... let's hit Ignore All... ah, that's better, maybe I should stop inventing words as I go along).

From RFC 4870:

The "simple" Canonicalization Algorithm

  • Each line of the email is presented to the signing algorithm in the order it occurs in the complete email, from the first line of the headers to the last line of the body.
  • If the "h" tag is used, only those header lines (and their continuation lines if any) added to the "h" tag list are included.
  • The "h" tag only constrains header lines. It has no bearing on body lines, which are always included.
  • Remove any local line terminator.
  • Append CRLF to the resulting line.
  • All trailing empty lines are ignored. An empty line is a line of zero length after removal of the local line terminator.

If the body consists entirely of empty lines, then the header/body line is similarly ignored. For those of you who don't understand geek-speak, CRLF means "carriage-return line-feed", which is the equivalent of hitting "enter" on your keyboard to wrap the line.

The "nofws" Canonicalization Algorithm

The "No Folding Whitespace" algorithm (nofws) is more complicated than the "simple" algorithm for two reasons; folding whitespace is removed from all lines and header continuation lines are unwrapped.

  • Each line of the email is presented to the signing algorithm in the order it occurs in the complete email, from the first line of the headers to the last line of the body.
  • Header continuation lines are unwrapped so that header lines are processed as a single line.
  • If the "h" tag is used, only those header lines (and their continuation lines if any) added to the "h" tag list are
     included.
  • The "h" tag only constrains header lines. It has no bearing on body lines, which are always included.
  • For each line in the email, remove all folding whitespace characters. Folding whitespace is defined in RFC 2822 as being the decimal ASCII values 9 (HTAB), 10 (NL), 13 (CR), and 32 (SP).
  • Append CRLF to the resulting line.
  • Trailing empty lines are ignored. An empty line is a line of zero length after removal of the local line terminator. Note that the test for an empty line occurs after removing all folding whitespace characters.

If the body consists entirely of empty lines, then the header/body line is similarly ignored.

So, we see that the process of canonicalization is arranging the headers and body of the message such that we can later reconstruct it to verify that the contents of it before sending it are the same as after receiving it.

Leave a Comment
  • Please add 8 and 6 and type the answer here:
  • Post
  • PingBack from http://www.artofbam.com/wordpress/?p=2823

  • > All email was once 7-bit ASCII

    In ancient history that was true.  Subsequently (and still anciently) escape sequences were added to shift into and out of various encoding modes for most of the world's written languages.

    > and now most of it is 8-bit ASCII.

    There is no such thing as 8-bit ASCII.  Most ANSI code pages are 8-bits, and ISO-8859 encoding schemes are 8-bits.  Some ANSI code pages are mixtures of 8 and 16 bits.  ISO-2022 schemes encode 16-bit characters in sequences of 8-bit bytes.  UTF-8 does the same.  Anyway, ASCII is still 7-bits, ASCII is essentially a 7-bit ANSI code page, and even .Net gets that part right.

    (I don't think UTF-16, as used by Windows NT-series APIs and .Net, can be sent through e-mail without the coding being changed to UTF-8.)

  • I just noticed that even today e-mail messages with ISO-8859-1 character set are encoded into sequences of 7-bit characters in order to traverse mail servers.  On the web you can get encodings into sequences of 8-bit characters (UTF-8 and national encodings) but in e-mail they're encoded into sequences of 7-bit characters (UTF-7 and national encodings).

    When creating a new message in Outlook Express, it offers one possibility as encoding into UTF-8, but I wonder how that fares when going through mail servers.

Page 1 of 1 (3 items)