Holy cow, I wrote a book!
I have kept every single piece of spam and virus email since mid-1997. Occasionally, it comes in handy, for example, to add naïve Bayesian spam filter to my custom-written email filter. And occasionally I use it to build a chart of spam and virus email.
The following chart plots every single piece of spam and virus email that arrived at my work email address since April 1997. Blue dots are spam and red dots are email viruses. The horizontal axis is time, and the vertical axis is size of mail (on a logarithmic scale). Darker dots represent more messages. (Messages larger than 1MB have been treated as if they were 1MB.)
Note that this chart is not scientific. Only mail which makes it past the corporate spam and virus filters show up on the chart.
Why does so much spam and virus mail get through the filters? Because corporate mail filters cannot take the risk of accidentally classifying valid business email as spam. Consequently, the filters have to make sure to remove something only if they has extremely high confidence that the message is unwanted.
Okay, enough dawdling. Let's see the chart.
Overall statistics and extrema:
From: 15841. To: 15841. Subject: About your account... Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit P
Things you can see on the chart:
As a comparison, here's the same chart based on email received at one of my inactive personal email addresses.
This particular email address has been inactive since 1995; all the mail it gets is therefore from harvesting done prior to 1995. (That's why you don't see any red dots: None of my friends have this address in their address book since it is inactive.) The graph doesn't go back as far because I didn't start saving spam from this address until late 2000.
Received: from dhcp065-025-005-032.neo.rr.com ([65.25.5.32]) by ... Sat, 24 Jul 2004 12:30:35 -0700 X-Message-Info: 10
I cannot explain the mysterious "quiet period" at the beginning of 2004. Perhaps my ISP instituted a filter for a while? Perhaps I didn't log on often enough to pick up my spam and it expired on the server? I don't know.
One theory is that the lull was due to uncertainty created by the CAN-SPAM Act, which took effect on January 1, 2004. I don't buy this theory since there was no significant corresponding lull at my other email account, and follow-up reports indicate that CAN-SPAM was widely disregarded. Even in its heyday, compliance was only 3%.
Curiously, the trend in spam size for this particular account is that it has been going down since 2002. In the previous chart, you could see a clear upward trend since 1997. My theory is that since this second dataset is more focused on current trends, it missed out on the growth trend in the late 1990's and instead is seeing the shift in spam from text to <IMG> tags.