Holy cow, I wrote a book!
I have kept every single piece of spam and virus email since mid-1997.
Occasionally, it comes in handy, for example, to add
naïve Bayesian spam filter to my custom-written email filter.
And occasionally I use it to build a chart of spam and virus email.
The following chart plots every single piece of spam and virus email
that arrived at my work email address since April 1997.
Blue dots are spam and red dots are email viruses.
The horizontal axis is time, and the vertical axis is size of mail
(on a logarithmic scale).
Darker dots represent more messages.
(Messages larger than 1MB have been treated as if they were 1MB.)
Note that this chart is not scientific. Only mail which makes it past
the corporate spam and virus filters show up on the chart.
Why does so much spam and virus mail get through the filters?
Because corporate mail filters cannot take the risk of accidentally
classifying valid business email as spam. Consequently, the filters
have to make sure to remove something only if they has extremely high
confidence that the message is unwanted.
Okay, enough dawdling. Let's see the chart.
Overall statistics and extrema:
Subject: About your account...
Content-Type: text/plain; charset=ISO-8859-1
Things you can see on the chart:
As a comparison, here's the same chart based on email received
at one of my inactive personal email addresses.
This particular email address has been inactive since 1995;
all the mail it gets is therefore from harvesting done prior to 1995.
(That's why you don't see any red dots: None of my friends have this address
in their address book since it is inactive.)
The graph doesn't go back as far because
I didn't start saving spam from this address until late 2000.
Received: from dhcp065-025-005-032.neo.rr.com ([126.96.36.199]) by ...
Sat, 24 Jul 2004 12:30:35 -0700
I cannot explain the mysterious "quiet period" at the beginning
of 2004. Perhaps my ISP instituted a filter for a while?
Perhaps I didn't log on often enough to pick up my spam and it
expired on the server? I don't know.
One theory is that the lull was due to uncertainty created by the
CAN-SPAM Act, which took effect on January 1, 2004.
I don't buy this theory since there was no significant corresponding
lull at my other email account, and follow-up reports indicate
that CAN-SPAM was widely disregarded.
Even in its heyday, compliance was only 3%.
Curiously, the trend in spam size for this particular account is
that it has been going down since 2002.
In the previous chart, you could see a clear upward trend since 1997.
My theory is that since this second dataset is more focused on current
trends, it missed out on the growth trend in the late 1990's
and instead is seeing the shift in spam from text to <IMG> tags.