A visual history of spam (and virus) email
I have kept every single piece of spam and virus email since mid-1997.
Occasionally, it comes in handy, for example, to add
naïve Bayesian spam filter to my custom-written email filter.
And occasionally I use it to build a chart of spam and virus email.
The following chart plots every single piece of spam and virus email
that arrived at my work email address since April 1997.
Blue dots are spam and red dots are email viruses.
The horizontal axis is time, and the vertical axis is size of mail
(on a logarithmic scale).
Darker dots represent more messages.
(Messages larger than 1MB have been treated as if they were 1MB.)
Note that this chart is not scientific. Only mail which makes it past
the corporate spam and virus filters show up on the chart.
Why does so much spam and virus mail get through the filters?
Because corporate mail filters cannot take the risk of accidentally
classifying valid business email as spam. Consequently, the filters
have to make sure to remove something only if they has extremely high
confidence that the message is unwanted.
Okay, enough dawdling. Let's see the chart.
Overall statistics and extrema:
Things you can see on the chart:
-
Spam went ballistic starting in 2002.
You could see it growing in 2001, but 2002 was when it really took off.
-
Vertical blue lines are "bad spam days".
Vertical red lines are "bad virus days".
-
Horizontal red lines let you watch the lifetime of a particular email virus.
(This works only for viruses with a fixed-size payload.
Viruses with variable-size payload are smeared vertically.)
-
The big red splotch in August 2003 around the 100K mark is the Sobig virus.
-
The horizontal line in 2004 that wanders around
the 2K mark is the Netsky virus.
-
For most of this time, the company policy on
spam filtering was not to filter it out at all,
because all the filters they tried had too high a false-positive rate.
(I.e., they were rejecting too many valid messages as spam.)
You can see that in late 2003, the blue dot density diminished
considerably.
That's when mail administrators found a filter
whose false-positive rate was low enough to be acceptable.
As a comparison, here's the same chart based on email received
at one of my inactive personal email addresses.
This particular email address has been inactive since 1995;
all the mail it gets is therefore from harvesting done prior to 1995.
(That's why you don't see any red dots: None of my friends have this address
in their address book since it is inactive.)
The graph doesn't go back as far because
I didn't start saving spam from this address until late 2000.
Overall statistics and extrema:
I cannot explain the mysterious "quiet period" at the beginning
of 2004. Perhaps my ISP instituted a filter for a while?
Perhaps I didn't log on often enough to pick up my spam and it
expired on the server? I don't know.
One theory is that the lull was due to uncertainty created by the
CAN-SPAM Act, which took effect on January 1, 2004.
I don't buy this theory since there was no significant corresponding
lull at my other email account, and follow-up reports indicate
that CAN-SPAM was widely disregarded.
Even in its heyday, compliance was only 3%.
Curiously, the trend in spam size for this particular account is
that it has been going down since 2002.
In the previous chart, you could see a clear upward trend since 1997.
My theory is that since this second dataset is more focused on current
trends, it missed out on the growth trend in the late 1990's
and instead is seeing the shift in spam from text to <IMG> tags.