We sometimes hear on various forums that spam is always on the increase and that email servers are getting blasted with it. I decided to investigate the relationship between spam volume one week to the next. Specifically, I decided to determine what the week-over-week relationship is in total spam traffic. If it increases one week, is it more likely or less likely to increase the next week? If we get a sudden blast in spam, will that spam blast continue?
To determine this, I collected all of our historical data going back to the beginning of 2005. I then calculated the week-over-week % change in overall message traffic. Specifically, I calculated this week's traffic to the previous week's traffic, then I calculated this week's traffic to next week's traffic. I then calculated the % differences between the two.
When comparing the % change in this week's traffic to last week's traffic, the average change is 1.6%. When comparing this week's traffic to next week's traffic, the average change is 1.5%. These two values are pretty much the same, this tells us that, on average, spam is increasing by about 1.5% per week. However, the relationship between the total volume of spam from one week to the other is not at all similar. I calculated the correlation between this week's % change in volume to next week's % change in volume, and the correlation coefficient -0.22. In other words, if we get an increase in spam this week, we are more likely to see a drop in spam next week. The difference will be made up when the next increase occurs and it will be larger than the average 1.5% increase.
Next, I decided to check the relationship for two week periods. For example, I calculated the total amount of mail between Jan 1 - Jan 14, then Jan 15 - Jan 28, Jan 29 - Feb 11, and so forth, and then ran the same tests. The results for the average change is that spam volume increases by about 2.5% every two weeks. However, the correlation coefficient for bi-week over bi-week change in volume is 0.032. In other words, if we get an increase in spam over a two-week period, we cannot make an estimate as to the total amount of spam we can expect to see over the following two weeks.
Your two-week measurements were skewed by including New Year's in one measurement. Two or three days each year, so many spammers are on holiday that you can hardly recognize your inbox.
For my numbers, I actually did not use the Christmas period in the 2005 period, and I went back and removed the Christmas 2006 period. It doesn't change the results at all - for week-over-week results we are more likely to see reversals of the previous week's, while for bi-weekly data we still see no correlation.
Fine, you didn't include Christmas and I didn't say you did, but you said you included New Year's and I said you did.
New Year's day is a holiday in more countries than Christmas is. Some non-Christian countries use the Christian calendar, some shift old religious holidays onto the Christian calendar's New Year's day, etc. A lot of spammers take that day off, and it skewed your results.
When I went back and redid the calculations, in both 2005 and 2006 I removed the final two weeks in December and the first two weeks in January.
Your point remains, however. Public holidays can skew the data but with enough data points we should still be able to observe statistically significant relationships.
> I removed the final two weeks in December and the
> first two weeks in January.
OK. Your original article said something different:
> total amount of mail between Jan 1 - Jan 14, then