About 15 months ago I started work on a project that measures our spam effectiveness. Just last week the first part of it finally went live, end-to-end. It was a long time coming but we finally got it done. If you're wondering what took so long, let me tell you:
None of those things is trivial because while the network is designed to mimic our existing filtering infrastructure, there are lots and lots of small differences. A pile of small differences adds up to a major engineering challenge.
Anyhow, the project originally started off as how to gauge our spam catch rate and false positive rate. As we started going along, it became clear to me that I had to scale back my expectations and I started concentrating and how to measure spam. Fancy charts, training the filter on false negatives, measuring false positives, post-examination, correlation between filters on missed messages... all of this stuff is cool but I had to first get up first rung on the ladder.
Now that we're looking at part 2, measuring our false positive rate, lots and lots of questions are popping up. How do we measure ourselves against our competition? How do we improve our effectiveness? How do we leverage this network? How do we correlate different false positives and false negatives across different filters? In other words, we now have some visibility and questions are arising about what this thing will look like at the end.
The truth is that I haven't completely thought everything through, I only have a rough outline. George Lucas has stated, of the Star Wars prequels, that when he wrote the stories back in 1975, he had a pretty good idea of what they would all look like. While he didn't have all the details ironed out the three new movies pretty much adhered to his basic storyline.
Well, similarly, while I haven't completely thought through all of the details and plot points, I have a pretty good idea of what this network will do when all is said and done. The end game is to create a network that measures how well we are doing on spam and non-spam, does training on false negatives/positives, determines our response time, compares ourselves to competitors and includes piles of statistics (because I like charts).
Now I need to hire a writer to get the dialogue to not be so cheesy.
A key component you also need to look at is the granularity of the measurement. On average most filters are very good. What is important to understand is how they respond to new spam attacks. For this you need to measure capture rates at the level of response, which should be on the order of every 1 minute of less.
Understanding how capture rates respond is very important because the dips in effectiveness that occur with new campaigns are what get through and most importantly what people notice.
You need to measure the depth of the dips and the time taken to recover to good effectiveness. These are the parameters that should govern the activity of any spam labs or test of new approaches.
If you do this you should be capturing it at a high level with a report on the standard deviation of your capture rate. This stat is far more important to anti-spam effectiveness than average capture rate but as far as I can tell has dropped out of use in the world of statistical reporting.