The other day, I was taking a look at some of our traffic statistics. One of the challenges that I have is determining what our catch rate is. We know how much traffic we see (more or less), we know how much we catch with blocklists and we know how much mail we filter as spam. We also know how much mail we deliver to the end user. What we don't know is how much of that mail we deliver to the end user is spam.
In order to do so, we'd either need to have every customer submit their spam to us (which will never happen) or we'd have to randomly sample and manually grade the mail that makes it to the end-user and extrapolate that to the rest of the network (which is also equally unlikely to happen).
I decided to do a worst case scenario. I can see that the traffic on weekends always dips, and the amount of mail we deliver always drops by about 2/3. For example, if we delivered 30 million messages during the week, on the weekends we deliver 10 million. These numbers are fairly consistent regardless of our inbound traffic.
I made an approximation that the amount of spam we deliver to users is about the same on weekends as it is during the week, and that that all mail we deliver on weekends is spam. In other words, 2/3 of the mail we deliver is non-spam, 1/3 is spam. This is difficult to believe, but it is also a worst-case scenario. Using these numbers, I calculated that our spam filtering is over 99% effective.
If I didn't know better, I'd be tempted to say "Wow, that's pretty good." Unfortunately, being a spam analyst, all I ever hear is how much our service "sucks" (why did this mail come through, we're getting spoofed, why do I submit this spam over and over again and not see any improvements even though in reality I only submitted it twice and the message headers were missing and the body contents were garbled, etc). In other words, even though we block 99% of spam, we still get plenty of complaints from end users and are reminded all the time that we have to improve our service.
I guess I should rephrase that, I hear plenty of complaints that get filtered up to me, but that's natural because nobody ever calls to compliment the service, they only call to complain. That comes with the territory. The point is that even hitting 99% spam effectiveness isn't enough because of the sheer volume of spam being sent; it's entirely dependent on end-user perception. If a user receives 10,000 spam messages and we block 99%, they still see 100 spam messages. That means there's some work to do.
On the other hand, I'm pretty proud of our false positive rate.
PingBack from http://msdnrss.thecoderblogs.com/2007/09/28/when-99-isnt-good-enough/
It could be even worse than you think.
If 90% of your incoming email is spam, and you block 99% of that, then one in twelve emails reaching your end users will be spam. For several users the numbers of incoming spam will exceed their incoming ham. That's why, even though we're doing a sterling job, the end-users' experience may be far worse than we expect.
Exactly, Phil. You've phrased it in a way that I was unable to capture in my post.
I think you guys kick ass, actually. I'd much rather use EHS and receive an occasional piece of spam (which I always send off with the EHS Outlook plug-in) then go back to the mess that I had before.
Since Yahoo puts most incoming spams in my spam box instead of deleting them entirely, I get to do two comparisons. One is the number of spams in my spam box vs. the number of spams in my inbox, and one is the number of legitimate messages in my spam box vs. the number of legitimate messages in my inbox.
So I think I can estimate pretty accurately that Yahoo catches about 90% of incoming spams with 10% false negatives.
And I think I can estimate pretty accurately that Yahoo passes about 97% of incoming legitimate messages with 3% false positives.
The latest false positive was informative. Yahoo judged an incoming legitimate message to be spam, not because of its contents, and not because of the IP address of the originating sender, but because of the IP address of the sending mail server. And the sending mail server was ... ready for this?
I guess Yahoo discovered that Yahoo sends enough spams, they judged themselves. But this particular message wasn't spam.
Sorry I forgot to mention a way that you can improve your estimates. If you have a web mail service like Yahoo, then you can add buttons for users to report misfiled messages before they take downloads to their real inboxes.
ya NORMAN I really agree with ....
The problem is that without filters that are individually tuned to each user you are going to really struggle to get past your current filtering rates.
Its always difficult to see the picture from the customers side... as Phil says, some of your customers may be receiving more spam than ham which is very bad for the customer - even though your stats may show 99% which look pretty good on paper!
Interesting blog though (Just discovered it and bookmarked).
The worlds only guaranteed spam blocker
MS's Frontbridge.com and 88.blacklist.zap should provide valid information on why a domain is black listed!!! It blocks routinely one of our user's email and we have zero tolerance of spammers. It doesn't even provide a delisting method that actually works. Why reinvent the wheel? Shouldn't blacklisting be left to one of the already credible (and reliable) services available like spamhaus.net?