An idea that has been floating around in my head for several months is the idea of using probability to pre-emptively mark messages as spam from IPs that we have never seen before.
Spammers are getting clever; because they know that a lot of companies will reject mail from their botnets, their solution is to make botnets wider and thinner (ie, send fewer mails from more and more machines). This makes it more difficult to for spam filters using reputation-based filtering to filter the mail. If an IP is brand new, we can't really assume anything about the content.
Or can we? I think that the spammer's strength can be used against them. The newness of an IP should actually be able a weakness that can be exploited. Let's say that we reject 70% of our mail at the connection level because all of those IPs are on blacklists. I think this can be extended to the following: if 70% of the IPs hitting are networks are known spammers, then all things being equal, an IP with no history has a 70% chance of being spammer. We can use probability to make a judgment about the spam message without ever having seen the mail before. This would be a true pre-emptive strategy, we don't bother to content filter to figure out if the IP is clean or dirty, we don't build up a reputation, we make a guess that the IP is probably going to end up being a spammer.
Somewhere in there has got to be a spam filtering methodology. What if we combined a new IP + a very effective spam rule that hits lots of spam but not enough to push a message over the spam threhsold? What if we subsequently generated a random number and if that number was above a threshold? For example, suppose we have a new IP, it hits a rule that occurs in a lot of spam, 85% spam hits, and then generated a random number between 0 and 1 and that number was 0.91. If we set our threshold at 0.85 (same as the spam % of the rule), then since 0.91 > 0.85, we classify the message as spam. The idea here is that we don't have enough content to classify the strange new message as spam, but there's a 70% chance it is spam, and the sender is a bad spammer. We then combine that knowledge with a rule that hits a lot of spam already. We then randomly drop messages that are probably spam. Would this work? I don't know, but I think it would be interesting to test and even more fun to refine and polish.
Then maybe I can win a Microsoft patent award.
This is reminiscent of the "x" constant in Gary Robinson's f(w) equation, see "Dealing with rare words" in http://www.linuxjournal.com/article/6467
. We use this approach in SpamAssassin's bayes-style classifier with words we haven't seen before.
It's also similar to greylisting -- tempfail SMTP traffic from new IPs.
'Then maybe I can win a Microsoft patent award.'
Please don't. The world doesn't need more software patents... ;)
You're right, it is similar to that. I think that the difference is that I'd like to go beyond greylisting and make a decision right away without having to temporarily fail SMTP traffic from new IPs. Ideally I'd like to permanently fail them, or at the very least tempfail a very small amount. I'm not sure it's possible to get around it but that'd be my hope.
'Please don't. The world doesn't need more software patents... ;)'
Heh, I would like to use it to negotiate my next salary increase. :p