A few weeks back, me and my crazy co-worker (the one who insists that all mail must have SMTP AUTH in order to not be considered spam regardless of its content or what proportion of the target end user considers it spam) were in another discussion about how to classify mail.

My position is that you should never reject mail that you classify as spam in the SMTP session without consideration of the source of the mail.  Here is what I mean by that:

  1. It is perfectly fine to reject mail based upon the sender if you can trust the source of that mail.  I don’t mean trust in the sense that you trust someone to send you good mail, but rather trust that your analysis that you have identified the actual source is correct.  For example, rejecting because an IP is on a blocklist is fine because you can trust that the source IP really is the source IP.  You are rejecting the source because it is a source with a bad reputation.

    Of course, some of you will point out that some spam is sent through a relay (such as two bots where one sends through the other) or through a web mail service via a bot.  Assuming you could walk through Received headers in a reliable way, it would be perfectly legitimate to reject based upon those IPs as well.  If you can’t, don’t traverse anything and only reject on the connecting IP; you still get to reject at least 75% of all spam.

    But in general, for IPs that are on a blocklist that is reputable, you can reject mail from the source (connecting) IP.

  2. It is not fine to reject on content only.  An example of this is URL filtering.  Suppose you get a downloaded list of URLs that contain spam in them.  You get a random message and the sending IP is not on any blocklists.  You proceed to scan the message and discover that the content contains a URL that is on a URL list, say SURBL.  Should your engine classify the message as spam right there without any additional scanning?  Or not?  Should you reject the message in SMTP?

    My co-worker says yes, I say no.  The reason is that the body content of a message is not trustworthy because anyone can “spoof” the body of a message.  For example, a spammer can send a message contain only the following body contents:

    Hey, check out this link: hxxp://www.freemedz4you.com
    This is obviously a spam message sent by a spammer and should be treated as such.  But what if I sent the following message to our Digital Crimes Unit:

    Hey, DCU, the following spammers are using hxxp://freemicrosoftsoftware.com as a redirect to a free warez page, you might want to go and check it out and take action if necessary.
    In this case, I have used untrusted content in the message and it contains spammy content, but the source of the message (me) is entirely legitimate.  I have used spammy content in a valid manner.  In this case, I don’t want the message marked as spam.  The source of the message trumps the content of the message.  Even if my spam filter marks it as spam, I don’t want it rejected in SMTP before a safe senders can get applied to it.

  3. One might argue that safe senders is a work around.  I’ve never really been a fan of safe senders for a few reasons.  The first is that a spam filter should be smart enough to be able to avoid false positives as much as possible with as little user intervention as possible.  It’s the spam filter’s job to cut down on false positives without forcing the user to do a hacky workaround.  I do use safe senders, but only when I have to.

    Second, safe senders should really only be done with an SPF pass, Sender ID pass or DKIM validation.  In other words, only do safe senders on senders you can trust.  But if you do that, you have to do another DNS query and wait for the check.  So, you’re forcing the filter to do extra look ups.  That’s not that big a deal, but what is a big deal is that you are losing the operational efficiency of rejecting in SMTP vs having to hold the connection open while you do a DNS look up for a sending domain’s TXT record.

    The problem is that many domains don’t even have SPF, Sender ID or DKIM records.  So, how would a validated sender even be done?  Many large organizations have these records set up, but most small ones do not.  That’s simply a fact of life in the world of email, and people who work with email know this.  The world does not revolve around large organizations.

    If a safe sender is the only mitigation to get around something like this (auto-marking a message as spam), then to me it implies that the design of the filter isn’t a good one.

  4. And therein lies the rub.  As a large filtering organization, you can’t possibly know what everyone’s trusted sources are.  You might be able to take a stab at 20% of them, but there are still 80% left.  We, as a filtering org, will never be able to build a system complex enough to be able to automatically determine when a source that contains suspicious content can actually be trusted.  The cost/benefit ratio just isn’t there because there isn’t a lot of value into building such a large sender/recipient reputation table. 

    if (message == SURBL) {block;}
    unless {
        sender == A && recipient == B, OR
        sender == C && recipient == D, OR
        sender == E && recipient == F, OR
        <repeat near infinitely>
    }


    However, there is a substantial cost/benefit ratio to avoiding false positives, because everyone hates false positives.  I’ve said it before and I will say it again, a false positive is much more important than a false negative.  We shouldn’t require users to build up a hacky safe senders mechanism to get around all of our false positives.  Better to be conservative and account for the fact that rejecting on content only is a risky business, and stick to rejecting on source only.

That’s the way I see things.  When it comes to spam filtering, I’ve been around long enough to know that false positives always happen.  Better to build a filter on the assumption that your filters will occasionally be wrong rather than assuming that they will always be right, and then defending your errors with the three letter acronym “SLA”. 

Of course, the latter part of that phrase is the topic for a future post.