Following on from my previous post, I thought I'd go into a bit of detail about how we go about creating spam rules. Actually, to be more accurate, I'll go into detail about how I used to create spam rules, as I stopped going through our abuse submission folder daily about a year and a half ago because I became very busy with false positives and other miscellaneous things. The process today is a little different than it was back then.
When I first started, we had an abuse inbox that we viewed in an email client. We also had a "scanner log" that parsed the subject lines and the domains within the messages and sorted them by frequency. I built my own application that allowed me to sort messages by extension (like .com, .net, etc), but also allowing for whitelists (like frontbridge.com, yahoo.com, etc).
I'd get all the common subjects and URLs and generally sort by URLs. I'd go through them one by one looking for suspicious ones. In general, a spammy URL usually has a lot of cousins, so I'd search for all the messages containing those URLs and create spam rules to block them. This was not that difficult a process because I could usually eyeball which ones were spam and which ones were not.
Once I found the messages, I would look at the message headers. One of the biggest complaints our spam analysts have are that messages rarely contain the original headers (the other complaint is that people regular submit messages that are not spam). By looking at the headers, I could see what spam rules a particular message hit by extracting our own particular header and decoding it. From there, I could decide to either write new spam rules or adjust existing ones.
This explains why submitting headers is so critical to our process -- without them we don't know what rules messages hit and how aggressive to score the rules.
This also explains a common complaint that people make - why am I submitting so many messages and nothing is happening? Another request is that some customers want to know what we're doing with all of their submissions and what rules are being created. When it comes to abuse submissions, messages are triaged by volume, not by customer. Messages are searched by characteristics such as body text, URLs, or subject lines. The most commonly occurring ones are examined first and then a common rule is written that covers them all. We do not, nor have we ever, mapped rules one-to-one with specific abuse messages. That's far too time consuming and it doesn't make sense to track because spam rules benefit everyone, not just one specific customer.
That's the general rule creation process. Again, back then we didn't have as much automation built in as we have now but there are still relevant parts that use even today.
PingBack from http://www.artofbam.com/wordpress/?p=5276
> a spammy URL usually has a lot of cousins
True. So in many cases it seems kind of useless to filter on the visible URL. You have to do a DNS lookup on the URLs in each incoming message and filter according to the IP addresses (or in some cases the domains) where the DNS servers are, and where the illegal proxies (oops I mean web servers) are.