Pattern Detection and Noise Reduction

The amount of noise inherent in outbound spam detection is high. End users will routinely mark messages as spam that aren’t actually spam. An example of this would be company billing reports; these are not spam but lots of people mark them like that. How do you know when you actually have a problem with outbound spam vs somebody just marking legitimate content as such? If an alert becomes too noisy and sends out hundreds of them every hour, the people who are supposed to monitor them start ignoring them. Complacency sets in because an actual problem is indistinguishable from a real one. It takes too much time to sort through it all so people take shortcuts[1].

One way to reduce the noise and bing[2] the data is with pattern analysis combined with the exceeding of a threshold.

  1. Every hour, we measure the amount of messages that are marked as spam and write the information to a log file.

  2. Next, convert this information to a normalized value and compare it to previous iterations of the measurement.

  3. Finally, analyze the pattern that these measurements form and take action based upon the type of pattern that has formed.

image

Given the above chart, what patterns are useful for taking action?


[1] This is a kind of Turing test. How do we automate something with a computer that humans are much better at interpreting?

[2] Bing is a new search engine from Microsoft. Its philosophy is to exclude the non-relevant data and only bring the user useful information. I use it as a verb to mean the same thing – exclude the noise, alert on useful data.