Terry Zink's Cyber Security Blog

Discussing Internet security in (mostly) plain English

My paper on spam metrics, part 1

My paper on spam metrics, part 1

  • Comments 3

I just finished a series on spam metrics and I submitted to the CEAS in order to get it accepted such that I could speak at the conference this year. I put it together in two days.  Well, as it turns out, it was rejected.

The reviews on it were anonymous, but I believe that since they rejected it I have the right to respond to the comments on it.  Here is what the first reviewer said:


Does not reference other work on which FP rate to use.  See for example Joel Snyder's discussion on which FP rate to use: http://www.networkworld.com/reviews/2004/122004spamside3.html

In this article, he proposes the use of the PPV.

Does not discuss the ROC approach used by researchers.  See all sorts of papers at CEAS and also see the TREC methodology documents.  In the introduction, the paper makes claims that the vendors can make various claims about the effectiveness of filters.

In the software industry, testing the quality of software is difficult. I would say that evaluating the quality of spam filters is done better than most. Quite a few reputable organizations / magazines have published evaluations:
Consumer Reports
Network World
PC Mag
etc etc


On the first point, I'll credit them that.  I don't reference other work on which FP rate to use.  The PPV in the linked article is my first metric for FP rate, Messages incorrectly flagged as spam / Total legitimate messages.  Yet, the reviewer and even the writer of the article plainly assume that this is the metric that everyone uses.  It isn't; both Postini and MessageLabs use my second metric for FP rate, Messages incorrectly flagged as spam / Total legitimate messages, as part of their SLA (if they don't, then the language is ambiguous and it's probably intentional). 

I don't have a problem with this definition of FP rate.  My point is that the researcher and reviewer might think it's obvious which FP rate ought to be used, but in the industry we don't agree.  That's why I was proposing a common set of metrics to be used by the industry, not trying to define something that no one has ever thought of.

Secondly, the reviewer says that evaluating the quality of spam filters is done better (ie, easier) than most, which contrasts with the testing of software in general (which is difficult).  If that were actually true, then you wouldn't need so many different metrics in order to tell you which one is the best with so many caveats.

The linked article has PPV, NPV, and so many conditions.  Well, this filter is better than that one but you also have to remember the FP rate, but then again the catch rate on this one makes it so that the FP rate balances out.  So basically, if you look at so many different variables you really can't make a determination without weighing one thing vs another because there are so many different definitions.

This is confusing and that's the point... again.  I was trying to illustrate a move towards a common set that everyone could agree on.  Secondly, at the end of the day, you really need one metric to tell you which is the best.  If your boss walks into your office and says "Quick, tell me, which filter is the best?" what are you going to say?  "Well, if you look at this factor, filter A is better, if you look at filter B, it has better factors for this..." and so forth.  My common metrics was designed to wrap everything into one metric and say "This particular filter is the best one because this super-metric says so."  No hedging, just a straight answer.

In the stock market, if someone were to ask me which stock I invested in was the best, I could say "Well, this one had accelerating sales, this one had a good price-volume relationship and this one had the highest dividend rate, so it depends on what you look at."  You see, I could say that, but the correct answer is "the one that made me the most money."  Simple, no hedging.

And when it comes to spam filters, that was my point.

Leave a Comment
  • Please add 7 and 2 and type the answer here:
  • Post
  • I agree with your complaints about the review process. It seems to me this is a necessary discussion. But I'd like to go further and discuss the fact that capture rate as used today as a static measure of capability is also a fairly useless and inaccurate basis of comparison.

    CR varies considerably over time, especially when new campaigns hit. And it is the dips in effectiveness that determine the visible success or failure of the system. The dips result in help desk calls and peak loads on the infrastructure.

    So any comparison of solutions needs to look at how the system responds to new campaigns. How long does it take to get back up to 95% CR, how much time is spent below 75% CR etc. What is the standard deviation of the CR, how many samples. What you need is a test which tells you the probability that the results achieved in testing are the actual capabilities of the system.

  • David,

    I actually was considering your point of view when I wrote the paper.  I was thinking to myself "I wonder what those guys from MailChannels will say with regards to the (down) blip factor..."

    I would have to look into it some more in order to come up with numbers that incorporate all of the metrics into one super-metric.

  • Thanks, I'm glad you thought about it and about MailChannels. A single metric is a very difficult measure to achieve. What tends to be the issue is that people are looking for absolute measures ahd this isn't what capture rate tests give you.

    CR is a statistical measure that varies over time but for some reason people are unable to report statistics properly these days (what happened to the standard deviations!), either because they don't understand them or they think others don't. Heck we can't even get people to report significant digits properly.

    When looking at capture rate measures it may be better to think about results as a test of the actual capture rate. So one good question to examine should be "what is the probability that this test result represents an actual average capture rate of n, lets say 98% for example" that would at least let the test compare results with wildly variable sample sizes.  

Page 1 of 1 (3 items)