Terry Zink: Security Talk

Discussing Internet security in (mostly) plain English

Why change the FP metrics?

Why change the FP metrics?

In the comments in my other post on the other side of accurate metrics, a fellow blogger writes the following:

In my experience every vendor who quotes a FP figure bases it on the total number of inbound messages (including those that get 5xx-rejected).

On the other hand, it is arguably the fairest way to measure FPs, as it reflects the total workload of the spam filter. All those messages have to go through the filter, so it makes sense to reflect them in the calculations.

In the past two weeks, internally here at Microsoft I have been arguing that measuring false positives as a proportion of total inbound traffic is not an accurate representation of the user experience and therefore we should avoid using it.

Using Spam Filter A - a user receives 100 legitimate messages and 1000 spam messages.  The spam filter correctly filters 95 legitimate messages and marks 5 of the legit ones as spam.  Using the traditional way of measuring false positives, the FP rate is 5 / (100 + 1000) = 5 / 1100 = 0.45%

Using Spam Filter B - a user receives 100 legitimate messages and 3000 spam messages.  The spam filter correctly filters 93 legitimate messages and marks 7 of the legit ones as spam.  Using the traditional way, the FP rate is 7 / (100 + 3000) = 7 / 3100 = 0.23%

Using the traditional FP metric, Spam Filter A's FP rate is double Spam Filter B's and therefore Filter B is better.  However, this ignores the effect of false positives on the user experience.  Spam Filter A's FP rate as a proportion of legitimate mail is 5%, while Spam Filter B's is 7%.  Looking at it this way, Spam Filter A is superior.

I propose that measuring false positives as a proportion of a user's legitimate mail stream is the proper metric for the following reasons:

• The increase in spam this year is making the traditional method irrelevant.  Doubling the mail stream of spam without improving the accuracy on non-spam does not make for a better filter.  The increase in spam is merely dwarfing the amount of legitimate mail and making it a smaller piece of the puzzle.  The metric is skewed.

• A user's mail stream stays more or less constant; at the most, it increases slowly over time.  They are talking to the same people, subscribing to the same newsletters and reading the same jokes.  Thus, when they look at FPs, they are experiencing it according to how many messages they want to see, not how many messages they wanted to see + didn't want to see.

Finally, with regards to the final point:

Personally, I can see both sides of the argument, but the pragmatic fact is that "the market" measures FPs as a proportion of *total* email, so arguments that they should do otherwise are a bit academic.

This is a valid point.  This is certainly the way the market (ie, industry) advertises its FP rates.  I would counter it by saying that the market has been ambiguous on the point, if not in the past, then certainly now.  It's time for a redefinition of success.

• Post
• fwiw, not every vendor does this -- at least, we in SpamAssassin measure FPs as a percentage of the total *ham* input, not counting spam.

We've always done it this way because, as you say, measuring FPs against the total mail stream makes no sense. ;)

• Good for you guys.  You're doing it right.

• I personally like the way DSPAM does produce the statistics. I entered the data into DSPAM for A and B and this is what DSPAM prints out in the statistics:

mail ~ # dspam_stats -H user_a

user_a:

TP True Positives:           1000

TN True Negatives:             95

FP False Positives:             5

FN False Negatives:             0

SC Spam Corpusfed:              0

NC Nonspam Corpusfed:           0

TL Training Left:            2400

SHR Spam Hit Rate         100.00%

HSR Ham Strike Rate:        5.00%

OCA Overall Accuracy:      99.55%

mail ~ # dspam_stats -H user_b

user_b:

TP True Positives:           3000

TN True Negatives:             93

FP False Positives:             7

FN False Negatives:             0

SC Spam Corpusfed:              0

NC Nonspam Corpusfed:           0

TL Training Left:            2400

SHR Spam Hit Rate         100.00%

HSR Ham Strike Rate:        7.00%

OCA Overall Accuracy:      99.77%

mail ~ #

A has a better catch rate at HAM but overall accuracy is better on B.

• '''Personally, I can see both sides of the argument, but the pragmatic fact is that "the market" measures FPs as a proportion of *total* email, so arguments that they should do otherwise are a bit academic.'''

I would call that an invalid though accurate point.  It's just like saying "Everyone else screws their customers so we should too," which is an invalid though accurate point.

You stated exactly what's needed here.  The ratio of false positives to legitimate messages really shows this half of the real impact on the user.

• interesting.

but the question I have is "according to whose perspective" ?

Personally, I agree with you that Filter A is better, but only because I am "joe schmoe, average internet user" ...  to me, FP is a problem defined by how many messages I wanted to see, but didn't.  i'd much rather have 10 spam emails get through than have 1 important personal or business email get lost.

but what if I am "joe schmoe, CEO of an ISP or some large tech firm with significant investment in network infrastructure" ?  would my perspective on the importance of a few lost emails relative to thousands of bandwidth clogging spams be changed?

I don't know.

• another question ...

If "Filter A" catches 1000 spams, and "Filter B" catches 3000, whats to say that the 3000 (or 1000) spams are unique?

What's to say that the 3000 spams from Filter B didn't just exercise the exact same lines of code that the 1000 from Filter A did.

it would be hardly fair to say that Filter B is somehow "better" because it matched the same exact patterns as Filter A, but just did it multiple times....

Now I know I'm not the first one to think of this.  I must have overlooked this detail mentioned already...

Page 1 of 1 (6 items)