Continuing on in my series of rebuttals to the reviewers of my paper, I'd like to respond to the third and final review.

I agree with the author that a set of common metrics is paramount for being able to measure and compare current approaches, and use these as a guide for future development.

Excellent.  I'm glad someone got it.

The paper does not provide a single reference to any prior work, though. E.g. "catch rate" is called "recall" in Information retrieval, and the "FP rate" would be 1.0-precision.

True.

The "RPI" would need to be modified to include different costs, as I personally find FPs a lot worse and the odd spam slipping through, and so do a lot of people I know, as they still basically blindly trust email and still see it as a reliable communication channel.

True enough, but in my paper I say explicitly "In addition, this assumes that the tradeoff between FPs and FNs are linear; some users may believe that an FP is twice as bad as FNs, and so forth."

If the relationship is linear, then it doesn't matter that much.  Simply do a survey of people and find out how much the average user things an FP is compared to an FN.  Then, put that constant in the denominator of the equation.

If the relationship is non-linear (ie, exponential or logarithmic), then simply adjust the formula as necessary.  The point is that we need to combine Catch Rate and FP rate together in order to arrive at a unified metric for cross-filter comparisons.

The "grey mail" section is interesting, but again I do not agree with some of the statements. I don't think that down-sampling is the solution to "mass marketing". What the x% response rate shows is that a message can be spam to one user, and ham to the next, i.e. spamminess is a function of message x user. Disallowing forwarded spam again is arbitrary.

All in all you do not fully answer your own question about including grey mail or not.

I think that down-sampling is the solution to mass marketing.  Let's say you randomly select a corpus of 10,000 grey messages.  Let's also assume that surveys say that 2% of people respond to grey mail.  If you filter 98% of it, then you can assume your filter is doing its job because 2% of the mail went to the end users who wanted it.  This can be further adjusted into the spam filtering equation; I didn't go into it in further detail in the paper due to space restrictions but it would have been an interesting discussion.

Disallowing forwarded spam is arbitrary, but not totally.  The messages is not unsolicited, nor is it bulk and it comes from someone you might normally want to talk to.  So, there's a method behind my madness, but since we don't know whether this is good mail or bad mail, we call it grey and if it passes filtering 2% of the time, then the filter is doing its job.