Several people have asked why Internet Explorer 7 will send "real" URLs instead of hashes to the AP (Anti-Phishing) server. That's a good question, and I know it's a good question because it's the same thing just about everybody at Microsoft (including me) says the first time they hear about the feature :-). Nevertheless, a fairly quick investigation into the issue shows that it buys very little in terms of privacy but comes at significant cost.
First we need to figure out what threats are mitigated by sending hashes instead of URLs. Next we need to figure out what additional threats surface if we send hashes instead of URLs. Finally we determine which is "better" using some subjective measurement.
There are two main threats that hashes would mitigate:
1) Attacker sniffs data travelling to the AP server over the internet and records a list of all URLs that a particular person is visiting, using the list for further phishing attacks, user profiling, stalking, sale to a marketing company, etc.
2) Attacker gains unauthorised access to client data once it reaches the AP server and uses it for similar actions as #1
Threat #1 is trivially mitigated by using SSL, and since the AP filter already does this nobody can sniff the traffic in transit. So hashes are not needed to mitigate threat #1.
Threat #2 is more interesting. I will start by assuming that Microsoft itself is not malicious and Microsoft's policies forbid unauthorised use of the AP filter data. You may choose to assume the opposite (ie, you might believe that Microsoft fully intends to use the AP data for profiling or marketing purposes), but in that case you should turn off the feature altogether. Sending hashes won't protect at all against a system intentionally designed to mine the data (we'll talk about why in a second).
Threat #2 boils down to an "insider" at Microsoft deliberately going against corporate policy to mine the data, or a malicious outsider gaining access to the server and doing the same, or some accidental disclosure of the information via recycling hard-drives that haven't been wiped, leaving print-outs in the garbage, etc. These things can (and do) happen, so it's not an unrealistic thing to be concerned about.
But what do hashes buy us? At first blush, they seem to buy us a lot. Now, instead of a list that says "User 12345 visited sites www.microsoft.com, www.slashdot.org, and www.apple.com" they now get a list that says something like "User 12345 visited sites A538E10D, 83B1E7C9, and 746C2B9A". Great! The bad guys don't know where I've been.
Or do they?
First things first: if Microsoft really was trying to track you then the server would simply have a database matching hashes-to-URLs, so it would be trivial to reverse every known hash back into its corresponding URL. That's why I said above that if you really believed we were going to be naughty then you should just turn off the feature.
The next thing to think about is what the attackers really want to know. Do they want a laundry-list of the hundred arbitrary web sites you visited last week, or do they just want to know if you visited a specific site such as www.citibank.com or www.their-competitor.com? In this case, the attacker can do the trivial reverse-match themselves, since they will have a database of URLs they care about along with the corresponding hashes.
At this point you may be asking yourself questions such as "I thought hashes were 1-way?" or "Can't you embed some kind of salt in the hash to make it irreversible?" or some other such questions. None of that helps. For the first objection, yes, hashes are 1-way but the same source text always hashes to the same value, so you can do reverse lookups if you know all the possible source texts and can pre-compute their hashes.
For the second objection, it just won't work. Imagine that each client request was hashed with a unique GUID so nobody could pre-compute the hash. Now how is the server-side filter going to work, since it has nothing to match the hash against (remember the whole point of AP is to match client requests with known-bad URLs and block them)? Even if you pass the GUID along with the hash in the request, the server would have to go and re-hash every known URL with the GUID on each request, which would be prohibitively expensive (and thus still require a database of all known URLs in plain text on the server... resulting in additional threats to the system). And obviously there can't be a "shared secret" hash that Microsoft adds on both the client and the server since the bad guys would simply reverse-engineer the "secret" out of the Internet Explorer binaries.
So the only thing that hashing really buys us is that it forces the attackers to keep their own database for matching URLs to hashes for reverse-lookup. Not much of a benefit if you ask me, especially since an enterprising businessperson could easily build and maintain such a database and sell it as a "legitimate" web service :-)
Now let's look at what new threats arise if we only send hashes to the server. Because hashes are 1-way functions (as noted above) there is no way to introspect on the hash to figure out what the original source text was (the only recourse being the known-source-text database lookup as noted above). Attackers can use this to their advantage:
1) Attacker appends random characters to their URL, resulting in unique hashes that are not in the AP database and thus bypassing the feature
In this attack, let's say that the AP server knows that www.evil.com/evil.htm is a known phishing page. It has a specific hash which is stored in the AP database. Now the attacker can simply send out e-mails such as www.evil.com/evil.htm.AAA to generate a completely different hash that is not in the database yet still sends the browser to the same page. (Of course they could also just rename evil.htm to evil2.htm, or use server-side 404 processing to respond to any random sequence of characters, etc.). Clearly a single hash for the entire URL is not good enough.
OK, so what about if we send separate hashes for the domain and the path, and now say "Anything from www.evil.com [with specific hash] is bad" and ignore the path portion. Well, now we have a different threat:
2) Attacker uses wildcard DNS to generate random host names, resulting in same outcome as above
In this attack, the attacker simply generates unique domain names like foo1.evil.com, foo2.evil.com, foo3.evil.com, and so on. They all point to the same server, but will hash to different values so again they will not appear in the AP database. OK, next solution is to hash each portion of the URL individually ("com", "evil", "foo1", etc.) and send all the hashes to the server. Just as with the path problem above, the server could have a rule such as "If you have 'evil' and 'com' as the top-level domain then ignore the subdomains and return it as a bad site." We still have a threat though:
3) Attacker hosts phishing site on a hosting server (such as www.geocities.com) and uses same URL obfuscation technique as above to avoid AP server
Using a hosting site means that the detection can't be based off the broken-up domain names, and can't be based on a hash of the entire path either. Instead, we have to perform the same break-up operation that was used for domains and split the entire path into its components (so, for example, in the URL www.my-hosting-company.com/path/to/phisher/evil.htm we would send seven distinct hashes for each of the following: "www", "my-hosting-company", "com", "path", "to", "phisher", and "evil.htm"). Now the server has to have rules that say "If you get the domain hashes for 'www', 'my-hosting-company' and 'com' and you get the path hashes for 'path', 'to', and 'phisher' in that order then return it as a bad site."
Now, not only are we sending orders of magnitude more data to the server (all those hashes are much bigger than the original source text) and increasing server-side processing dramatically, we still haven't solved the problem:
4) Attacker uses non-path-based resource identification, resulting in same outcome as above
An example of such identification can be found on this very web server, which will take a URL such as http://blogs.msdn.com/452453.aspx (note there is no path component) yet it still sends you to content that I control living under the /ptorr/ directory. (In this specific case, MSDN sends a re-direct which results in a navigation to the full URL including the path, but there is no reason why that has to be the case). Couple this with arbitrary query-string processing, custom server-based path parsers, URL re-writers, etc. and there's really no way to figure out where the content is going when all you have is an opaque hash. Fundamentally, you need to see the original source text of the URL to effectively mitigate these kinds of attacks on the server.
So is hashing really worth it? On the plus side, you put a trivial extra burden on the attackers who want to invade your privacy (they now need to maintain a URL-to-hash database for reverse lookups), and on the down side you significantly increase both the amount of data sent to the server and the time to process a request, and you introduce by-design loopholes for the attackers who want to phish you to bypass the feature. I'll let you decide which you think is the better approach...
It is important to note that I have only talked about the threats involved in both choices here; I have not touched on any potential benefits of sending the raw URLs (that's beyond the scope of this blog entry).
Also, I want to briefly touch on the query-string question -- why doesn't IE send the query string to the AP server along with the full URL? Won't the attackers simply use query strings to decide whether to respond with the real phishing page (for the people who click on the e-mail links including the query-string) or a "legitimate" page for the people processing request at the AP server (who click on links without the query-string)?
Just like hashes, this doesn't really help (and in fact makes things worse). If the attackers are going to dynamically return a "good" or "bad" page depending on whether it has a query-string or not, they can also do so depending on whether that particular query string has been visited before! The server processing is like this:
1) If no query string, return "legitimate" page
2) If query string exists but is not in database of valid query strings, return "404" page
3) If query string is in database of valid query strings, remove it from database and return phishing page
Now when the victim visits the site by clicking on an e-mail message, they send the query string victim=12345 and condition 3 is satisfied so the user gets the phishing page. They report it to the AP server (including the query string) and soon afterwards a URL researcher at Microsoft goes to visit the site to see if it really is a phishing site. But because victim=12345 has already been visited they satisfy condition 2 and they get the 404 page fooling them into thinking the site has already been taken down. So query strings don't really work. You could also do some trickery with cookies to ensure that the first victim always saw the phishing page even on subsequent visits, whilst the researcher always saw 404s.
Query strings also make things theoretically worse, because they are the most likely portion of the URL to contain personally-identifiable information (username, tracking GUID, etc.). So if there was ever a breach of security at the AP server then not only would the attacker know all the websites you visited, they might also know your username and / or password to each of them as well!
But do not fear! User feedback is not the only way that the AP researchers will get new entries to put into the database -- they get the very same phishing e-mails that you and I get, so they will still be able to detect the phishing sites even without IE sending the data and even if the phishing server employs the fancy query-string detection logic.