Delay's Blog is the blog of David Anson, a Microsoft developer who works with the Silverlight, WPF, Windows Phone, and web platforms.
http://dlaa.me/
@DavidAns
Internet downloads (particularly large ones) are often published with an associated checksum that can be used to verify that the file was successfully downloaded. While transmission errors are relatively rare, the popularity of file sharing and malware introduce the possibility that what you get isn't always exactly what you wanted. The posting of checksums by the original content publisher attempts to solve this problem by giving the end user an easy way to validate the downloaded file. Checksums are typically computed by applying a cryptographic hash function to the original file; the hash function computes a small (10-30 character) textual "snapshot" of the file that satisfies the following properties (quoting from Wikipedia):
It is easy to compute the hash for any given data It is extremely difficult to construct a [file] that has a given hash It is extremely difficult to modify a given [file] without changing its hash It is extremely unlikely that two different [files] will have the same hash
Because of these four properties, a user can be fairly confident a file has not been tampered with or garbled as long as the checksum computed on their own machine matches what the publisher posted. I say fairly confident because there's always the possibility of a hash collision - two different files with the same checksum. No matter how good the hash function is, the pigeonhole principle guarantees there will ALWAYS be the possibility of collisions. The point of a good hash function is to make this possibility so unlikely as to be impossible for all intents and purposes.
Popular hash functions in use today are MD5 and SHA-1, with CRC-32 rapidly losing favor. Speaking in very broad terms, one might say that the quality of CRC-32 is "not good", MD5 is "good", and SHA-1 is "very good". For now, that is; research is always under way that could render any of these algorithms useless tomorrow... (For more information about the weaknesses of each algorithm, refer to the links above.)
In order for published checksums to be useful, the user needs an easy way to calculate them. I looked around a bit and didn't a lot of free tools for computing these popular hash functions that I was comfortable with, so I wrote my own using .NET. Here it is:
ComputeFileHashes Computes CRC32, MD5, and SHA1 hashes for the specified file(s) Version 2009-01-12 http://blogs.msdn.com/Delay/ Syntax: ComputeFileHashes FileOrPattern [FileOrPattern [...]] Each FileOrPattern can specify a single file or a set of files Examples: ComputeFileHashes Image.iso ComputeFileHashes *.iso ComputeFileHashes *.iso *.vhd
[Click here to download ComputeFileHashes along with its complete source code.]
Here's the output of running ComputeFileHashes on the recently released Windows 7 Beta ISO images. Anyone can verify the checksums below on the MSDN Subscriber Downloads site. (Note: You need to be a subscriber to download files from there; everyone else can download from the official Windows 7 link.)
C:\T>ComputeFileHashes.exe "M:\Windows 7\*" M:\Windows 7\en_windows_7_beta_dvd_x64_x15-29074.iso CRC32: 8E2FAD39 MD5: 773FC9CC60338C612AF716A2A14F177D SHA1: E09FDBC1CB3A92CF6CC872040FDAF65553AB62A5 M:\Windows 7\en_windows_7_beta_dvd_x86_x15-29073.iso CRC32: AABA5A48 MD5: F9DCE6EBD0A63930B44D8AE802B63825 SHA1: 6071184282B2156FF61CDC5260545C078CCA31EE
One of the things that was important to me when writing ComputeFileHashes was performance. Nobody likes to wait, and I'm probably even less patient than the average bear. One of the things I wanted my program to do was take advantage of multi-processing and the multi-core CPUs that are so prevalent these days. So ComputeFileHashes runs the three hash functions in parallel with each other and with loading the next bytes of the file. Theoretically, this can take advantage of four different cores - though in practice my limited testing suggests there's just not enough work to saturate them all. :)
While I paid attention to performance at a macro level (i.e., algorithm design), I didn't worry about it at a micro level (i.e., focused optimization). And I used .NET of all things. (Cue the doubters: "ZOMG, EPIC FAIL!!") If it uses .NET, it must be slow, right? Well, let's do the numbers:
So depending on how you want to look at things, ComputeFileHashes is either the best or the worst of the lot. :) But recall that we're pitching an unoptimized .NET application against two solid native-code implementations. ComputeFileHashes pretty clearly held its ground and I'd say the doubters don't have much to complain about here.
Implementation notes:
WaitingRoom
ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone. I've wanted a good, fast, free, all-in-one solution I could trust for some time now, and the recent release of Windows 7 finally prompted me to write my own. I hope others to put ComputeFileHashes to use for themselves - perhaps even for the Windows 7 Beta! :)