Delay's Blog is the blog of David Anson, a Microsoft developer who works with the Silverlight, WPF, Windows Phone, and web platforms.
http://dlaa.me/
@DavidAns
In yesterday's post announcing ComputeFileHashes's new support for MD5 on Silverlight, I promised to share some details about my experience getting an MD5 HashAlgorithm implementation for Silverlight. Recall that an MD5 class is available in the desktop .NET Framework, but is not part of Silverlight 2's subset of the .NET Framework. (Probably in order to save space by excluding one of the less-secure cryptographic hash functions - a completely sensible tradeoff.) Because I didn't want to write my own code for MD5 (it's a non-trivial algorithm), the challenge was to find something freely available that I could just drop in and take advantage of. So I was very interested when I found out about Reid Borsuk's managed implementation of an MD5 HashAlgorithm for Silverlight because it sounded perfect for my needs.
The first step of incorporating something like this is to check the license: this code is under Ms-PL, so there were no problems there. The next step is to skim the code and get a general feel for how it works - and it was while doing this that I realized I wouldn't be able to use this code as-is...
To understand why, a little background is required:
The HashAlgorithm abstract class requires that derived classes implement the following methods: Initialize, HashCore, and HashFinal. Initialize gets called once at the start of hashing, then HashCore is called many times (being passed a different block of data each time), and then HashFinal is called once at the end of hashing to finalize any computations and return the computed hash value. It's a straightforward model and is flexible enough to accommodate a wide variety of checksum algorithms. Other than maintaining a few bytes of internal state across calls, there's no need for the hash algorithm to allocate anything: the data flows in from the user, gets processed, and is immediately forgotten about.
Or at least that's how it's supposed to work...
What I found when I looked at the aforementioned MD5 implementation was that it would allocate an internal buffer during Initialize, repeatedly re-allocate that buffer and append new data to the end of it during every call to HashCore, and then process the entire buffer all at once in HashFinal. While this approach works fine for fairly small inputs, it was completely impractical for ComputeFileHashes which is expected to process multi-gigabyte files as a matter of course. All those reallocations and the large internal buffer would quickly exhaust the physical memory of virtually any system in use today on something like the Windows 7 Beta ISO images I've been using for my examples. (In fact, it's a bit more dire than it initially seems: this technique requires twice the memory of the original data: that last call to HashCore needs to copy the nearly-full-sized buffer into a new full-sized buffer.)
Okay, so if I couldn't use the code as-is; the next step is to see what it would take to modify it so that it would work for my scenario. Well, this implementation uses a small HashAlgorithm wrapper class around a core MD5 implementation class, I wondered if it would be a simple matter of changing the way the wrapper called into the core. But looking at things a bit more closely, it seemed the core was not structured for that - and separating things like I wanted might not be trivial.
It's decision time: Do I start changing the structure of this code to work the way I need it to, or do I investigate other options? I considered the implications of both approaches, but it was something a coworker said that convinced me to spend a bit of time looking elsewhere. He asked, "If you don't like a fundamental part of the implementation and feel the need to fix it, why do you think you won't be compelled to make changes to the rest of it as well?" That question expressed my concerns pretty well, so I decided to look into other options for a bit. After all, I could always come back to this if nothing panned out.
The obvious place to start was the MD5 specification: RFC1321, The MD5 Message-Digest Algorithm. The body of this document describes the algorithm in great detail and would be a great place to start writing my own implementation if I was willing to spend a considerable amount of time developing and testing. But the real gem is in the appendix: a reference implementation of MD5 written in C! Fortunately, C's not so different from C# - and I've ported things before - so I had a decent idea what to expect. And it sure is hard to beat the reference implementation from the point of view of obtaining an accurate, (typically) bug-free, chunk of code. There is an accompanying license, but it's open (this is a public specification, after all) and primarily requires that derivative works identify themselves as being "derived from the RSA Data Security, Inc. MD5 Message-Digest Algorithm." (like I just did). So things seemed promising!
I decided to spend a bus ride porting the reference implementation and see how far I got. As it happens (and I'm sure this is no accident), the reference implementation uses exactly the Initialize/HashCore/HashFinal pattern that HashAlgorithm expects. Consequently, each of my own HashAlgorithm wrapper methods simply makes a single call into the ported reference implementation - and all of a sudden concerns about memory exhaustion are a thing of the past! By the end of the bus ride, I had successfully ported the reference implementation to C# and had it passing the seven test cases that are part of the specification.
My mind was pretty much made up at this point: I'd use my port of the MD5 reference implementation for the Silverlight version of ComputeFileHashes. This was code from a reliable source, code I had become familiar with, and code that I'd feel comfortable debugging or tuning if necessary. I beefed up the test cases a bit by exercising all of them for all the possible chunk sizes, addressed a couple of code analysis warnings, and had something ready in a jiffy. I added the MD5Managed class to the Silverlight build of ComputeFileHashes and - yep - it just worked. :)
MD5Managed
So here's a(nother) completely managed MD5 implementation that anyone is free to use (subject to the reference implementation's license). I haven't spent time optimizing it - but that was kind of the point (see the class comments below for more). I'd started out trying to avoid writing my own MD5 implementation and I only partly succeeded - but I'm glad with how this worked out and maybe some of you can benefit from what I've done. Even if all you do is run the Silverlight version of ComputeFileHashes from time to time, I feel like my relatively minimal investment was worthwhile! :)
[Click here to download a Visual Studio solution containing the source code for a Silverlight-ready managed implementation of MD5 along with the simple test cases discussed above.]
using System; using System.Diagnostics.CodeAnalysis; using System.Security.Cryptography; namespace Delay { /// <summary> /// MD5Managed: A HashAlgorithm implementation that acts as a thin wrapper /// around a C# translation of the MD5 reference implementation. The C code /// has been translated as closely as possible so that most of the original /// structure remains and comparisons between the two are straightforward. /// </summary> /// <remarks> /// Derived from the RSA Data Security, Inc. MD5 Message-Digest Algorithm. /// /// Specification: /// RFC1321 - The MD5 Message-Digest Algorithm /// http://www.faqs.org/rfcs/rfc1321.html /// /// Original license: /// Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All /// rights reserved. /// /// License to copy and use this software is granted provided that it /// is identified as the "RSA Data Security, Inc. MD5 Message-Digest /// Algorithm" in all material mentioning or referencing this software /// or this function. /// /// License is also granted to make and use derivative works provided /// that such works are identified as "derived from the RSA Data /// Security, Inc. MD5 Message-Digest Algorithm" in all material /// mentioning or referencing the derived work. /// /// RSA Data Security, Inc. makes no representations concerning either /// the merchantability of this software or the suitability of this /// software for any particular purpose. It is provided "as is" /// without express or implied warranty of any kind. /// /// These notices must be retained in any copies of any part of this /// documentation and/or software. /// </remarks> public class MD5Managed : HashAlgorithm { // Current context private readonly MD5_CTX _context = new MD5_CTX(); // Last hash result private readonly byte[] _digest = new byte[16]; // True if HashCore has been called private bool _hashCoreCalled; // True if HashFinal has been called private bool _hashFinalCalled; /// <summary> /// Initializes a new instance. /// </summary> public MD5Managed() { InitializeVariables(); } /// <summary> /// Initializes internal state. /// </summary> public override void Initialize() { InitializeVariables(); } /// <summary> /// Initializes variables. /// </summary> private void InitializeVariables() { MD5Init(_context); _hashCoreCalled = false; _hashFinalCalled = false; } /// <summary> /// Updates the hash code with the data provided. /// </summary> /// <param name="array">Data to hash.</param> /// <param name="ibStart">Start position.</param> /// <param name="cbSize">Number of bytes.</param> protected override void HashCore(byte[] array, int ibStart, int cbSize) { if (null == array) { throw new ArgumentNullException("array"); } if (_hashFinalCalled) { throw new CryptographicException("Hash not valid for use in specified state."); } _hashCoreCalled = true; MD5Update(_context, array, (uint)ibStart, (uint)cbSize); } /// <summary> /// Finalizes the hash code and returns it. /// </summary> /// <returns></returns> protected override byte[] HashFinal() { _hashFinalCalled = true; MD5Final(_digest, _context); return Hash; } /// <summary> /// Returns the hash as an array of bytes. /// </summary> [SuppressMessage("Microsoft.Design", "CA1065:DoNotRaiseExceptionsInUnexpectedLocations", Justification = "Matching .NET behavior by throwing here.")] [SuppressMessage("Microsoft.Usage", "CA2201:DoNotRaiseReservedExceptionTypes", Justification = "Matching .NET behavior by throwing NullReferenceException.")] public override byte[] Hash { get { if (!_hashCoreCalled) { throw new NullReferenceException(); } if (!_hashFinalCalled) { // Note: Not CryptographicUnexpectedOperationException because that can't be instantiated on Silverlight 4 throw new CryptographicException("Hash must be finalized before the hash value is retrieved."); } return _digest; } } // Return size of hash in bits. public override int HashSize { get { return _digest.Length * 8; } } /////////////////////////////////////////////// // MD5 reference implementation begins here. // /////////////////////////////////////////////// /* MD5 context. */ private class MD5_CTX { public readonly uint[] state; /* state (ABCD) */ public readonly uint[] count; /* number of bits, modulo 2^64 (lsb first) */ public readonly byte[] buffer; /* input buffer */ public MD5_CTX() { state = new uint[4]; count = new uint[2]; buffer = new byte[64]; } public void Clear() { Array.Clear(state, 0, state.Length); Array.Clear(count, 0, count.Length); Array.Clear(buffer, 0, buffer.Length); } } /* Constants for MD5Transform routine. */ private const int S11 = 7; private const int S12 = 12; private const int S13 = 17; private const int S14 = 22; private const int S21 = 5; private const int S22 = 9; private const int S23 = 14; private const int S24 = 20; private const int S31 = 4; private const int S32 = 11; private const int S33 = 16; private const int S34 = 23; private const int S41 = 6; private const int S42 = 10; private const int S43 = 15; private const int S44 = 21; private static byte[] PADDING; [SuppressMessage("Microsoft.Performance", "CA1810:InitializeReferenceTypeStaticFieldsInline", Justification = "More compact this way")] static MD5Managed() { PADDING = new byte[64]; PADDING[0] = 0x80; } /* F, G, H and I are basic MD5 functions. */ private static uint F(uint x, uint y, uint z) { return (((x) & (y)) | ((~x) & (z))); } private static uint G(uint x, uint y, uint z) { return (((x) & (z)) | ((y) & (~z))); } private static uint H(uint x, uint y, uint z) { return ((x) ^ (y) ^ (z)); } private static uint I(uint x, uint y, uint z) { return ((y) ^ ((x) | (~z))); } /* ROTATE_LEFT rotates x left n bits. */ private static uint ROTATE_LEFT(uint x, int n) { return (((x) << (n)) | ((x) >> (32 - (n)))); } /* FF, GG, HH, and II transformations for rounds 1, 2, 3, and 4. Rotation is separate from addition to prevent recomputation. */ private static void FF(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac) { (a) += F((b), (c), (d)) + (x) + (uint)(ac); (a) = ROTATE_LEFT((a), (s)); (a) += (b); } private static void GG(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac) { (a) += G((b), (c), (d)) + (x) + (uint)(ac); (a) = ROTATE_LEFT((a), (s)); (a) += (b); } private static void HH(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac) { (a) += H((b), (c), (d)) + (x) + (uint)(ac); (a) = ROTATE_LEFT((a), (s)); (a) += (b); } private static void II(ref uint a, uint b, uint c, uint d, uint x, int s, uint ac) { (a) += I((b), (c), (d)) + (x) + (uint)(ac); (a) = ROTATE_LEFT((a), (s)); (a) += (b); } /* MD5 initialization. Begins an MD5 operation, writing a new context. */ private static void MD5Init(MD5_CTX context) /* context */ { context.count[0] = context.count[1] = 0; /* Load magic initialization constants. */ context.state[0] = 0x67452301; context.state[1] = 0xefcdab89; context.state[2] = 0x98badcfe; context.state[3] = 0x10325476; } /* MD5 block update operation. Continues an MD5 message-digest operation, processing another message block, and updating the context. */ private static void MD5Update(MD5_CTX context, /* context */ byte[] input, /* input block */ uint inputIndex, // Starting index for input block uint inputLen) /* length of input block */ { /* Compute number of bytes mod 64 */ uint index = (uint)((context.count[0] >> 3) & 0x3F); /* Update number of bits */ if ((context.count[0] += ((uint)inputLen << 3)) < ((uint)inputLen << 3)) { context.count[1]++; } context.count[1] += ((uint)inputLen >> 29); uint partLen = 64 - index; /* Transform as many times as possible. */ uint i = 0; if (inputLen >= partLen) { Buffer.BlockCopy(input, (int)inputIndex, context.buffer, (int)index, (int)partLen); MD5Transform(context.state, context.buffer, 0); for (i = partLen; i + 63 < inputLen; i += 64) { MD5Transform(context.state, input, inputIndex + i); } index = 0; } /* Buffer remaining input */ Buffer.BlockCopy(input, (int)(inputIndex + i), context.buffer, (int)index, (int)(inputLen - i)); } /* MD5 finalization. Ends an MD5 message-digest operation, writing the the message digest and zeroizing the context. */ private static void MD5Final(byte[] digest, /* message digest */ MD5_CTX context) /* context */ { byte[] bits = new byte[8]; /* Save number of bits */ Encode(bits, context.count, 8); /* Pad out to 56 mod 64. */ uint index = (uint)((context.count[0] >> 3) & 0x3f); uint padLen = (index < 56) ? (56 - index) : (120 - index); MD5Update(context, PADDING, 0, padLen); /* Append length (before padding) */ MD5Update(context, bits, 0, 8); /* Store state in digest */ Encode(digest, context.state, 16); /* Zeroize sensitive information. */ context.Clear(); } /* MD5 basic transformation. Transforms state based on block. */ private static void MD5Transform(uint[] state, byte[] block, uint blockIndex) { uint a = state[0], b = state[1], c = state[2], d = state[3]; uint[] x = new uint[16]; Decode(x, block, blockIndex, 64); /* Round 1 */ FF(ref a, b, c, d, x[0], S11, 0xd76aa478); /* 1 */ FF(ref d, a, b, c, x[1], S12, 0xe8c7b756); /* 2 */ FF(ref c, d, a, b, x[2], S13, 0x242070db); /* 3 */ FF(ref b, c, d, a, x[3], S14, 0xc1bdceee); /* 4 */ FF(ref a, b, c, d, x[4], S11, 0xf57c0faf); /* 5 */ FF(ref d, a, b, c, x[5], S12, 0x4787c62a); /* 6 */ FF(ref c, d, a, b, x[6], S13, 0xa8304613); /* 7 */ FF(ref b, c, d, a, x[7], S14, 0xfd469501); /* 8 */ FF(ref a, b, c, d, x[8], S11, 0x698098d8); /* 9 */ FF(ref d, a, b, c, x[9], S12, 0x8b44f7af); /* 10 */ FF(ref c, d, a, b, x[10], S13, 0xffff5bb1); /* 11 */ FF(ref b, c, d, a, x[11], S14, 0x895cd7be); /* 12 */ FF(ref a, b, c, d, x[12], S11, 0x6b901122); /* 13 */ FF(ref d, a, b, c, x[13], S12, 0xfd987193); /* 14 */ FF(ref c, d, a, b, x[14], S13, 0xa679438e); /* 15 */ FF(ref b, c, d, a, x[15], S14, 0x49b40821); /* 16 */ /* Round 2 */ GG(ref a, b, c, d, x[1], S21, 0xf61e2562); /* 17 */ GG(ref d, a, b, c, x[6], S22, 0xc040b340); /* 18 */ GG(ref c, d, a, b, x[11], S23, 0x265e5a51); /* 19 */ GG(ref b, c, d, a, x[0], S24, 0xe9b6c7aa); /* 20 */ GG(ref a, b, c, d, x[5], S21, 0xd62f105d); /* 21 */ GG(ref d, a, b, c, x[10], S22, 0x02441453); /* 22 */ GG(ref c, d, a, b, x[15], S23, 0xd8a1e681); /* 23 */ GG(ref b, c, d, a, x[4], S24, 0xe7d3fbc8); /* 24 */ GG(ref a, b, c, d, x[9], S21, 0x21e1cde6); /* 25 */ GG(ref d, a, b, c, x[14], S22, 0xc33707d6); /* 26 */ GG(ref c, d, a, b, x[3], S23, 0xf4d50d87); /* 27 */ GG(ref b, c, d, a, x[8], S24, 0x455a14ed); /* 28 */ GG(ref a, b, c, d, x[13], S21, 0xa9e3e905); /* 29 */ GG(ref d, a, b, c, x[2], S22, 0xfcefa3f8); /* 30 */ GG(ref c, d, a, b, x[7], S23, 0x676f02d9); /* 31 */ GG(ref b, c, d, a, x[12], S24, 0x8d2a4c8a); /* 32 */ /* Round 3 */ HH(ref a, b, c, d, x[5], S31, 0xfffa3942); /* 33 */ HH(ref d, a, b, c, x[8], S32, 0x8771f681); /* 34 */ HH(ref c, d, a, b, x[11], S33, 0x6d9d6122); /* 35 */ HH(ref b, c, d, a, x[14], S34, 0xfde5380c); /* 36 */ HH(ref a, b, c, d, x[1], S31, 0xa4beea44); /* 37 */ HH(ref d, a, b, c, x[4], S32, 0x4bdecfa9); /* 38 */ HH(ref c, d, a, b, x[7], S33, 0xf6bb4b60); /* 39 */ HH(ref b, c, d, a, x[10], S34, 0xbebfbc70); /* 40 */ HH(ref a, b, c, d, x[13], S31, 0x289b7ec6); /* 41 */ HH(ref d, a, b, c, x[0], S32, 0xeaa127fa); /* 42 */ HH(ref c, d, a, b, x[3], S33, 0xd4ef3085); /* 43 */ HH(ref b, c, d, a, x[6], S34, 0x04881d05); /* 44 */ HH(ref a, b, c, d, x[9], S31, 0xd9d4d039); /* 45 */ HH(ref d, a, b, c, x[12], S32, 0xe6db99e5); /* 46 */ HH(ref c, d, a, b, x[15], S33, 0x1fa27cf8); /* 47 */ HH(ref b, c, d, a, x[2], S34, 0xc4ac5665); /* 48 */ /* Round 4 */ II(ref a, b, c, d, x[0], S41, 0xf4292244); /* 49 */ II(ref d, a, b, c, x[7], S42, 0x432aff97); /* 50 */ II(ref c, d, a, b, x[14], S43, 0xab9423a7); /* 51 */ II(ref b, c, d, a, x[5], S44, 0xfc93a039); /* 52 */ II(ref a, b, c, d, x[12], S41, 0x655b59c3); /* 53 */ II(ref d, a, b, c, x[3], S42, 0x8f0ccc92); /* 54 */ II(ref c, d, a, b, x[10], S43, 0xffeff47d); /* 55 */ II(ref b, c, d, a, x[1], S44, 0x85845dd1); /* 56 */ II(ref a, b, c, d, x[8], S41, 0x6fa87e4f); /* 57 */ II(ref d, a, b, c, x[15], S42, 0xfe2ce6e0); /* 58 */ II(ref c, d, a, b, x[6], S43, 0xa3014314); /* 59 */ II(ref b, c, d, a, x[13], S44, 0x4e0811a1); /* 60 */ II(ref a, b, c, d, x[4], S41, 0xf7537e82); /* 61 */ II(ref d, a, b, c, x[11], S42, 0xbd3af235); /* 62 */ II(ref c, d, a, b, x[2], S43, 0x2ad7d2bb); /* 63 */ II(ref b, c, d, a, x[9], S44, 0xeb86d391); /* 64 */ state[0] += a; state[1] += b; state[2] += c; state[3] += d; /* Zeroize sensitive information. */ Array.Clear(x, 0, x.Length); } /* Encodes input (UINT4) into output (unsigned char). Assumes len is a multiple of 4. */ private static void Encode(byte[] output, uint[] input, uint len) { for (uint i = 0, j = 0; j < len; i++, j += 4) { output[j] = (byte)(input[i] & 0xff); output[j + 1] = (byte)((input[i] >> 8) & 0xff); output[j + 2] = (byte)((input[i] >> 16) & 0xff); output[j + 3] = (byte)((input[i] >> 24) & 0xff); } } /* Decodes input (unsigned char) into output (UINT4). Assumes len is a multiple of 4. */ private static void Decode(uint[] output, byte[] input, uint inputIndex, uint len) { for (uint i = 0, j = 0; j < len; i++, j += 4) { output[i] = ((uint)input[inputIndex + j]) | (((uint)input[inputIndex + j + 1]) << 8) | (((uint)input[inputIndex + j + 2]) << 16) | (((uint)input[inputIndex + j + 3]) << 24); } } } }
Updated 2009-02-16: Call MD5Init from the constructor for consistency with the Framework's HashAlgorithm classes where a call to Initialize is not necessary for a newly constructed instance.
MD5Init
Initialize
Updated 2010-12-06: Added missing inputIndex offset to MD5Update method.
inputIndex
MD5Update
I was very happy with last week's release of ComputeFileHashes supporting the command-line, WPF, Silverlight, *and* ClickOnce. Only one thing bothered me: the Silverlight version didn't do MD5 due to the lack of support for that type of checksum by Silverlight 2. Recall that I'd fairly happily implemented my own CRC-32 class because none of the platforms supported it. [Also, it was relatively simple and had a good reference implementation. :) ] But because MD5 is a more complex algorithm and was only missing on Silverlight, I was reluctant to do the same thing for MD5...
What I really wanted was a freely available, Silverlight compatible HashAlgorithm-based MD5 implementation that I could trivially drop into my code and use on Silverlight. So I was excited when kind reader (and teammate!) Jeff Wilcox left a comment pointing to something that sounded perfect for my needs. I told Jeff I'd add MD5 for Silverlight and mentally breathed a sigh of relief that all four of ComputeFileHashes's supported platforms would provide the same set of checksums.
As it turns out, after a bit of research I decided not to use that MD5 implementation. (I'll explain why in my next post.) However, now that I'd fully bought in to the idea of MD5 on Silverlight, I was reluctant to let it go... So I spent some time working on an alternate solution and developed something I'm quite happy with. So I'm able to release an update to ComputeFileHashes that offers MD5 support on Silverlight!
The latest version of ComputeFileHashes is now 2009-01-26. I've updated all the binaries in order to avoid version number confusion - but the only real change here is the addition of MD5 for Silverlight. (FYI, I only updated the screenshot of the Silverlight version below.)
Please refer to the original release announcement for more information about supported platforms, source code, implementation, etc..
Click here or on the image below to install the ClickOnce version of ComputeFileHashes.
Click here or on the image below to run the Silverlight version of ComputeFileHashes in your browser.
Click here or on the image below to download the command-line and WPF versions of ComputeFileHashes - along with the ClickOnce and Silverlight versions AND the complete source code for everything!
I've said that "ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone.". And in some ways, I think the Silverlight version is the easiest option of all because there's no need to install it on your machine and it runs everywhere Silverlight 2 does (PC, Mac, (Linux soon!), Internet Explorer, Firefox, Safari, ...). So I'm really glad to add MD5 support to ComputeFileHashes for Silverlight - I hope you enjoy the new functionality!
While working on code for an upcoming blog post, I found myself dealing with the HashAlgorithm.HashSize property again and realized I'd made a silly mistake a few days ago... :(
HashAlgorithm.HashSize
I'm pretty sure I remember consulting the documentation when implementing this method for my free CRC-32 HashAlgorithm implementation, and I obviously believed the correct behavior was to return the size in bytes because that's what my comment says and that's what my code does. However, the documentation seems pretty clear on the matter: Gets the size, in bits, of the computed hash code." (emphasis mine). So my initial implementation of this property was wrong. Fortunately, none of the four implementations of ComputeFileHashes (command-line, WPF, ClickOnce, Silverlight) make use of HashSize, so they're not affected by this bug. Unfortunately, anyone who decided to use my CRC-32 implementation and referenced this property would see the wrong value. For their sake, I've just made the trivial fix to the code (multiplying the byte count by 8 to get bit count), updated the comment for that property, added a note about the update to the bottom of the original post, and republished it.
HashSize
I'm very sorry for the error and any trouble this may have caused.
PS - The version of CRC32.cs in the ComputeFileHashes source code download has not been updated - but will be in a few days as part of an upcoming post.
CRC32.cs
Last week, I released the ComputeFileHashes tool for calculating file checksums. (To read more about what checksums are and why they're useful, please refer to that post.) ComputeFileHashes is a fairly simple .NET command-line application for calculating the MD5, SHA-1, and CRC-32 hashes of one or more files. It takes advantage of the multi-processing capabilities of today's hardware to complete that task quickly - roughly on par with native-code implementations. ComputeFileHashes works quite well and I happily used it to verify the recently released Windows 7 Beta ISO images I'd downloaded.
Because not everybody is a fan of command-line tools, I thought it would be nice to use WPF to create a more user-friendly version of ComputeFileHashes. Once I'd done that, I knew it would be a trivial matter to publish the WPF version via ClickOnce to enable an absurdly easy install scenario. From there, porting to Silverlight would be straightforward and would offer an install-free, completely web-based solution with cross-platform (ex: PC/Mac), cross-browser (ex: IE/Firefox/Safari) appeal. What's more, because all of these platforms are built on .NET, so I expected to be able to take significant advantage of code sharing!
Implementation notes:
ComputeFileHashesCore.cs
ComputeFileHashesUI.cs
CRC32.csWaitingRoom.cs
HashFileInfo.csBlockingQueue.cs
ComputeFileHashesCL\
ComputeFileHashesCL.cs
ComputeFileHashesWPF\
Window1.xaml
Window1.xaml.cs
ComputeFileHashesSL\
Page.xaml
Page.xaml.cs
In the original release announcement, I wrote that "ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone.". Well, that's still the case - and adding support for WPF, ClickOnce, and Silverlight should make it even easier for everyone to use. Just decide what kind of user interface you prefer, and start using that version of ComputeFileHashes for all your checksumming needs! :)
In my notes for the release of the ComputeFileHashes tool (and source code), I mentioned that I'd written a synchronization object for managing the parallel computation of checksums across multiple threads. I called this class WaitingRoom (after failing to come up with anything better) and thought I'd write about the pattern it implements so others might use it as well.
WaitingRoom
To understand the motivation behind the WaitingRoom class, it's helpful to understand a bit about how ComputeFileHashes works. For those who haven't read the implementation notes, the basic goal is to enable parallel computation of multiple checksum algorithms for sequential chunks of a file. So there's a primary thread which is responsible for opening the file and making consecutive blocks of it available for processing and there are one or more worker threads which are responsible for processing the data in each block that becomes available. For performance reasons, there are two buffers for these blocks and they're swapped repeatedly: one is the "current" block (which has valid data) and the other is the "next" block (which is being filled-in). It's important to ensure the worker threads only access a block when it's valid, so the primary thread needs a way to tell the worker threads when it's safe to start as well as a way to find out when they've all finished with a block.
The WaitingRoom class makes this synchronization process easy by exposing three methods: Wait, Release, and Arrive. Both Wait and Release are used by the primary thread: it calls Wait to block until all worker threads are "in the waiting room" (during which time they're all blocked) and then calls Release to "open the waiting room door" and unblock the worker threads. [Okay, the analogy breaks down a bit here because most doctors admit only *one* patient at a time. :) ] The worker threads call Arrive to signal they've "entered the waiting room" and automatically block until the primary thread releases them. What's great about using WaitingRoom is that the order of the threads doesn't matter: the primary thread can be ready first, or the worker threads can be ready first, or the primary thread can be ready after some - but not all - of the worker threads are ready. Whatever the order, WaitingRoom coordinates things smoothly!
Wait
Release
Arrive
As it happens, ComputeFileHashes uses two WaitingRoom instances. The first instance is used to wait for the worker threads to be ready to process a new file. When this is going on, the primary thread has nothing else to do, so it calls Wait and then immediately calls Release when the wait completes. The second instance is used to wait for the worker threads to finish processing a block of data. In this case, the primary thread has some work it can do in parallel with the workers: it reads the next block of data from disk and updates the status display. So it makes a call to Release (which gets the worker threads going), does its own work while they're busy, then calls Wait to wait for them to finish. After that, it's safe to swap the "current" and "next" buffers and repeat the same steps with the next block.
The complete implementation of WaitingRoom is below. It uses only the standard .NET Monitor class - along with the C# lock statement to simplify the syntax a bit. There are a few Debug.Asserts in there to try to keep callers honest, but it's up to the developer to ensure that Wait and Release are only called by the primary thread and that Arrive is only called by the worker threads.
Here's what it looks like:
using System.Diagnostics; using System.Threading; namespace ComputeFileHashes { /// <summary> /// Implements a synchronization object that allows an owner /// thread to synchronize with its worker threads. /// </summary> class WaitingRoom { // Number of worker threads private readonly int _capacity; // Object on which to lock for entry private readonly object _entryLock = new object(); // Object on which to lock for exit private readonly object _exitLock = new object(); // Current count of worker threads private int _count; // "Sign" of owner/worker threads private bool _sign; /// <summary> /// Initializes a new instance. /// </summary> /// <param name="capacity">Number of worker threads.</param> public WaitingRoom(int capacity) { Debug.Assert(0 < capacity); _capacity = capacity; } /// <summary> /// Waits for all worker threads to call the Arrive method. /// </summary> public void Wait() { // Claim entry lock lock (_entryLock) { Debug.Assert((0 <= _count) && (_count <= _capacity)); // Block if all worker threads have not arrived if (_count < _capacity) { Monitor.Wait(_entryLock); } } } /// <summary> /// Signals the presence/availability of a worker thread. /// </summary> public void Arrive() { // Claim the entry lock bool sign; lock (_entryLock) { Debug.Assert((0 <= _count) && (_count < _capacity)); // Capture sign sign = _sign; // Wake owner thread if all worker threads present _count++; if (_count == _capacity) { Monitor.Pulse(_entryLock); } } // Claim the exit lock lock (_exitLock) { // Block if owner has not yet released the worker threads if (sign == _sign) { Monitor.Wait(_exitLock); } } } public void Release() { // Claim the exit lock lock (_exitLock) { Debug.Assert(_count == _capacity); // Reset count and flip sign _count = 0; _sign = !_sign; // Wake worker threads Monitor.PulseAll(_exitLock); } } } }
There's probably an official name for this kind of synchronization primitive, but I don't know what it is. :| I poked around a bit on Wikipedia's "Currency Control" page just now and found something called room synchronization that sounds close... I think WaitingRoom may be a type of room synchronization - with some specialized restrictions for the specifics of the ComputeFileHashes scenario. But whatever you call it, WaitingRoom is a pretty handy class to work with. Feel free to use WaitingRoom in your own code - maybe even drop me a note if you find other interesting uses. And if anyone reading knows the real name for what it's doing, please let me know! :)
In the notes for yesterday's release of the ComputeFileHashes tool (and source code), I mentioned that I'd written my own .NET HashAlgorithm class to compute CRC-32 hash values. The complete implementation can be found below and should behave just like every other HashAlgorithm subclass (ex: MD5 or SHA1). The code here is based on the CRC-32 reference implementation provided in Annex D of the PNG specification and pretty much "just worked". It implements the necessary Initialize, HashCore, and HashFinal methods as well as the technically optional (but practically necessary) Hash and HashSize properties. There's no test code to speak of, though it's worth pointing out that I've run tens of gigabytes of data through my ComputeFileHashes tool and have verified the correctness of the computed CRC-32 value for each test file. :)
HashCore
HashFinal
Hash
Without further ado:
using System; using System.Security.Cryptography; namespace Delay { /// <summary> /// HashAlgorithm implementation for CRC-32. /// </summary> [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Naming", "CA1709:IdentifiersShouldBeCasedCorrectly", MessageId = "CRC", Justification = "Matching algorithm acronym.")] public class CRC32 : HashAlgorithm { // Shared, pre-computed lookup table for efficiency private static readonly uint[] _crc32Table; /// <summary> /// Initializes the shared lookup table. /// </summary> [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Performance", "CA1810:InitializeReferenceTypeStaticFieldsInline", Justification = "Table values must be computed; not possible to remove the static constructor.")] static CRC32() { // Allocate table _crc32Table = new uint[256]; // For each byte for (uint n = 0; n < 256; n++) { // For each bit uint c = n; for (int k = 0; k < 8; k++) { // Compute value if (0 != (c & 1)) { c = 0xedb88320 ^ (c >> 1); } else { c = c >> 1; } } // Store result in table _crc32Table[n] = c; } } // Current hash value private uint _crc32Value; // True if HashCore has been called private bool _hashCoreCalled; // True if HashFinal has been called private bool _hashFinalCalled; /// <summary> /// Initializes a new instance. /// </summary> public CRC32() { InitializeVariables(); } /// <summary> /// Initializes internal state. /// </summary> public override void Initialize() { InitializeVariables(); } /// <summary> /// Initializes variables. /// </summary> private void InitializeVariables() { _crc32Value = uint.MaxValue; _hashCoreCalled = false; _hashFinalCalled = false; } /// <summary> /// Updates the hash code for the provided data. /// </summary> /// <param name="array">Data.</param> /// <param name="ibStart">Start position.</param> /// <param name="cbSize">Number of bytes.</param> protected override void HashCore(byte[] array, int ibStart, int cbSize) { if (null == array) { throw new ArgumentNullException("array"); } if (_hashFinalCalled) { throw new CryptographicException( "Hash not valid for use in specified state."); } _hashCoreCalled = true; for (int i = ibStart; i < ibStart + cbSize; i++) { byte index = (byte)(_crc32Value ^ array[i]); _crc32Value = _crc32Table[index] ^ ((_crc32Value >> 8) & 0xffffff); } } /// <summary> /// Finalizes the hash code and returns it. /// </summary> /// <returns></returns> protected override byte[] HashFinal() { _hashFinalCalled = true; return Hash; } /// <summary> /// Returns the hash as an array of bytes. /// </summary> [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Design", "CA1065:DoNotRaiseExceptionsInUnexpectedLocations", Justification = "Matching .NET behavior by throwing here.")] [System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2201:DoNotRaiseReservedExceptionTypes", Justification = "Matching .NET behavior by throwing NullReferenceException.")] public override byte[] Hash { get { if (!_hashCoreCalled) { throw new NullReferenceException(); } if (!_hashFinalCalled) { // Note: Not CryptographicUnexpectedOperationException because // that can't be instantiated on Silverlight 4 throw new CryptographicException( "Hash must be finalized before the hash value is retrieved."); } // Convert complement of hash code to byte array byte[] bytes = BitConverter.GetBytes(~_crc32Value); // Reverse for proper endianness, and return Array.Reverse(bytes); return bytes; } } // Return size of hash in bits. public override int HashSize { get { return 4 * 8; } } } }
The CRC32 class presented here is nothing fancy, but it should be a pretty solid implementation of the once-popular CRC-32 algorithm that's ripe for reuse. Thanks to .NET's HashAlgorithm base class, it's easy to drop in CRC32 anywhere hashes are already being computed. I hope you find it useful!
Updated 2009-01-22: Corrected implementation of HashSize to return hash size in bits instead of bytes.
Updated 2009-02-16: Initialize the _crc32Value variable to its starting value for consistency with the Framework's HashAlgorithm classes where a call to Initialize is not necessary for a newly constructed instance.
_crc32Value
Updated 2010-12-06: Added missing ibStart offset to HashCore loop.
ibStart
Internet downloads (particularly large ones) are often published with an associated checksum that can be used to verify that the file was successfully downloaded. While transmission errors are relatively rare, the popularity of file sharing and malware introduce the possibility that what you get isn't always exactly what you wanted. The posting of checksums by the original content publisher attempts to solve this problem by giving the end user an easy way to validate the downloaded file. Checksums are typically computed by applying a cryptographic hash function to the original file; the hash function computes a small (10-30 character) textual "snapshot" of the file that satisfies the following properties (quoting from Wikipedia):
It is easy to compute the hash for any given data It is extremely difficult to construct a [file] that has a given hash It is extremely difficult to modify a given [file] without changing its hash It is extremely unlikely that two different [files] will have the same hash
Because of these four properties, a user can be fairly confident a file has not been tampered with or garbled as long as the checksum computed on their own machine matches what the publisher posted. I say fairly confident because there's always the possibility of a hash collision - two different files with the same checksum. No matter how good the hash function is, the pigeonhole principle guarantees there will ALWAYS be the possibility of collisions. The point of a good hash function is to make this possibility so unlikely as to be impossible for all intents and purposes.
Popular hash functions in use today are MD5 and SHA-1, with CRC-32 rapidly losing favor. Speaking in very broad terms, one might say that the quality of CRC-32 is "not good", MD5 is "good", and SHA-1 is "very good". For now, that is; research is always under way that could render any of these algorithms useless tomorrow... (For more information about the weaknesses of each algorithm, refer to the links above.)
In order for published checksums to be useful, the user needs an easy way to calculate them. I looked around a bit and didn't a lot of free tools for computing these popular hash functions that I was comfortable with, so I wrote my own using .NET. Here it is:
ComputeFileHashes Computes CRC32, MD5, and SHA1 hashes for the specified file(s) Version 2009-01-12 http://blogs.msdn.com/Delay/ Syntax: ComputeFileHashes FileOrPattern [FileOrPattern [...]] Each FileOrPattern can specify a single file or a set of files Examples: ComputeFileHashes Image.iso ComputeFileHashes *.iso ComputeFileHashes *.iso *.vhd
[Click here to download ComputeFileHashes along with its complete source code.]
Here's the output of running ComputeFileHashes on the recently released Windows 7 Beta ISO images. Anyone can verify the checksums below on the MSDN Subscriber Downloads site. (Note: You need to be a subscriber to download files from there; everyone else can download from the official Windows 7 link.)
C:\T>ComputeFileHashes.exe "M:\Windows 7\*" M:\Windows 7\en_windows_7_beta_dvd_x64_x15-29074.iso CRC32: 8E2FAD39 MD5: 773FC9CC60338C612AF716A2A14F177D SHA1: E09FDBC1CB3A92CF6CC872040FDAF65553AB62A5 M:\Windows 7\en_windows_7_beta_dvd_x86_x15-29073.iso CRC32: AABA5A48 MD5: F9DCE6EBD0A63930B44D8AE802B63825 SHA1: 6071184282B2156FF61CDC5260545C078CCA31EE
One of the things that was important to me when writing ComputeFileHashes was performance. Nobody likes to wait, and I'm probably even less patient than the average bear. One of the things I wanted my program to do was take advantage of multi-processing and the multi-core CPUs that are so prevalent these days. So ComputeFileHashes runs the three hash functions in parallel with each other and with loading the next bytes of the file. Theoretically, this can take advantage of four different cores - though in practice my limited testing suggests there's just not enough work to saturate them all. :)
While I paid attention to performance at a macro level (i.e., algorithm design), I didn't worry about it at a micro level (i.e., focused optimization). And I used .NET of all things. (Cue the doubters: "ZOMG, EPIC FAIL!!") If it uses .NET, it must be slow, right? Well, let's do the numbers:
So depending on how you want to look at things, ComputeFileHashes is either the best or the worst of the lot. :) But recall that we're pitching an unoptimized .NET application against two solid native-code implementations. ComputeFileHashes pretty clearly held its ground and I'd say the doubters don't have much to complain about here.
ComputeFileHashes is a simple tool intended to make verifying checksums easy for anyone. I've wanted a good, fast, free, all-in-one solution I could trust for some time now, and the recent release of Windows 7 finally prompted me to write my own. I hope others to put ComputeFileHashes to use for themselves - perhaps even for the Windows 7 Beta! :)
A few years ago I found myself spending a lot of time writing batch files to perform a variety of relatively simple tasks. For those who aren't familiar, you can do some surprisingly powerful things with batch files - but it usually involves a lot of trickery and arcane knowledge. I wanted a simpler way of doing things, but I didn't want to create a bunch of compiled executables and lose the elegant, easy-to-modify transparency that batch files offer. These days, I'd probably look to PowerShell for the solution, but back then there was no such thing...
What I did have was the power of .NET and some inspiration from a coworker's tool that was able to take .NET 1.1 source code and execute it "on the fly" without the need for compilation. What I did not have was something similar for .NET 2.0 or access to the source code for that tool (because the author had left the company). So I wrote my own:
============================================ == CSI: C# Interpreter == == Delay (http://blogs.msdn.com/Delay/) == ============================================ Summary ======= CSI: C# Interpreter Version 2009-01-06 for .NET 3.5 http://blogs.msdn.com/Delay/ Enables the use of C# as a scripting language by executing source code files directly. The source code IS the executable, so it is easy to make changes and there is no need to maintain a separate EXE file. CSI (CodeFile)+ (-d DEFINE)* (-r Reference)* (-R)? (-q)? (-c)? (-a Arguments)? (CodeFile)+ One or more C# source code files to execute (*.cs) (-d DEFINE)* Zero or more symbols to #define (-r Reference)* Zero or more assembly files to reference (*.dll) (-R)? Optional 'references' switch to include common references (-q)? Optional 'quiet' switch to suppress unnecessary output (-c)? Optional 'colorless' switch to suppress output coloring (-a Arguments)? Zero or more optional arguments for the executing program The list of common references included by the -R switch is: System.dll System.Data.dll System.Drawing.dll System.Windows.Forms.dll System.Xml.dll PresentationCore.dll PresentationFramework.dll WindowsBase.dll System.Core.dll System.Data.DataSetExtensions.dll System.Xml.Linq.dll CSI's return code is 2147483647 if it failed to execute the program or 0 (or whatever value the executed program returned) if it executed successfully. Examples: CSI Example.cs CSI Example.cs -r System.Xml.dll -a ArgA ArgB -Switch CSI ExampleA.cs ExampleB.cs -d DEBUG -d TESTING -R Notes ===== CSI was inspired by net2bat, an internal .NET 1.1 tool whose author had left Microsoft. CSI initially added support for .NET 2.0 and has now been extended to support .NET 3.0 and .NET 3.5. Separate executables are provided to accommodate environments where the latest version of .NET is not available. Version History =============== Version 2009-01-06 Initial public release Version 2005-12-15 Initial internal release
[Click here to download CSI for .NET 3.5, 3.0, 2.0, and 1.1 - along with the complete source code and test suite.]
Using CSI is quite simple. Here's an example from the release package:
C:\T\CSI>TYPE Samples\HelloWorld.cs using System; public class HelloWorld { public static void Main(string[] args) { Console.WriteLine("Hello world."); } } C:\T\CSI>CSI.exe Samples\HelloWorld.cs Hello world.
And if you use the included batch files to register the .CSI file type (more on this in the notes below), it's even easier:
C:\T\CSI>TYPE Samples\Greetings.csi using System; public class Greetings { public static void Main(string[] args) { Console.WriteLine("Hello {0}", string.Join(" ", args)); } } C:\T\CSI>Samples\Greetings.csi out there world. Hello out there world.
Notes:
RegisterCSI.cmd
UnregisterCSI.cmd
-R
var
LINQ
CSC.exe
CSI.exe
CSI was a fun project in its day and I've done quite a bit with it that would have been fairly tedious otherwise. Today's world offers plenty of alternatives - but if you're comfortable with C# and prefer to stick to one language for all your programming needs, CSI is pretty handy to have around. Because you never know when the urge to code will strike! :)