So why on earth did Microsoft chose the default alignment to be naturally aligned?
After all, the compiler would be completely within the standard if it chose to pack its data tightly. And, as I said, on MS-DOS platforms, memory size was everything.
Well, I lied :). It turns out that on all x86 architecture based platforms, memory size isn't everything. Memory is critically important, but so is memory alignment.
You can see this really clearly if you hook an 8088, 80186, 80286, or 80386 machine to an In Circuit Emulator (ICE) (the behavior's not as obvious with newer processors because of the L1 cache). For those that have never had the opportunity to use an ICE, an ICE is essentially a processor add-on that lets you analyze what's happening on the processor at the external pin level. For the 8088-80286, it required a special version of the processor (called a bond-out chip) that added some additional trace information on the external bus. An ICE would let you look at all the memory accesses done by the machine, and would even allow you to look back in time - if your system would crash mysteriously (and on the 8088, where there was no illegal instruction trap, this happened often), you could look at the history of instructions fetched and rendered and see what happened.
For Intel machines before the 486, there was a really simple rule for performance: If your code or data was small, then your program ran faster. This is because the processor didn't have a particularly effective cache and memory was significantly slower than the processor (this is still the case). So the single thing you need to ensure is that you don't have to go to memory too often.
If you look at an instruction like:
mov ax, ds: ; Load ax with the value at 20 bytes into the DS segment
Under an ICE, you'll see something like (assuming that DS points to 0x300):
mov ax, ds: ; Load ax with the value at 20 bytes into the DS segment << FETCH: WORD @003020, Value 0x3525
But if you look at a fetch from 21, you see:
mov ax, ds: ; Load ax with the value at 21 bytes into the DS segment << FETCH: BYTE @003021, Value 25 << FETCH: BYTE @003022, Value 35
So an unaligned memory access is broken up by the processor and turned into two aligned fetches. This is because the memory bus can't perform unaligned word accesses to memory. On some processors (like most RISC machines), they don't even bother to break the memory access up, they just crash (generate an exception which is usually handled by the operating system, taking tens of thousands of instructions to execute).
So unaligned access to memory is a huge source of performance problems. One solution to this was the __unaligned attribute on data - it acts as a flag to the compiler that instructs it that the data referenced by this pointer might be unaligned, and thus it should generate code to access the data one byte at a time - even though that involves many fetches of memory, it's STILL faster than handling the exception that might be generated.
But even with the unaligned attribute, it's important to realize that there are concurrency issues associated with unaligned memory access. In general, aligned reads and writes are atomic - if you write a value of 0x12345678 and then write 0x87654321 to an aligned block of memory, when you read it back, you'll get either 0x12345678 or 0x87654321. On the other hand, if the memory is unaligned, you might get 0x12344321 when you read it back. Now, given that modern CPUs can (and do) reorder writes, you'd properly need to put a memory barrier between the two writes, and I believe that the memory barrier would ensure that this wouldn't happen, but in the absense of a memory barrier, this is a very real problem (I've seen it happen in production code).
There's a great write-up on alignment on MSDN that I found here.
For more current processor architectures, the alignment issues aren't as significant - the L2 cache essentially renders many (most) of the issues moot, because it only accesses main memory a cache line at a time (32bytes). If the unaligned access crosses a cache line boundary however, all bets are off - your memory access will be indescribably slow (and historically, unaligned data access across cache line (and page) boundaries has been a source of many x86 CPU bugs (yes, CPUs have bugs too)).
Edit: Added comment about breaking atomicity, thanks Dmitry.