I had a brief e-mail exchange with one of the devs on the optimizer team about a checkin he put up for review. He modified the compiler so that it only aligns the stack for functions that call other functions - that's the typical definition in compiler lingo of a 'leaf function'. My first response was "don't do that - you may still have to align for reasons A, B, & C". To which he responded with a quote from the ABI doc that explicitly says you only have to align the stack if you're calling a function. So I started reading. Turns out, he's right (and so is the doc), but there are some really nasty gotchas involved. When we initially did this stuff, we just lived with the scenario where you might be wasting a bit more stack space...
The theme of the problems revolve around restrictions on encoding information in the unwind data. If you are saving XMM registers and have an unaligned stack pointer, you must use the SAVE_XMM128_FAR opcode because the SAVE_XMM128 opcode will only accept offsets in multiples of 16 bytes. Or I guess you could use MOVUPS instead of MOVAPS to save & restore your XMM register, but I wouldn't recommend that on current hardware. Similarly, you have to use the ALLOC_LARGE descriptor for stack allocation of sizes that aren’t [n * 8 + 8]. ALLOC_LARGE actually has 2 variants - one that multiples by 8, and the other than doesn't. You can use the latter to allocate random amounts of data from the stack. But then if you want to use a frame pointer, you're going to be in a pretty weird spot, because it will have to be unaligned, as well. Unwind data dictates that you can only have a frame pointer that = RSP + 16 * [1-15].
I'll probably come back to this article & add some nice hyperlinks to ABI details in the doc itself, but it's been a while since I blogged anything useful, so I figured I'd just get this out there quickly.
If the leaf function has to adjust rsp at all, then there's no point to *not* aligning the stack. Alignment to 16 bytes can't cross a 4K page boundary. The "wasted" memory is already allocated. If you want to get really pedantic, it can't even affect the size of the sub instruction used to adjust rsp.