Holy cow, I wrote a book!
After DirectX 3 was released, development on two successor products took place simultaneously: a shorter-term release called DirectX 4 and a more substantial longer-term release called DirectX 5.
But based on the feedback we were getting from the game development community, they didn't really care about the small features in DirectX 4; what they were much more interested in were the features of DirectX 5. So it was decided to cancel DirectX 4 and roll all of its features into DirectX 5.
So why wasn't DirectX 5 renamed to DirectX 4?
Because there were already hundreds upon hundreds of documents that referred to the two projects as DirectX 4 and DirectX 5. Documents that said things like "Feature XYZ will not appear until DirectX 5". Changing the name of the projects mid-cycle was going to create even more confusion. You would end up with headlines like "Microsoft removes DirectX 5 from the table - kiss good-bye to feature XYZ" and conversations reminiscent of Who's on First:
"I have some email from you saying that feature ABC won't be ready until DirectX 5. When do you plan on releasing DirectX 5?"
"We haven't even started planning DirectX 5; we're completely focused on DirectX 4, which we hope to have ready by late spring."
"But I need feature XYZ and you said that won't be ready until DirectX 5."
"Oh, that email was written two weeks ago. Since then, DirectX 5 got renamed to DirectX 4, and DirectX 4 was cancelled."
"So when I have a letter from you talking about DirectX 5, I should pretend it says DirectX 4, and when it says DirectX 4, I should pretend it says 'a project that has since been cancelled'?"
"Right, but check the date at the top of the letter, because if it's newer than last week, then when it says DirectX 4, it really means the new DirectX 4."
"And what if it says DirectX 5?"
"Then somebody screwed up and didn't get the memo."
"Okay, thanks. Clear as mud."
The AMD64 takes the traditional x86 and expands the registers to 64 bits, naming them rax, rbx, etc. It also adds eight more general purpose registers, named simply R8 through R15.
The first four parameters to a function are passed in rcx, rdx, r8 and r9. Any further parameters are pushed on the stack. Furthermore, space for the register parameters is reserved on the stack, in case the called function wants to spill them; this is important if the function is variadic.
Parameters that are smaller than 64 bits are not zero-extended; the upper bits are garbage, so remember to zero them explicitly if you need to. Parameters that are larger than 64 bits are passed by address.
The return value is placed in rax. If the return value is larger than 64 bits, then a secret first parameter is passed which contains the address where the return value should be stored.
All registers must be preserved across the call, except for rax, rcx, rdx, r8, r9, r10, and r11, which are scratch.
The callee does not clean the stack. It is the caller's job to clean the stack.
The stack must be kept 16-byte aligned. Since the "call" instruction pushes an 8-byte return address, this means that every non-leaf function is going to adjust the stack by a value of the form 16n+8 in order to restore 16-byte alignment.
Here's a sample:
void SomeFunction(int a, int b, int c, int d, int e); void CallThatFunction() { SomeFunction(1, 2, 3, 4, 5); SomeFunction(6, 7, 8, 9, 10); }
On entry to CallThatFunction, the stack looks like this:
Due to the presence of the return address, the stack is misaligned. CallThatFunction sets up its stack frame, which might go like this:
sub rsp, 0x28
Notice that the local stack frame size is 16n+8, so that the result is a realigned stack.
Now we can set up for the first call:
mov dword ptr [rsp+0x20], 5 ; output parameter 5 mov r9d, 4 ; output parameter 4 mov r8d, 3 ; output parameter 3 mov edx, 2 ; output parameter 2 mov ecx, 1 ; output parameter 1 call SomeFunction ; Go Speed Racer!
When SomeFunction returns, the stack is not cleaned, so it still looks like it did above. To issue the second call, then, we just shove the new values into the space we already reserved:
mov dword ptr [rsp+0x20], 10 ; output parameter 5 mov r9d, 9 ; output parameter 4 mov r8d, 8 ; output parameter 3 mov edx, 7 ; output parameter 2 mov ecx, 6 ; output parameter 1 call SomeFunction ; Go Speed Racer!
add rsp, 0x28 ret
Notice that you see very few "push" instructions in amd64 code, since the paradigm is for the caller to reserve parameter space and keep re-using it.
[Updated 11:00am: Fixed some places where I said "ecx" and "edx" instead of "rcx" and "rdx"; thanks to Mike Dimmick for catching it.]
Intel provides the Intel® Itanium® Architecture Software Developer's Manual which you can read to get extraordinarily detailed information on the instruction set and processor architecture. I'm going to describe just enough to explain the calling convention.
The Itanium has 128 integer registers, 32 of which (r0 through r31) are global and do not participate in function calls. The function declares to the processor how many registers of the remaining 96 it wants to use for purely local use ("local region"), the first few of which are used for parameter passing, and how many are used to pass parameters to other functions ("output registers").
For example, suppose a function takes two parameters, requires four registers for local variables, and calls a function that takes three parameters. (If it calls more than one function, take the maximum number of parameters used by any called function.) It would then declare at function entry that it wants six registers in its local region (numbered r32 through r37) and three output registers (numbered r38, r39 and r40). Registers r41 through r127 are off-limits.
Note to pedants: This isn't actually how it works, I know. But it's much easier to explain this way.
When the function wants to call that child function, it puts the first parameter in r38, the second in r39, the third in r40, then calls the function. The processor shifts the caller's output registers so they can act as the input registers for the called function. In this case r38 moves to r32, r39 moves to r33 and r40 moves to r34. The old registers r32 through r38 are saved in a separated register stack, different from the "stack" pointed to by the sp register. (In reality, of course, these "spills" are deferred, in the same way that SPARC register windows don't spill until needed. Actually, you can look at the whole ia64 parameter passing convention as the same as SPARC register windows, just with variable-sized windows!)
When the called function returns, the register then move back to their previous position and the original values of r32 through r38 are restored from the register stack.
This creates some surprising answers to the traditional questions about calling conventions.
What registers are preserved across calls? Everything in your local region (since it is automatically pushed and popped by the processor).
What registers contain parameters? Well, they go into the output registers of the caller, which vary depending on how many registers the caller needs in its local region, but the callee always sees them as r32, r33, etc.
Who cleans the parameters from the stack? Nobody. The parameters aren't on the stack to begin with.
What register contains the return value? Well that's kind of tricky. Since the caller's registers aren't accessible from the called function, you'd think that it would be impossible to pass a value back! That's where the 32 global registers come in. One of the global registers (r8, as I recall) is nominated as the "return value register". Since global registers don't participate in the register window magic, a value stored there stays there across the function call transition and the function return transition.
The return address is typically stored in one of the registers in the local region. This has the neat side-effect that a buffer overflow of a stack variable cannot overwrite a return address since the return address isn't kept on the stack in the first place. It's kept in the local region, which gets spilled onto the register stack, a chunk of memory separate from the stack.
A function is free to subtract from the sp register to create temporary stack space (for string buffers, for example), which it of course must clean up before returning.
One curious detail of the stack convention is that the first 16 bytes on the stack (the first two quadwords) are always scratch. (Peter Lund calls it a "red zone".) So if you need some memory for a short time, you can just use the memory at the top of the stack without asking for permission. But remember that if you call out to another function, then that memory becomes scratch for the function you called! So if you need the value of this "free scratchpad" preserved across a call, you need to subtract from sp to reserve it officially.
One more curious detail about the ia64: A function pointer on the ia64 does not point to the first byte of code. Intsead, it points to a structure that describes the function. The first quadword in the structure is the address of the first byte of code, and the second quadword contains the value of the so-called "gp" register. We'll learn more about the gp register in a later blog entry.
(This "function pointer actually points to a structure" trick is not new to the ia64. It's common on RISC machines. I believe the PPC used it, too.)
Okay, this was a really boring entry, I admit. But believe it or not, I'm going to come back to a few points in this entry, so it won't have been for naught.
class SomeClass { ... static DWORD CALLBACK s_ThreadProc(LPVOID lpParameter) { return ((SomeClass*)lpParameter)->ThreadProc(); } DWORD ThreadProc() { ... fun stuff ... } };
Some callback function signatures place the context parameter (also known as "reference data") as the first parameter. How convenient, for the secret "this" parameter is also the first parameter. Looking at the various calling conventions available to us, it sure looks like the __stdcall calling convention for member functions matches our desired stack layout rather well. Let's take WAITORTIMERCALLBACK for example:
__stdcall
WAITORTIMERCALLBACK
Well, "thiscall" doesn't match, but the two "__stdcall"s do. Fortunately the compiler is smart enough to recognize this and can optimize the s_ThreadProc static method to nothing if you just give it enough of a nudge:
s_ThreadProc
class SomeClass { ... static DWORD CALLBACK s_ThreadProc(LPVOID lpParameter) { return ((SomeClass*)lpParameter)->ThreadProc(); } DWORD __stdcall ThreadProc() { ... fun stuff ... } };
If you look at the code generation for the s_ThreadProc function, you'll see that has been reduced to nothing but a jump instruction, since the compiler has realized that the two calling conventions coincide here so there is no actual translation to do.
?s_ThreadProc@SomeClass@@SGKPAX@Z PROC NEAR jmp ?ThreadProc@SomeClass@@QAGKXZ ?s_ThreadProc@SomeClass@@SGKPAX@Z ENDP
Now some people would take this one step further and just cast the second parameter to CreateThread to LPTHREAD_START_ROUTINE and get rid of the helper s_ThreadProc function entirely. I strongly advise against this. I have seen too many people cause trouble by miscasting function pointers; more on this in a future entry.
CreateThread
LPTHREAD_START_ROUTINE
Although we took advantage above of a coincidence between the two __stdcall calling conventions, we did not rely on it. If the coincidence in calling conventions fails to occur, the code is still correct. This is important when it comes time to port this code to another architecture, one where the coincidence may longer be true!
(By the way, in case people didn't get it: I'm only talking in the context of calling conventions you're likely to encounter when doing Windows programming or which are used by Microsoft compilers. I do not intend to cover calling conventions for other operating systems or that are specific to a particular language or compiler vendor.)
Remember: If a calling convention is used for a C++ member function, then there is a hidden "this" parameter that is the implicit first parameter to the function.
The 32-bit x86 calling conventions all preserve the EDI, ESI, EBP, and EBX registers, using the EDX:EAX pair for return values.
The same constraints apply to the 32-bit world as in the 16-bit world. The parameters are pushed from right to left (so that the first parameter is nearest to top-of-stack), and the caller cleans the parameters. Function names are decorated by a leading underscore.
This is the calling convention used for Win32, with exceptions for variadic functions (which necessarily use __cdecl) and a very few functions that use __fastcall. Parameters are pushed from right to left [corrected 10:18am] and the callee cleans the stack. Function names are decorated by a leading underscore and a trailing @-sign followed by the number of bytes of parameters taken by the function.
The first two parameters are passed in ECX and EDX, with the remainder passed on the stack as in __stdcall. Again, the callee cleans the stack. Function names are decorated by a leading @-sign and a trailing @-sign followed by the number of bytes of parameters taken by the function (including the register parameters).
The first parameter (which is the "this" parameter) is passed in ECX, with the remainder passed on the stack as in __stdcall. Once again, the callee cleans the stack. Function names are decorated by the C++ compiler in an extraordinarily complicated mechanism that encodes the types of each of the parameters, among other things. This is necessary because C++ permits function overloading, so a complex decoration scheme must be used so that the various overloads have different decorated names.
There are some nice diagrams on MSDN illustrating some of these calling conventions.
Remember that a calling convention is a contract between the caller and the callee. For those of you crazy enough to write in assembly language, this means that your callback functions need to preserve the registers mandated by the calling convention because the caller (the operating system) is relying on it. If you corrupt, say, the EBX register across a call, don't be surprised when things fall apart on you. More on this in a future entry.
Curiously, it is only the 8086 and x86 platforms that have multiple calling conventions. All the others have only one!
Now we're going deep into trivia that absolutely nobody remembers or even cares about: The 32-bit calling conventions you don't see any more.
All of the processors listed here are RISC-style, which means there are lots of registers, none of which have any particular meaning. Well, aside from the zero register which is hard-wired to zero. (It turns out zero is a very handy number to have readily available.) Any meanings attached to the registers are those imposed by the calling convention.
As a throwback to the processors of old, the "call" instruction stores the return address in a register instead of being pushed onto the stack. A good thing, too, since the processor doesn't officially know about a "stack", it being a construction of the calling convention.
As always, registers or stack space used to pass parameters may be used as scratch by the called function, as can the return value register.
You may notice that all of the RISC calling conventions are basically the same. Once again, evidence that the 8086/x86 is the weirdo. A wildly popular weirdo, mind you.
The Alpha AXP ("AXP" being yet another of those faux-acronyms that officially doesn't stand for anything) has 32 integer registers, one of which is hard-wired to zero. By convention, one of the registers is the "stack pointer", one is the "return address" register; and two others have special meanings unrelated to parameter passing.
The first six parameters are passed in registers, with the remaining parameters on the stack. If the function is variadic, the parameters can be spilled onto the stack so they can be accessed as an array.
Seven other registers are preserved across calls, one is the return value, and the remaining thirteen are scratch. 1 zero register + 1 stack pointer + 1 return address + 2 special + 6 parameters + 7 preserved + 1 return value + 13 scratch = 32 total integer registers.
Function names on the Alpha AXP are completely undecorated.
The first four parameters are passed in a0, a1, a2 and a3; the remainder are spilled onto the stack. What's more, there are four "dead spaces" on the stack where the four register parameters "would have been" if they had been passed on the stack. These are for use by the callee to spill the register parameters back onto the stack if desired. (Particularly handy for variadic functions.)
Function names on the MIPS are completely undecorated.
The first eight parameters are passed in registers (r3 through r10), and the return address is managed manually.
I forget what happens to parameters nine and up...
Function names on the PowerPC are decorated by prepending two periods.
Postclaimer: I haven't had personal experience with the MIPS or PPC processors, so my discussion of those processors may be a tad off, but the basic idea I think is sound.
The 8086 was a 16-bit version of the even older 8080 processor, which had six 8-bit registers, named A, B, C, D, E, H, and L. The registers could be used in pairs to products three 16-bit pseudo-registers, BC, DE, and HL. What's more, you could put a 16-bit address into the HL register and use the pseudo-register "M" to deference it. So, for example, you could write "MOV B, M" and this meant to load the 8-bit value pointed to by the HL register pair into the B register.
The 8086 took these 8080 registers and mapped them sort of like this:
This is why the 8086 instruction set can only dereference through the [BX] register and not the [CX] or [DX] registers: On the original 8080, you could not dereference through [BC] or [DE], only thorugh M=[HL].
This much so far is pretty official. The instruction set for the 8086 was chosen to be upwardly-compatible with the 8080, so as to facilitate machine translation of existing 8-bit code to this new 16-bit processor. Even the MS-DOS function calls were designed so as to faciliate machine translation.
What about the SI and DI registers? I suspect they were inspired by the IX and IY registers available on the Z-80, a competitor to the 8080 which took the 8080 instruction set and extended it with more registers. The Z-80 allowed you to dereference through [IX] and [IY], so the 8086 lets you dereference through [SI] and [DI].
And what about the BP register? I suspect that was invented on the fly in order to facilitate stack-based parameter passing. Notice that the BP register is the only 8086 register that defaults to the SS segment register and which can be used to access memory directly.
Why not add even more registers, like today's processors with their palette of 16 or even 128 registers? Why limit the 8086 to only eight registers (AX, BX, CX, DX, SI, DI, BP, SP)? Well, that was then and this is now. At that time, processors did not have lots of registers. The 68000 had a whopping sixteen registers, but if you look more closely, only half of them were general purpose arithmetic registers; the other half were used only for accessing memory.
In the 16-bit world, part of the calling convention was fixed by the instruction set: The BP register defaults to the SS selector, whereas the other registers default to the DS selector. So the BP register was necessarily the register used for accessing stack-based parameters.
The registers for return values were also chosen automatically by the instruction set. The AX register acted as the accumulator and therefore was the obvious choice for passing the return value. The 8086 instruction set also has special instructions which treat the DX:AX pair as a single 32-bit value, so that was the obvious choice to be the register pair used to return 32-bit values.
That left SI, DI, BX and CX.
(Terminology note: Registers that do not need to be preserved across a function call are often called "scratch".)
When deciding which registers should be preserved by a calling convention, you need to balance the needs of the caller against the needs of the callee. The caller would prefer that all registers be preserved, since that removes the need for the caller to worry about saving/restoring the value across a call. The callee would prefer that no registers be preserved, since that removes the need to save the value on entry and restore it on exit.
If you require too few registers to be preserved, then callers become filled with register save/restore code. But if you require too many registers to be preserved, then callees become obligated to save and restore registers that the caller might not have really cared about. This is particularly important for leaf functions (functions that do not call any other functions).
The non-uniformity of the x86 instruction set was also a contributing factor. The CX register could not be used to access memory, so you wanted to have some register other than CX be scratch, so that a leaf function can at least access memory without having to preserve any registers. So BX was chosen to be scratch, leaving SI and DI as preserved.
So here's the rundown of 16-bit calling conventions:
In summary: Caller cleans the stack, parameters pushed right to left.
Function name decoration consists of a leading underscore. My guess is that the leading underscore prevented a function name from accidentally colliding with an assembler reserved word. (Imagine, for example, if you had a function called "call".)
Nearly all Win16 functions are exported as Pascal calling convention. The callee-clean convention saves three bytes at each call point, with a fixed overhead of two bytes per function. So if a function is called ten times, you save 3*10 = 30 bytes for the call points, and pay 2 bytes in the function itself, for a net savings of 28 bytes. It was also fractionally faster. On Win16, saving a few hundred bytes and a few cycles was a big deal.
Consequently, __fastcall was typically faster only for short leaf functions, and even then it might not be.
Okay, those are the 16-bit calling conventions I remember. Part 2 will discuss 32-bit calling conventions, if I ever get around to writing it.
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Compatibility
Actually, this list is only partial. Many times, the compatibility fix is made inside the core component for all programs rather than targetting a specific program, as this list does.
(The Windows 2000-to-Windows XP list is stored in your C:\WINDOWS\AppPatch directory, in a binary format to permit rapid scanning. Sorry, you won't be able to browse it easily. I think the Application Compatibility Toolkit includes a viewer, but I may be mistaken.)
Would you have bought Windows XP if you knew that all these programs were incompatible?
It takes only one incompatible program to sour an upgrade.
Suppose you're the IT manager of some company. Your company uses Program X for its word processor and you find that Program X is incompatible with Windows XP for whatever reason. Would you upgrade?
Of course not! Your business would grind to a halt.
"Why not call Company X and ask them for an upgrade?"
Sure, you could do that, and the answer might be, "Oh, you're using Version 1.0 of Program X. You need to upgrade to Version 2.0 for $150 per copy." Congratulations, the cost of upgrading to Windows XP just tripled.
And that's if you're lucky and Company X is still in business.
I recall a survey taken a few years ago by our Setup/Upgrade team of corporations using Windows. Pretty much every single one has at least one "deal-breaker" program, a program which Windows absolutely must support or they won't upgrade. In a high percentage of the cases, the program in question was developed by their in-house programming staff, and it's written in Visual Basic (sometimes even 16-bit Visual Basic), and the person who wrote it doesn't work there any more. In some cases, they don't even have the source code any more.
And it's not just corporate customers. This affects consumers too.
For Windows 95, my application compatibility work focused on games. Games are the most important factor behind consumer technology. The video card that comes with a typical computer has gotten better over time because games demand it. (Outlook certainly doesn't care that your card can do 20 bajillion triangles a second.) And if your game doesn't run on the newest version of Windows, you aren't going to upgrade.
Anyway, game vendors are very much like those major corporations. I made phone call after phone call to the game vendors trying to help them get their game to run under Windows 95. To a one, they didn't care. A game has a shelf life of a few months, and then it's gone. Why would they bother to issue a patch for their program to run under Windows 95? They already got their money. They're not going to make any more off that game; its three months are over. The vendors would slipstream patches and lose track of how many versions of their program were out there and how many of them had a particular problem. Sometimes they wouldn't even have the source code any more.
They simply didn't care that their program didn't run on Windows 95. (My favorite was the one that tried to walk me through creating a DOS boot disk.)
Oh, and that Application Compatibility Toolkit I mentioned above. It's a great tool for developers, too. One of the components is the Verifier: If you run your program under the verifier, it will monitor hundreds of API calls and break into the debugger when you do something wrong. (Like close a handle twice or allocate memory with GlobalAlloc but free it with LocalAlloc.)
Three examples off the top of my head of the consequences of grovelling into and relying on undocumented structures.
Of course there was no mention of this illicit behavior in the documentation. So when the background defragmenter corrupted their disks, Microsoft got the blame.
From the shell extension, they used an undocumented window message to get a pointer to one of the internal Explorer structures. Then they walked the structure until they found something they recognized. Then they knew, "The thing immediately after the thing that I recognize is the thing that I want."
Well, the thing that they recognize and the thing that they want happened to be base classes of a multiply-derived class. If you have a class with multiple base classes, there is no guarantee from the compiler which order the base classes will be ordered. It so happened that they appeared in the order X,Y,Z in all the versions of Windows this software company tested against.
Except Windows 2000.
In Windows 2000, the compiler decided that the order should be X,Z,Y. So now they grovelled in, saw the "X" and said "Aha, the next thing must be a Y" but instead they got a Z. And then they crashed your system some time later.
So I had to create a "fake X,Y" so when the program went looking for X (so it could grab Y), it found the fake one first.
This took the good part of a week to figure out.
It so happens that the NMHDR is allocated on the stack, so this program is reaching up into the stack and grabbing the value of some local variable (which happens to be two frames up the stack!) and using it to control their logic.
For Windows 2000, we upgraded the compiler to a version which did a better job of reordering and re-using local variables, and now the program couldn't find the local variable it wanted and stopped working.
I got tagged to investigate and fix this. I had to create a special NMHDR structure that "looked like" the stack the program wanted to see and pass that special "fake stack".
I think this one took me two days to figure out.
I hope you understand why I tend to go ballistic when people recommend relying on undocumented behavior. These weren't hobbyists in their garage seeing what they could do. These were major companies writing commercial software.
When you upgrade to the next version of Windows and you experience (a) disk corruption, (b) sporadic Explore crashes, or (c) sporadic loss of functionality in your favorite program, do you blame the program or do you blame Windows?
If you say, "I blame the program," the first problem is of course figuring out which program. In cases (a) and (b), the offending program isn't obvious.