September, 2011

  • The Old New Thing

    2011 Q3 link clearance: Microsoft blogger edition


    It's that time again: Linking to other Microsoft bloggers.

  • The Old New Thing

    There's also a large object heap for unmanaged code, but it's inside the regular heap


    Occasionally, a customer will ask for assistance explaining some strange heap behavior, or at least heap behavior that appears to be strange if you assume that the heap behaves purely classically.

    I need to understand the heap behavior we're seeing. I have a sample program which allocates five blocks of memory from the process heap, each of size 100 bytes. When we dump the heap blocks with the !heap -x command, we find that all of them belong to the same heap segment, and when we do a !vadump -v, we find that they all live on the same page. On the other hand, if we allocate five blocks of size 512KB, then we find that each one gets placed in its own virtually-allocated range, and the range is not size 512KB but rather 516KB.


    1. Why does the system combine the 100-byte allocations? Why is there no extra overhead for the small allocations, but the large allocations have a 4KB overhead?
    2. Why does the system allocate a separate virtual address range for the 512KB block? Which component makes this decision and what are the factors involved?
    3. Why does allocating a 512KB block require 4KB of overhead?
    4. Is there any documentation for this behavior?

    Let's look at the last question first. There is no documentation for this behavior because it is all implementation detail. All the heap is required to do is return you a block of memory of the size you requested. How it goes about getting this memory block is at the discretion of the heap manager, and as we've seen in the past, the precise algorithm has varied over time.

    The claim that there is no extra overhead for the small allocations is incorrect. There is still a small amount of overhead, but you can't see it because the !heap -x command doesn't report it. There is still overhead in the heap to keep track of the memory block's actual size, the fact that the memory block is allocated (and not free), and other overhead-type things. Indeed, even for large blocks, you didn't see the overhead reported by the !heap -x command; you had to drop below the heap and use the !vadump command to see it. (And how do you know that the other 3.9KB are being lost to overhead? Maybe they are being held for a future 3.9KB allocation?)

    Okay, so fine, small blocks have overhead, but why do larger blocks have significantly higher overhead? The answer to this is in the second question: The system allocates large blocks in a separate virtual address range because the default process heap treats large memory blocks differently from small memory blocks, in the same way that the CLR treats large and small objects differently.

    You might say that the unmanaged heap also has a large-object sub-heap.

    When objects get large, the heap switches to VirtualAlloc. You can think of it as creating a custom segment just for that object. And since VirtualAlloc allocates memory in integer multiples of the page size, any nonzero overhead will result in a full extra page being allocated since the requested allocation size was itself already an integral multiple of the page size. Mathematically:

    roundup(n × PageSize + v, PageSize) = n × PageSize + roundup(v, PageSize).

    Therefore, even if a heap allocation has only one byte of overhead, you will have to pay a full page for it due to rounding.

    The precise point at which the heap will switch to VirtualAlloc has changed over time, so don't rely on it. In Windows 95, the switchover point was around 4MB, but Windows NT set the cutover to something close to 512KB. If you're interested in details of the internal heap bookkeeping, you can check out Advanced Windows Debugging or Windows Internals. Note that information about the heap implementation is to be used for diagnostic and educational purposes only. Do not write code that takes advantage of the internals of the heap architecture, because those details change all the time.

    The purpose of a heap is to reduce memory and address space overhead compared to direct allocation via methods like VirtualAlloc by combining multiple small allocations into a single 64KB block, at a cost of some overhead per item. Although there is a small amount of overhead per item, the overall savings makes it a win. If you need eight hundred 100-byte chunks of memory, and you didn't have a heap manager, you would allocate eight hundred 64KB blocks (since VirtualAlloc allocates address space in 64KB chunks) and commit the first page in each, for a total memory usage of around 3MB and address space usage of around 50MB. Even if the heap overhead were a ridiculous 100%, the one thousand allocations would fit into forty 4KB pages, reducing the memory usage to 160KB and the address space usage to 192KB, a massive savings. (Looking at things in terms of relative overhead, 4KB out of a 512KB allocation is still less than one percent. You wish your index fund had such a low overhead!)

    The advantage of chunking allocations together diminishes as the size of the allocations increase. By the time you reach 512KB, the heap is not really buying you much savings, since you're already buying in bulk. In fact, the switchover to direct VirtualAlloc is an admission by the heap manager that the allocation size has become so large that it is starting to make the overall heap degrade. If the 512KB allocation were made in the classic heap, it would probably not start on an exact page boundary; when you went to free the memory, there would be little fragments left over at the start and end which could not be decommitted and which would end up as little committed pages scattered about fragmenting your address space. (Besides, I don't think the heap decommits partial segments anyway.)

    If you know ahead of time that your allocation is large, and you control both the code which allocates the memory and the code which frees the memory, you can switch to VirtualAlloc directly and avoid burdening the heap with very large allocations. (On the other hand, the heap does this cutover automatically, so perhaps you're better off just letting the heap designers decide how big is "too big.")

    Bonus chatter: And no, the heap manager will never ask VirtualAlloc for large pages.

  • The Old New Thing

    Appearing to succeed is a valid form of undefined behavior, but it's still undefined


    A customer requested a clarification on the MSDN documentation for the HeapFree function.

    The MSDN documentation says that if the lpMem parameter is NULL, then the behavior is undefined. Is this true?

    As explicitly stated in MSDN, the behavior is undefined. Observe that the annotation on the lpMem parameter is __in, which means that the parameter must be a non-NULL value provided by the caller. (If NULL were permitted, the annotation would have been __in_opt.)

    Undefined behavior means that anything can happen. The program might crash immediately. It might crash five minutes later. It might send email to your boss saying that you screwed up and then read you Vogon poetry. Or maybe not.

    MSDN says don't do it, so don't do it.

    The customer explained why they were interested in knowing more information about undefined behavior:

    We were interested because there is a mismatch between the semantics of a function we are implementing (where NULL is valid and ignored) and the function HeapFree we are using as the implementation. It looks like Windows Vista returns TRUE if you pass NULL.

    If there is a mismatch in semantics between the function you are implementing and the function you are calling, it is your responsibility as the programmer to bridge the gap. The customer didn't say what function they were implementing, but I'm guessing it was something like operator delete. Since your function accepts NULL but HeapFree doesn't, it is your responsibility to filter out NULL parameters.

    void operator delete(void* ptr) throw ()
     if (ptr != NULL)
      HeapFree(CustomHeap, 0, ptr);

    This concept goes by the fancy name of the Adapter Pattern. The less fancy name is wrapper function.

    And the value returned by HeapFree on Windows Vista is irrelevant. Pretending to succeed is a valid form of undefined behavior, because anything qualifies as undefined behavior.

    (Of course, you can't assume that returning TRUE will always be the result of triggering undefined behavior. After all, if you could rely on it, then it wouldn't be undefined any more!)

  • The Old New Thing

    Does this operation work when impersonating? The default answer is NO


    I'll often see a customer ask for assistance with a scenario like this: "We're having trouble doing X. We're doing X1, X2, and X3, but it looks like we're getting the wrong answer back."

    The next step in the conversation goes something like "There must be something else going on, because X1, X2 and X3 is the correct way of doing X. To demonstrate, I've written the following sample program that illustrates doing X by the X1, X2, X3 technique. When I run it, I get the correct answer. What do you get when you run it?"

    "When we run your program we get the correct answer, but it doesn't work when we do it from our program." And then, as if by afterthought, "Could the problem be that we're impersonating?"

    Ohhhhhh, you're impersonating. Thanks for not mentioning that.

    By default, nothing works when impersonating. Impersonation requires end-to-end awareness. A function might create a worker thread—the worker thread runs with the identity of the process. A function might use a function like Queue­Usere­Work­Item—by default, the work item runs with the identity of the process. (You have to pass WT_TRANSFER_IMPERSONATION if you want the work item to respect impersonation.) A function might send a message to another window—that window will do its work under its own security token, not the token of the sender. A function might invoke a method on a remote COM object—that object will run under its own security token, not the token of the invoker. (COM requires you to call Co­Set­Proxy­Blanket to enable impersonation transfer during marshaling, and the server needs to call CoImpersonateClient. For some reason, this is called cloaking.) The registry keys HKEY_CURRENT_USER and HKEY_CLASSES_ROOT don't work when you're impersonating. (You have to use RegOpenCurrentUser or RegOpenUserClassesRoot.) Functions like SHGetKnownFolderPath have a token parameter which is used when impersonating; if you pass NULL, then it assumes you aren't impersonating.

    The requirements go beyond just code that runs during the execution of the function in question. If you have a function which caches information across calls, the cache needs to be made impersonation-aware so that a value calculated when called while impersonating user X isn't mistakenly used while impersonating user Y.

    In order for impersonation to work, every function all the way down the chain needs to be impersonation-safe. Sure, you might be careful to call QueueUserWorkItem with the WT_TRANSFER_IMPERSONATION flag, and your work item is careful to call SetProxyBlanket on its COM objects, and your COM server is careful to call CoImpersonateClient when servicing the call, but if your COM server then calls a helper object which calls SHGetKnownFolderPath and passes NULL for the impersonation token, then all your careful work has been for naught.

    This is another special case of When you create an object with constraints, you have to make sure everybody who uses the object understands those constraints.

    The Programming Golden Rule can be applied here as well: When you write your own code, do you do this? Since most people who write code do not think about impersonation (indeed, the operating system even encourages not-impersonation-safe coding when it provides conveniences like HKEY_CURRENT_USER) the default answer to "Does this work when I'm impersonating" is "No."

  • The Old New Thing

    Ah, the exciting world of cross-forest dogfood


    The Windows group has its own domain (known as NTDEV for historical reasons) which operates separately from the domain forest operated by the Microsoft IT department. Various trust relationships need to be set up between them so that people on the Windows team can connect to resources managed by the Microsoft IT department and vice versa, but it generally works out okay.

    There are some glitches, but that's the price of dogfood. What better way to make sure that Windows works well in a cross-forest environment than by putting the entire Windows division in its own domain separate from the rest of the company?

    Subject: Newbie domain join question

    My user account is in the REDMOND domain. Should I join my machine to the REDMOND domain or the NTDEV domain?

    I was much amused by this cynical recommendation:

    Subject: RE: Newbie domain join question

    It's pretty simple, really.

    If you want to have trouble accessing NTDEV resources, then join the REDMOND domain.

    If you want to have trouble accessing REDMOND resources, then join the NTDEV domain.

  • The Old New Thing

    Sending a window a WM_DESTROY message is like prank calling somebody pretending to be the police


    A customer was trying to track down a memory leak in their program. Their leak tracking tool produced the stacks which allocated memory that was never freed, and they all seemed to come from uxtheme.dll, which is a DLL that comes with Windows. The customer naturally contacted Microsoft to report what appeared to be a memory leak in Windows.

    I was one of the people who investigated this case, and the customer was able to narrow down the scenario which was triggering the leak. Eventually, I tracked it down. First, here's the thread that caused the leak:

    DWORD CALLBACK ThreadProc(void *lpParameter)
     // This CreateWindow caused uxtheme to allocate some memory
     HWND hwnd = CreateWindow(...);
     MSG msg;
     while (GetMessage(&msg, NULL, 0, 0)) {
     return 0;

    This thread creates an invisible window whose job is to do something until it is destroyed, at which point the thread is no longer needed. The window procedure for the window looks like this:

                             WPARAM wParam, LPARAM lParam)
     switch (uMsg) {
     ... business logic deleted ...
     case WM_DESTROY:
     return DefWindowProc(hwnd, uMsg, wParam, lParam);

    Sinec this is the main window on the thread, its destruction posts a quit message to signal the message loop to exit.

    There's nothing obviously wrong here that would cause uxtheme to leak memory. And yet it does. The memory is allocated when the window is created, and it's supposed to be freed when the window is destroyed. And the only time we exit the message loop is when the window is destroyed. So how is it that this thread manages to exit without destroying the window?

    The key is how the program signals this window that it should go away.

    void MakeWorkerGoAway()
     // Find the worker window if it is registered
     HWND hwnd = GetWorkerWindow();
     // If we have one, destroy it
     if (hwnd) {
      // DestroyWindow doesn't work for windows that belong
      // to other threads.
      // DestroyWindow(hwnd);
      SendMessage(hwnd, WM_DESTROY, 0, 0);

    The authors of this code first tried destroying the window with DestroyWindow but ran into the problem that you cannot destroy a window that belongs to a different thread. "But aha, since the DestroyWindow function sends the WM_DESTROY message, we can just cut out the middle man and send the message directly."

    Well, yeah, you can do that, but that doesn't actually destroy the window. It just pretends to destroy the window by prank-calling the window procedure and saying "Ahem, um, yeah, this is the, um, window manager? (stifled laughter) And, um, like, we're just calling you to tell you, um, you're being destroyed. (giggle) So, um, you should like pack up your bags and (snort) sell all your furniture! (raucous laughter)"

    The window manager sends the WM_DESTROY message to a window as part of the window destruction process. If you send the message yourself, then you're making the window think that it's being destroyed, even though it isn't. (Because it's DestroyWindow that destroys windows.)

    The victim window procedure goes through its "Oh dear, I'm being destroyed, I guess I'd better clean up my stuff" logic, and in this case, it unregisters the worker window and posts a quit message to the message loop. The message loop picks up the WM_QUIT and exits the thread.

    And that's the memory leak: The thread exited before all its windows were destroyed. That worker window is still there, because it never got DestroyWindow'd. Since the window wasn't actually destroyed, the internal memory used to keep track of the window didn't get freed, and there you have your leak.

    "You just got punk'd!"

    The correct solution is for the MakeWorkerGoAway function to send a message to the worker window to tell it, "Hey, I'd like you to go away. Please call DestroyWindow on yourself." You can invent a private message for this, or you can take advantage of the fact that the default behavior of the WM_CLOSE message is to destroy the window. Since our window procedure doesn't override WM_CLOSE, the message will fall through to DefWindowProc which will convert the WM_CLOSE into a DestroyWindow.

    Now that you understand the difference between destroying a window and prank-calling a window telling it is being destroyed, you might be able to help Arno with his problem.

  • The Old New Thing

    Why does my asynchronous I/O complete synchronously?


    A customer was creating a large file and found that, even though the file was opened with FILE_FLAG_OVERLAPPED and the Write­File call was being made with an OVERLAPPED structure, the I/O was nevertheless completing synchronously.

    Knowledge Base article 156932 covers some cases in which asynchronous I/O will be converted to synchronous I/O. And in this case, it was scenario number three in that document.

    The reason the customer's asynchronous writes were completing synchronously is that all of the writes were to the end of the file. It so happens that in the current implementation of NTFS, writes which extend the length of the file always complete synchronously. (More specifically, writes which extend the valid data length are forced synchronous.)

    We saw last time that merely calling Set­End­Of­File to pre-extend the file to the final size doesn't help, because that updates the file size but not the valid data length. To avoid synchronous behavior, you need to make sure your writes do not extend the valid data length. The suggestions provided in yesterday's article apply here as well.

  • The Old New Thing

    Why does my single-byte write take forever?


    A customer found that a single-byte write was taking several seconds, even though the write was to a file on the local hard drive that was fully spun-up. Here's the pseudocode:

    // Create a new file - returns quickly
    hFile = CreateFile(..., CREATE_NEW, ...);
    // make the file 1GB
    SetFilePointer(hFile, 1024*1024*1024, NULL, FILE_BEGIN);
    // Write 1 byte into the middle of the file
    SetFilePointer(hFile, 512*1024*1024, NULL, FILE_BEGIN);
    BYTE b = 42;
    / this write call takes several seconds!
    WriteFile(hFile, &b, &nBytesWritten, NULL);

    The customer experimented with using asynchronous I/O, but it didn't help. The write still took a long time. Even using FILE_FLAG_NO_BUFFERING (and writing full sectors, naturally) didn't help.

    The reason is that on NTFS, extending a file reserves disk space but does not zero out the data. Instead, NTFS keeps track of the "last byte written", technically known as the valid data length, and only zeroes out up to that point. The data past the valid data length are logically zero but are not physically zero on disk. When you write to a point past the current valid data length, all the bytes between the valid data length and the start of your write need to be zeroed out before the new valid data length can be set to the end of your write operation. (You can manipulate the valid data length directly with the Set­File­Valid­Data function, but be very careful since it comes with serious security implications.)

    Two solutions were proposed to the customer.

    Option 1 is to force the file to be zeroed out immediately after setting the end of file by writing a zero byte to the end. This front-loads the cost so that it doesn't get imposed on subsequent writes at seemingly random points.

    Option 2 is to make the file sparse. Mark the file as sparse with the FSCTL_SET_SPARSE control code, and immediately after setting the end of file, use the FSCTL_SET_ZERO_DATA control code to make the entire file sparse. This logically fills the file with zeroes without committing physical disk space. Anywhere you actually write gets converted from "sparse" to "real". This does open the possibility that a later write into the middle of the file will encounter a disk-full error, so it's not a "just do this and you won't have to worry about anything" solution, and depending on how randomly you convert the file from "sparse" to "real", the file may end up more fragmented than it would have been if you had "kept it real" the whole time.

  • The Old New Thing

    Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?


    If you look at the disassembly of functions inside Windows DLLs, you'll find that they begin with the seemingly pointless instruction MOV EDI, EDI. This instruction copies a register to itself and updates no flags; it is completely meaningless. So why is it there?

    It's a hot-patch point.

    The MOV EDI, EDI instruction is a two-byte NOP, which is just enough space to patch in a jump instruction so that the function can be updated on the fly. The intention is that the MOV EDI, EDI instruction will be replaced with a two-byte JMP $-5 instruction to redirect control to five bytes of patch space that comes immediately before the start of the function. Five bytes is enough for a full jump instruction, which can send control to the replacement function installed somewhere else in the address space.

    Although the five bytes of patch space before the start of the function consists of five one-byte NOP instructions, the function entry point uses a single two-byte NOP.

    Why not use Detours to hot-patch the function, then you don't need any patch space at all.

    The problem with Detouring a function during live execution is that you can never be sure that at the moment you are patching in the Detour, another thread isn't in the middle of executing an instruction that overlaps the first five bytes of the function. (And you have to alter the code generation so that no instruction starting at offsets 1 through 4 of the function is ever the target of a jump.) You could work around this by suspending all the threads while you're patching, but that still won't stop somebody from doing a CreateRemoteThread after you thought you had successfully suspended all the threads.

    Why not just use two NOP instructions at the entry point?

    Well, because a NOP instruction consumes one clock cycle and one pipe, so two of them would consume two clock cycles and two pipes. (The instructions will likely be paired, one in each pipe, so the combined execution will take one clock cycle.) On the other hand, the MOV EDI, EDI instruction consumes one clock cycle and one pipe. (In practice, the instruction will occupy one pipe, leaving the other available to execute another instruction in parallel. You might say that the instruction executes in half a cycle.) However you calculate it, the MOV EDI, EDI instruction executes in half the time of two NOP instructions.

    On the other hand, the five NOPs inserted before the start of the function are never executed, so it doesn't matter what you use to pad them. It could've been five garbage bytes for all anybody cares.

    But much more important than cycle-counting is that the use of a two-byte NOP avoids the Detours problem: If the code had used two single-byte NOP instructions, then there is the risk that you will install your patch just as a thread has finished executing the first single-byte NOP and is about to begin executing the second single-byte NOP, resulting in the thread treating the second half of your JMP $-5 as the start of a new instruction.

    There's a lot of patching machinery going on that most people don't even realize. Maybe at some point, I'll get around to writing about how the operating system manages patches for software that isn't installed yet, so that when you do install the software, the patch is already there, thereby closing the vulnerability window between installing the software and downloading the patches.

  • The Old New Thing

    Random notes from //build/ 2011


    Here are some random notes from //build/ 2011, information of no consequence whatesoever.

    • A game we played while walking to and from the convention center was spot the geek. "Hey, there's a guy walking down the street. He's wearing a collared shirt and khakis, with a black bag over his shoulder, staring into his phone. I call geek."
    • One of the stores on Harbor Boulevard has the direct-and-to-the-point name Brand Name Mart, or as it was known at night (due to burnt-out lights) Bra d N    Mart.
    • In the room where the prototype devices were being handed out to attendees, the boxes were stacked in groups. Each group consisted of 512 devices. Why 512? Because the boxes were stack 8 across, 8 high, and 8 wide. Somebody was being way too cute.
    • Nearly all the machines were handed out in the first 55 minutes of availability. During that time, they were distributed at a rate of one machine per second. Kudos to the event staff for managing the enormous onslaught! Also, kudos to my colleagues who flew down a week early for the thankless task of preparing 5,000 computers to be handed out!
    • In the way, way back of the Expo room were a bunch of makeshift private meeting rooms for the various vendors. As you can see from the picture, it was a depressing hallway of what looked like sensory deprivation chambers or interrogation rooms from 1984. All that was missing was the screaming. Upon seeing the photo, one of my friends remarked, "Mental institutions look more cheerful than this," and she should know: She's a professional nurse.
    • The setup at our demo booth consisted of a table with a touch monitor, with the image duplicated onto a wall-mounted display for better visibility. More than once, somebody would walk up to the wall-mounted display and try touching it. The game I played was to surreptitiously manipulate the touch monitor to match what the person was doing on the wall-mounted display, and see how long before they figure out that somebody was messing with them. (It didn't take long.)
    • Two of my colleagues played an even more elaborate trick. One of them stood about ten feet from the wall-mounted display and waved his arms as if he were using a Kinect. The other matched his colleague's actions on the touch monitor. So if you see a media report about seeing a Kinect-enabled Windows 8 machine at the //build/ conference, you'll know that they were pranked.
    • John Sheehan stopped by our booth, and around his neck were so many access passes he could've played solitaire. Security was tight, as you might expect, and any time he needed to go backstage, the security guard would ask to see his pass. "I'd just hold up all of them, saying 'Go ahead, pick one. Whatever pass you're looking for, it's in here somewhere.'"
    • One of my colleagues stopped by our booth, and I made some remark about the backstage passes around her neck. She replied, "You so don't want a backstage pass. Because if you have one, it means that you will be working on three hours' sleep for days on end."
    • Instead of "Hello, world," I think Aleš should have acknowledged that the programming landscape have changed, and the standard first application to write for a new platform is now a fart app. Wouldn't that have been an awesome app to have written on stage at the keynote?
    • You may have noticed that everybody was wearing a white or green T-shirt under their //build/ conference uniform. When we arrived, each staff member was issued two uniform shirts, plus four undershirts. And for people who didn't understand what that meant, there were instructions to wear a different undershirt each day. (The engineer would optimize the solution to two uniform shirts and only two undershirts, with instructions to wear undershirts the first two days and skip them on the last two days.)
    • Ever since PDC sessions started being put online, attending sessions has tended to take a back seat to business networking as a primary goal for coming to the conference, since you can always catch up on sessions later. As a result, the Expo floor tended to remain busy even when breakout sessions were taking place. Also, the last day of the conference tended to be a bit dead, with a lot of people leaving early, and the remaining people just taking it easy. But this year was different: People actually went to the breakout sessions! And despite being held on the final day of the conference, Matt Merry's session was not only well-attended, it overflowed.

Page 1 of 3 (26 items) 123