• The Old New Thing

    The 2010 Niney Award nominees have been announced

    • 18 Comments

    The nominees for the first (annual?) Niney Awards have been announced. The Nineys are an award which recognizes those who have had the greatest impact on the technical/developer community over the past year. Winners are selected by you, the technical/developer community.

    The winners will be announced at the MIX11 conference in April. But before they can announce winners, they need to collect votes.

    That's where you come in.

    Cast your vote online in the following categories:

    • Favorite Channel 9 Show (video) of 2010
    • Favorite Channel 9 Series (video) of 2010
    • Favorite Community Show (video) of 2010
    • Favorite Twitter User of 2010
    • Most Helpful Niner of 2010
    • Channel 9 Video of the Year for 2010
    • Favorite New Microsoft Technology or Product for 2010
    • Favorite CodePlex Project of 2010
    • Favorite Microsoft Blogger of 2010
    • Favorite Blog About Microsoft of 2010
    • Favorite Audio Podcast of 2010
    • Favorite Web Site Design of 2010

    Now, it so happens that among the nominees is an author of a somewhat unsucessful book on programming (not to be confused with an author of several successful books on programming and a to-be-published novel) but please don't let that distract you from voting for whoever you feel best deserves to win.

  • The Old New Thing

    Ready... cancel... wait for it! (part 3)

    • 16 Comments

    A customer reported that their application was crashing in RPC, and they submitted a sample program which illustrated the same crash as their program. Their sample program was actually based on the AsyncRPC sample client program, which was nice, because it provided a mutually-known starting point. They made quite a few changes to the program, but this is the important one:

    // old code:
    // status = RpcAsyncCancelCall(&Async, FALSE);
    
    // new code:
     status = RpcAsyncCancelCall(&Async, TRUE);
    

    (It was actually more complicated than this, but this is the short version.)

    The program was crashing for the same reason that Wednesday's I/O cancellation program was crashing: The program issued an asynchronous cancel and didn't wait for the cancel to complete. In this case, the crash occurred when the RPC call finally completed and RPC went about cleaning up the call based on the information in the now-freed RPC_ASYNC_STATE structure.

    The error was probably caused by the not-very-helpful name for that last parameter to Rpc­Async­Cancel­Call: fAbort­Call, and the accompanying documentation which says, "In an abortive cancel (fAbort­Call is TRUE), the Rpc­Async­Cancel­Call function sends a cancel notification to the server and client side and the asynchronous call is canceled immediately, not waiting for a response from the server." Compare this to a nonabortive cancel, where "the Rpc­Async­Cancel­Call function notifies the server of the cancel and the client waits for the server to complete the call."

    Obviously, it's faster if you don't wait for the server to respond, right? Let's pass TRUE, so that the function cancels the asynchronous call immediately without waiting for the server. Wow, look at how fast our program runs now!

    Unfortunately, the documentation doesn't make it sufficiently clear that when you issue a cancellation, you still have to wait for the operation to complete before you can clean up all the resources associated with that operation. Another way of looking at that last parameter is to think of it as fAsync. If you pass fAsync = TRUE, then the Rpc­Async­Cancel­Call function issues the cancellation and returns before the operation completes. If you pass fAsync = FALSE, then the Rpc­Async­Cancel­Call function issues the cancellation and waits for the operation to complete before returning.

    If you switch from a synchronous cancel to an asynchronous cancel, then you become responsible for keeping the RPC_ASYNC_STATE valid until the cancellation completes. In this case, the customer was using the Rpc­Notification­Type­Event notification type, which means that they need to wait for the Async.u.hEvent to become signaled before they can free the RPC_ASYNC_STATE.

    The customer confirmed the fix and closed the support case. Another problem solved.

    Three months later, the customer reopened the case, reporting that after they released a new version of their program with the aforementioned fix, they were nevertheless getting WinQual crashes which looked exactly like the ones that they were having before they applied the fix. It appears that the fix wasn't working.

    Upon closer investigation, it turns out that the customer originally did apply the fix as recommended: They added a Wait­For­Single­Object(Async.u.hEvent, INFINITE) call before destroying the Async object to ensure that the cancellation was complete. However, they became frustrated that sometimes the cancellation would take a long time to complete, so they changed it to

    WaitForSingleObject(Async.u.hEvent, 5000); // wait up to 5 seconds
    

    The customer explained, "After the wait fails due to timeout, we just proceed as normal and call Rpc­Async­Complete­Call and free the the RPC_ASYNC_STATE. Is that wrong?"

    Um, yeah. Changing the Wait­For­Single­Object from an infinite wait to one with a timeout means that you just reintroduced the bug that the Wait­For­Single­Object was originally supposed to fix! If the cancellation takes more than 5 seconds, then your code will continue and free the RPC_ASYNC_STATE, just like it did when you didn't wait at all.

    "How long can I wait before assuming that the event will simply never get signaled?"

    There is no such duration after which you can safely abandon the operation. Even if the event doesn't get signaled for 30 minutes (say because the computer is thrashing its guts out), it may get signaled at 30 minutes and 1 second.

    "But we don't want our program to get stuck waiting for the server."

    Great. It's fine to have your program continues running after issuing the cancellation, even if the RPC call hasn't completed. Just don't free the RPC_ASYNC_STATE until the call is complete. and if you set things up so that your completion event takes the form of a callback, you can just make the callback free the RPC__ASYNC_STATE. Then you don't have to keep track of the asynchronous call any more; the system will merely call you when it's finished, and then you can free the state structure.

    Bonus RPC chatter: (For the purpose of this discussion, I'll use the term RPC operation instead of RPC call so we don't have confusion between function calls and RPC calls.) A colleague explained the lifetime of an RPC operation as follows:

    Submit phase You call into the MIDL-generated stub. You cannot call Rpc­Async­Cancel­Call during the submit phase.
    The stub does magic RPC stuff.
    The stub returns control back to the caller.
    Pending phase RPC is waiting for the response to the operation. The operation remains in this phase until the operation completes or is cancelled. You can call Rpc­Async­Cancel­Call to cancel the RPC operation and accelerate the transition to the Notified phase.
    Notified phase RPC informs the application of the result of the operation in a manner described by the Notification­Type and RPC_ASYNC_NOTIFICATION_INFO members of the RPC_ASYNC_STATE structure. You can call Rpc­Async­Cancel­Call but it will have no effect since the operation is already complete.
    Completion phase The application calls the Rpc­Async­Complete­Call function to clean up the resources used to track the RPC operation. You exit the completion phase when Rpc­Async­Complete­Call returns something other than RPC_S_ASYNC_CALL_PENDING. You cannot call Rpc­Async­Cancel­Call after Rpc­Async­Complete­Call indicates that the operation is complete, since that is the call that says "I'm all done!"
  • The Old New Thing

    I am no longer impressed by your fancy new 10,000 BTU hot pot burner

    • 31 Comments

    Two years ago, we had a gathering at my house for some friends for hot pot, the traditional way of ringing in the lunar new year (which takes place today). It was actually a bit of a cross-cultural event, since the attendees came from different regions of Asia, where different traditions reign. (And the American guests just had to choose sides!)

    My house has but one portable stove for hot pot, so one of the guests brought her own unit, a unit as it turns out which was purchased specifically for the occasion, which gleamed in the light and proudly proclaimed 10,000 BTU of raw heating power. This was cause for much boasting, particularly since I didn't know the heating power of my own puny old unit, but I accepted my second-place position with grace.

    Some time later, we had a quiet family hot pot, and my old and horrifically unfashionable burner was brought out to do its tired but important job, and it was then that I found the sticker that specified its heating power.

    9,925 BTU.

    Now I am no longer impressed by my friend's 10,000 BTU burner.

  • The Old New Thing

    Ready... cancel... wait for it! (part 2)

    • 9 Comments

    A customer had a question about I/O cancellation. They have a pending Read­File­Ex call with a completion procedure. They then cancel the I/O with Cancel­Io­Ex and wait for the completion by passing TRUE as the bWait parameter to Get­Overlapped­Result.

    Assuming both return success, can I assume that my completion procedure will not be called after GetOverlappedResult returns? It appears that GetOverlappedResult waits non-alertably for the I/O to complete, so I'm assuming it just eats the APC if there was one. But if an APC had been posted just before I called CancelIoEx, will it also cancel that APC?

    Get­Overlapped­Result does not magically revoke completion callbacks. Why should it?

    Recall that completion is not the same as success. Completion means that the I/O subsystem has closed the books on the I/O operation. The underlying operation may have completed successfully or it may have failed (and cancellation is just one of the many possible reasons for failure). Either way, the completion procedure signed up to be notified when the I/O completes, and therefore it will be called to be informed of the completion due to cancellation.

    Besides, as the customer noted, there is a race condition if the Cancel­Io­Ex call is made just after the I/O completed, in which case it didn't get cancelled after all.

    This answers our question from last time, namely, how our fix for the cancellation code was incomplete. If the I/O had been issued with a completion routine (or equivalently, if it had been issued against an I/O completion port), then the code frees the OVERLAPPED structure before the completion routine runs. The kernel doesn't care that you did that (the kernel is finished with the OVERLAPPED structure), but your completion routine is probably not going to be happy that it was given a pointer to freed memory as its lpOverlapped parameter.

    You have to delay freeing the OVERLAPPED structure until the completion routine executes. Typically, this is done by allocating the OVERLAPPED structure on the heap rather than the stack, and making it the completion routine's responsibility to free the memory as its final act.

  • The Old New Thing

    Ready... cancel... wait for it! (part 1)

    • 31 Comments

    One of the cardinal rules of the OVERLAPPED structure is the OVERLAPPED structure must remain valid until the I/O completes. The reason is that the OVERLAPPED structure is manipulated by address rather than by value.

    The word complete here has a specific technical meaning. It doesn't mean "must remain valid until you are no longer interested in the result of the I/O." It means that the structure must remain valid until the I/O subsystem has signaled that the I/O operation is finally over, that there is nothing left to do, it has passed on: You have an ex-I/O operation.

    Note that an I/O operation can complete successfully, or it can complete unsuccessfully. Completion is not the same as success.

    A common mistake when performing overlapped I/O is issuing a cancel and immediately freeing the OVERLAPPED structure. For example:

    // this code is wrong
     HANDLE h = ...; // handle to file opened as FILE_FLAG_OVERLAPPED
     OVERLAPPED o;
     BYTE buffer[1024];
     InitializeOverlapped(&o); // creates the event etc
     if (ReadFile(h, buffer, sizeof(buffer), NULL, &o) ||
         GetLastError() == ERROR_IO_PENDING) {
      if (WaitForSingleObject(o.hEvent, 1000) != WAIT_OBJECT_0) {
       // took longer than 1 second - cancel it and give up
       CancelIo(h);
       return WAIT_TIMEOUT;
      }
      ... use the results ...
     }
     ...
    

    The bug here is that after calling Cancel­Io, the function returns without waiting for the Read­File to complete. Returning from the function implicitly frees the automatic variable o. When the Read­File finally completes, the I/O system is now writing to stack memory that has been freed and is probably being reused by another function. The result is impossible to debug: First of all, it's a race condition between your code and the I/O subsystem, and breaking into the debugger doesn't stop the I/O subsystem. If you step through the code, you don't see the corruption, because the I/O completes while you're broken into the debugger.

    Here's what happens when the program is run outside the debugger:

    ReadFile I/O begins
    WaitForSingleObject I/O still in progress
    WaitForSingleObject times out
    CancelIo I/O cancellation submitted to device driver
    return
    Device driver was busy reading from the hard drive
    Device driver receives the cancellation
    Device driver abandons the rest of the read operation
    Device driver reports that I/O has been canceled
    I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
    I/O subsystem queues the completion function (if applicable)
    I/O subsystem signals the completion event (if applicable)
    I/O operation is now complete

    When the I/O subsystem receives word from the device driver that the cancellation has completed, it performs the usual operations when an I/O operation completes: It updates the OVERLAPPED structure with the results of the I/O operation, and notifies whoever wanted to be notified that the I/O is finished.

    Notice that when it updates the OVERLAPPED structure, it's updating memory that has already been freed back to the stack, which means that it's corrupting the stack of whatever function happens to be running right now. (It's even worse if you happened to catch it while it was in the process of updating the buffer!) Since the precise timing of I/O is unpredictable, the program crashes with memory corruption that keeps changing each time it happens.

    If you try to debug the program, you get this:

    ReadFile I/O begins
    WaitForSingleObject I/O still in progress
    WaitForSingleObject times out
    Breakpoint hit on Cancel­Io statement
    Stops in debugger
    Hit F10 to step over the CancelIo call I/O cancellation submitted to device driver
    Breakpoint hit on return statement
    Stops in debugger
    Device driver was busy reading from the hard drive
    Device driver receives the cancellation
    Device driver abandons the rest of the read operation
    Device driver reports that I/O has been canceled
    I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
    I/O subsystem queues the completion function (if applicable)
    I/O subsystem signals the completion event (if applicable)
    I/O operation is now complete
    Look at the OVERLAPPED structure in the debugger
    It says STATUS_CANCELED
    Hit F5 to resume execution
    No memory corruption

    Breaking into the debugger changed the timing of the I/O operation relative to program execution. Now, the I/O completes before the function returns, and consequently there is no memory corruption. You look at the OVERLAPPED structure and say, "See? Immediately on return from the Cancel­Io function, the OVERLAPPED structure has been updated with the result, and the buffer contents are not being written to. It's safe to free them both now. Therefore, this can't be the source of my memory corruption bug."

    Except, of course, that it is.

    This is even more crazily insidious because the OVERLAPPED structure and the buffer are updated by the I/O subsystem, which means that it happens from kernel mode. This means that write breakpoints set by your debugger won't fire. Even if you manage to narrow down the corruption to "it happens somewhere in this function", your breakpoints will never see it as it happens. You're going to see that the value was good, then a little while later, the value was bad, and yet your write breakpoint never fired. You're then going to declare that the world has gone mad and seriously consider a different line of work.

    To fix this race condition, you have to delay freeing the OVERLAPPED structure and the associated buffer until the I/O is complete and anything else that's using them has also given up their claim to it.

       // took longer than 1 second - cancel it and give up
       CancelIo(h);
       WaitForSingleObject(o.hEvent, INFINITE); // added
       // Alternatively: GetOverlappedResult(h, &o, TRUE);
       return WAIT_TIMEOUT;
    

    The Wait­For­Single­Object after the Cancel­Io waits for the I/O to complete before finally returning (and implicitly freeing the OVERLAPPED structure and the buffer on the stack). Better would be to use GetOverlapped­Result with bWait = TRUE, because that also handles the case where the hEvent member of the OVERLAPPED structure is NULL.

    Exercise: If you retrieve the completion status after canceling the I/O (either by looking at the OVERLAPPED structure directly or by using GetOverlapped­Result) there's a chance that the overlapped result will be something other than STATUS_CANCELED (or ERROR_CANCELLED if you prefer Win32 error codes). Explain.

    Exercise: If this example had used Read­File­Ex, the proposed fix would be incomplete. Explain and provide a fix. Answer to come next time, and then we'll look at another version of this same principle.

  • The Old New Thing

    There is no longer any pleasure in reading the annual Microsoft injury reports

    • 22 Comments

    Microsoft is required by law to file reports on employees who have sustained injuries on the job. They are also required to post the reports in a location where employees can see them. These reports come out every year on February 1.

    Back in the old days, these reports were filled out by hand, and reading them was oddly amusing for the details. My favorite from the mid 1990's was a report on an employee who was injured on the job, and the description was simply pencil lead embedded in hand.

    Sadly, the reports are now computerized, and there isn't a place to describe the nature of each injury. It's just a bunch of numbers.

    Numbers are nice, but they don't tell a story in quite the same way.

  • The Old New Thing

    Solutions that require a time machine: Making applications which require compatibility behaviors crash so the developers will fix their bug before they ship

    • 76 Comments

    A while ago, I mentioned that there are many applications that rely on WM_PAINT messages being delivered even if there is nothing to paint because they put business logic inside their WM_PAINT handler. As a result, Windows sends them dummy WM_PAINT messages.

    Jerry Pisk opines,

    Thanks to the Windows team going out of their way not to break poorly written applications developers once again have no incentive to clean up their act and actually write working applications. If an application requires a dummy WM_PAINT not to crash it should be made to crash as soon as possible so the developers go in and fix it before releasing their "code".

    In other words, Jerry recommends that Microsoft use the time machine that Microsoft Research has been secretly perfecting for the past few years. (They will sometimes take it out for a spin and fail to cover their tracks.)

    In 1993, Company X writes a program that relies on WM_PAINT messages arriving in a particular order relative to other messages. (And just to make things more interesting, in 1994, Company X goes out of business, or they discontinue the program in question, or the only person who understands the code leaves the company or dies in a plane crash.)

    In 1995, changes to Windows alter the order of messages, and in particular, WM_PAINT messages are no longer sent under certain circumstances. I suspect that the reason for this is the introduction of the taskbar. Before the taskbar, minimized windows appeared as icons on your desktop and therefore received WM_PAINT messages while minimized. But now that applications minimize to the taskbar, minimized windows are sent off screen and never actually paint. The taskbar button does the job of representing the program on the screen.

    Okay, now let's put Jerry in charge of solving this compatibility problem. He recommends that instead of sending a dummy WM_PAINT message to these programs to keep them happy, these programs should instead be made to crash as soon as possible, so that the developers can go in and fix the problem before they release the program.

    In other words, he wants to take the Microsoft Research time machine back to 1993 with a beta copy of Windows 95 and give it to the programmers at Company X and tell them, "Your program crashes on this future version of Windows that doesn't exist yet in your time. Fix the problem before you release your code. (Oh, and by the way, the Blue Jays are going to repeat.)"

    Or maybe I misunderstood his recommendation.

  • The Old New Thing

    Some remarks on VirtualAlloc and MEM_LARGE_PAGES

    • 41 Comments

    If you try to run the sample program demonstrating how to create a file mapping using large pages, you'll probably run into the error ERROR_NOT_ALL_ASSIGNED (Not all privileges or groups referenced are assigned to the caller) when calling Adjust­Token­Privileges. What is going on?

    The Adjust­Token­Privileges enables privileges that you already have (but which are masked). Sort of like how a super hero can't use super powers while disguised as a normal mild-mannered citizen. In order to enable the Se­Lock­Memory­Privilege privilege, you must already have it. But where do you get it?

    You do this by using the group policy editor. The list of privileges says that the Se­Lock­Memory­Privilege corresponds to "Lock pages in memory".

    Why does allocating very large pages require permission to lock pages in memory?

    Because very large pages are not pageable. This is not an inherent limitation of large pages; the processor is happy to page them in or out, but you have to do it all or nothing. In practice, you don't want a single page-out or page-in operation to consume 4MB or 16MB of disk I/O; that's a thousand times more I/O than your average paging operation. And in practice, the programs which use these large pages are "You paid $40,000 for a monster server whose sole purpose is running my one application and nothing else" type applications, like SQL Server. Those applications don't want this memory to be pageable anyway, so adding code to allow them to be pageable is not only a bunch of work, but it's a bunch of work to add something nobody who uses the feature actually wants.

    What's more, allocating very large pages can be time-consuming. All the physical pages which are involved in a very large page must be contiguous (and must be aligned on a large page boundary). Prior to Windows XP, allocating a very large page can take 15 seconds or more if your physical memory is fragmented. (And even machines with as much as 2GB of memory will probably have highly fragmented physical memory once they're running for a little while.) Internally, allocating the physical pages for a very large page is performed by the kernel function which allocates physically contiguous memory, which is something device drivers need to do quite often for I/O transfer buffers. Some drivers behave "highly unfavorably" if their request for contiguous memory fails, so the operating system tries very hard to scrounge up the memory, even if it means shuffling megabytes of memory around and performing a lot of disk I/O to get it. (It's essentially performing a time-critical defragmentation.)

    If you followed the discussion so far, you'll see another reason why large pages aren't paged out: When they need to be paged back in, the system may not be able to find a suitable chunk of contiguous physical memory!

    In Windows Vista, the memory manager folks recognized that these long delays made very large pages less attractive for applications, so they changed the behavior so requests for very large pages from applications went through the "easy parts" of looking for contiguous physical memory, but gave up before the memory manager went into desperation mode, preferring instead just to fail. (In Windows Vista SP1, this part of the memory manager was rewritten so the really expensive stuff is never needed at all.)

    Note that the MEM_LARGE_PAGES flag triggers an exception to the general principle that MEM_RESERVE only reserves address space, MEM_COMMIT makes the memory manager guarantee that physical pages will be there when you need them, and that the physical pages aren't actually allocated until you access the memory. Since very large pages have special physical memory requirements, the physical allocation is done up front so that the memory manager knows that when it comes time to produce the memory on demand, it can actually do so.

  • The Old New Thing

    How do you obtain the icon for a shortcut without the shortcut overlay?

    • 12 Comments

    The easy one-stop-shopping way to get the icon for a file is to use the SHGet­File­Info function with the SHGFI_ICON flag. One quirk of the SHGet­File­Info function is that if you pass the path to a shortcut file, it will always place the shortcut overlay on the icon, regardless of whether you passed the SHGFI_ADD­OVERLAYS flag. (Exercise: What is so special about the shortcut overlay that makes it exempt from the powers of the SHGFI_ADD­OVERLAYS flag? The information you need is on the MSDN page for SHGet­File­Info, though you'll have to apply some logic to the sitaution.)

    I'm using SHGet­File­Info to get the icon of a file to display in my application. When the file is a shortcut, rather than displaying the exe icon with a link overlay (as in SHGFI_LINK­OVERLAY) I'd like to display the original exe icon. Is there a way to do this with SHGet­File­Info? Thanks,

    First, correcting a minor error in the question: The icon for a shortcut is, by default, the icon for the shortcut target, but it doesn't have to be. The IShell­Link::Set­Icon­Location method lets you set the icon for a shortcut to anything you like. (This is the method used when you click Change Icon on the shortcut property page.)

    Anyway, the SHGet­File­Info function gets the icon first by asking the shell namespace for the icon index in the system imagelist, and then converting that imagelist/icon index into a HICON. If you want to change the conversion, you can just ask SHGet­File­Info to stop halfway and then finish the process the way you like.

    HICON GetIconWithoutShortcutOverlay(PCTSTR pszFile)
    {
     SHFILEINFO sfi;
     HIMAGELIST himl = reinterpret_cast<HIMAGELIST>(
      SHGetFileInfo(pszFile, 0, &sfi, sizeof(sfi),
                    SHGFI_SYSICONINDEX));
     if (himl) {
      return ImageList_GetIcon(himl, sfi.iIcon, ILD_NORMAL);
     } else {
      return NULL;
     }
    }
    

    Of course, if you're going to be doing this for a lot of files, you may want to just stop once you have the imagelist and the index, using Image­List_Draw to draw the image when necessary, instead of creating thousands of little icons.

  • The Old New Thing

    Microspeak: Leverage

    • 25 Comments

    At Microsoft, leverage is not a term of physics whereby a force can be magnified by the application of mechanical advantage. It is also not a term of finance whereby the power of a small amount of money can be magnified by the assumption of debt. In fact, at Microsoft, the word leverage isn't even a noun. It is a verb: to leverage, leverages, leveraging, leveraged, has leveraged. Oh, and it has nothing to do with magnification.

    Here are some citations:

    How do I leverage a SiteMap?
    Allow advertising partners to leverage this resource for providing targeted advertising links.
    Leverage existing design to power other designs
    Do we have documents on how Windows 95 can leverage Windows 2000 Active Directory?

    At Microsoft, to leverage means to take advantage of, or in many cases, simply to use.

    Verbal use of the word leverage appears to be popular outside of Microsoft as well, such as this headline from eWeek: How to Leverage IT to Speed RandD Innovation.

    But can you do this: At Microsoft you can leverage people.

    Alice, does Bob perform any component-Foo testing? Charles can be leveraged to actually execute tests if Donald can drive him with these asks. Let me know.

    That snippet was a whirlwind of Microspeak, with the passive form of the verb to leverage, plus to drive and the plural noun asks.

Page 124 of 438 (4,376 items) «122123124125126»