February, 2011

  • The Old New Thing

    Microspeak: Recycling bits or recycling electrons

    • 11 Comments

    To recycle bits (or recycle electrons) is to take an old piece of email and use it to answer a similar (often identical) question or discussion on a mailing list. This is usually done by simply replying to the thread with the two-word message "Recycling bits" (or "Recycling electrons") and attaching the original email message.

    An important aspect of the use of this term is that the attached email message definitively answers the question or resolves the discussion. Usually, the attached email message comes from the very same mailing list that is hosting the current discussion. For example, consider this question:

    From: X
    To: Q Discussion
    Subject: How do I do X?
    Date: February 8, 2011

    Blah blah blah blah? Blah blah. Thanks.

    An example reply which earns maximum style points would go something like this:

    From: Y
    To: Q Discussion
    Subject: RE: How do I do X?
    Date: February 8, 2011

    Recycling electrons.

    <Attached message>

    From: S
    To: Q Discussion
    Subject: RE: How do I do X?
    Date: October 31, 2010

    You first need to frob the blurble and then blah blah blah blah.

    ----Original message----

    From: R
    To: Q Discussion
    Subject: How do I do X?
    Date: October 30, 2010

    Blah blah blah blah? Blah blah. Thanks.

    Since the question asked on February 8 is identical to the question asked on October 30, the answer from S can be re-used verbatim. (This of course assumes that nothing has happened in between to invalidate S's original answer.)

    See also: "I refer the honourable gentleman to the answer given some moments ago."

  • The Old New Thing

    The cursor isn't associated with a window or a window class; it's associated with a thread group

    • 21 Comments

    In my earlier discussion of the fact that changing a class property affects all windows of that class, commenters LittleHelper and Norman Diamond wanted to know "Why is the cursor associated with class and not a window?"

    This is another one of those questions that start off with an invalid assumption. The cursor is not associated with a class. The cursor is not associated with a window. The cursor is associated with an input state. (Initially, each thread has its own input state, but functions like Attach­Thread­Input can cause threads to share their input states.)

    As we saw when we explored the process by which the cursor gets set, the cursor-setting process is initiated by the WM_SETCURSOR message, which is percolated up and down the window hierarchy until somebody calls Set­Cursor and returns TRUE to say "Okay, I set the cursor. You can stop searching now." And that cursor remains in effect until somebody else in the same thread group calls Set­Cursor.

    It so happens that the Def­Window­Proc function, when asked to set a cursor, will use the window's class cursor. But that's just the default in the absence of any customization to the contrary. If you want to customize the cursor when it is over a particular window, then use the customization; don't go changing the default. If you change the default, then you affect what happens to all the other windows of the class. Just handle the WM_SETCURSOR message to establish your "per-window cursor". (And you can be even more specific than just per-window. For example, you might decide to show a hand cursor if the user is over a hyperlink but an arrow cursor otherwise.)

    Many of the fields in the WND­CLASS structure are merely defaults which are applied to windows of the class. You can still override them on a per-window basis.

    Field How to override
    lpfn­Wnd­Proc Set­Window­Long­Ptr(GWLP_WNDPROC)
    hIcon Send­Message(WM_SETICON)
    hCursor Handle the WM_SETCURSOR message
    hbrBackground Handle the WM_ERASEBKGND message
    lpsz­Menu­Name Set­Menu()

    (This is the same table I wrote up some time ago, but the original table didn't have an entry for the window procedure, so this table is slightly more complete.

  • The Old New Thing

    The 2010 Niney Award nominees have been announced

    • 18 Comments

    The nominees for the first (annual?) Niney Awards have been announced. The Nineys are an award which recognizes those who have had the greatest impact on the technical/developer community over the past year. Winners are selected by you, the technical/developer community.

    The winners will be announced at the MIX11 conference in April. But before they can announce winners, they need to collect votes.

    That's where you come in.

    Cast your vote online in the following categories:

    • Favorite Channel 9 Show (video) of 2010
    • Favorite Channel 9 Series (video) of 2010
    • Favorite Community Show (video) of 2010
    • Favorite Twitter User of 2010
    • Most Helpful Niner of 2010
    • Channel 9 Video of the Year for 2010
    • Favorite New Microsoft Technology or Product for 2010
    • Favorite CodePlex Project of 2010
    • Favorite Microsoft Blogger of 2010
    • Favorite Blog About Microsoft of 2010
    • Favorite Audio Podcast of 2010
    • Favorite Web Site Design of 2010

    Now, it so happens that among the nominees is an author of a somewhat unsucessful book on programming (not to be confused with an author of several successful books on programming and a to-be-published novel) but please don't let that distract you from voting for whoever you feel best deserves to win.

  • The Old New Thing

    Ready... cancel... wait for it! (part 3)

    • 16 Comments

    A customer reported that their application was crashing in RPC, and they submitted a sample program which illustrated the same crash as their program. Their sample program was actually based on the AsyncRPC sample client program, which was nice, because it provided a mutually-known starting point. They made quite a few changes to the program, but this is the important one:

    // old code:
    // status = RpcAsyncCancelCall(&Async, FALSE);
    
    // new code:
     status = RpcAsyncCancelCall(&Async, TRUE);
    

    (It was actually more complicated than this, but this is the short version.)

    The program was crashing for the same reason that Wednesday's I/O cancellation program was crashing: The program issued an asynchronous cancel and didn't wait for the cancel to complete. In this case, the crash occurred when the RPC call finally completed and RPC went about cleaning up the call based on the information in the now-freed RPC_ASYNC_STATE structure.

    The error was probably caused by the not-very-helpful name for that last parameter to Rpc­Async­Cancel­Call: fAbort­Call, and the accompanying documentation which says, "In an abortive cancel (fAbort­Call is TRUE), the Rpc­Async­Cancel­Call function sends a cancel notification to the server and client side and the asynchronous call is canceled immediately, not waiting for a response from the server." Compare this to a nonabortive cancel, where "the Rpc­Async­Cancel­Call function notifies the server of the cancel and the client waits for the server to complete the call."

    Obviously, it's faster if you don't wait for the server to respond, right? Let's pass TRUE, so that the function cancels the asynchronous call immediately without waiting for the server. Wow, look at how fast our program runs now!

    Unfortunately, the documentation doesn't make it sufficiently clear that when you issue a cancellation, you still have to wait for the operation to complete before you can clean up all the resources associated with that operation. Another way of looking at that last parameter is to think of it as fAsync. If you pass fAsync = TRUE, then the Rpc­Async­Cancel­Call function issues the cancellation and returns before the operation completes. If you pass fAsync = FALSE, then the Rpc­Async­Cancel­Call function issues the cancellation and waits for the operation to complete before returning.

    If you switch from a synchronous cancel to an asynchronous cancel, then you become responsible for keeping the RPC_ASYNC_STATE valid until the cancellation completes. In this case, the customer was using the Rpc­Notification­Type­Event notification type, which means that they need to wait for the Async.u.hEvent to become signaled before they can free the RPC_ASYNC_STATE.

    The customer confirmed the fix and closed the support case. Another problem solved.

    Three months later, the customer reopened the case, reporting that after they released a new version of their program with the aforementioned fix, they were nevertheless getting WinQual crashes which looked exactly like the ones that they were having before they applied the fix. It appears that the fix wasn't working.

    Upon closer investigation, it turns out that the customer originally did apply the fix as recommended: They added a Wait­For­Single­Object(Async.u.hEvent, INFINITE) call before destroying the Async object to ensure that the cancellation was complete. However, they became frustrated that sometimes the cancellation would take a long time to complete, so they changed it to

    WaitForSingleObject(Async.u.hEvent, 5000); // wait up to 5 seconds
    

    The customer explained, "After the wait fails due to timeout, we just proceed as normal and call Rpc­Async­Complete­Call and free the the RPC_ASYNC_STATE. Is that wrong?"

    Um, yeah. Changing the Wait­For­Single­Object from an infinite wait to one with a timeout means that you just reintroduced the bug that the Wait­For­Single­Object was originally supposed to fix! If the cancellation takes more than 5 seconds, then your code will continue and free the RPC_ASYNC_STATE, just like it did when you didn't wait at all.

    "How long can I wait before assuming that the event will simply never get signaled?"

    There is no such duration after which you can safely abandon the operation. Even if the event doesn't get signaled for 30 minutes (say because the computer is thrashing its guts out), it may get signaled at 30 minutes and 1 second.

    "But we don't want our program to get stuck waiting for the server."

    Great. It's fine to have your program continues running after issuing the cancellation, even if the RPC call hasn't completed. Just don't free the RPC_ASYNC_STATE until the call is complete. and if you set things up so that your completion event takes the form of a callback, you can just make the callback free the RPC__ASYNC_STATE. Then you don't have to keep track of the asynchronous call any more; the system will merely call you when it's finished, and then you can free the state structure.

    Bonus RPC chatter: (For the purpose of this discussion, I'll use the term RPC operation instead of RPC call so we don't have confusion between function calls and RPC calls.) A colleague explained the lifetime of an RPC operation as follows:

    Submit phase You call into the MIDL-generated stub. You cannot call Rpc­Async­Cancel­Call during the submit phase.
    The stub does magic RPC stuff.
    The stub returns control back to the caller.
    Pending phase RPC is waiting for the response to the operation. The operation remains in this phase until the operation completes or is cancelled. You can call Rpc­Async­Cancel­Call to cancel the RPC operation and accelerate the transition to the Notified phase.
    Notified phase RPC informs the application of the result of the operation in a manner described by the Notification­Type and RPC_ASYNC_NOTIFICATION_INFO members of the RPC_ASYNC_STATE structure. You can call Rpc­Async­Cancel­Call but it will have no effect since the operation is already complete.
    Completion phase The application calls the Rpc­Async­Complete­Call function to clean up the resources used to track the RPC operation. You exit the completion phase when Rpc­Async­Complete­Call returns something other than RPC_S_ASYNC_CALL_PENDING. You cannot call Rpc­Async­Cancel­Call after Rpc­Async­Complete­Call indicates that the operation is complete, since that is the call that says "I'm all done!"
  • The Old New Thing

    I am no longer impressed by your fancy new 10,000 BTU hot pot burner

    • 31 Comments

    Two years ago, we had a gathering at my house for some friends for hot pot, the traditional way of ringing in the lunar new year (which takes place today). It was actually a bit of a cross-cultural event, since the attendees came from different regions of Asia, where different traditions reign. (And the American guests just had to choose sides!)

    My house has but one portable stove for hot pot, so one of the guests brought her own unit, a unit as it turns out which was purchased specifically for the occasion, which gleamed in the light and proudly proclaimed 10,000 BTU of raw heating power. This was cause for much boasting, particularly since I didn't know the heating power of my own puny old unit, but I accepted my second-place position with grace.

    Some time later, we had a quiet family hot pot, and my old and horrifically unfashionable burner was brought out to do its tired but important job, and it was then that I found the sticker that specified its heating power.

    9,925 BTU.

    Now I am no longer impressed by my friend's 10,000 BTU burner.

  • The Old New Thing

    Ready... cancel... wait for it! (part 2)

    • 9 Comments

    A customer had a question about I/O cancellation. They have a pending Read­File­Ex call with a completion procedure. They then cancel the I/O with Cancel­Io­Ex and wait for the completion by passing TRUE as the bWait parameter to Get­Overlapped­Result.

    Assuming both return success, can I assume that my completion procedure will not be called after GetOverlappedResult returns? It appears that GetOverlappedResult waits non-alertably for the I/O to complete, so I'm assuming it just eats the APC if there was one. But if an APC had been posted just before I called CancelIoEx, will it also cancel that APC?

    Get­Overlapped­Result does not magically revoke completion callbacks. Why should it?

    Recall that completion is not the same as success. Completion means that the I/O subsystem has closed the books on the I/O operation. The underlying operation may have completed successfully or it may have failed (and cancellation is just one of the many possible reasons for failure). Either way, the completion procedure signed up to be notified when the I/O completes, and therefore it will be called to be informed of the completion due to cancellation.

    Besides, as the customer noted, there is a race condition if the Cancel­Io­Ex call is made just after the I/O completed, in which case it didn't get cancelled after all.

    This answers our question from last time, namely, how our fix for the cancellation code was incomplete. If the I/O had been issued with a completion routine (or equivalently, if it had been issued against an I/O completion port), then the code frees the OVERLAPPED structure before the completion routine runs. The kernel doesn't care that you did that (the kernel is finished with the OVERLAPPED structure), but your completion routine is probably not going to be happy that it was given a pointer to freed memory as its lpOverlapped parameter.

    You have to delay freeing the OVERLAPPED structure until the completion routine executes. Typically, this is done by allocating the OVERLAPPED structure on the heap rather than the stack, and making it the completion routine's responsibility to free the memory as its final act.

  • The Old New Thing

    Ready... cancel... wait for it! (part 1)

    • 31 Comments

    One of the cardinal rules of the OVERLAPPED structure is the OVERLAPPED structure must remain valid until the I/O completes. The reason is that the OVERLAPPED structure is manipulated by address rather than by value.

    The word complete here has a specific technical meaning. It doesn't mean "must remain valid until you are no longer interested in the result of the I/O." It means that the structure must remain valid until the I/O subsystem has signaled that the I/O operation is finally over, that there is nothing left to do, it has passed on: You have an ex-I/O operation.

    Note that an I/O operation can complete successfully, or it can complete unsuccessfully. Completion is not the same as success.

    A common mistake when performing overlapped I/O is issuing a cancel and immediately freeing the OVERLAPPED structure. For example:

    // this code is wrong
     HANDLE h = ...; // handle to file opened as FILE_FLAG_OVERLAPPED
     OVERLAPPED o;
     BYTE buffer[1024];
     InitializeOverlapped(&o); // creates the event etc
     if (ReadFile(h, buffer, sizeof(buffer), NULL, &o) ||
         GetLastError() == ERROR_IO_PENDING) {
      if (WaitForSingleObject(o.hEvent, 1000) != WAIT_OBJECT_0) {
       // took longer than 1 second - cancel it and give up
       CancelIo(h);
       return WAIT_TIMEOUT;
      }
      ... use the results ...
     }
     ...
    

    The bug here is that after calling Cancel­Io, the function returns without waiting for the Read­File to complete. Returning from the function implicitly frees the automatic variable o. When the Read­File finally completes, the I/O system is now writing to stack memory that has been freed and is probably being reused by another function. The result is impossible to debug: First of all, it's a race condition between your code and the I/O subsystem, and breaking into the debugger doesn't stop the I/O subsystem. If you step through the code, you don't see the corruption, because the I/O completes while you're broken into the debugger.

    Here's what happens when the program is run outside the debugger:

    ReadFile I/O begins
    WaitForSingleObject I/O still in progress
    WaitForSingleObject times out
    CancelIo I/O cancellation submitted to device driver
    return
    Device driver was busy reading from the hard drive
    Device driver receives the cancellation
    Device driver abandons the rest of the read operation
    Device driver reports that I/O has been canceled
    I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
    I/O subsystem queues the completion function (if applicable)
    I/O subsystem signals the completion event (if applicable)
    I/O operation is now complete

    When the I/O subsystem receives word from the device driver that the cancellation has completed, it performs the usual operations when an I/O operation completes: It updates the OVERLAPPED structure with the results of the I/O operation, and notifies whoever wanted to be notified that the I/O is finished.

    Notice that when it updates the OVERLAPPED structure, it's updating memory that has already been freed back to the stack, which means that it's corrupting the stack of whatever function happens to be running right now. (It's even worse if you happened to catch it while it was in the process of updating the buffer!) Since the precise timing of I/O is unpredictable, the program crashes with memory corruption that keeps changing each time it happens.

    If you try to debug the program, you get this:

    ReadFile I/O begins
    WaitForSingleObject I/O still in progress
    WaitForSingleObject times out
    Breakpoint hit on Cancel­Io statement
    Stops in debugger
    Hit F10 to step over the CancelIo call I/O cancellation submitted to device driver
    Breakpoint hit on return statement
    Stops in debugger
    Device driver was busy reading from the hard drive
    Device driver receives the cancellation
    Device driver abandons the rest of the read operation
    Device driver reports that I/O has been canceled
    I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
    I/O subsystem queues the completion function (if applicable)
    I/O subsystem signals the completion event (if applicable)
    I/O operation is now complete
    Look at the OVERLAPPED structure in the debugger
    It says STATUS_CANCELED
    Hit F5 to resume execution
    No memory corruption

    Breaking into the debugger changed the timing of the I/O operation relative to program execution. Now, the I/O completes before the function returns, and consequently there is no memory corruption. You look at the OVERLAPPED structure and say, "See? Immediately on return from the Cancel­Io function, the OVERLAPPED structure has been updated with the result, and the buffer contents are not being written to. It's safe to free them both now. Therefore, this can't be the source of my memory corruption bug."

    Except, of course, that it is.

    This is even more crazily insidious because the OVERLAPPED structure and the buffer are updated by the I/O subsystem, which means that it happens from kernel mode. This means that write breakpoints set by your debugger won't fire. Even if you manage to narrow down the corruption to "it happens somewhere in this function", your breakpoints will never see it as it happens. You're going to see that the value was good, then a little while later, the value was bad, and yet your write breakpoint never fired. You're then going to declare that the world has gone mad and seriously consider a different line of work.

    To fix this race condition, you have to delay freeing the OVERLAPPED structure and the associated buffer until the I/O is complete and anything else that's using them has also given up their claim to it.

       // took longer than 1 second - cancel it and give up
       CancelIo(h);
       WaitForSingleObject(o.hEvent, INFINITE); // added
       // Alternatively: GetOverlappedResult(h, &o, TRUE);
       return WAIT_TIMEOUT;
    

    The Wait­For­Single­Object after the Cancel­Io waits for the I/O to complete before finally returning (and implicitly freeing the OVERLAPPED structure and the buffer on the stack). Better would be to use GetOverlapped­Result with bWait = TRUE, because that also handles the case where the hEvent member of the OVERLAPPED structure is NULL.

    Exercise: If you retrieve the completion status after canceling the I/O (either by looking at the OVERLAPPED structure directly or by using GetOverlapped­Result) there's a chance that the overlapped result will be something other than STATUS_CANCELED (or ERROR_CANCELLED if you prefer Win32 error codes). Explain.

    Exercise: If this example had used Read­File­Ex, the proposed fix would be incomplete. Explain and provide a fix. Answer to come next time, and then we'll look at another version of this same principle.

  • The Old New Thing

    There is no longer any pleasure in reading the annual Microsoft injury reports

    • 22 Comments

    Microsoft is required by law to file reports on employees who have sustained injuries on the job. They are also required to post the reports in a location where employees can see them. These reports come out every year on February 1.

    Back in the old days, these reports were filled out by hand, and reading them was oddly amusing for the details. My favorite from the mid 1990's was a report on an employee who was injured on the job, and the description was simply pencil lead embedded in hand.

    Sadly, the reports are now computerized, and there isn't a place to describe the nature of each injury. It's just a bunch of numbers.

    Numbers are nice, but they don't tell a story in quite the same way.

Page 3 of 3 (28 items) 123