• The Old New Thing

    We can't cut that; it's our last feature

    • 38 Comments

    Many years ago, I was asked to help a customer with a problem they were having. I don't remember the details, and they aren't important to the story anyway, but as I was investigating one of their crashes, I started to wonder why they were even doing it.

    I expressed my concerns to the customer liaison. "Why are they writing this code in the first place? The performance will be terrible, and it'll never work exactly the way they want it to."

    The customer liaison confided, "Yeah, I thought the same thing. But this is a feature they're adding to the next version of their product. The product is so far behind schedule, they've been cutting features like mad to get back on track. But they can't cut this feature. It's the last one left!"

  • The Old New Thing

    We've traced the call and it's coming from inside the house: A function call that always fails

    • 64 Comments

    A customer reported that they had a problem with a particular function added in Windows 7. The tricky bit was that the function was used only on very high-end hardware, not the sort of thing your average developer has lying around.

    GROUP_AFFINITY GroupAffinity;
    ... code that initializes the GroupAffinity structure ...
    if (!SetThreadGroupAffinity(hThread, &GrouAffinity, NULL));
    {
     printf("SetThreadGroupAffinity failed: %d\n", GetLastError());
     return FALSE;
    }
    

    The customer reported that the function always failed with error 122 (ERROR_INSUFFICIENT_BUFFER) even though the buffer seems perfectly valid.

    Since most of us don't have machines with more than 64 processors, we couldn't run the code on our own machines to see what happens. People asked some clarifying questions, like whether this code is compiled 32-bit or 64-bit (thinking that maybe there is an issue with the emulation layer), until somebody noticed that there was a stray semicolon at the end of the if statement.

    The customer was naturally embarrassed, but was gracious enough to admit that, yup, removing the semicolon fixed the problem.

    This reminds me of an incident many years ago. I was having a horrible time debugging a simple loop. It looked like the compiler was on drugs and was simply ignoring my loop conditions and always dropping out of the loop. At wit's end, I asked a colleague to come to my office and serve as a second set of eyes. I talked him through the code as I single-stepped:

    "Okay, so we set up the loop here..."

    NODE pn = GetActiveNode();
    

    "And we enter the loop, continuing while the node still needs processing."

    if (pn->NeedsProcessing())
    {
    

    "Okay, we entered the loop. Now we realign the skew rods on the node."

     pn->RealignSkewRods();
    

    "If the treadle is splayed, we need to calibrate the node against it."

     if (IsSplayed()) pn->Recalibrate(this);
    

    "And then we loop back to see if there is more work to be done on this node."

    }
    

    "But look, even though the node needs processing «view node members», we don't loop back. We just drop out of the loop. What's going on?"

    Um, that's an if statement up there, not a while statement.

    A moment of silence while I process this piece of information.

    "All right then, sorry to bother you, hey, how about that sporting event last night, huh?"

  • The Old New Thing

    Controlling which devices will wake the computer out of sleep

    • 62 Comments

    I haven't experienced this problem, but I know of people who have. They'll put their laptop into suspend or standby mode, and after a few seconds, the laptop will spontaneously wake itself up. Someone gave me this tip that might (might) help you figure out what is wrong.

    Open a command prompt and run the command

    powercfg -devicequery wake_from_any

    This lists all the hardware devices that are capable of waking up the computer from standby. But the operating system typically ignores most of them. To see the ones that are not being ignored, run the command

    powercfg -devicequery wake_armed

    This second list is typically much shorter. On my computer, it listed just the keyboard, the mouse, and the modem. (The modem? I never use that thing!)

    You can disable each of these devices one by one until you find the one that is waking up the computer.

    powercfg -devicedisablewake "device name"

    (How is this different from unchecking Allow this device to wake the computer from the device properties in Device Manager? Beats me.)

    Once you find the one that is causing problems, you can re-enable the others.

    powercfg -deviceenablewake "device name"

    I would start by disabling wake-up for the keyboard and mouse. Maybe the optical mouse is detecting tiny vibrations in your room. Or the device might simply be "chatty", generating activity even though you aren't touching it.

    This may not solve your problem, but at least's something you can try. I've never actually tried it myself, so who knows whether it works.

    Exercise: Count how many disclaimers there are in this article, and predict how many people will ignore them.

  • The Old New Thing

    What does an invalid handle exception in LeaveCriticalSection mean?

    • 27 Comments

    Internally, a critical section is a bunch of counters and flags, and possibly an event. (Note that the internal structure of a critical section is subject to change at any time—in fact, it changed between Windows XP and Windows 2003. The information provided here is therefore intended for troubleshooting and debugging purposes and not for production use.) As long as there is no contention, the counters and flags are sufficient because nobody has had to wait for the critical section (and therefore nobody had to be woken up when the critical section became available).

    If a thread needs to be blocked because the critical section it wants is already owned by another thread, the kernel creates an event for the critical section (if there isn't one already) and waits on it. When the owner of the critical section finally releases it, the event is signaled, thereby alerting all the waiters that the critical section is now available and they should try to enter it again. (If there is more than one waiter, then only one will actually enter the critical section and the others will return to the wait loop.)

    If you get an invalid handle exception in LeaveCriticalSection, it means that the critical section code thought that there were other threads waiting for the critical section to become available, so it tried to signal the event, but the event handle was no good.

    Now you get to use your brain to come up with reasons why this might be.

    One possibility is that the critical section has been corrupted, and the memory that normally holds the event handle has been overwritten with some other value that happens not to be a valid handle.

    Another possibility is that some other piece of code passed an uninitialized variable to the CloseHandle function and ended up closing the critical section's handle by mistake. This can also happen if some other piece of code has a double-close bug, and the handle (now closed) just happened to be reused as the critical section's event handle. When the buggy code closes the handle the second time by mistake, it ends up closing the critical section's handle instead.

    Of course, the problem might be that the critical section is not valid because it was never initialized in the first place. The values in the fields are just uninitialized garbage, and when you try to leave this uninitialized critical section, that garbage gets used as an event handle, raising the invalid handle exception.

    Then again, the problem might be that the critical section is not valid because it has already been destroyed. For example, one thread might have code that goes like this:

    EnterCriticalSection(&cs);
    ... do stuff...
    LeaveCriticalSection(&cs);
    

    While that thread is busy doing stuff, another thread calls DeleteCriticalSection(&cs). This destroys the critical section while another thread was still using it. Eventually that thread finishes doing its stuff and calls LeaveCriticalSection, which raises the invalid handle exception because the DeleteCriticalSection already closed the handle.

    All of these are possible reasons for an invalid handle exception in LeaveCriticalSection. To determine which one you're running into will require more debugging, but at least now you know what to be looking for.

    Postscript: One of my colleagues from the kernel team points out that the Locks and Handles checks in Application Verifier are great for debugging issues like this.

  • The Old New Thing

    Why doesn't the RunAs program accept a password on the command line?

    • 57 Comments

    The RunAs program demands that you type the password manually. Why doesn't it accept a password on the command line?

    This was a conscious decision. If it were possible to pass the password on the command line, people would start embedding passwords into batch files and logon scripts, which is laughably insecure.

    In other words, the feature is missing to remove the temptation to use the feature insecurely.

    If this offends you and you want to be insecure and pass the password on the command line anyway (for everyone to see in the command window title bar), you can write your own program that calls the CreateProcessWithLogonW function.

    (I'm told that there is a tool available for download which domain administrators might find useful, though it solves a slightly different problem.)

  • The Old New Thing

    Quick overview of how processes exit on Windows XP

    • 37 Comments

    Exiting is one of the scariest moments in the lifetime of a process. (Sort of how landing is one of the scariest moments of air travel.)

    Many of the details of how processes exit are left unspecified in Win32, so different Win32 implementations can follow different mechanisms. For example, Win32s, Windows 95, and Windows NT all shut down processes differently. (I wouldn't be surprised if Windows CE uses yet another different mechanism.) Therefore, bear in mind that what I write in this mini-series is implementation detail and can change at any time without warning. I'm writing about it because these details can highlight bugs lurking in your code. In particular, I'm going to discuss the way processes exit on Windows XP.

    I should say up front that I do not agree with many steps in the way processes exit on Windows XP. The purpose of this mini-series is not to justify the way processes exit but merely to fill you in on some of the behind-the-scenes activities so you are better-armed when you have to investigate into a mysterious crash or hang during exit. (Note that I just refer to it as the way processes exit on Windows XP rather than saying that it is how process exit is designed. As one of my colleagues put it, "Using the word design to describe this is like using the term swimming pool to refer to a puddle in your garden.")

    When your program calls ExitProcess a whole lot of machinery springs into action. First, all the threads in the process (except the one calling ExitProcess) are forcibly terminated. This dates back to the old-fashioned theory on how processes should exit: Under the old-fashioned theory, when your process decides that it's time to exit, it should already have cleaned up all its threads. The termination of threads, therefore, is just a safety net to catch the stuff you may have missed. It doesn't even wait two seconds first.

    Now, we're not talking happy termination like ExitThread; that's not possible since the thread could be in the middle of doing something. Injecting a call to ExitThread would result in DLL_THREAD_DETACH notifications being sent at times the thread was not prepared for. Nope, these threads are terminated in the style of TerminateThread: Just yank the rug out from under it. Buh-bye. This is an ex-thread.

    Well, that was a pretty drastic move, now, wasn't it. And all this after the scary warnings in MSDN that TerminateThread is a bad function that should be avoided!

    Wait, it gets worse.

    Some of those threads that got forcibly terminated may have owned critical sections, mutexes, home-grown synchronization primitives (such as spin-locks), all those things that the one remaining thread might need access to during its DLL_PROCESS_DETACH handling. Well, mutexes are sort of covered; if you try to enter that mutex, you'll get the mysterious WAIT_ABANDONED return code which tells you that "Uh-oh, things are kind of messed up."

    What about critical sections? There is no "Uh-oh" return value for critical sections; EnterCriticalSection doesn't have a return value. Instead, the kernel just says "Open season on critical sections!" I get the mental image of all the gates in a parking garage just opening up and letting anybody in and out. [See correction.]

    As for the home-grown stuff, well, you're on your own.

    This means that if your code happened to have owned a critical section at the time somebody called ExitProcess, the data structure the critical section is protecting has a good chance of being in an inconsistent state. (Afer all, if it were consistent, you probably would have exited the critical section! Well, assuming you entered the critical section because you were updating the structure as opposed to reading it.) Your DLL_PROCESS_DETACH code runs, enters the critical section, and it succeeds because "all the gates are up". Now your DLL_PROCESS_DETACH code starts behaving erratically because the values in that data structure are inconsistent.

    Oh dear, now you have a pretty ugly mess on your hands.

    And if your thread was terminated while it owned a spin-lock or some other home-grown synchronization object, your DLL_PROCESS_DETACH will most likely simply hang indefinitely waiting patiently for that terminated thread to release the spin-lock (which it never will do).

    But wait, it gets worse. That critical section might have been the one that protects the process heap! If one of the threads that got terminated happened to be in the middle of a heap function like HeapAllocate or LocalFree, then the process heap may very well be inconsistent. If your DLL_PROCESS_DETACH tries to allocate or free memory, it may crash due to a corrupted heap.

    Moral of the story: If you're getting a DLL_PROCESS_DETACH due to process termination,† don't try anything clever. Just return without doing anything and let the normal process clean-up happen. The kernel will close all your open handles to kernel objects. Any memory you allocated will be freed automatically when the process's address space is torn down. Just let the process die a quiet death.

    Note that if you were a good boy and cleaned up all the threads in the process before calling ExitThread, then you've escaped all this craziness, since there is nothing to clean up.

    Note also that if you're getting a DLL_PROCESS_DETACH due to dynamic unloading, then you do need to clean up your kernel objects and allocated memory because the process is going to continue running. But on the other hand, in the case of dynamic unloading, no other threads should be executing code in your DLL anyway (since you're about to be unloaded), so—assuming you coded up your DLL correctly—none of your critical sections should be held and your data structures should be consistent.

    Hang on, this disaster isn't over yet. Even though the kernel went around terminating all but one thread in the process, that doesn't mean that the creation of new threads is blocked. If somebody calls CreateThread in their DLL_PROCESS_DETACH (as crazy as it sounds), the thread will indeed be created and start running! But remember, "all the gates are up", so your critical sections are just window dressing to make you feel good.

    (The ability to create threads after process termination has begun is not a mistake; it's intentional and necessary. Thread injection is how the debugger breaks into a process. If thread injection were not permitted, you wouldn't be able to debug process termination!)

    Next time, we'll see how the way process termination takes place on Windows XP caused not one but two problems.

    Footnotes

    †Everybody reading this article should already know how to determine whether this is the case. I'm assuming you're smart. Don't disappoint me.

  • The Old New Thing

    What's the difference between the COM and EXE extensions?

    • 42 Comments

    Commenter Koro asks why you can rename a COM file to EXE without any apparent ill effects. (James MAstros asked a similar question, though there are additional issues in James' question which I will take up at a later date.)

    Initially, the only programs that existed were COM files. The format of a COM file is... um, none. There is no format. A COM file is just a memory image. This "format" was inherited from CP/M. To load a COM file, the program loader merely sucked the file into memory unchanged and then jumped to the first byte. No fixups, no checksum, nothing. Just load and go.

    The COM file format had many problems, among which was that programs could not be bigger than about 64KB. To address these limitations, the EXE file format was introduced. The header of an EXE file begins with the magic letters "MZ" and continues with other information that the program loader uses to load the program into memory and prepare it for execution.

    And there things lay, with COM files being "raw memory images" and EXE files being "structured", and the distinction was rigidly maintained. If you renamed an EXE file to COM, the operating system would try to execute the header as if it were machine code (which didn't get you very far), and conversely if you renamed a COM file to EXE, the program loader would reject it because the magic MZ header was missing.

    So when did the program loader change to ignore the extension entirely and just use the presence or absence of an MZ header to determine what type of program it is? Compatibility, of course.

    Over time, programs like FORMAT.COM, EDIT.COM, and even COMMAND.COM grew larger than about 64KB. Under the original rules, that meant that the extension had to be changed to EXE, but doing so introduced a compatibility problem. After all, since the files had been COM files up until then, programs or batch files that wanted to, say, spawn a command interpreter, would try to execute COMMAND.COM. If the command interpreter were renamed to COMMAND.EXE, these programs which hard-coded the program name would stop working since there was no COMMAND.COM any more.

    Making the program loader more flexible meant that these "well-known programs" could retain their COM extension while no longer being constrained by the "It all must fit into 64KB" limitation of COM files.

    But wait, what if a COM program just happened to begin with the letters MZ? Fortunately, that never happened, because the machine code for "MZ" disassembles as follows:

    0100 4D            DEC     BP
    0101 5A            POP     DX
    

    The first instruction decrements a register whose initial value is undefined, and the second instruction underflows the stack. No sane program would begin with two undefined operations.

  • The Old New Thing

    If you can detect the difference between an emulator and the real thing, then the emulator has failed

    • 2 Comments

    Recall that a corrupted program sometimes results in a "Program too big to fit in memory" error. In response, Dog complained that while that may have been a reasonable response back in the 1980's, in today's world, there's plenty of memory around for the MS-DOS emulator to add that extra check and return a better error code.

    Well yeah, but if you change the externally visible behavior, then you've failed as an emulator. The whole point of an emulator is to mimic another world, and any deviations from that other world can come back to bite you.

    MS-DOS is perhaps one of the strongest examples of requiring absolute unyielding backward compatibility. Hundreds if not thousands of programs scanned memory looking for specific byte sequences inside MS-DOS so it could patch them or hunted around inside MS-DOS's internal state variables so it could modify them. If you move one thing out of place, those programs stop working.

    MS-DOS contains chunks of "junk DNA", code fragments which do nothing but waste space, but which exist so that programs which go scanning through memory looking for specific byte sequences will find them. (This principle is not dead; there's even some junk DNA in Explorer.)

    Given the extreme compatibility required for MS-DOS emulation, I'm not surprised that the original error behavior was preserved. There is certainly some program out there that stops working if attempting to execute a COM-style image larger than 64KB returns any error other than 8. (Besides, if you wanted it to return some other error code, you had precious few to choose from.)

  • The Old New Thing

    Do not overload the E_NOINTERFACE error

    • 26 Comments

    One of the more subtle ways people mess up IUnknown::QueryInterface is returning E_NOINTERFACE when the problem wasn't actually an unsupported interface. The E_NOINTERFACE return value has very specific meaning. Do not use it as your generic "gosh, something went wrong" error. (Use an appropriate error such as E_OUTOFMEMORY or E_ACCESSDENIED.)

    Recall that the rules for IUnknown::QueryInterface are that (in the absence of catastrophic errors such as E_OUTOFMEMORY) if a request for a particular interface succeeds, then it must always succeed in the future for that object. Similarly, if a request fails with E_NOINTERFACE, then it must always fail in the future for that object.

    These rules exist for a reason.

    In the case where COM needs to create a proxy for your object (for example, to marshal the object into a different apartment), the COM infrastructure does a lot of interface caching (and negative caching) for performance reasons. For example, if a request for an interface fails, COM remembers this so that future requests for that interface are failed immediately rather than being marshalled to the original object only to have the request fail anyway. Requests for unsupported interfaces are very common in COM, and optimizing that case yields significant performance improvements.

    If you start returning E_NOINTERFACE for problems other than "The object doesn't support this interface", COM will assume that the object really doesn't support the interface and may not ask for it again even if you do. This in turn leads to very strange bugs that defy debugging: You are at a call to IUnknown::QueryInterface, you set a breakpoint on your object's implementation of IUnknown::QueryInterface to see what the problem is, you step over the call and get E_NOINTERFACE back without your breakpoint ever hitting. Why? Because at some point in the past, you said you didn't support the interface, and COM remembered this and "saved you the trouble" of having to respond to a question you already answered. The COM folks tell me that they and their comrades in product support end up spending hours debugging customer's problems like "When my computer is under load, sometimes I start getting E_NOINTERFACE for interfaces I definitely support."

    Save yourself and the COM folks several hours of frustration. Don't return E_NOINTERFACE unless you really mean it.

  • The Old New Thing

    Things I've written that have amused other people, Episode 4

    • 66 Comments

    One of my colleagues pointed out that my web site is listed in the references section of this whitepaper. It scares me that I'm being used as formal documentation because that is explicitly what this web site isn't. I wrote back,

    I really need to put a disclaimer on my web site.
    FOR ENTERTAINMENT PURPOSES ONLY

    Remember, this is a blog. The opinions (and even some facts) expressed here are those of the author and do not necessarily reflect those of Microsoft Corporation. Nothing I write here creates an obligation on Microsoft or establishes the company's official position on anything. I am not a spokesperson. I'm just this guy who strings people along in the hopes that they might hear a funny story once in a while.

    You'd think this was obvious, but apparently there are people who think that somehow what I write has the weight of official Microsoft policy and take my sentences apart as if they were legal documents or who take my articles and declare them to be official statements from Microsoft Corporation.

Page 3 of 425 (4,250 items) 12345»