• The Old New Thing

    Debugging walkthrough: Access violation on nonsense instruction, episode 3

    • 17 Comments

    A colleague of mine asked for help debugging a strange failure. Execution halted on what appeared to be a nonsense instruction.

    eax=022b13a0 ebx=00000000 ecx=02570df4 edx=769f4544 esi=02570dec edi=05579748
    eip=76c49131 esp=05cce038 ebp=05cce07c iopl=0         nv up ei pl nz na po nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
    KERNELBASE!GetFileAttributesExW+0x2:
    76c49131 ec              in      al,dx
    

    This is clearly an invalid instruction. But observe that the offset is +2, which is normally the start of the function, because the first two bytes of Windows operating system functions are a mov edi, edi instruction. Therefore, the function is corrupted. Lets look back two bytes to see if it gives any clues.

    0:006> u 76c49131-2
    KERNELBASE!GetFileAttributesExW:
    76c4912f e95aecebf3      jmp     IoLog!Mine_GetFileAttributesExW (6ab07d8e)
    

    Oh look, somebody is doing API patching (already unsupported) and they did a bad job. They tried to patch code while a thread was in the middle of executing it, resulting in a garbage instruction.

    This is a bug in IoLog. The great thing about API patching is that when you screw up, it looks like an OS bug. That way, nobody ever files bugs against you!

    (In this case, IoLog is a diagnostic tool which is logging file I/O performed by an application which is being instrumented.)

    My colleague replied, "Thanks. Looks like a missing lock in IoLog. It doesn't surprise me that API patching isn't supported..."

  • The Old New Thing

    The Windows 95 I/O system assumed that if it wrote a byte, then it could read it back

    • 11 Comments

    In Windows 95, compressed data was read off the disk in three steps.

    1. The raw compressed data was read into a temporary buffer.
    2. The compressed data was uncompressed into a second temporary buffer.
    3. The uncompressed data was copied to the application-provided I/O buffer.

    But you could save a step if the I/O buffer was a full cluster:

    1. The raw compressed data was read into a temporary buffer.
    2. The compressed data was uncompressed directly into the application-provided I/O buffer.

    A common characteristic of dictionary-based compression is that a compressed stream can contain a code that says "Generate a copy of bytes X through Y from the existing uncompressed data."

    As a simplified example, suppose the cluster consisted of two copies of the same 512-byte block. The compressed data might say "Take these 512 bytes and copy them to the output. Then take bytes 0 through 511 of the uncompressed output and copy them to the output."

    So far, so good.

    Well, except that if the application wrote to the I/O buffer while the read was in progress, then the read would get corrupted because it would copy the wrong bytes to the second half of the cluster.

    Fortunately, writing to the I/O buffer is forbidden during the read, so any application that pulled this sort of trick was breaking the rules, and if it got corrupted data, well, that's its own fault. (You can construct a similar scenario where writing to the buffer during a write can result in corrupted data being written to disk.)

    Things got even weirder if you passed a memory-mapped device as your I/O buffer. There was a bug that said, "The splash screen for this MS-DOS game is all corrupted if you run it from a compressed volume."

    The reason was that the game issued an I/O directly into the video frame buffer. The EGA and VGA video frame buffers used planar memory and latching. When you read or write a byte in video memory, the resulting behavior is a complicated combination of the byte you wrote, the values in the latches, other configuration settings, and the values already in memory. The details aren't important; the important thing is that video memory does not act like system RAM. Write a byte to video memory, then read it back, and not only will you not get the same value back, but you probably modified video memory in a strange way.

    The game in question loaded its splash screen by issuing I/O directly into video memory, knowing that MS-DOS copies the result into the output buffer byte by byte. It set up the control registers and the latches in such a way that then bytes written into memory go exactly where they should. (It issued four reads into the same buffer, with different control registers each time, so that each read ended up being issued to a different plane.)

    This worked great, unless the disk was compressed.

    The optimization above relied on the property that writing a byte followed by reading the byte produces the byte originally written. But this doesn't work for video memory because of the weird way video memory works. The result was that when the decompression engine tried to read what it thought was the uncompressed data, it was actually asking the video controller to do some strange operations. The result was corrupted decompressed data, and corrupted video data.

    The fix was to force double-buffering in non-device RAM if the I/O buffer was into device-mapped memory.

  • The Old New Thing

    Rules can exist not because there's a problem, but in order to prevent future problems

    • 19 Comments

    I lost the link, but one commenter noted that the Read­File function documentation says

    Applications must not read from, write to, reallocate, or free the input buffer that a read operation is using until the read operation completes.

    The commenter noted, "What is the point of the rule that disallows reading from or writing to the input buffer while the I/O is in progress? If there is no situation today where this actually causes a problem, then why is the rule there?"

    Not all rules exist to address current problems. They can also exist to prevent future problems.

    In general, you don't want the application messing with an I/O buffer because the memory may have been given to the device, and now the device has to deal with bus contention. And there isn't really much interesting you can do with the buffer before the I/O completes. You can't assume that the I/O will complete the first byte of the buffer first, and the last byte of the buffer last. The I/O request may get split into multiple pieces, and the individual pieces may complete out of order.

    So the rule against accessing the buffer while I/O is in progress is not a significant impediment in practice because you couldn't reliably obtain any information from the buffer until the I/O completed. And the rule leaves room for the future versions of the operating system to take advantage of the fact that the application will not read from or write to the buffer.

    Tomorrow, I'll tell a story of a case where accessing the I/O buffer before the I/O completed really did cause problems in Windows 95.

  • The Old New Thing

    Microspeak: DRI, the designated response individual

    • 7 Comments

    Someone sent a message to a peer-to-peer discussion group and remarked, "This is critical. I'm a DRI at the moment and have some issues to fix."

    The term DRI was new to many people on the mailing list (including me), and while others on the mailing list helped to solve the person's problem, I also learned that DRI stands for designated response individual or designated responsible individual, depending on whom you ask. This is the person who is responsible for monitoring and replying to email messages sent to a hot issues mailing list. For online services, it's the person responsible for dealing with live site issues.

    From what I can gather, teams that use this model rotate the job of being the DRI through the members of the team, so that each person on the schedule serves as DRI for a set period of time (typically a day or a week). The DRI may also be responsible for running various tools at specific times. Each team sets its own rules.

    Other teams have come up with their own name for this job. Another term I've seen is Point Dev. On our team, we call it the Developer of the Day.

    Bonus chatter: I bought this hat back in the day when the stitching was done by hand on a specially-designed sewing machine. Nowadays, it's computerized.

  • The Old New Thing

    Insightful graph: The ship date predictor

    • 51 Comments

    The best graphs are the ones that require no explanation. You are just told what the x- and y-axes represent, and the answer just jumps out at you.

    One of the greatest graphs I've seen at Microsoft is this one that a colleague of mine put together as Windows 95 was nearing completion. He took each email message from management that changed the Windows 95 RTM date (also known as the ship date) and plotted it on a chart. The x-axis is the date the statement was made and the y-axis is number of days remaining in the project, according to the email. The dotted line is a linear least-squares fit, and the green star is the actual ship date (July 14, 1995).

    600 
    500 
    400 
    300 
    200 
    100 
     
     
     
     
     
     
     
     
    Apr

    1992
    Jul
    Oct
    Jan 1993
    Apr
    Jul
    Oct
    Jan 1994
    Apr
    Jul
    Oct
    Jan 1995
    Apr
    Jul
    Oct

    What's so amazing about this chart is that the linear approximation predicts the actual ship date with very high accuracy. The slope of the line is 0.43%, which means that if you took the predicted "days remaining before we ship" and multiplied it by around 2.3, you'd be pretty close to the actual ship date.

    In other words, management fairly consistently underestimated the number of days until RTM by a factor of 2.3. (Another way of looking at it is that the development team consistently underreported the number of days to completion to management by a factor of 2.3.)

    Bonus amusement

    Here is a pull quote from each of the announcements, lightly edited.

    Date Revised RTM Remark
    February 1992 June 1993 "Ready to RTM 6/93. Otherwise, I'll be applying for a job at McDonalds."
    April 1992 September 1993 "This is a critical release."
    July 1992 March 1994 "The feature set will NOT be expanded to fill the new schedule."
    September 1992 December 1993 "This product must RTM by the end of 1993. If we miss this window of opportunity, then the value of this product goes way down."
    January 1993 March 1994 "I recently learned that Team X was planning around a Q4 94 ship date!" (Team X provided code to Windows 95.)
    March 1993 April 1994 "We need to formulate plans which get us there."
    August 1993 May 1994 "It's really important for the company that we make this date. This must be our last slip."
    December 1993 August 1994 "This is about as late as we can go without incurring big financial problems for the company."
    February 1994 September 1994 "What determines the ship date is the team's commitment to a ship date. We must make our RTM date."
    May 1994 November 1994 "Software and hardware vendors are counting on us."
    August 1994 February 1995 "Completing this milestone by the end of the year is absolutely critical to the product gaining quick success."
    December 1994 May 1995 "People all over are planning their business on when we release. We must make our current date."

    Today marks the 20th anniversary of the public release of Windows 95. Just one more year, and you'll be old enough to buy a drink!¹

    Bonus reading: Start Me Up (again): Brad Chase (who ran the worldwide launch of Windows 95) tells the story of how Start Me Up became the anthem for Windows 95, and addresses the legend that that it cost $14 million to license the song. (Spoiler: It was more like $3 million.)

    Bonus chatter: The ticket price for the Windows 95 team reunion party is $47.50. This seems like an odd number, but it makes more sense when you buy two tickets (one for you, and one for your partner).

    ¹ In the United States, the age at which it is legal to purchase alcohol is 21.

  • The Old New Thing

    Handy delegate shortcut hides important details: The hidden delegate

    • 13 Comments

    One of my colleagues was having trouble with a little tool he wrote.

    I installed a low-level keyboard hook following the code in this article, but it crashes randomly. Here's what I know so far:

    • I spawn a new STA thread to register the hook, so that it can run a message pump, which is a requirement for low-level hooks.
    • After setting the hook, the program waits on a Manual­Reset­Event with Wait­One(). Since this is being called from an STA thread, it will pump messages while waiting, which is what we want.
    • The event is signaled by another part of the program when the hook is no longer needed, at which point the thread unregisters the hook before exiting normally.

    The crash happens inside Wait­One() immediately after keyboard activity occurs. The debugger tells me that it is crashing trying to dispatch a call into a managed stub via the message pump, but that's all I was able to extract.

    I took a look at the article that my colleague referenced and observed that there was a subtlety in the code that not obvious, and which may have been lost in translation. I shared my observation in the form a psychic prediction.

    My psychic powers tell me that you did not prevent the delegate from getting GCd. The next time GC runs, the delegate will get collected, and the next attempt to fire the callback will AV because its calling into memory that has been freed.

    The sample code from the blog avoids this problem by putting the delegate in a private static, which makes it a GC root, ineligible for collection.

    private static LowLevelKeyboardProc _proc = HookCallback;
    

    This is subtle because the private static is decoupled from Set­Hook. If you copied Set­Hook but not the private static, then you inadvertently created a bug because local variables can get optimized out.

    Either put it in a static, like the sample does, or explicitly extend the delegates lifetime by calling GC.Keep­Alive() after you unhook the hook.

    LowLevelKeyboardProc proc = HookCallback;
    IntPtr hookId = SetHook(proc);
    WaitOne();
    RemoveHook(hookId);
    GC.KeepAlive(proc); // keep the proc alive until this line is reached
    

    My colleague realized that was the problem.

    I'd actually thought of that (mostly). I made my callback method itself a static, thinking that this was enough. What I forgot is that C# wraps that in an instance of the delegate automatically, and it was this hidden delegate that was getting GC'd not the callback function itself. This explains why I could always inspect the callback method and see that it was alive and well, yet we were still jumping into space when invoking the callback.

    Explicitly calling out the assignment reminded me of the details of delegates. Thanks!

    The classical notation for creating a delegate is

        DelegateType d = new DelegateType(o.Method);
        DelegateType d = new DelegateType(Method); // this.Method
    

    C# version 2.0 added delegate inference which lets you omit the new DelegateType most of the time. The compiler will automatically convert the method name (and optional this object) into a delegate.

        DelegateType d = o.Method;
        DelegateType d = Method; // this.Method
    

    This shorthand is so old, you may not even remember (or realize) that it is a shorthand for a hidden delegate.

    In my colleague's program, the line

        IntPtr hookId = SetHook(HookCallback);
    

    was shorthand for

        LowLevelKeyboardProc temp = HookCallback;
        IntPtr hookId = SetHook(temp);
    

    Once the delegate was made explicit rather than hidden, the issue became clear: Since there was nothing keeping the delegate alive, the delegate disappeared at the next GC, and the unmanaged function pointer disappeared with it.

    And now CLR Week will disappear until next time.

  • The Old New Thing

    I saw a pinvoke signature that passed a UInt64 instead of a FILETIME, what's up with that?

    • 12 Comments

    A customer had a question about a pinvoke signature that used a UInt64 to hold a FILETIME structure.

    [DllImport("kernel32.dll", SetLastError = true)
    static external bool GetProcessTimes(
        IntPtr hProcess,
        out UInt64 creationTime,
        out UInt64 exitTime,
        out UInt64 kernelTime,
        out UInt64 userTime);
    

    Is this legal? The documentation for FILETIME says

    Do not cast a pointer to a FILETIME structure to either a ULARGE_INTEGER* or __int64* value because it can cause alignment faults on 64-bit Windows.

    Are we guilty of this cast in the above code? After all you can't treat a FILETIME as an __int64.

    There are two types of casts possible in this scenario.

    • Casting from FILETIME* to __int64*.
    • Casting from __int64* to FILETIME*.

    The FILETIME structure requires 4-byte alignment, and the __int64 data type requires 8-byte alignment. Therefore the first cast is unsafe, because you are casting from a pointer with lax alignment requirements to one with stricter requirements. The second cast is safe because you are casting from a pointer with strict alignment requirements to one with laxer requirements.

    4-byte aligned 8-byte aligned

    Everything in the blue box is also in the pink box, but not vice versa.

    Which cast is the one occurring in the above pinvoke signature?

    In the above signature, the UInt64 is being allocated by the interop code, and therefore it is naturally aligned for UInt64, which means that it is 8-byte aligned. The Get­Process­Times function then treats those eight bytes as a FILETIME. So we are in the second case, where we cast from __int64* to FILETIME*.

    Mind you, you can avoid all this worrying by simply declaring your pinvoke more accurately. The correct solution is to declare the last four parameters as ComTypes.FILETIME. Now there are no sneaky games. Everything is exactly what it says it is.

    Bonus reading: The article Use PowerShell to access registry last-modified time stamp shows how to use the ComTypes.FILETIME technique from PowerShell.

  • The Old New Thing

    If you are going to call Marshal.GetLastWin32Error, the function whose error you're retrieving had better be the one called most recently

    • 12 Comments

    Even if you remember to set Set­Last­Error=true in your p/invoke signature, you still have to be careful with Marshal.Get­Last­Win32­Error because there is only one last-error code, and it gets overwritten each time.

    So let's try this program:

    using System;
    using System.Runtime.InteropServices;
    
    class Program
    {
      [DllImport("user32.dll", SetLastError=true)]
      public static extern bool OpenIcon(IntPtr hwnd);
    
      public static void Main()
      {
        // Intentionally pass an invalid parameter.
        var result = OpenIcon(IntPtr.Zero);
        Console.WriteLine("result: {0}", result);
        Console.WriteLine("last error = {0}",
                          Marshal.GetLastWin32Error());
      }
    }
    

    The expectation is that the call to Open­Icon will fail, and the error code will be some form of invalid parameter.

    But when you run the program, it prints this:

    result: False
    last error = 0
    

    Zero?

    Zero means "No error". But the function failed. Where's our error code? We printed the result immediately after calling Open­Icon. We didn't call any other p/invoke functions. The last-error code should still be there.

    Oh wait, printing the result to the screen involves a function call.

    That function call might itself do a p/invoke!

    We have to call Marshal.Get­Last­Win32­Error immediately after calling Open­Icon. Nothing else can sneak in between.

    using System;
    using System.Runtime.InteropServices;
    
    class Program
    {
      [DllImport("user32.dll", SetLastError=true)]
      public static extern bool OpenIcon(IntPtr hwnd);
    
      public static void Main()
      {
        // Intentionally pass an invalid parameter.
        var result = OpenIcon(IntPtr.Zero);
        var lastError = Marshal.GetLastWin32Error();
        Console.WriteLine("result: {0}", result);
        Console.WriteLine("last error = {0}",
                          lstError);
      }
    }
    

    Okay, now the program reports the error code as 1400: "Invalid window handle."

    This one was pretty straightforward, because the function call that modified the last-error code was right there in front of us. But there are other ways that code can run which are more subtle.

    • If you retrieve a property, the property retrieval may involve a p/invoke.
    • If you access a class that has a static constructor, the static constructor will secretly run if this is the first time the class is used.
  • The Old New Thing

    If you are going to call Marshal.GetLastWin32Error, the function whose error you're retrieving had better have SetLastError=true

    • 16 Comments

    A customer reported that their p/invoke to a custom DLL was failing, and the error code made no sense.

    // C#
    using System;
    using System.Runtime.InteropServices;
    using System.Diagnostics;
    
    class Program
    {
      [DllImport("contoso.dll", CallingConvention=CallingConvention.Cdecl)]
      public static extern int Fribble();
    
      public static void Main()
      {
        Console.WriteLine("About to call Fribble");
    
        var result = Fribble();
        if (result >= 0) {
          Console.WriteLine("succeeded {0}", result);
        } else {
          Console.WriteLine("failed {0}, last error = {1}",
                            result, Marshal.GetLastWin32Error());
        }
      }
    }
    
    // C++
    
    int __cdecl Fribble()
    {
     HANDLE hEvent = OpenEvent(EVENT_MODIFY_STATE, FALSE,
                               TEXT("FribbleEvent"));
     if (hEvent == nullptr)
      return -1;
     }
    
     if (!SetEvent(hEvent)) {
      CloseHandle(hEvent);
      return -2;
     }
    
     CloseHandle(hEvent);
     return 1;
    }
    

    The customer reported that their Fribble function was returning −1, indicating a failure to open the event, but the error code returned by Marshal.Get­Last­Win32­Error is 87, "The parameter is incorrect." But all of the parameters to Open­Event look correct. Why are we getting this strange error code?

    My psychic powers tell me that if the customer had taken the time to troubleshoot their problem by writing a C++ program that calls the Fribble function, Get­Last­Error would have returned the more reasonable error 2, meaning that the event does not exist.

    That's because Get­Last­Error is working fine. The last error code is 2.

    The problem is with the p/invoke declaration.

    The documentation for the Marshal.Get­Last­Win32­Error function says as its very first line

    Returns the error code returned by the last unmanaged function that was called using platform invoke that has the DllImportAttribute.SetLastError flag set.

    (Emphasis mine.)

    This reminder about Dll­Import­Attribute.Set­Last­Error is repeated in the Remarks.

    You can use this method to obtain error codes only if you apply the System.Runtime.Interop­Services.Dll­Import­Attribute to the method signature and set the Set­Last­Error field to true.

    Observe that the Set­Last­Error field was not set in the p/invoke declaration. Therefore, what you are actually getting when you call Marshal.Get­Last­Win32­Error is whatever error was lying around after the previous call to a p/invoke function that did specify Set­Last­Error = true.

    Changing the p/invoke to

    [DllImport("contoso.dll", SetLastError=true,
               CallingConvention=CallingConvention.Cdecl)]
    public static extern int Fribble();
    

    fixed the problem.

  • The Old New Thing

    p/invoke gotcha: C++ bool is not Win32 BOOLEAN is not UnmanagedType.Bool

    • 25 Comments

    Welcome to CLR Week. I hope you enjoy your stay.

    A customer reported that their p/invoke was not working.

    We aren't getting the proper return codes from the Audit­Set­System­Policy. When the call succeeds, the return code is 1, as expected. But in our tests, when we force the call to fail (insufficient access), the return code is not zero. Instead, the return code is some value of the form 0xFFxxxxxx, where the x's vary, but the high byte is always 0xFF.

    For reference, the DllImport declaration we are using is

    [DllImport("advapi32.dll", SetLastError=true)]
    public static extern UInt32 AuditSetSystemPolicy(
        IntPtr pAuditPolicy,
        UInt32 policyCount);
    

    The corresponding Win32 declaration is

    BOOLEAN WINAPI AuditSetSystemPolicy(
      _In_  PCAUDIT_POLICY_INFORMATION pAuditPolicy,
      _In_  ULONG PolicyCount
    );
    

    Alas, the customer fell into one of the common gotchas when writing p/invoke: They confused BOOLEAN and BOOL.

    BOOL is a 32-bit integer, whereas BOOLEAN is an 8-bit integer.

    Since they were marshaling the return code as a UInt32, they were getting the byte returned by the function, plus three bonus uninitialized garbage bytes. If they studied more closely, they would have found that the erroneous return codes were all of the form 0xFFxxxx00 where the bottom 8 bits are all zero. That's because the bottom 8 bits are the actual value; the rest are garbage.

    The correct declaration is to use Unmanaged­Type.U1 aka byte rather than Unmanaged­Type.U4 aka UInt32.

    [DllImport("advapi32.dll", SetLastError=true)]
    public static extern byte AuditSetSystemPolicy(
        IntPtr pAuditPolicy,
        UInt32 policyCount);
    

    The customer confirmed that switching to Unmanaged­Type.U1 fixed the problem.

Page 1 of 457 (4,568 items) 12345»