January, 2013

  • The Old New Thing

    What is this rogue version 1.0 of the HTML clipboard format?

    • 34 Comments

    At least as of the time this article was originally written, the HTML clipboard format is officially at version 0.9. A customer observed that sometimes they received HTML clipboard data that marked itself as version 1.0 and wanted to know where they could find documentation on that version.

    As far as I can tell, there is no official version 1.0 of the HTML clipboard format.

    I hunted around, and the source of the rogue version 1.0 format appears to be the WPF Toolkit. Version 1.0 has been the version used by ClipboardHelper.cs since its initial commit.

    If you read the code, it appears that they are not generating HTML clipboard data that uses any features beyond version 0.9, so the initial impression is that it's just somebody who jumped the gun and set their version number higher than they should have. The preliminary analysis says that you can treat version 1.0 the same as version 0.9.

    But that's merely the preliminary analysis.

    A closer look at the Get­Clipboard­Content­For­Html function shows that it generated the HTML content incorrectly. The code treats the fragment start and end offsets as character offsets, not byte offsets. But the offsets are explicitly documented as in bytes.

    StartFragment Byte count from the beginning of the clipboard to the start of the fragment.
    EndFragment Byte count from the beginning of the clipboard to the end of the fragment.
    My guess is that the author of that helper function made two mistakes that partially offset each other.

    1. The author failed to take into account that C# operates in Unicode, whereas the HTML clipboard format operates in UTF-8. The byte offset specified in the HTML format header is the byte count in the UTF-8 encoding, not the byte count in the UTF-16 encoding used by C#.
    2. The author did all their testing with ASCII strings. UTF-8 has the property that ASCII encodes to itself, so one byte equals one character. If they had tested with a non-ASCII character, they would have seen the importance of the byte count. (Or maybe they simply would have gotten more confused.)

    Now, WPF knows that the Data­Formats.HTML clipboard format is encoded in UTF-8, so when you pass a C# string to be placed on the clipboard as HTML, it knows to convert the string to UTF-8 before putting it on the clipboard. But it doesn't know to convert the offsets you provided in the HTML fragment itself. As a result, the values encoded in the offsets end up too small if the text contains non-ASCII characters. (You can see this by copying text containing non-ASCII characters from the DataGrid control, then pasting into Word. Result: Truncated text, possibly truncated to nothing depending on the nature of the text.)

    There are two other errors in the Get­Clipboard­Content­For­Html function: Although the code attempts to follow the recommendation of the specification by placing a <!--EndFragment--> marker after the fragment, they erroneously insert a \r\n in between. Furthermore, the EndHTML value is off by two. (It should be DATA­GRID­VIEW_html­End­Fragment.Length, which is 38, not 36.)

    Okay, now that we see the full situation, it becomes clear that at least five things need to happen.

    The immediate concern is what an application should do when it sees a rogue version 1.0. One approach is to exactly undo the errors in the WPF Toolkit: Treat the offsets as character offsets (after converting from UTF-8 to UTF-16) rather than byte offsets. This would address the direct problem of the WPF Toolkit, but it is also far too aggressive, because there may be another application which accidentally marked its HTML clipboard data as version 1.0 but which does not contain the exact same bug as the WPF Toolkit. Instead, applications which see a version number of 1.0 should treat the EndHTML, EndFragment, and EndSelection offsets as untrustworthy. The application should verify that the EndFragment lines up with the <!--EndFragment--> marker. If it does not, then ignore the specified value for EndFragment and infer the correct offset to the fragment end by searching for the last occurrence of the <!--EndFragment--> marker in the clipboard data, but trim off the spurious \r\n that the WPF Toolkit erroneously inserted, if present. Similarly, EndHTML should line up with the end of the </HTML> tag; if not, the specified offset should be ignored and the correct value inferred. Fortunately, the WPF Toolkit does not use EndSelection, so there is no need to attempt to repair that value, and it does not use multiple fragments, so only one fragment repair is necessary.

    Welcome to the world of application compatibility, where you have to accommodate the mistakes of others.

    Some readers of this Web site would suggest that the correct course of action for your application is to detect version 1.0 and put up an error message saying, "The HTML on the clipboard was placed there by a buggy application. Contact the vendor of that application and tell them to fix their bug. Until then, I will refuse to paste the data you copied. Don't blame me! I did nothing wrong!" Good luck with that.

    Second, the authors of the WPF Toolkit should fix their bug so that they encode the offsets correctly in their HTML clipboard format.

    Third, at the same time they fix their bug, they should switch their reported version number back to 0.9, so as to say, "Okay, everybody, this is the not-buggy version. No workaround needed any more." If they leave it as 1.0, then applications which took the more aggressive workaround will end up double-correcting.

    Fourth, the maintainers of the HTML clipboard format may want to document the rogue version 1.0 clipboard format and provide recommendations to applications (like I just did) as to what they should do when they encounter it.

    Fifth, the maintainers of the HTML clipboard format must not use version 1.0 as the version number for any future revision of the HTML clipboard format. If they make another version, they need to call it 0.99 or 1.01 or something different from 1.0. Version 1.0 is now tainted. It's the version number that proclaims, "I am buggy!"

    At first, we thought that all we found was a typo in an open-source helper library, but digging deeper and deeper revealed that it was actually a symptom of a much deeper problem that has now turned into an industry-wide five-pronged plan for remediation.

  • The Old New Thing

    A few stray notes on Windows patching and hot patching

    • 58 Comments

    Miscellaneous notes, largely unorganized.

    • A lot of people questioned the specific choice of MOV EDI, EDI as the two-byte NOP, with many people suggesting alternatives. The decision to use MOV EDI, EDI as the two-byte NOP instruction came after consulting with CPU manufacturers for their recommendations for the best two-byte NOP. So if you think something better should have been used, go take it up with the CPU manufacturers. They're the ones who came up with the recommendation. (Though I suspect they know more about the best way to optimize code for their CPUs than you do.)
    • You can enable hotpatching on your own binaries by passing the /hotpatch flag to the compiler.
    • The primary audience for hotpatching is server administrators who want to install a security update without having to reboot the computer.
    • There were some people who interpreted the presence of hotpatch points as a security hole, since it makes it easier for malware to redirect OS code. Well, yes, but it didn't enable anything that they didn't already know how to do. If malware can patch your process, then it has already made it to the other side of the airtight hatchway. And besides, malware authors aren't going to bother carefully patching code to avoid obscure race conditions. They're just going to patch the first five bytes of the function without regard for safety, because that'll work 99% of the time. (It's not like the other 1% are going to call the virus authors when the patch fails.)
  • The Old New Thing

    If NTFS is a robust journaling file system, why do you have to be careful when using it with a USB thumb drive?

    • 68 Comments

    Some time ago, I noted that in order to format a USB drive as NTFS, you have to promise to go through the removal dialog.

    But wait, NTFS is a journaling file system. The whole point of a journaling file system is that it is robust to these sorts of catastrophic failures. So how can surprise removal of an NTFS-formatted USB drive result in corruption?

    Well, no it doesn't result in corruption, at least from NTFS's point of view. The file system data structures remain intact (or at least can be repaired from the change journal) regardless of when you yank the drive out of the computer. So from the file system's point of view, the answer is "Go ahead, yank the drive any time you want!"

    This is a case of looking at the world through filesystem-colored glasses.

    Sure, the file system data structures are intact, but what about the user's data? The file system's autopilot system was careful to land the plane, but yanking the drive killed the passengers.

    Consider this from the user's point of view: The user copies a large file to the USB thumb drive. Chug chug chug. Eventually, the file copy dialog reports 100% success. As soon as that happens, the user yanks the USB thumb drive out of the computer.

    The user goes home and plugs in the USB thumb drive, and finds that the file is corrupted.

    "Wait, you told me the file was copied!"

    Here's what happened:

    • The file copy dialog creates the destination file and sets the size to the final size. (This allows NTFS to allocate contiguous clusters to the file.)
    • The file copy dialog writes a bunch of data to the file, and then closes the handle.
    • The file system writes the data into the disk cache and returns success.
    • The file copy dialog says, "All done!"
    • The user yanks the USB thumb drive out of the computer.
    • At some point, the disk cache tries to flush the data to the USB thumb drive, but discovers that the drive is gone! Oops, all the dirty data sitting in the disk cache never made it to the drive.

    Now you insert the USB drive into another computer. Since NTFS is a journaling file system, it can auto-repair the internal data structures that are used to keep track of files, so the drive itself remains logically consistent. The file is correctly set to the final size, and its directory entry is properly linked in. But the data you wrote to the file? It never made it. The journal didn't have a copy of the data you wrote in step 2. It only got as far as the metadata updates from step 1.

    That's why the default for USB thumb drives is to optimize for Quick Removal. Because people expect to be able to yank USB thumb drives out of the computer as soon as the computer says that it's done.

    If you want to format a USB thumb drive as NTFS, you have to specify that you are Optimizing for Performance and that you promise to warn the file system before yanking the drive, so that it can flush out all the data sitting in the disk cache.

    Even though NTFS is robust and can recover from the surprise removal, that robustness does not extend to the internal consistency of the data you lost. From NTFS's point of view, that's just a passenger.

    Update: It seems that people missed the first sentence of this article. Write-behind caching is disabled by default on removable drives. You get into this mess only if you override the default. And on the dialog box that lets you override the default, there is a warning message that says that when you enable write-behind caching, you must use the Safely Remove Hardware icon instead of just yanking the drive. In other words, this problem occurs because you explicitly changed a setting from the safe setting to the dangerous one, and you ignored the warning that came with the dangerous setting, and now you're complaining that the setting is dangerous.

  • The Old New Thing

    Heads-up: Phone scammers pretending to be JPMorgan Chase MasterCard security

    • 21 Comments

    Recently, a round of phone scammers have been dialing through our area with a caller-ID of (000) 000-0000, which should already raise suspicions.

    When you answer, a synthesized voice says that they are calling from JPMorgan Chase MasterCard security. They claim that your credit card has been disabled due to suspicious activity, and in order to reactivate it, you need to enter your 16-digit credit card number.

    I decided to see how far I could take the robot voice for a ride, so I entered 16 random digits. No luck, the robot voice knew about the checksum and asked me to enter it again. I did a quick search for fake credit card number and landed on this blog entry and entered the first fake credit card number on that page.

    The robot voice accepted the credit card and proceeded to ask me for the card's expiration date (I made one up) and the PIN (I made one up). At that point, it reported that it had successfully verified the information and that my card had been re-enabled.

    Of course, the joke's on them: The fake credit card I used wasn't even a fake MasterCard number. It was a fake VISA credit card!

    By the way, don't read the comments on that blog entry if you want to retain any faith in humanity.

  • The Old New Thing

    Understanding errors in classical linking: The delay-load catch-22

    • 25 Comments

    Wrapping up our week of understanding the classical model for linking, we'll put together all the little pieces we've learned this week to puzzle out a linker problem: The delay-load catch-22.

    You do some code cleanup, then rebuild your project, and you get

    LNK4199: /DELAYLOAD:SHLWAPI ignored; no imports found from SHLWAPI
    

    What does this error mean?

    It means that you passed a DLL via the /DELAYLOAD command line switch which your program doesn't actually use, so the linker is saying, "Um, you said to treat this DLL special, but I don't see that DLL."

    "Oh, right," you say to yourself. "I got rid of a call to Hash­String, and that was probably the last remaining function with a dependency on SHLWAPI.DLL. The linker is complaining that I asked to delay-load a DLL that I wasn't even loading!"

    You fix the problem by deleting SHLWAPI.DLL from the /DELAYLOAD list, and removing SHLWAPI.LIB from the list of import libararies. And then you rebuild, and now you get

    LNK2019: unresolved external '__imp__HashData' referenced in function 'HashString'
    

    "Wait a second, I stopped calling that function. What's going on!"

    What's going on is that the Hash­String function got taken along for the ride by another function. The order of operations in the linker is

    • Perform classical linking
    • Perform nonclassical post-processing
      • Remove unused functions (if requested)
      • Apply DELAYLOAD (if requested)

    The linker doesn't have a crystal ball and say, "I see that in the future, the 'remove unused functions' step is going to delete this function, so I can throw it away right now during the classical linking phase."

    You have a few solutions available to you.

    If you can modify the library, you can split the Hash­String function out so that it doesn't come along for the ride.

    If you cannot modify the library, then you'll have to use the /IGNORE flag to explicitly ignore the warning.

    Exercise: Another option is to leave SHLWAPI.LIB in the list of import libraries, but remove it from the DELAYLOAD list. Why is this a dangerous option? What can you do to make it less dangerous?

  • The Old New Thing

    Eliot Chang's list of things Asians hate

    • 54 Comments

    One time, somebody asked me, "What nationality are you?"

    I answered, "American."

    "No, I mean what nationality are your parents?"

    "They're also American."

    "No, I mean where are your parents from?"

    "They're from New Jersey."

    "No, I mean before that."

    "North Carolina."

  • The Old New Thing

    What's the guidance on when to use rundll32? Easy: Don't use it

    • 28 Comments

    Occasionally, a customer will ask, "What is Rundll32.exe and when should I use it instead of just writing a standalone exe?"

    The guidance is very simple: Don't use rundll32. Just write your standalone exe.

    Rundll32 is a leftover from Windows 95, and it has been deprecated since at least Windows Vista because it violates a lot of modern engineering guidelines. If you run something via Rundll32, then you lose the ability to tailor the execution environment to the thing you're running. Instead, the environment is set up for whatever Rundll32 requests.

    • Data Execution Prevention policy cannot be applied to a specific Rundll32 command line. Any policy you set applies to all Rundll32 commands.
    • Address Space Layout Randomization cannot be applied to a specific Rundll32 command line. Any policy you set applies to all Rundll32 commands.
    • Application compatibility shims cannot be applied to a specific Rundll32 command line. Any application compatibilty shim you enable will be applied to all Rundll32 commands.
    • SAFER policy cannot be applied to a specific Rundll32 command line. Any policy you set applies to all Rundll32 commands.
    • The Description in Task Manager will be Rundll32's description, which does not help users identify what the specific Rundll32 instance is doing.
    • You cannot apply a manifest to a specific Rundll32 command line. You have to use the manifest that comes with Rundll32. (In particular, this means that your code must be high DPI aware.)
    • The Fault Tolerant Heap cannot be enabled for a specific Rundll32 command line. Any policy you set applies to all Rundll32 commands.
    • All Rundll32.exe applications are treated as the same program for the purpose of determining which applications are most frequently run.
    • Explorer tracks various attributes of an application based on the executable name, so all Rundll32.exe commands will be treated as the same application. (For example, all windows hosted by Rundll32 will group together.)
    • You won't get any Windows Error Reporting reports for crashes in your Rundll32.exe command line, because they all got sent to the registered owner of Rundll32.exe (the Windows team).
    • Many environmental settings are implied by the executable. If you use Rundll32, then those settings are not chosen by you since you didn't control how Rundll32 configures its environment.
      • Rundll32 is marked as TSAWARE, so your Rundll32 command must be Terminal Services compatible.
      • Rundll32 is marked as LARGE­ADDRESS­AWARE, so your Rundll32 command must be 3GB-compatible.
      • Rundll32 specifies its preferred stack reserve and commit, so you don't control your stack size.
      • Rundll32 is marked as compatible with the version of Windows it shipped with, so it has opted into all new behaviors (even the breaking ones), such as automatically getting the Heap­Enable­Termination­On­Corruption flag set on all its heaps.
    • Windows N+1 may add a new behavior that Rundll32 opts into, but which your Rundll32 command line does not support. (It can't, because the new behavior didn't exist at the time you wrote your Rundll32 command line.) As you can see, this has happened many times in the past (for example, high DPI, Terminal Services compatibility, 3GB compatibility), and it will certainly happen again in the future.

    You get the idea.

    Note also that Rundll32 assumes that the entry point you provide corresponds to a task which pumps messages, since it creates a window on your behalf and passes it as the first parameter. A common mistake is writing a Rundll32 entry point for a long-running task that does not pump messages. The result is an unresponsive window that clogs up broadcasts.

    Digging deeper, one customer explained that they asked for guidance making this choice because they want to create a scheduled task that runs code inside a DLL, and they wanted to decide whether to create a Rundll32 entry point in their DLL, or whether they should just create a custom executable whose sole job is loading the DLL and calling the custom code.

    By phrasing it as an either/or question, they missed the third (correct) option: Create your scheduled task with an ICom­Handler­Action that specifies a CLSID your DLL implements.

  • The Old New Thing

    Understanding the classical model for linking, groundwork: The algorithm

    • 23 Comments

    The classical model for linking goes like this:

    Each OBJ file contains two lists of symbols.

    1. Provided symbols: These are symbols the OBJ contains definitions for.
    2. Needed symbols: These are symbols the OBJ would like the definitions for.

    (The official terms for these are exported and imported, but I will use provided and needed to avoid confusion with the concepts of exported and imported functions in DLLs, and because provided and needed more clearly captures what the two lists are for.)

    Naturally, there is other bookkeeping information in there. For example, for provided symbols, not only is the name given, but also additional information on locating the definition. Similarly, for needed symbols, in addition to the name, there is also information about what should be done once its definition has been located.

    Collectively, provided and needed symbols are known as symbols with external linkage, or just externals for short. (Of course, by giving them the name symbols with external linkage, you would expect there to be things known as symbols with internal linkage, and you'd be right.)

    For example, consider this file:

    // inventory.c
    
    extern int InStock(int id);
    
    int GetNextInStock()
    {
      static int Current = 0;
      while (!InStock(++Current)) { }
      return Current;
    }
    

    This very simple OBJ file has one provided symbol, Get­Next­In­Stock: That is the object defined in this file that can be used by other files. It also has one needed symbol, In­Stock: That is the object required by this file in order to work, but which the file itself did not provide a definition for. It's hoping that somebody else will define it. There's also a symbol with internal linkage: Current, but that's not important to the discussion, so I will ignore it from now on.

    OBJ files can hang around on their own, or they can be bundled together into a LIB file.

    When you ask the linker to generate a module, you hand it a list of OBJ files and a list of LIB files. The linker's goal is to resolve all of the needed symbols by matching them up to a provided symbol. Eventually, everything needed will be provided, and you have yourself a module.

    To do this, the linker keeps track of which symbols in the module are resolved and which are unresolved.

    • A resolved symbol is one for which a provided symbol has been located and added to the module. Under the classical model, a symbol can be resolved only once. (Otherwise, the linker wouldn't know which one to use!)
    • An unresolved symbol is one that is needed by the module, but for which no provider has yet been identified.

    Whenever the linker adds an OBJ file to the module, it goes through the list of provided and needed symbols and updates the list of symbols in the module. The algorithm for updating this list of symbols is obvious if you've been paying attention, because it is a simple matter of preserving the invariants described above.

    For each provided symbol in an OBJ file added to a module:

    • If the symbol is already in the module marked as resolved, then raise an error complaining that an object has multiple definitions.
    • If the symbol is already in the module marked as unresolved, then change its marking to resolved.
    • Otherwise, the symbol is not already in the module. Add it and mark it as resolved.

    For each needed symbol in an OBJ file added to a module:

    • If the symbol is already in the module marked as resolved, then leave it marked as resolved.
    • If the symbol is already in the module marked as unresolved, then leave it marked as unresolved.
    • Otherwise, the symbol is not already in the module. Add it and mark it as unresolved.

    The algorithm the linker uses to resolve symbols goes like this:

    • Initial conditions: Add all the explicitly-provided OBJ files to the module.
    • While there is an unresolved symbol:
      • Look through all the LIBs for the first OBJ to provide the symbol.
      • If found: Add that OBJ to the module.
      • If not found: Raise an error complaining of an unresolved external. (If the linker has the information available, it may provide additional details.)

    That's all there is to linking and unresolved externals. At least, that's all there is to the classical model.

    Next time, we'll start looking at the consequences of the rules for classical linking.

    Sidebar: Modern linkers introduce lots of non-classical behavior. For example, the rule

    • If the symbol is already in the module marked as resolved, then raise an error complaining that an object has multiple definitions.

    has been replaced with the rules

    • If the symbol is already in the module marked as resolved:
      • If both the original symbol and the new symbol are marked __declspec(selectany), then do not raise an error. Pick one arbitrarily and discard the other.
      • Otherwise, raise an error complaining that an object has multiple definitions.

    Another example of non-classical behavior is dead code removal. If you pass the /OPT:REF linker flag, then after all externals have been resolved, the linker goes through and starts discarding functions and data that are never referenced, taking advantage of another non-classical feature (packed functions) to know where each function begins and ends.

    But I'm going to stick with the classical model, because you need to understand classical linking before you can study non-classical behavior. Sort of how in physics, you need to learn your classical mechanics before you study relativity.

  • The Old New Thing

    Poisoning your own DNS for fun and profit

    • 28 Comments

    When you type a phrase into the Windows Vista Start menu's search box and click Search the Internet, then the Start menu hands the query off to your default Internet search provider.

    Or at least that's what the Illuminati would have you believe.

    A customer reported that when they typed a phrase into the Search box and clicked Search the Internet, they got a screenful of advertisements disguised to look like search results.

    What kind of evil Microsoft shenanigans is this?

    If you looked carefully at the URL for the bogus search "results", the results were not coming from Windows Live Search. They were coming from a server controlled by the customer's ISP.

    That was the key to the rest of the investigation. Here's what's going on:

    The ISP configured all its customers to use the ISP's custom DNS servers by default. That custom DNS server, when asked for the location of search.live.com, returned not the actual IP address of Windows Live Search but rather the IP address of a machine hosted by the ISP. (This was confirmed by manually running nslookup on the customer machine and seeing that the wrong IP addresses were being returned.) The ISP was stealing traffic from Windows Live Search. It then studied the URL you requested, and if it is the URL used by the Start menu Search feature, then it sent you to the page of fake search results. Otherwise, it redirected you to the real Windows Live Search, and you're none the wiser, aside from your Web search taking a fraction of a second longer than usual. (Okay, snarky commenters, and aside from the fact that it was Windows Live Search.)

    The fake results page does have an About This Page link, but that page only talks about how the ISP intercepts failed DNS queries (which has by now become common practice). It doesn't talk about redirecting successful DNS queries.

    I remember when people noticed widespread hijacking of search traffic, and my response to myself was, "Well, duh. I've know about this for years."

    Bonus chatter: It so happens that the offending ISP's Acceptable Use Policy explicitly lists as a forbidden activity "to spoof the URL, DNS, or IP addresses of «ISP» or any other entity." In other words, they were violating their own AUP.

    Related

  • The Old New Thing

    Why does my program run really slow or even crash (or stop crashing, or crash differently) if running under a debugger?

    • 27 Comments

    More than once, a customer has noticed that running the exact same program under the debugger rather than standalone causes it to change behavior. And not just in the "oh, the timing of various operations changed to hit different race conditions" but in much more fundamental ways like "my program runs really slow" or "my program crashes in a totally different location" or (even more frustrating) "my bug goes away".

    What's going on? I'm not even switching between the retail and debug versions of my program, so I'm not a victim of changing program semantics in the debug build.

    When a program is running under the debugger, some parts of the system behave differently. One example is that the Close­Handle function raises an exception (I believe it's STATUS_INVALID_HANDLE but don't quote me) if you ask it to close a handle that isn't open. But the one that catches most people is that when run under the debugger, an alternate heap is used. This alternate heap has a different memory layout, and it does extra work when allocating and freeing memory to help try to catch common heap errors, like filling newly-allocated memory with a known sentinel value.

    But this change in behavior can make your debugging harder or impossible.

    So much for people's suggestions to switch to a stricter implementation of the Windows API when a debugger is attached.

    On Windows XP and higher, you can disable the debug heap even when debugging. If you are using a dbgeng-based debugger like ntsd or WinDbg, you can pass the -hd command line switch. If you are using Visual Studio, you can set the _NO_DEBUG_HEAP environment variable to 1.

    If you are debugging on a version of Windows prior to Windows XP, you can start the process without a debugger, then connect a debugger to the live process. The decision to use the debug heap is made at process startup, so connecting the debugger afterwards ensures that the retail heap is chosen.

Page 1 of 3 (29 items) 123