April, 2006

  • The Old New Thing

    A new scripting language doesn't solve everything

    • 96 Comments

    Yes, there are plenty of scripting languages that are much better than boring old batch. Batch files were definitely a huge improvement over SUBMIT back in 1981, but they've been showing their age for quite some time. The advanced age of boring old batch, on the other hand, means that you have millions of batch files out there that you had better not break if you know what's good for you. (Sure, in retrospect, you might decide to call the batch language a design mistake, but remember that it had to run in 64KB of memory on a 4.77MHz machine while still remaining compatible in spirit with CP/M.)

    Shipping a new command shell doesn't solve everything either. For one thing, you have to decide if you are going to support classic batch files or not. Maybe you decide that you won't and prefer to force people to rewrite all their batch files into your new language. Good luck on that.

    On the other hand, if you decide that you will support batch files after all, then presumably your new command shell will not execute old batch files natively, but rather will defer to CMD.EXE. And there's your problem: You see, batch files have the ability to modify environment variables and have the changes persist beyond the end of the batch file. Try it:

    C> copy con marco.cmd
    @set MARCO=polo
    ^Z
            1 file(s) copied.
    
    C> echo %MARCO%
    %MARCO%
    
    C> marco
    
    C> echo %MARCO%
    polo
    

    If your new command shell defers to CMD.EXE, these environment changes won't propagate back to your command shell since the batch file modifies the environment variables of CMD.EXE, not your shell. Many organizations have a system of batch files that rely on the ability to pass parameters between scripts by stashing them into environment variables. The DDK's own razzle does this, for example, in order to establish a consistent build environment and pass information to build.exe about what kind of build you're making. And I bet you have a batch file or two that sets your PROMPT or PATH environment variable or changes your current directory.

    So good luck with your replacement command shell. I hope you figure out how to run batch files.

  • The Old New Thing

    Adding a new flag to enable behavior that previously was on by default

    • 73 Comments

    One of the suggestions for addressing the network compatibility problem was to give up on fast mode and have a new "fast mode 2". (Equivalently, add a flag to the server capabilities that means "I support fast mode, and I'm not buggy.") This is another example of changing the rules after the game is over, by adding a flag to work around driver bugs.

    Consider a hypothetical program that uses fast mode on Windows XP. It runs against a Windows Server 2003 server and everybody is happy. Suppose you make a change to Windows Vista so that it requires that servers set a new "fast mode 2" flag in order to support fast mode. When the customer upgrades their client from Windows XP to Windows Vista, they would find that their hypothetical program ran much slower. Whose fault is it? Not the hypothetical program that was using fast mode on Windows XP; that program is using fast mode correctly. Not the Windows Server 2003 machine; that server supports fast mode correctly. Is it Windows Vista, then, that is at fault?

    "Hey, don't blame me," you answer. you answer. "It's that guy over there. That guy you've never heard of. He made me do it. Blame him!"

    To describe this sort of behavior I like to steal a phrase from Albert Einstein: "Spooky action at a distance". (Einstein used it to describe what in modern physics is known as quantum entanglement.) In this particular situation, we have a conversation between two participants (the client software and the server software) mediated by a third (Windows) which collapses due to the mere existence of a fourth party not involved in the conversation! It's as if your CD player suddenly lost the ability to play any of your music CDs because some company you've never heard of halfway around the world pressed a bunch of bad CDs for a few months earlier this year.

    Some people suggested, "Why not have a flag that says 'I support fast mode'?" Indeed that flag already exists; that's why Windows Vista was trying to use fast mode in the first place. The problem wasn't that the server didn't support fast mode. The problem was that the server had a bug in its fast mode implementation.

    "Okay, then add a new flag that says 'My fast mode isn't buggy.'" Consider also how this course of action would look after a few revisions of the specification:

    In response to the QUERY_CAPABILITIES request, the server shall return a 32-bit value consisting of zero of more of the following bits:

    0x00000001  This server supports fast mode
    0x00000002  This server supports fast mode and doesn't have the bug where enumerating a directory with more than 128 files fails on the 129th query
    0x00000004  This server supports fast mode and doesn't have the bug where the long file name is reported incorrectly in the response packet
    0x00000008  This server supports fast mode and doesn't have the bug where directories whose names consist entirely of digits are misreported as files
    0x00000010  This server supports fast mode and doesn't have the bug where the enumeration resets if a file is created in the directory while the enumeration is in progress
    0x00000020  This server supports fast mode and doesn't have the bug where FindNext returns failure even though there are still files to be enumerated
    ...

    If a new capabilities flag were created for every single server bug that was discovered, the capabilities mask would quickly fill up with all these random bits for bugs that were fixed ages ago. And each time a bug was found in any one server, all servers would have to be updated to add the new capabilities bit that says, "I'm not that buggy server you found on April 8th 2006," even the servers sitting in a locked closet whose operating systems are burned into EPROMs. And if you're the author of a new server, which capabilities bits do you set? Do you claim that you don't have the bug where FindNext returns failure even though the enumeration hasn't completed? What if, six months after you ship, somebody finds a bug in your server of exactly that sort? I guess this mean that the next revision of the protocol will have to have a new flag:

    0x00000020  This server supports fast mode and doesn't have the bug where it claims that it doesn't have the "FindNext returns failure even though there are still files to be enumerated" bug, even though it actually does have the bug, but in a more subtle manner

    Or maybe you're convinced that you don't have any bugs in your "fast mode" implementation. Do you report 0xFFFFFFFF to say "I have no bugs at all, not even the ones people might discover later in other implementations"? What happens when the 33rd "fast mode" bug is found? Do we have to have a QUERY_CAPABILITIES2 function? If a capabilities bit is created for every single bug that ever existed in a networking protocol implementation, you'd have a few thousand capability bits all of whom mean "I don't have that bug where..."

    Now, I'm not saying that this course of action is out of the question. Sometimes you have to do it, but you also have to realize that the cost for making this type of change is very high, and the benefit had better be worth it.

  • The Old New Thing

    Locale-sensitive number grouping

    • 67 Comments

    Most westerners are familiar with the fact that the way numbers are formatted differ between the United States and much of Europe.

    Culture Format
    United States 1,234,567.89
    France 1 234 567,89
    Germany 1.234.567,89
    Switzerland 1'234'567.89

    What people don't realize is that the grouping is not always in threes. In India, the least significant group consists of three digits, but subsequent groups are in pairs.

    India 12,34,567.89

    I've also seen reports that the first group consists of five digits, followed by pairs:

    India 12,34567.89

    Meanwhile, Chinese and Japanese traditionally group in fours.

    China, Japan 123 4567.89

    What does this mean for you? Don't assume that numbers group in threes, and of course you can't assume that the grouping separator is the comma and the decimal character is the period. Just use the GetNumberFormat function and let NLS do the work for you.

    Next time, a little more about that NUMBERFMT structure.

  • The Old New Thing

    Computing over a high-latency network means you have to bulk up

    • 65 Comments

    One of the big complaints about Explorer we've received from corporations is how often it accesses the network. If the computer you're accessing is in the next room, then accessing it a large number of times isn't too much of a problem since you get the response back rather quickly. But if the computer you're talking to is halfway around the world, then even if you can communicate at the theoretical maximum possible speed (namely, the speed of light), it'll take 66 milliseconds for your request to reach the other computer and another 66 milliseconds for the reply to come back. In practice, the signal takes longer than that to make its round trip. A latency of a half second is not unusual for global networks. A latency of one to two seconds is typical for satellite networks.

    Note that latency and bandwidth are independent metrics. Bandwidth is how fast you can shovel data, measured in data per unit time (e.g. bits per second); latency is how long it takes the data to reach its destination, measured in time (e.g. milliseconds). Even though these global networks have very high bandwidth, the high latency is what kills you.

    (If you're a physicist, you're going to see the units "data per unit time" and "time" and instinctively want to multiply them together to see what the resulting "data" unit means. Bandwidth times latency is known as the "pipe". When doing data transfer, you want your transfer window to be the size of your pipe.)

    High latency means that you should try to issue as few I/O requests as possible, although it's okay for each of those requests to be rather large if your bandwidth is also high. Significant work went into reducing the number of I/O requests issued by Explorer during common operations such as enumerating the contents of a folder.

    Enumerating the contents of a folder in Explorer is more than just getting the file names. The file system shell folder needs other file metadata such as the last-modification time and the file size in order to build up its SHITEMID, which is the unit of item identification in the shell namespace. One of the other pieces of information that the shell needs is the file's index, a 64-bit value that is different for each file on a volume. Now, this information is not returned by the "slow" FindNextFile function. As a result, the shell would have to perform three round-trip operations to retrieve this extra information:

    • CreateFile(),
    • GetFileInformationByHandle() (which returns the file index in the BY_HANDLE_FILE_INFORMATION structure), and finally
    • CloseHandle().

    If you assume a 500ms network latency, then these three additional operations add a second and a half for each file in the directory. If a directory has even just forty files, that's a whole minute spent just obtaining the file indices. (As we saw last time, the FindNextFile does its own internal batching to avoid this problem when doing traditional file enumeration.)

    And that's where this "fast mode" came from. The "fast mode" query is another type of bulk query to the server which returns all the normal FindNextFile information as well as the file indices. As a result, the file index information is piggybacked on top of the existing FindNextFile-like query. That's what makes it fast. In "fast mode", enumerating 200 files from a directory would take just a few seconds (two "bulk queries" that return the FindNextFile information and the file indices at one go, plus some overhead for establishing and closing the connection). In "slow mode", getting the normal FindNextFile information takes a few seconds, but getting the file indices would add another 1.5 seconds for each file, for an additional 1.5 × 200 = 300 seconds, or five minutes.

    I think most people would agree that reducing the time it takes to obtain the SHITEMIDs for all the files in a directory from five minutes to a few seconds is a big improvement. That's why the shell is so anxious to use this new "fast mode" query.

    If your program is going to be run by multinational corporations, you have to take high-latency networks into account. And this means bulking up.

    Sidebar: Some people have accused me of intentionally being misleading with the characterization of this bug. Any misleading on my part was unintentional. I didn't have all the facts when I wrote up that first article, and even now I still don't have all the facts. For example, FindNextFile using bulk queries? I didn't learn that until Tuesday night when I was investigating an earlier comment—time I should have been spending planning Wednesday night's dinner, mind you. (Yes, I'm a slacker and don't plan my meals out a week at a time like organized people do.)

    Note that the exercise is still valuable as a thought experiment. Suppose that FindNextFile didn't use bulk queries and that the problem really did manifest itself only after the 101st round-trip query. How would you fix it?

    I should also point out that the bug in question is not my bug. I just saw it in the bug database and thought it would be an interesting springboard for discussion. By now, I'm kind of sick of it and will probably not bother checking back to see how things have settled out.

  • The Old New Thing

    Doing the best we can until time travel has been perfected

    • 64 Comments

    Mistakes were made.

    Mistakes such as having Windows NT put Notepad in a different location from Windows 3.1. (Though I'm sure they had their reasons.) Mistakes such as having a TCS_VERTICAL when there is already a CCS_VERT style. Mistakes such as having listview state images be one-biased, whereas treeview state images are zero-biased.

    But what's done is done. The mistakes are out there. You can't go back and fix them—at least not until time travel has been perfected—or you'll break code that was relying on the mistakes. (And believe me, there's a lot of code that relies on mistakes.) You'll just have to do the best you can with the situation as it is.

    Often, when I discuss a compatibility problem, people will respond with "That's your own damn fault. If you had done XYZ, then you wouldn't have gotten into this mess." Maybe that's true, maybe it isn't, but that doesn't make any progress towards solving the problem and therefore isn't very constructive. I sure hope these people never become lifeguards.

    "Help me, I'm drowning!"

    "Are you wearing a life preserver?"

    "No."

    "Well, if you had worn a life preserver, then you wouldn't be drowning. It's your own damn fault."

    When faced with a problem, you first need to understand the problem, then you set about exploring solutions to the problem. Looking for someone to blame doesn't solve the problem. I'm not saying that one should never assign blame, just that doing so doesn't actually solve anybody's problem. (If you want to blame somebody, do it at the bug post-mortem. Then you can study the conditions that led to the mistake, assign blame, if you're looking for a scapegoat, and take steps to prevent a future mistake of the same sort from occurring. As a lifeguard, you first rescue the drowning person, and then you lecture them for not wearing a life preserver.)

  • The Old New Thing

    Ich habe meinen Computer zu Deutsch gewechselt

    • 49 Comments

    This weekend, I changed my computer's user interface language from Swedish (where it had been since November 2003) to German. Germany is the country I'm most likely to vacation to next, and I figured I should start pseudo-immersing myself. Of course, all it really means is that I'm going to be learning a lot of computer-related German words like Einstellungen and Speicher.

    The change will also take additional adjustment because I learned German under the old spelling rules, before the controversial spelling reform of 1996 was promulgated. Perhaps the most prominent change is the new rules for the ß character, but for me personally, that change is barely noticeable because I learned German from a textbook that uses Swiss spelling! (The Swiss do not use the ß character; they use double-s instead.) Learning from a Swiss textbook also means that I learned phrases like "Tschüss" and "Bilder knipsen", my use of which amuses Germans to no end. One of the lesser rules that affects me more is the regularization of rules surrounding noun capitalization, as in "zu Deutsch" above.

    This completes the switch to German that began in January when I changed my Microsoft Office language to German. For the past few months I had been running a mix of Swedish and German. That sounds confusing, but it wasn't that bad, really. I barely even realized that half of my dialog boxes were in one language and half were in another. (Well, okay, and the third half was in English. The programs that are neither part of Windows nor part of Office remain in English.) The real hard part is learning all the new keyboard shortcuts.

    (In marginally related news, the Swedish Academy recently released its latest official Swedish word list, and it changed its longstanding policy and now lists the words beginning with "W" separately from words beginning with "V". Up until now, "W" and "V" had been considered merely typographical variants of one another and had been treated as identical for alphabetization purposes.)

  • The Old New Thing

    Merchandise your food with pride

    • 43 Comments

    There is a new placard in our cafeteria which reads "Merchandise your food with pride". That's the first time I've seen the word "merchandise" used as a verb.

    Here, I'll translate that last paragraph into management speak for you:

    The cafeteria newly signed a placard whose read is "Merchandise your food with pride". That's my first see of a verbed "merchandise".

    Earlier this year, I was chatting with a Boeing employee who mentioned that he "had to status an action item". I have yet to see the word "status" used as a verb at Microsoft, but it's only a matter of time.

  • The Old New Thing

    Sometimes you just have to make a snap decision

    • 41 Comments

    Saturday afternoon, my phone rings.

    "Hello?"

    "Quick! We're on our way to the nursery. Do you want to come?"

    I recognize the voice as one of my friends who recently bought a house and presumably is doing some spring landscaping. But I have to answer fast. Time for a snap decision.

    "No."

    My friend seems surprised that I give my answer so quickly.

    "Oh! Well then! Bye."

    If you tell me I have to answer fast, you shouldn't act all offended if I give a quick answer.

  • The Old New Thing

    Why is the Microsoft Protection Service called "msmpsvc"?

    • 40 Comments

    (This is the first in a series of short posts on where Microsoft products got their names.)

    The original name for the malware protection service was "mpsvc" the "Microsoft Protection Service", but it was discovered later that that filename was already used by malware! As a result, the name of the service had to be changed by sticking an "ms" in front, making it "msmpsvc.exe".

    Therefore, technically, its name is the "Microsoft Microsoft Protection Service". (This is, of course, not to be confused with "mpssvc.exe", which is, I guess, the "Microsoft Protection Service Service".)

    Fortunately, the Marketing folks can attempt to recover by deciding that "msmpsvc" stands for "Microsoft Malware Protection Service". But you and I will know what it really stands for.

  • The Old New Thing

    No good deed goes unpunished: Bug assignment

    • 36 Comments

    Sometimes you're better off keeping your fool mouth shut.

    The other day I got a piece of email requesting that I look at a crashed system because the tester believed it was another instance of bug 12345. While that may very well have been the case, bug 12345 was a kernel pool corruption bug in the object manager, something of which I know next to nothing.

    Why was I being asked to look at the machine? Because the bug was originally assigned to my team. As a gesture of good will, I reassigned the bug to a more appropriate team, and that was my downfall, because that put my fingerprint on the bug report. The second occurence of the bug was inside another component entirely, still completely unrelated to my team. But the tester thinks that I am somehow responsible for fixing the bug since my name is now in the bug report.

    Perhaps I need an alter ego whose name I can use for cases like this, where my involvement in a bug is not as an interested party, but rather as somebody merely offering to help redirect the bug to the correct developer. But that merely defers the problem: What should I do if somebody asks the alter ego to investigate the bug?

Page 1 of 4 (34 items) 1234