March, 2010

  • The Old New Thing

    2010 Q1 link clearance: Microsoft blogger edition

    • 8 Comments

    It's that time again: Sending some link love to my colleagues.

  • The Old New Thing

    The great thing about URL encodings is that there are so many to choose from

    • 14 Comments

    The phrase URL encoding appears to mean different things to different people.

    First, Tim Berners-Lee says that URLs are encoded by using %xx to encode "dangerous" characters, or to suppress the special meaning that would normally be assigned to characters such as / or ?. For example, the URL http://server/why%3F/?q=bother is a request to the server server with the path /why?/ and with the query string q=bother. Notice that by escaping the question mark, we prevent it from being interpreted as the start of the query portion of the URL.

    Now, it so happens that when a form is submitted via GET, then the contents of the form are encoded (by default) into the query according to a set of rules laid out in the HTML 4.01 specification: The query string takes the basic form of var=value&var=value&.... If a variable name or a value contains a "dangerous" character or a special character like = or &, then it must be %-escaped. For example, co=AT%26T says that the variable co has the value AT&T. Encoding the ampersand prevents it from being interpreted as a separator.

    And here is the special additional rule that confuses a lot of people: When submitting a form via GET, the form data is encoded into the query portion of a URL, and under the default encoding, the character U+0020 (space) is encoded as U+002B (plus sign). This special use of the plus sign applies only to the query portion of the URL. Sometimes people get confused and think that it applies to URLs in general.

    Example:

    http://example.com/embedded%20space.html?key=apple+pie#result%20panel
    

    The base URL and fragment (colored in blue) use the %20 sequence to encode the embedded space, whereas the query (colored in green) uses the plus sign.

    You'd think that would be the end of the story, but in fact it's just the beginning, because now we get to throw in all sorts of nonstandard URL encoders.

    The PHP function urlencode treats the entire string as if it were a value (or variable name) in a query string, encoding spaces as a plus sign and being careful to escape all other punctuation. Not to be confused with rawurlencode which encodes everything (even characters like /).

    JScript comes with a whole bucketload of functions for URL encoding. There's escape(), which encodes almost everything but leaves the slash and—bafflingly—the plus sign unencoded. And then there's the encodeURI() function which leaves a few more characters unencoded (including the colon (U+003A), and question mark (U+003F)). But wait, there's also encodeURIComponent() which goes to the effort of encoding slashes too. It's a total mess, but this site tries to make some sense out of the whole thing.

    The ASP.Net function Server.UrlEncode behaves the same way as the PHP urlencode function.

    There are probably a dozen other functions which purport to perform some form of URL encoding. You have to read the documentation on each one carefully to see whether it does the type of encoding you want.

    But wait, you're not done yet. There are URL encodings which are built on top of the basic URL encoding.

    The punycode encoding is used to encode Unicode characters in domain names, which have an even more limited character set than URLs.

    When auto-generating a URL from a string, different Web sites use different algorithms. This isn't really an encoding in the URL encoding sense; it's just a convention for generating names for Web pages. The result of these conversion algorithms still need to be URL encoded.

    For example, Wikipedia's URL auto-generation algorithm changes spaces to underscores. It leaves most punctuation marks unchanged, which means that once you've gone through Wikipedia's auto-generation algorithm, you still have to go back and escape all the characters which require escaping according to RFC3986.

    As another example, it is popular with many blog software packages to change spaces to hyphens when auto-generating a URL from the title of a blog post. The handling of special characters varies. Some packages simply omit them; others try to encode them, resulting in a double-encoded string if the encoding uses characters for which RFC3986 requires encodings!

    So if somebody asks a question about URL encoding, before you answer, make sure you understand what sense of the phrase "URL encoding" is being used.

  • The Old New Thing

    Non-Microspeak: Boiling the ocean

    • 34 Comments

    Some time ago, MSN Careers listed Boil the ocean as a workplace phrase you should learn. Thankfully, the phrase (meaning "to attempt something impossibly ambitious") is not currently in wide use in Microspeak. However, a friend of mine who works in another industry tells me that it is not only very much alive in his line of work, it became corrupted as it was imported.

    My friend's industry involves companies from around the world, and although the working language for meetings is English, most of the participants are not native speakers of the language. He suspects that the phrase boil the ocean was introduced into the collective consciousness without a formal definition, and the company representatives at the meeting missed out on the impossible part of the definition, interpreting it instead as merely meaning to attempt something ambitious.

    As a result, at their meetings, you will hear people say things like "Let's try not to boil more than one ocean."

  • The Old New Thing

    Ruth Bader Ginsburg, the yardstick for Wikipedia entries

    • 55 Comments

    I use Ruth Bader Ginsburg's Wikipedia entry as a yardstick for other Wikipedia entries. At the time I'm writing this blog entry, her Wikipedia article is 1600 words long. So 1600 words is how many words Wikipedia assigns to the 20th most powerful woman (and the most powerful female lawyer) in the world.

    By comparison, Wikipedia has collectively decided that the 2007 Philadelphia Eagles season merits 5500 words. The exploits of a lackluster last-place American football team therefore clocks in at 3.4 Ginsburgs. In a sense, Wikipedia says that a last-place football team's exploits is over three times more significant than the most powerful woman lawyer in the world.

    The fictional character John Locke from the television program Lost comes in at 4900 words, or just above 3 Ginsburgs. Apparently Wikipedia thinks that it's notable that Locke likes Gold Toe socks.

    The plot summary for the movie Little Man is approximately 1.5 Ginsburgs. That's just the plot summary, not the entire entry. We know more about the plot to a mediocre film than we know about a Supreme Court Justice.

    It's not just popular culture that has taken the lead on Wikipedia. It's specifically post-2000 popular culture. The television program The Brady Bunch, which has been influencing United States popular culture for decades, merits a simple list of episodes, whereas the short-lived and totally forgotten show Studio 60 on the Sunset Strip spends 1300 words just on the subject of references from the show to other Sorkin shows.

    Even the Internet video The Bus Uncle weighs in at 1.8 Ginsburgs.

    Slate's Timothy Noah, who is certainly more notable than I am, also takes issue with Wikipedia's notability guidelines after being evicted from Wikipedia and subsequently reinstated. Reuters noted that the Wikipedia page is the new status symbol.

    For some reason I have a Wikipedia entry, despite not being "the subject of multiple, non-trivial published works from sources that are reliable and independent of the subject and of each other." As far as I'm aware, nobody has written even one book about me, much less multiple.

  • The Old New Thing

    What happens to the control names in the IDE when my program is running?

    • 20 Comments

    nick_journals demonstrates some confusion about names in source code and their relationship to runtime behavior.

    A topic I am particularly interested in is the naming of controls, how it works...

    Every control gets a name from a developer...via the IDE (e.g btnOK)

    When using this function: GetWindowLong(handle,GWL_ID) it doesn't return the name itself but mostly a number or nothing.

    What is GWL_ID, the documentation isn't very clear on this.

    How does this whole system work, what are these numbers and where are the 'real' names?

    I'm going to answer the questions most technical first. That way you can stop reading when you understand the topic.

    The GWL_ID window long (or, more precisely, the GWLP_ID window pointer-sized long) returns the value you passed to the CreateWindowEx function as the child window identifier (overloaded as the hMenu parameter). The call to CreateWindowEx might have happened explicitly in your code, or it may have been the result of a call to CreateWindowEx made on your behalf by another component, such as the dialog manager, which takes the control identifier from the dialog template. (Note that only child windows have child window identifiers. Top-level windows don't have child window identifiers.) And if you passed zero as the child window identifier, then when you ask for the GWLP_ID, you'll get zero back.

    The name that appears in your source code is just a name you decided to use to talk about the control. It's just a convenience for yourself, so instead of saying "Control number 103" all over the place, you can say "Control number IDC_ADD". And sometimes even that is too much typing, so you shorten it to "control btnAdd." The window manager doesn't know what cute shortcut names you've created for your child windows; the window manager still calls it control number 103. By analogy, the phone company doesn't know which numbers you've programmed into your speed dial. That's just something you set up in your phone to make dialing more convenient.

    In other words, the "real name" is really just a fake name you created to make things easier to talk about. The actual "real name" is the child window identifier. When you look in the telephone book for Bob, you'll see Bob's phone number, not his speed-dial number on your phone.

    This is the same thing that happens to your variables after the code has been compiled. At runtime, your variable names don't exist any more. They were just convenient mnemonic names you gave to computational values. The compiler uses those convenient names to determine what you're talking about when you assign a variable or a fetch a variable's value, but once that's done, it has no need for the name any more.

    In other words, the name in your source code is just something you did to make things easier to write. The compiler's job is to change your line x = y; into mov eax, [ebp-05ch]; mov [00437180h], eax, at which point the names x and y are no longer needed and are discarded. (Actually, it's saved off in a separate file for the debugger to use, so that when you ask the debugger to show the value of the variable x, it knows to look in 00437180h, but that happens outside of the execution environment.)

    It's like asking, "What happens to the phrase 1 cup sugar when I eat my cookies?" The phrase 1 cup sugar was part of the instructions for making the cookies. Once you finish following the recipe, the instructions aren't needed any more. Or at least not for the process of enjoying cookies. (You might still want to hang onto them to debug your recipe!)

  • The Old New Thing

    WaitForInputIdle waits for any thread, which might not be the thread you care about

    • 6 Comments

    We saw last time that the WaitForInputIdle function waits only once for a process to go input idle. Even if the process later stops processing messages, WaitForInputIdle will return immediately and say, "Yeah, he's idle."

    The way a process is determined to be input idle is that it is waiting for user input when there is none. This translates into the process sitting in a function like GetMessage or WaitMessage when there are no messages. But what if a process has more than one thread? And what if one of the threads is waiting for input, while the other is busy and unresponsive? The WaitForInputIdle function will treat the process as having gone input idle, even if the ready thread is just displaying your splash screen and the busy thread is the one preparing the main window. The WaitForInputIdle function doesn't know that the main window is more important than the splash screen; as far as the window manager is concerned, you've got two threads, each with a window.

    What this means for your application is that you need to know that, if you create multiple threads, then the moment any of them goes input idle, the entire process is treated as input idle, and you need to be ready for people who were waiting for you via WaitForInputIdle to start trying to talk to your application.

    As I noted last time, the WaitForInputIdle function is really just a stopgap to help with the transition from 16-bit Windows to 32-bit Windows. Whereas 16-bit programs could just charge ahead knowing that the program it launched is ready (because if it weren't ready, then it would still have control of the CPU), 32-bit programs need to wait for this faked-up version of Yield. And since it was created merely to aid in porting 16-bit programs, the WaitForInputIdle function didn't really worry about multiple threads. After all, 16-bit Windows didn't support multiple threads per process, so all 16-bit programs were necessarily single-threaded. If you're porting one of these programs, your initial 32-bit version is also going to be single-threaded. (At least I hope you're not going to try to add multiple threads right off the bat. The first step in porting is just to get the program to run without adding any new features!)

    In fact, back in the old days, the WaitForInputIdle function tried a bit too hard to emulate the Yield behavior from 16-bit Windows. When the target application received a message, it was taken out of the input idle state, and went back into the state when it once again reached a state where it was waiting for input. In other words, the one-line summary of the WaitForInputIdle function was actually correct at the time it was written.

    The old mechanism for WaitForInputIdle, taking the process in and out of the input idle state, mirrored the 16-bit behavior of Windows, but only if you had a single-threaded application. If you had multiple threads, then the input-idle state starts getting all wonky as each thread updates the global idle state:

    Thread 1 Thread 2
    1BusyBusy
    2Idle
    (application marked as idle)
    3Idle
    (application marked as idle)
    4Busy
    (application marked as busy)
    5Busy
    (application marked as busy)

    There is only one state that gets updated by each thread as they go idle or busy. All you really know is that if all threads are busy, then the input idle state will be busy and that if all threads are idle, then the input idle state will be idle. But if one thread is idle and the other is busy, then the process idle state is unpredictable; it depends on what the most recent transition was. For example, notice that at step 2, we have one idle thread (thread 1) and one busy thread (thread 2), and the process is marked idle. On the other hand, at step 4, we also have one idle thread (thread 2) and one busy thread (thread 1), but this time, the process is marked busy.

    Of course, since this behavior was intended to mimic the 16-bit programming model, the fact that it got all confused when applications created multiple threads was "out of scope". Creating multiple threads meant that you have gone past the simple "Just trying to get it to work" stage and have moved on to adding Win32-specific enhancements. You were creating a situation that the WaitForInputIdle function was not designed to handle.

    My guess is that the unpredictable nature of the original design led the window manager folks to take a deep breath and go back to the spirit of the WaitForInputIdle function: To tell when a program has finished its initialization.

    So now it decides that when you've finished initialization, you've finished initialization. It's a one-way door.

  • The Old New Thing

    WaitForInputIdle should really be called WaitForProcessStartupComplete

    • 24 Comments

    The WaitForInputIdle function waits for a process to finish its initialization, which is determined when it reaches a state where it is just sitting around waiting for messages.

    The documentation for WaitForInputIdle doesn't even get around to the initialization part until the Remarks section. If all you read is the one-sentence summary, Waits until the specified process is waiting for user input with no input pending, or until the time-out interval has elapsed, it would not be unreasonable for you to conclude that a process goes into and out of the input idle state each time it processes a message. But no, it's a one-time transition.

    If you call WaitForInputIdle on a process which had previously gone input idle, but is now busy and not processing pending input messages, the function will still return immediately, because WaitForInputIdle only checks whether the process has gone input idle at all and not whether it is input idle right now.

    As the Remarks section notes, the purpose of the WaitForInputIdle function is for a process to determine whether another process (which is recently launched) has reached a state where it is okay to send that process messages. This is important to know when the form of communication between two processes is a message-based mechanism, and the two processes otherwise have no real way of knowing what the other is doing. (If the two processes had been written by the same author, then you could come up with some more expressive interface for the two to communicate through, one which avoids the need for one process to guess when the other one is ready.) The specific scenario that WaitForInputIdle was created to address is DDE.

    Back in the old 16-bit days, you didn't need a WaitForInputIdle function, because scheduling was co-operative. You know that the other process was sitting idle, because if it were busy, your code wouldn't be running in the first place. It's like waiting for the talking stick to be handed to you so that you can ask the question, "Are you ready to give up the talking stick?" The WaitForInputIdle function assisted in the porting of these 16-bit applications by allowing a process to wait and simulate the "Wait for the other person to stop talking" operation which had previously been implicit in a co-operative system.

    What would it mean for WaitForInputIdle to wait on a program that has already completed its initialization, when the program has multiple threads? Suppose one thread is sitting around waiting for messages, but another is busy and still has unprocessed input messages. Would a call to this WaitForInputIdleAgain function wait, or should it return immediately? According to the description, it would return immediately, because there is a thread in the process which is "waiting for user input with no input pending." So even if WaitForInputIdle worked like this imaginary WaitForInputIdleAgain function, it still wouldn't help you, because it wouldn't actually wait in cases where you probably wanted it to.

    Actually, the above analysis applies to WaitForInputIdle as well; we'll pick up this discussion next time.

  • The Old New Thing

    First, try reading the error message, episode 3: Even programmers see error messages without reading them

    • 37 Comments

    As we all know, users don't read error messages. And, sad to say, this behavior extends to the group of people you would think pay the most attention to error messages, namely, programmers, who are the people who cause your computer to display the error messages!

    Today's example comes from an internal programmer's tool which I will call Program Q.

    I'm trying to back out an update I made to record 19 of our table (which was entered as transaction 512), but I can't get it to work:

    C:\> q backout 512
    Transaction opened for record 19 - must reconcile backout
    
    C:\> q commit all
    Transaction 520 created for 1 record
    Committing transaction 520
    Record 19 - must reconcile backout
    Unreconciled changes pending - use 'q reconcile' to resolve
    Commit failed -- fix problems above and then use 'q commit 520'
    

    What is the problem here, and how do I fix it?

    The transaction couldn't be backed out because somebody else made a change to record 19 since your original erroneous update, and the system wants you to reconcile this conflict.

    But that's beside the point. The point is that the user here merely saw that there was an error and didn't actually read any of the error messages. There were four error messages, and every single one told him what he needed to do!

    C:\> q backout 512
    Transaction opened for record 19 - must reconcile backout
    
    C:\> q commit all
    Transaction 520 created for 1 record
    Committing transaction 520
    Record 19 - must reconcile backout
    Unreconciled changes pending - use 'q reconcile' to resolve
    Commit failed -- fix problems above and then use 'q commit 520'
    
  • The Old New Thing

    Why does it take longer to reject an invalid password than to accept a valid one?

    • 37 Comments

    You may have noticed that it takes longer to reject an invalid password than to accept a valid one. There are a few reasons for this.

    First of all, it simply takes longer to confirm that a password is invalid.

    Your local computer retains a password cache. This password cache improves performance of local authentication operations, such as unlocking a workstation. If you unlock the workstation with the same password you used to log on, then the password is assumed to be good. This allows the workstation to unlock quickly. Without the password cache, unlocking the workstation would require going back to the domain controller to validate the password, which for slow network connections, can take a very long time. In fact, you might not have any network connection at all (an extreme case of slow), say because you've taken your laptop on the road and are not connected to the corporate network any more. In that case, without the password cache, it would be impossible for you to unlock your workstation at all!

    Note that you can disable these password caches if they offend you.

    The algorithm for testing if a password is valid goes like this:

    1. If password is in cache and matches: Return valid.
    2. Else contact domain controller for password validation.

    If you pass a valid password, then the validation succeeds at step 1. Notice that step 1 can be performed entirely on the local machine. It doesn't need to contact any other computers to get an answer.

    On the other hand, if you pass an invalid password, then we go on to step 2, which attempts to contact the domain controller to validate the password. Obviously you have to do this for passwords not in the cache, because you have no information about those passwords. But why do you also have to do this for passwords that are in the cache and don't match? Why don't you just say invalid without contacting the domain controller?

    Because your cache itself may be invalid.

    If the user recently changed the password on another machine, then the password in your machine's cache is not valid. If the user tries to use the new password, your computer's cache says, "Nope, that's not the right password." If you returned invalid immediately instead of contacting the domain controller, then users whose passwords have changed would not be able to use that password to access any computer which had cached the old password! They would have to sit around and wait for the old password to fall out of the cache, so that the computer would continue to step 2 and get the new password from the domain controller.

    You can imagine the bizarro technical support calls that would have resulted. "Yes, I know you changed your password, but you have to keep using your old password until the system starts rejecting it, and then you switch to the new password. And the rejection time will vary from computer to computer, depending on how many other people use the computer also. Oh, and if you're the only person who uses the computer, then it will never accept your new password. But once you log onto the computer with the old password, you might need to give your new password when connecting from that machine to other machines, because those other machines might have received your new password."

    Okay, so one reason why invalid passwords take longer to reject is that the computer has to try more things before finally deciding to reject it.

    Another reason why invalid passwords take longer to reject is to reduce the effectiveness of dictionary attacks. If invalid passwords were rejected just as quickly as valid passwords were accepted, then a bad guy could just churn through a dictionary trying out invalid passwords at high speed. Adding a delay of a few seconds before rejecting invalid passwords introduces a minor inconvenience to users who mistyped their passwords, but makes a huge dent in stopping dictionary attacks. For example (and these numbers are completely made up), suppose you have a 75,000 word password dictionary, and passwords are accepted or rejected in 100ms. It would take a little over three hours to attempt every password in the dictionary. Introducing even a simple 5-second delay into the rejection of invalid passwords increases the time to perform a dictionary search to over four days.

    The invalid password rejection time in some places can get quite high, especially if the delay escalates each time you get the password wrong. For example, after you type the third (fourth?) incorrect password to the Windows logon screen, it displays the incorrect password error for something like 30 seconds before letting you try again.

  • The Old New Thing

    Germans are falling for the same trap as the Japanese: Importing words from English and changing the meaning, but the Germans do it even though the words didn't need to be imported at all

    • 73 Comments

    Languages borrow from each other all the time. English has historically been a happy perpetrator of word-theft, but in recent decades, it has been serving as the source for a lot of theft, too. What I find particularly interesting, though, is when a word is borrowed and given a meaning in its new language different from its meaning in the source language.

    Japanese is famous for this, For example, they take the English phrase white shirt and import it as waishatsu, which means not white shirt but dress shirt.

    In Swedish, the phenomenon of importing English into Swedish is known as svengelska, a blend of svenska and engelska. The Swedes use the faux-English term service-mind to mean dedication to customer service. I find this interesting because they just took some English words and combined them in a way not used in English at all. And as you can see from the citation, it seems that there are some who are under the mistaken impression that we use the word in English, too.

    One thing that disturbs me is when a word is imported into a language even though there is already a perfectly good word for the concept. Many years ago, my aunt (who at the time was a Japanese teacher) went to Japan and sat down in a restaurant. She looked over the menu looking for the beverages and couldn't find it. She asked the waiter where the beverage section was, and the waiter directed her to the section titled dorinku. The Japanese have imported the word dorinku from the English word drink, displacing the traditional Japanese word (which I believe is nomimono, but I could be wrong). My aunt was looking for the traditional Japanese word and couldn't find it.

    The award, however, for using faux English terms, goes to the Germans and Denglisch, the term for the blending of Deutsch and Englisch. Unsatisfied with the perfectly good German word Rucksack, which means backpack, some marketing geniuses decided to adopted the English word body bag instead. This is wrong on so many levels. First of all, it's the phenomenon of replacing a perfectly good native word with a loanword. Second, the English language imported the word rucksack from German. We borrowed the word from you. Feel free to borrow it back; it was yours originally! And third, the import is disturbingly incorrect. In English, a body bag is a bag for carrying corpses, what in German would be called a Leichensack.

    This is a rather long and tedious set-up for my recent discovery. Apparently in Switzerland, the term for a light truck—not a semi-trailer but something that a family might own—is a Pickup, even though in English, the term pick-up is used only for a particular type of small truck.

    Mind you, the English language is hardly innocent in the matter of importing a word while changing its meaning. For example, the prefix über- (or simply uber-) is used in English to mean ultimate or super. As a random example, I produce this citation for the word ubermom. Of course, this prefix is nonsensical to native speakers of German, since über merely means over or on top of. The German word Übermensch, was originally translated into English as superman, leading to the widespread misconception that Über must mean super.

Page 1 of 4 (35 items) 1234