May, 2006

  • The Old New Thing

    How do I write a regular expression that matches an IPv4 dotted address?

    • 54 Comments

    Writing a regular expression that matches an IPv4 dotted address is either easy or hard, depending on how good a job you want to do. In fact, to make things easier, let's match only the decimal dotted notation, leaving out the hexadecimal variant, as well as the non-dotted variants.

    For the purpose of this discussion, I'll restrict myself to the common subset of the regular expression languages shared by perl, JScript, and the .NET Framework, and I'll assume ECMA mode, wherein \d matches only the characters 0 through 9. (By default, in the .NET Framework, \d matches any decimal digit, not just 0 through 9.)

    The easiest version is just to take any string of four decimal numbers separated by periods.

    /^\d+\.\d+\.\d+\.\d+$/
    

    This is nice as far as it goes, but it erroneously accepts strings like "448.90210.0.65535". A proper decimal dotted address has no value larger than 255. But writing a regular expression that matches the integers 0 through 255 is hard work because regular expressions don't understand arithmetic; they operate purely textually. Therefore, you have to describe the integers 0 through 255 in purely textual means.

    • Any single digit is valid (representing 0 through 9).
    • Any nonzero digit followed by another digit is valid (representing 10 through 99).
    • A "1" followed by two digits is valid (100 through 199).
    • A "2" followed by "0" through "4" followed by another digit is valid (200 through 249).
    • A "25" followed by "0" through "5" is valid (250 throuth 255).

    Given this textual breakdown of the integers 0 through 255, your first try would be something like this:

    /^\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]$/
    

    This can be shrunk a bit by recognizing that the first two rules above could be combined into

    • Any digit, optionally preceded by a nonzero digit, is valid.

    yielding

    /^[1-9]?\d|1\d\d|2[0-4]\d|25[0-5]$/
    

    Now we just have to do this four times with periods in between:

    /^([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$/
    

    Congratulations, we have just taken a simple description of the dotted decimal notation in words and converted into a monstrous regular expression that is basically unreadable. Imagine you were maintaining a program and stumbled across this regular expression. How long would it take you to figure out what it did?

    Oh, and it might not be right yet, because some parsers accept leading zeroes in front of each decimal value without affecting it. (For example, 127.0.0.001 is the same as 127.0.0.1. On the other hand, some parsers treat a leading zero as an octal prefix.) Updating our regular expression to accept leading decimal zeroes means that we now have

    /^0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.0*([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$/
    

    This is why I both love and hate regular expressions. They are a great way to express simple patterns. And they are a horrific way to express complicated ones. Regular expressions are probably the world's most popular write-only language.

    Aha, but you see, all this time diving into regular expressions was a mistake. Because we failed to figure out what the actual problem was. This was a case of somebody "solving" half of their problem and then asking for help with the other half: "I have a string and I want to check whether it is a dotted decimal IPv4 address. I know, I'll write a regular expression! Hey, can anybody help me write this regular expression?"

    The real problem was not "How do I write a regular expression to recognize a dotted decimal IPv4 address." The real problem was simply "How do I recognize a dotted decimal IPv4 address." And with this broader goal in mind, you recognize that limiting yourself to a regular expression only made the problem harder.

    function isDottedIPv4(s)
    {
     var match = s.match(/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/);
     return match != null &&
            match[1] <= 255 && match[2] <= 255 &&
            match[3] <= 255 && match[4] <= 255;
    }
    
    WScript.StdOut.WriteLine(isDottedIPv4("127.0.0.001"));
    WScript.StdOut.WriteLine(isDottedIPv4("448.90210.0.65535"));
    WScript.StdOut.WriteLine(isDottedIPv4("microsoft.com"));
    

    And this was just a simple dotted decimal IPv4 address. Woe unto you if you decide you want to parse e-mail addresses.

    Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.

  • The Old New Thing

    Raymond 1, Sidewalk 1

    • 12 Comments

    I successfully avoided the stealth sidewalk the other day. This evens the score.

    (Today is Starbucks Bike to Work Day.)

  • The Old New Thing

    Redirecting output can result in altered program behavior

    • 15 Comments

    Consider a program whose output to the console goes like this. (I've prefixed each line with the output stream.)

    stdout: line 1
    stderr: err 1
    stdout: line 2
    stderr: err 2
    

    You want to capture both the normal and error streams, so you run the program and append ">output 2>&1" to capture both streams into a single file. But when you look at the resulting output file, you get this:

    line 1
    line 2
    err 1
    err 2
    

    What happened?

    Most programs change their output behavior depending on whether the output stream is a file or a device. If the output stream is a device (such as the screen), then buffering is disabled and every print statement goes to the screen immediately. On the other hand, if the output stream is a file, then buffering is enabled and print statements do not go directly into the file but rather into an application-managed memory buffer. When the buffer fills, it is written to the file, and then the buffer is emptied so it can accept more output.

    This explains the behavior we see above. The program generates its output to both stdout and stderr, and the results are buffered. When the program exits, the buffers are flushed, first stdout and then stderr. That's why you see all the stdout output grouped together and all the stderr output grouped together.

    "But I don't do any of this in my programs; why is it happening anyway?"

    If you use the C runtime for your output, then your program does behave this way whether you realize it or not. The default behavior of the C runtimes is to perform unbuffered I/O for devices and buffered I/O for files. You can override this behavior in your own programs by calling setvbuf to force buffering on or off for a file handle.

    (A classmate in college became legendary for fixing a bug in one of the programs used in the VLSI class, all of which were notoriously buggy. He patched the binary to disable buffered I/O.)

  • The Old New Thing

    Making up new Winter Olympic events

    • 13 Comments

    My approach to inventing new Winter Olympic events is to create new opportunities for head-to-head competition, opening the door to new dramatic possibilities. For example, in Ski Jump Biathlon, one team jumps while the other team tries to shoot them (with paint pellets, of course) as they sail through the air. In Figure Curling, one team performs a free skate while the other team hurls granite stones down the ice in an attempt to foil that triple toe loop.

    Japanese filmmaker Riichiro Mashima was also bitten by the sport-inventing bug. His creation: Ski Jumping Pairs. I'm going to have to check it out.

  • The Old New Thing

    The redirection operator can occur in the middle of the command line

    • 10 Comments

    Although the redirection operator traditionally appears at the end of a command line, there is no requirement that it do so. All of these commands are equivalent:

    echo A B>C
    echo A>C B
    echo>C A B
    >C echo A B
    

    All of them echo "A B" to the file "C".

    You can use this trick to avoid the redirection problem we discussed last time. We saw that writing

    set message=Meet at 2
    echo %message%>schedule
    

    inadvertently interprets the "2" as part of the redirection operator. One solution was to insert a space:

    echo %message% >schedule
    

    but this assumes that the space won't cause a problem. If you're in a case where that space will indeed cause a problem, you can use the trick above to move the redirection operator to a location where it won't cause any trouble:

    >schedule echo %message%
    
  • The Old New Thing

    Don't mention the war. I mentioned it once, but I think I got away with it all right

    • 14 Comments

    The Germans is probably the most well-known episode of Fawlty Towers. Who better than John Cleese, therefore, to release the song Don't Mention the War, just in time for the World Cup. The purpose is to mend fences between Britain and Germany, but it might just make things worse, who knows.

  • The Old New Thing

    Beware of digits before the redirection operator

    • 29 Comments

    If you want to put the string "Meet at 2" into the file "schedule", you might be tempted to use

    echo Meet at 2>schedule
    

    If you try this, however, you'll see the string "Meet at" on the screen and the "schedule" file will be blank. [Typo fixed, 10am]

    What happened?

    A digit immediately before a redirection operator modifies which stream the redirection operator applies to. If you're going to redirected an alternate output stream, it'll nearly always be the standard error stream, or stream 2. To put the error output into a file, you would write something like this:

    sort /invalidswitch 2>errorfile
    

    There is also the operator ">&" that reopens a stream as another stream. The idiom

    some-command >output 2>&1
    

    says, "Put the normal output into the file output, and then change the error output stream (2) to refer to the normal output stream (1)." The result is that both the regular output and error output end up in the output file.

    But what if you really want to put the string "Meet at 2" into the file "schedule"?

    You can insert a space between the "2" and the ">". This works for most programs since they ignore trailing spaces on their command line, but this was a trick question: The echo command is one of the few commands that actually pays attention to trailing spaces. As a result, the contents of the "schedule" file is "Meet at 2<space><cr><lf>". Maybe this is close enough for you, in which case you can skip the next paragraph.

    But what if you don't want that trailing space? For that, you can use the metacharacter escape character, the ^:

    echo Meet at ^2>schedule
    

    The last gotcha is that the pesky "2" might come from environment variable expansion.

    set message=Meet at 2
    echo %message%>schedule
    

    The trailing "2" in %message% interacts with the greater-than sign, leading to an unintended redirection. For this, you can insert a space before the greater-than sign, assuming you are in a scenario where that space is not going to cause you any problems. (And if you're in a scenario where that space will cause a problem, you can use a trick we'll look at next time.)

    Mind you, if you're going to take an environment variable whose contents you do not control and expand it onto your command line unquoted, you have much worse problems than a trailing digit messing up your file redirection. Somebody might have decided that the message should be "&format C: /y". Inserting this into the command line unquoted would yield "echo &format C: /y>schedule" which is a pretty good way to ruin somebody's day. (Well, okay, you can't format a drive with an active pagefile, but you get the idea.)

  • The Old New Thing

    The real scoop on the the x64 calling convention on 64-bit Windows

    • 19 Comments

    Official (though preliminary) documentation on the x64 calling convention is available on MSDN, for those who want more than my quack overview. (Oops, I meant "quick overview". Little Freudian slip there.)

  • The Old New Thing

    Command line redirection is performed by the command line interpreter

    • 31 Comments

    The magic characters like <, >, and | in command lines like

    myprogram.exe | sort > output.txt
    

    are interpreted by the command interpreter CMD.EXE; they aren't built into the CreateProcess function. (This is obvious if you think about it. That command line created two processes; which one should CreateProcess return a handle to?)

    If you pass a command line like this to CreateProcess, it will merely run the myprogram.exe program with the command line arguments "| sort > output.txt". (The ShellExecute function behaves similarly.) If you want these characters to be interpreted as redirection operators, you need to give them to someone who will interpret those characters in the manner you intend:

    cmd.exe /C myprogram.exe | sort > output.txt
    

    Since different command line interpreters use different syntax, you have to specify which command line interpreter you want to use.

    If the command line came from the user, you probably want to use the COMSPEC variable in order to give the command to the user's command line interpreter of choice.

  • The Old New Thing

    The first word on the command line is the program name only by convention

    • 28 Comments

    The format of the command line returned by GetCommandLine is "program args", but this is only a convention. If you pass NULL for the lpApplicationName to the CreateProcess function, then the CreateProcess function will treat the first word of the lpCommandLine as the program name. However, if you pass a value for lpApplicationName, then that string determines the program that is run, and the string passed as the lpCommandLine is not used for that purpose.

    This means that if somebody runs your program with the following parameters to the CreateProcess function

    lpApplicationName  =  "C:\Path\To\Program.exe"
    lpCommandLine  =  "slithy toves"

    then when your program calls the GetCommandLine function, it will get the string "slithy toves", which doesn't give your program much help at all in determining its own name or location.

    If your program needs to determine its own name and location, use the GetModuleFileName function, as I noted some time ago.

    What is the point of letting a program specify something different as the first word on the command line from the actual program being run? There isn't much point to it in Windows, although it is used to greater effect in unix, where you can run a program under various "alias" names, executing one program but lying to it and putting a different name at the start of the command line. Some programs are specially designed to be run this way and alter their behavior depending on the "alias" name they were given. For example, the visual editor runs in screen mode if its name is given as "vi" but in line mode if its name is given as "ex".

    Although extremely few Windows programs use this quirk (I am not aware of any myself), the behavior nevertheless is supported, and you need to be aware of it when writing your own program, even if you don't intend to use it.

    For example, if you forget to repeat the program name on the command line and create the process like this

    lpApplicationName  =  "C:\Path\To\Program.exe"
    lpCommandLine  =  "arg1 arg2"

    then when that program runs, you will most likely see it ignore the arg1 because it thinks that arg1 is just the program name. If that program is a console program that uses the C runtime startup code, then it will receive its parameters as

    argv[0]  =  "arg1"
    argv[1]  =  "arg2"

    As I noted earlier, most console programs merely ignore their argv[0] since that slot is just the program name. (In this case, it's the alias program name, but the program being run doesn't know that.)

    Similarly, if the program is a Windows program that uses the C runtime startup code, then the C runtime startup code will merely skip over the first word on the command line, passing "arg2" to the WinMain function as its lpCmdLine.

    What was the point of all this discussion? Two things. First, that if you are launching other programs and passing an explicit lpApplicationName, then it behooves you to format the command line in a compatible manner. Otherwise, the results may not be what you expect. Second, that you as a program should not use the first token on the command line to control any security decisions since the value is controlled by the launcher and need not have any connection to reality.

Page 2 of 4 (35 items) 1234