Holy cow, I wrote a book!
Writing a regular expression that matches an IPv4 dotted address is either
easy or hard, depending on how good a job you want to do.
In fact, to make things easier, let's match only the decimal
dotted notation, leaving out the hexadecimal variant,
as well as the non-dotted variants.
For the purpose of this discussion,
I'll restrict myself to the common subset
of the regular expression languages
shared by perl, JScript, and the .NET Framework, and
I'll assume ECMA mode, wherein \d matches only the characters
0 through 9.
(By default, in the .NET Framework,
\d matches any decimal digit, not just 0 through 9.)
The easiest version is just to take any string of four decimal
numbers separated by periods.
This is nice as far as it goes, but it erroneously accepts
strings like "448.90210.0.65535".
A proper decimal dotted address has no value larger than 255.
But writing a regular expression that matches the integers 0 through 255
is hard work because
regular expressions don't understand arithmetic;
they operate purely textually.
Therefore, you have to describe the integers 0 through 255 in purely
Given this textual breakdown of the integers 0 through 255,
your first try would be something like this:
This can be shrunk a bit by recognizing that the first two rules above
could be combined into
Now we just have to do this four times with periods in between:
Congratulations, we have just taken a simple description of the
dotted decimal notation in words and converted into a monstrous
regular expression that is basically unreadable.
Imagine you were maintaining a program and stumbled across this
How long would it take you to figure out what it did?
Oh, and it might not be right yet,
because some parsers accept leading zeroes
in front of each decimal value without affecting it.
(For example, 127.0.0.001 is the same as 127.0.0.1.
On the other hand, some parsers treat a leading zero as an octal prefix.)
Updating our regular expression to accept leading decimal zeroes means
that we now have
This is why I both love and hate regular expressions.
They are a great way to express simple patterns.
And they are a horrific way to express complicated ones.
Regular expressions are probably the world's most popular
Aha, but you see, all this time diving into regular expressions
was a mistake.
Because we failed to figure out
what the actual problem was.
This was a case of somebody "solving" half of their problem
and then asking for help with the other half:
"I have a string and I want to check whether it is a dotted decimal
I know, I'll write a regular expression!
Hey, can anybody help me write this regular expression?"
The real problem was not "How do I write a regular expression to
recognize a dotted decimal IPv4 address."
The real problem was simply "How do I recognize a dotted decimal IPv4
And with this broader goal in mind, you recognize that limiting
yourself to a regular expression only made the problem harder.
var match = s.match(/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/);
return match != null &&
match <= 255 && match <= 255 &&
match <= 255 && match <= 255;
And this was just a simple dotted decimal IPv4 address.
Woe unto you if you decide you want to
parse e-mail addresses.
Don't make regular expressions do what they're not good at.
If you want to match a simple pattern, then match a simple pattern.
If you want to do math, then do math.
As commenter Maurits put it,
"The trick is not to spend time developing a combination hammer/screwdriver,
but just use a hammer and a screwdriver.
I successfully avoided
the stealth sidewalk
the other day.
This evens the score.
Starbucks Bike to Work Day.)
Consider a program whose output to the console goes like this.
(I've prefixed each line with the output stream.)
stdout: line 1
stderr: err 1
stdout: line 2
stderr: err 2
You want to capture both the normal and error streams,
so you run the program and append
">output 2>&1" to capture both streams
into a single file.
But when you look at the resulting output file, you get this:
change their output behavior depending on whether
the output stream is a file or a device.
If the output stream is a device (such as the screen),
then buffering is disabled and every print statement goes
to the screen immediately.
On the other hand, if the output stream is a file,
then buffering is enabled and print statements do not go
directly into the file but rather into an application-managed
When the buffer fills, it is written to the file, and then the
buffer is emptied so it can accept more output.
This explains the behavior we see above.
The program generates its output to both stdout and stderr,
and the results are buffered.
When the program exits, the buffers are flushed, first
stdout and then stderr.
That's why you see all the stdout output grouped together
and all the stderr output grouped together.
"But I don't do any of this in my programs; why is it happening anyway?"
If you use the C runtime for your output, then your program does
behave this way whether you realize it or not.
The default behavior of the C runtimes is to perform unbuffered I/O
for devices and buffered I/O for files.
You can override this behavior in your own programs by calling
setvbuf to force buffering on or off for a file handle.
(A classmate in college became legendary for fixing a bug
in one of the programs used in the VLSI class, all of which were
He patched the binary to disable buffered I/O.)
My approach to
inventing new Winter Olympic events
is to create new opportunities for head-to-head competition,
opening the door to new dramatic possibilities.
For example, in Ski Jump Biathlon,
one team jumps while the other team tries to shoot them (with paint
pellets, of course) as they sail through the air.
In Figure Curling, one team performs a free skate while the
other team hurls granite stones down the ice in an attempt to foil
that triple toe loop.
Japanese filmmaker Riichiro Mashima
was also bitten by the sport-inventing bug.
Ski Jumping Pairs.
I'm going to have to check it out.
Although the redirection operator traditionally appears at the end
of a command line, there is no requirement that it do so.
All of these commands are equivalent:
echo A B>C
echo A>C B
echo>C A B
>C echo A B
All of them echo "A B" to the file "C".
You can use this trick to avoid
the redirection problem we discussed last time.
We saw that writing
set message=Meet at 2
inadvertently interprets the "2" as part of the redirection operator.
One solution was to insert a space:
echo %message% >schedule
but this assumes that the space won't cause a problem.
If you're in a case where that space will indeed cause a problem,
you can use the trick above to move the redirection operator
to a location where it won't cause any trouble:
>schedule echo %message%
is probably the most well-known episode of
Who better than John Cleese, therefore, to release the song
Don't Mention the War,
just in time for the World Cup.
The purpose is to mend fences between Britain and Germany,
but it might just make things worse, who knows.
If you want to put the string "Meet at 2" into the file "schedule",
you might be tempted to use
echo Meet at 2>schedule
If you try this, however, you'll see the string "Meet at" on the
screen and the "schedule" file will be blank.
[Typo fixed, 10am]
A digit immediately before a redirection operator modifies
which stream the redirection operator applies to.
If you're going to redirected an alternate output stream,
it'll nearly always be the standard error stream, or stream 2.
To put the error output into a file, you would write
something like this:
sort /invalidswitch 2>errorfile
There is also the operator ">&" that reopens a stream
as another stream.
some-command >output 2>&1
says, "Put the normal output into the file output,
and then change the error output stream (2) to refer to the
normal output stream (1)."
The result is that both the regular output and error output
end up in the output file.
But what if you really want to put the string "Meet at 2" into
the file "schedule"?
You can insert a space between the "2" and the ">".
This works for most programs since they ignore trailing spaces
on their command line, but this was a trick question:
The echo command is one of the few commands
that actually pays attention to trailing spaces.
As a result, the contents of the "schedule" file is
"Meet at 2<space><cr><lf>".
Maybe this is close enough for you, in which case you can skip
the next paragraph.
But what if you don't want that trailing space?
For that, you can use the metacharacter escape character, the ^:
echo Meet at ^2>schedule
The last gotcha is that the pesky "2" might come from environment
The trailing "2" in %message% interacts with the
greater-than sign, leading to an unintended redirection.
For this, you can insert a space before the greater-than sign,
assuming you are in a scenario where that space is not going to
cause you any problems.
(And if you're in a scenario where that space will cause a problem,
you can use a trick we'll look at next time.)
Mind you, if you're going to take an environment variable
whose contents you do not control and expand it
onto your command line unquoted, you have much worse problems
than a trailing digit messing up your file redirection.
Somebody might have decided that the message
should be "&format C: /y". Inserting this into the command
line unquoted would yield "echo &format C: /y>schedule"
which is a pretty good way to ruin somebody's day.
(Well, okay, you can't format a drive with an active pagefile,
but you get the idea.)
&format C: /y
echo &format C: /y>schedule
Official (though preliminary) documentation on the x64 calling convention
is available on MSDN,
for those who want more than
my quack overview.
(Oops, I meant "quick overview". Little Freudian slip there.)
The magic characters like
in command lines like
myprogram.exe | sort > output.txt
are interpreted by the command interpreter CMD.EXE;
they aren't built into the CreateProcess function.
(This is obvious if you think about it.
That command line created two processes; which one should
CreateProcess return a handle to?)
If you pass a command line like this to CreateProcess,
it will merely run the
myprogram.exe program with the command line arguments
"| sort > output.txt".
(The ShellExecute function behaves similarly.)
If you want these characters to be interpreted as redirection
operators, you need to give them to someone who will interpret
those characters in the manner you intend:
cmd.exe /C myprogram.exe | sort > output.txt
Since different command line interpreters use different syntax,
you have to specify which command line interpreter you want to use.
If the command line came from the user, you probably want to use
the COMSPEC variable in order to give the command
to the user's command line interpreter of choice.
The format of the command line returned by
"program args", but this is only a convention.
If you pass NULL for the lpApplicationName
to the CreateProcess function,
then the CreateProcess function will treat the first word
of the lpCommandLine as the program name.
However, if you pass a value for lpApplicationName,
then that string determines the program that is run, and the
string passed as the lpCommandLine is not used for
This means that if somebody runs your program with the following
parameters to the CreateProcess function
then when your program calls the GetCommandLine function,
it will get the string "slithy toves",
which doesn't give your program much help at all in determining
its own name or location.
If your program needs to determine its own name and location,
use the GetModuleFileName function,
as I noted some time ago.
What is the point of letting a program specify something different
as the first word on the command line from the actual program being run?
There isn't much point to it in Windows, although it is used
to greater effect in unix,
where you can run a program under various "alias" names,
executing one program but lying to it and putting
a different name at the start of the command line.
Some programs are specially designed to be run this way
and alter their behavior depending on the "alias" name they were given.
For example, the visual editor runs in screen mode
if its name is given as "vi" but in line mode if its
name is given as "ex".
Although extremely few Windows programs use this quirk
(I am not aware of any myself),
the behavior nevertheless is supported, and you need to be aware
of it when writing your own program, even if you don't intend to use it.
For example, if you forget to repeat the program name on the command
line and create the process like this
then when that program runs, you will most likely see it ignore
the arg1 because it thinks that arg1 is
just the program name.
If that program is a console program that uses the C runtime startup code,
then it will receive its parameters as
As I noted earlier,
most console programs merely ignore their argv since
that slot is just the program name.
(In this case, it's the alias program name, but the program being
run doesn't know that.)
Similarly, if the program is a Windows program that uses the
C runtime startup code,
then the C runtime startup code will merely skip over the first word
on the command line, passing "arg2" to the WinMain
function as its lpCmdLine.
What was the point of all this discussion?
First, that if you are launching other programs and passing
an explicit lpApplicationName, then it behooves you
to format the command line in a compatible manner.
Otherwise, the results may not be what you expect.
Second, that you as a program should not use the first token on
the command line to control any security decisions
since the value is controlled by the launcher and need not have
any connection to reality.