<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Fabulous Adventures In Coding : Regular Expressions</title><link>http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx</link><description>Tags: Regular Expressions</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Regular Expressions From Scratch, Part Twelve: Superposition of States</title><link>http://blogs.msdn.com/ericlippert/archive/2006/01/20/regular-expressions-from-scratch-part-twelve-superposition-of-states.aspx</link><pubDate>Fri, 20 Jan 2006 22:47:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:515471</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>7</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/515471.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=515471</wfw:commentRss><description>&lt;div class="mine"&gt;

&lt;P&gt;
Happy New Year everyone.  Over the break I had a wonderful time reconnecting with my friends and family. 
And of course I came back to a huge pile of work!  We're going through the flaws that were discovered in
C# 2.0 too late in the cycle to risk fixing, and some of them illustrate interesting corner cases in
the language.  
But those will have to wait, as we've still got a lot of ground to cover before I'm done this crazy long
series on regular expressions.
&lt;/P&gt;
&lt;P&gt;
To summarize the story so far: we've defined alphabets and languages over those alphabets. 
We've shown how to create a special "regular expression" alphabet and language for any given alphabet. 
We've also come up with a rule that associates a language with a regular expression. This language is the
set of strings which "matches" the regular expression.
&lt;/P&gt;
&lt;P&gt;
We've determined that there are deterministic finite state computers which consume one character of a string
at a time that can "recognize" strings from some regular languages.  We've also determined that there are
"nondeterministic" finite automata that somehow "magically" figure out which rules to apply at any time to
match a string.
&lt;/P&gt;
&lt;P&gt;
We've also shown that any NFA that has rules that act on more than one character can be turned into an equivalent
NFA which only has single-character or no-character transitions.
&lt;/P&gt;
&lt;P&gt;
What we would like to do is show the following three facts: first, that every NFA is equivalent to a DFA. 
Second, that every DFA/NFA recognizes a regular language. And third, that every regular language has a DFA/NFA
that recognizes it.
&lt;/P&gt;
&lt;P&gt;
We've still got a ways to go to show that first result.  Rather than prove the general result, which will
be tedious, I'll just go through an example and hope that it is illustrative enough to convince you that
we could give this treatment to any NFA.
&lt;/P&gt;
&lt;P&gt;
 
Let M be an NFA with alphabet 
{&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;}, states {1, 2, 3, 4}, start state is 1, and the set of acceptable states is {4}. 
We'll assume that we've already eliminated all the multi-character rules and have a set of rules as follows:
&lt;/P&gt;
&lt;P&gt;
(1, e) &amp;#x2192; 2&lt;br&gt;
(1, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 3&lt;br&gt;
(2, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1&lt;br&gt;
(2, e) &amp;#x2192; 3&lt;br&gt;
(2, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 4&lt;br&gt;
(3, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 4
&lt;/P&gt;
&lt;P&gt;

This NFA accepts the language 
&lt;span class="code"&gt;a&lt;/span&gt;*(
&lt;span class="code"&gt;bb&lt;/span&gt;
&amp;#X222A;
&lt;span class="code"&gt;b&lt;/span&gt;
&amp;#X222A;
&lt;span class="code"&gt;a&lt;/span&gt;
), and as you can see, we've got lots of empty and ambiguous rules.
&lt;/P&gt;
&lt;P&gt;

Here's the trick:  given an NFA with 4 states like this one, we can find a DFA which is equivalent to this 
guy that has 2&lt;SUP&gt;4&lt;/SUP&gt; = 16 states or fewer.  
We can think of the NFA as at any time living in a "superstate" that consists of all of the states that it 
COULD be in right now.
&lt;/P&gt;
&lt;P&gt;
Let me try to explain that a little better.  
We start in state 1, right?  
But since 
(1, e) &amp;#x2192; 2 and
(2, e) &amp;#x2192; 3, we could also be starting in states 2 or 3 before we process any input.

Let's create a new automaton with a start state that represents the concept 
"right now M could be in states 1, 2, or 3."  
We'll call that state 123x, which is one big symbol, not a string of four symbols.  
In our new automaton, the start state is 123x.
&lt;/P&gt;
&lt;P&gt;
We're trying to write a DFA here, so there needs to be a rule for every input and every state:
&lt;/P&gt;
&lt;P&gt;
(123x, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; ?&lt;br&gt;
(123x, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; ?
&lt;/P&gt;
&lt;P&gt;
Look at the original automaton.  
When we were in states 1, 2 or 3, what were the possible state transitions for 
&lt;span class="code"&gt;a&lt;/span&gt;?  
The only ones were 
(2, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1
and 
(2, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 4.

But we're not finished! Again, we need to consider what could happen from states 1 and 4 if we processed the 
"empty string" transition rules.  
Since (1, e) &amp;#x2192; 2 and (2, e) &amp;#x2192; 3, we could end up in state 1, 2, 3 or 4.  Let's create a new state
called 1234 for our DFA and finish off that rule:
&lt;/P&gt;
&lt;P&gt;
(123x, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1234
&lt;/P&gt;
&lt;P&gt;
Now do the same analysis for (123x, &lt;span class="code"&gt;b&lt;/span&gt;) and we'll see that in the original automaton,
the only possible resulting states starting in 1, 2 or 3, and processing a  
&lt;span class="code"&gt;b&lt;/span&gt; are 3 and 4.  Add another new state:
&lt;P&gt;
&lt;p&gt;
(123x, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xx34
&lt;/P&gt;
&lt;P&gt;
Are we done? No. We've added two more states, 1234 and xx34, so we need to figure out the state transitions for 
them too, which we do by the same process.  We discover that
&lt;/P&gt;
&lt;P&gt;
(1234, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1234&lt;br&gt;
(1234, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xx34&lt;br&gt;
(xx34, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; ?&lt;br&gt;
(xx34, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xxx4
&lt;/P&gt;

Uh oh.  There are no state transitions in the original NFA that start in states 3 or 4 and take an 
&lt;span class="code"&gt;a&lt;/span&gt;.  Remember that in that case we assume that the NFA goes into a special
"reject" state.  Let's call the reject state xxxx.
&lt;/P&gt;
&lt;P&gt;
We now have two more new states to work out the transitions for. If we're in the reject state, we 
stay in the reject state, and we see from the original NFA that there are no transitions out of state
4, so we can round out our list with:

&lt;/P&gt;
&lt;P&gt;
(xx34, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxxx, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxxx, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxx4, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxx4, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xxxx
&lt;/P&gt;
&lt;P&gt;

Since 4 was the acceptable state in the original NFA, any "superstate" that contains 4 must be an acceptable state.  
So our new DFA is a machine M&lt;SUB&gt;2&lt;/SUB&gt; with 

alphabet {&lt;span class="code"&gt;a&lt;/span&gt;}, 
&lt;span class="code"&gt;b&lt;/span&gt;}, states {1234, 123x, xx34, xxx4, xxxx}, 
start state is 123x, and the set of acceptable states is {1234, xx34, xxx4}.  The rules are
&lt;/P&gt;
&lt;P&gt;
(1234, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1234&lt;br&gt;
(1234, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xx34&lt;br&gt;
(123x, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1234&lt;br&gt;
(123x, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xx34&lt;br&gt;
(xx34, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xx34, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xxx4&lt;br&gt;
(xxx4, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxx4, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxxx, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; xxxx&lt;br&gt;
(xxxx, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; xxxx
&lt;/P&gt;

This DFA accepts the language 
&lt;span class="code"&gt;a&lt;/span&gt;*(
&lt;span class="code"&gt;bb&lt;/span&gt;
&amp;#X222A;
&lt;span class="code"&gt;b&lt;/span&gt;
&amp;#X222A;
&lt;span class="code"&gt;a&lt;/span&gt;
), so we've found an equivalent DFA to our NFA.

We've turned an NFA with four states and five rules into an equivalent DFA with five states and ten rules.  
We did well -- you can find NFAs that require 2&lt;SUP&gt;n&lt;/SUP&gt; new states, where n is the number of NFA states.  
An NFA with a thousand states could require 2&lt;SUP&gt;1000&lt;/SUP&gt; states to represent as a DFA! 
2&lt;SUP&gt;1000&lt;/SUP&gt; is a finite, albeit rather large number.  
But it's not &lt;i&gt;that&lt;/i&gt; big. Clearly we could build a 2&lt;SUP&gt;1000&lt;/SUP&gt; state machine with only 1000 bits to 
store the state information.  
It's the &lt;i&gt;state transition rules&lt;/i&gt; that are the pain to work out!
&lt;/P&gt;
&lt;p&gt;
Using the techniques from this and the previous entry we can turn any NFA into an equivalent DFA, so NFAs are
really not magic at all. They're just a convenient way to talk about DFAs.
&lt;/P&gt; 
&lt;P&gt;
Now, obviously what I've done here isn't a proof; one example does not a proof make.  
But the proof is both tedious and complicated, so I'm going to skip it.  
The key result here is that we can stop saying "deterministic finite automaton" and 
"nondeterministic finite automaton" and just start saying "finite automaton", because they're essentially the 
same thing.
&lt;/P&gt;
&lt;P&gt;
Now that we know that we can use NFAs with impunity, we can try to answer questions such as:
&lt;/P&gt; 
&lt;P&gt;
Is every regular language recognizable by a finite automaton?
&lt;/P&gt;
&lt;P&gt;
Are there any FAs that recognize irregular languages?
&lt;/P&gt;
&lt;P&gt;
Tune in next time to find out!
&lt;/P&gt;
&lt;/DIV&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=515471" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Eleven: Eliminating Multi-Symbol Rules</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/22/regular-expressions-from-scratch-part-eleven-eliminating-multi-symbol-rules.aspx</link><pubDate>Thu, 22 Dec 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:504393</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/504393.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=504393</wfw:commentRss><description>
&lt;div class="mine"&gt;

&lt;P&gt;
The story so far: we have deterministic and nondeterministic finite automata.  
DFAs follow a rigid, fully specified set of rules which, on any string, yield either an accept 
or reject state after exactly as many moves as the string has characters.  
NFAs, on the other hand, are poorly specified.  
They somehow magically are able to choose which rules to apply given 
the input to find a path that yields an accept state, if one exists.
&lt;/P&gt;
&lt;P&gt;
NFA magic buys us a lot, but at the high price that we can now no longer easily see a 
clear relationship between this magic box and a computer we could actually build.  
We need to get back to reality.  
Here is the &lt;I&gt;amazing result&lt;/I&gt; that I'm going to try to justify in your minds:
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;&lt;I&gt;Every&lt;/I&gt; NFA is equivalent to some DFA. 
Even better: given a description of an NFA, we can construct a description of an equivalent DFA.
&lt;/B&gt;
&lt;/P&gt;
&lt;P&gt;
In other words, NFA magic doesn't buy us any extra power - no NFA can recognize a language 
that some DFA cannot also recognize.
&lt;/P&gt;
&lt;P&gt;
This is excellent news, because it means that we can describe and reason about machines using short, convenient, 
vague nondeterministic notation, but still have confidence that we could build a fully deterministic 
machine that did exactly the same job.
&lt;/P&gt;
&lt;P&gt;
But how the heck are we going to motivate that amazing result?
&lt;/P&gt;
&lt;P&gt;One step at a time.  
We're going to take some NFAs and make simple transformations that gradually turn them
into different but equivalent NFAs, and eventually one of those NFAs will actually be a DFA.
Rather than giving a full formal proof, which would be tedious and boring, I'm going to rely 
on the sketch being convincing enough that you believe me that we can take any NFA and turn it into a DFA.
Maybe later in this series we'll actually write some C# code that implements the transformation.
&lt;/P&gt;
&lt;P&gt;
The magical bits that we need to remove are: rules can have multiple characters in state transitions,
rules can have "empty" state transitions, rules can have ambiguous state transitions, there can be
situations for which no stated rule applies, and there's a magical 'crash' state. 
Once we eliminate all those weirdnesses, we'll be left with a DFA.
&lt;/P&gt;
&lt;P&gt;
Step one: Any NFA that has rules that have multiple-character state transitions is 
equivalent to some NFA that has only single-character or empty-string transitions. 
&lt;/P&gt;
&lt;P&gt;
We can get rid of the multiple-character transitions by adding one new state for each extra character.
Recall our previous example:
&lt;/P&gt;
&lt;P&gt;
Alphabet: S = {&lt;span class="code"&gt;a&lt;/span&gt;, &lt;span class="code"&gt;b&lt;/span&gt;}&lt;BR&gt;
States: K = {0}&lt;BR&gt;
Start: s = 0&lt;BR&gt;
Acceptable: F = {0}
&lt;/P&gt;
&lt;P&gt;Rules:&lt;/P&gt;
(0,&lt;span class="code"&gt;ab&lt;/span&gt;) &amp;#x2192; 0&lt;br&gt;
(0,&lt;span class="code"&gt;aba&lt;/span&gt;) &amp;#x2192; 0
&lt;/P&gt;
&lt;P&gt;
Suppose we added three new (non-acceptable) states, 1, 2 and 3.
You agree I hope that the first rule is equivalent to these two rules:
&lt;/P&gt;
&lt;P&gt;
(0,&lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1&lt;br&gt;
(1,&lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 0
&lt;/P&gt;
&lt;P&gt;
And the second rule is equivalent to these three rules:
&lt;/P&gt;
&lt;P&gt; 
(0,&lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 2&lt;br&gt;
(2,&lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 3&lt;br&gt;
(3,&lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 0&lt;br&gt;
&lt;/P&gt;
&lt;P&gt;
We never need to add more than a finite number of new states or rules, so the new machine is always still an NFA.
&lt;/P&gt;
&lt;P&gt;
Throw away the original two rules and hey presto, we have an equivalent NFA where every rule only 
has a zero or one-symbol string.  
Multi-symbol transition rules are no longer 
magical; we can eliminate them easily.
&lt;/P&gt;
&lt;P&gt;
This example above has no empty-string transitions.  
Next time we'll pick another example that has empty string transitions and one-character transitions, 
and then show that we can turn something that looks like that into a DFA.
&lt;/P&gt;
&lt;P&gt;
"Next time" will likely be in the new year, as I am heading back to my ancestral home for the next week to
visit friends and relatives during this festive holiday season.  
I hope all of you have a fun and safe holiday; thanks for reading and we'll see you in 2006.
&lt;/P&gt;

&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=504393" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Ten: Magic!</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/19/regular-expressions-from-scratch-part-ten-magic.aspx</link><pubDate>Mon, 19 Dec 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:502866</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/502866.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=502866</wfw:commentRss><description>
&lt;div class="mine"&gt;

&lt;P&gt;
Let's recap the story so far.
&lt;/P&gt;
Starting only with basic set theory, sequences, symbols and numbers, we've defined alphabets, languages,
regular expressions, and the mapping between regular expressions and regular languages.
We've also defined deterministic finite automata as machines that take in strings one character at a time, 
change their internal state to one of a finite set according to strict rules, 
and end up in either an accept or reject state as output.
&lt;/P&gt;
&lt;P&gt;
Where we're going with this is towards unification of these two concepts.
We want to show that for every DFA there's a regexp, and vice versa.
But to get there is going to take some magic.
&lt;/P&gt;

&lt;P&gt;
The trouble with DFAs is that they're kind of a pain to specify. 
You have an alphabet with, say, a hundred symbols, and you have, say, fifty states.
You need to come up with a state transition rule for 100 x 50 = 5000 possible combinations.
Since most of those will likely be transitions to rejection states anyway, that's majorly boring.
&lt;/P&gt;
&lt;P&gt;
I'm hereby declaring that we have &lt;B&gt;nondeterministic finite automata&lt;/B&gt;.
An NFA works just like a DFA, except that the "rules" for determining the state transitions can be 
&lt;B&gt;ambiguous and weird&lt;/B&gt;.
We'll write an NFA like this:
&lt;/P&gt;
&lt;P&gt;
Let M&lt;SUB&gt;1&lt;/SUB&gt; be an NFA such that:
&lt;/P&gt;
&lt;P&gt;
Alphabet: S = {&lt;span class="code"&gt;a&lt;/span&gt;, &lt;span class="code"&gt;b&lt;/span&gt;}&lt;BR&gt;
States: K = {0, 1, 2}&lt;BR&gt;
Start: s = 0&lt;BR&gt;
Acceptable: F = {0}
&lt;/P&gt;
&lt;P&gt;
Rules:
&lt;/P&gt;
&lt;P&gt;
(0,&lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1&lt;br&gt;
(1,&lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 2&lt;br&gt;
(2,&lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 0&lt;br&gt;
(2,e) &amp;#x2192; 0
&lt;/P&gt;

&lt;P&gt;
Notice that we're not specifying the six rules we ought to be, and one of those rules is a transition on an 
"empty input"!
This last rule means that if we are in state 2, we can go to state 0 "for free", without consuming any characters.
&lt;/P&gt;
&lt;P&gt;We say that M&lt;SUB&gt;1&lt;/SUB&gt; accepts a string if there is &lt;I&gt;any&lt;/I&gt; way to make it eventually 
yield an acceptable state with no more input.
For example, suppose we start with &lt;span class="code"&gt;ababa&lt;/span&gt;:
&lt;/P&gt;
&lt;P&gt;
(0, &lt;span class="code"&gt;ababa&lt;/span&gt;) &amp;#x2192; 
(1, &lt;span class="code"&gt;baba&lt;/span&gt;) &amp;#x2192;
(2, &lt;span class="code"&gt;aba&lt;/span&gt;) &amp;#x2192;
(0, &lt;span class="code"&gt;aba&lt;/span&gt;) &amp;#x2192;
(1, &lt;span class="code"&gt;ba&lt;/span&gt;) &amp;#x2192;
(2, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192;
(0, e) 
&lt;/P&gt;
&lt;P&gt;is one path to an acceptable state. Even though &lt;/P&gt;
&lt;P&gt;
(0, &lt;span class="code"&gt;ababa&lt;/span&gt;) &amp;#x2192; 
(1, &lt;span class="code"&gt;baba&lt;/span&gt;) &amp;#x2192;
(2, &lt;span class="code"&gt;aba&lt;/span&gt;) &amp;#x2192;
(0, &lt;span class="code"&gt;ba&lt;/span&gt;) 
gets us stuck in a state that we can go no further in, that's OK.
NFAs are magic.
If there exists &lt;I&gt;any&lt;/I&gt; path such that 
(0, &lt;span class="code"&gt;ababa&lt;/span&gt;) &amp;#x2D92; (0, e)
then the NFA will find it and end in an accepting state.
&lt;B&gt;The fact that some nonsensical or rejecting paths exist doesn't matter if an accepting path exists.&lt;/B&gt; 
Only if every possible path either ends in a rejecting state, or leaves us unable to find any applicable rule,
does the NFA reject the string.
&lt;/P&gt;
&lt;P&gt;
In fact, this machine accepts the regular language 
&lt;span class="code"&gt;(ab&amp;#x222A;aba)*&lt;/span&gt;
&lt;/P&gt;
&lt;p&gt;
(Recall that I am being sloppy about parenthesizing. 
This isn't a "real" regular expression but you get the point that when I say that I mean 
L(&lt;span class="code"&gt;((ab)&amp;#x222A;((ab)a))*&lt;/span&gt;) I'm sure.)
&lt;/P&gt;
&lt;P&gt;
We've got lots of magic already, so let's add more.
We'll say that a rule can consume any number of matching symbols that it wants.
&lt;/P&gt;
&lt;P&gt;
Let M&lt;sub&gt;2&lt;/sub&gt; be an NFA such that:&lt;/P&gt;
&lt;P&gt;
Alphabet: S = {&lt;span class="code"&gt;a&lt;/span&gt;, &lt;span class="code"&gt;b&lt;/span&gt;}&lt;BR&gt;
States: K = {0}&lt;BR&gt;
Start: s = 0&lt;BR&gt;
Acceptable: F = {0}
&lt;/P&gt;
&lt;P&gt;Rules:&lt;/P&gt;
(0,&lt;span class="code"&gt;ab&lt;/span&gt;) &amp;#x2192; 0&lt;br&gt;
(0,&lt;span class="code"&gt;aba&lt;/span&gt;) &amp;#x2192; 0
&lt;/P&gt;
&lt;P&gt;
This machine clearly accepts 
&lt;span class="code"&gt;(ab&amp;#x222A;aba)*&lt;/span&gt;
except for the fact that it is now totally &lt;i&gt;unclear&lt;/i&gt; what we mean by "accepts" vs. "rejects" in
a machine with only one state! How does it reject anything?
&lt;/P&gt;
&lt;P&gt;
Let's assume that NFAs have a magic implicit "crash" state which means "For this input I was unable to find any
path that did not end in a state where I lacked a rule telling me what to do next."  
The crash state is always a rejecting state.
&lt;/P&gt;
&lt;P&gt;
Clearly M&lt;sub&gt;1&lt;/sub&gt; and M&lt;sub&gt;2&lt;/sub&gt; accept the same language.
Let's make up a word for that relationship.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 15:&lt;/B&gt; Two machines are &lt;B&gt;equivalent&lt;/B&gt; if they both accept exactly the same language.
&lt;/P&gt;
&lt;P&gt;
Next time: Can we in fact build a device which does magic?
&lt;/P&gt;
&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=502866" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Nine: A Dream of a Machine</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/15/regular-expressions-from-scratch-part-nine-a-dream-of-a-machine.aspx</link><pubDate>Thu, 15 Dec 2005 18:28:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:501683</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>8</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/501683.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=501683</wfw:commentRss><description>&lt;DIV class="mine"&gt;

&lt;P&gt;I want to come up with the &lt;I&gt;simplest possible device&lt;/I&gt; that can identify whether a given string is a 
member of a given regular language. 
We need some kind of computer, but to make it easy to analyze, I want to put as many restrictions upon it
as possible.
For example, I want there to be very limited memory storage, I want to process only one character at a time, and so on.
&lt;/P&gt;
&lt;P&gt;Definition 14&lt;/B&gt;: A &lt;B&gt;deterministic finite automaton&lt;/B&gt; (DFA) is an idealized machine. 
It has an input &lt;B&gt;tape&lt;/B&gt; with a finite string written down on it in a given alphabet.
The tape is exactly as long as the string. 
The DFA has a finite number of distinct &lt;B&gt;states&lt;/B&gt; that it can be in. 
A DFA can be in only one state at a time, called the &lt;B&gt;current state&lt;/B&gt;. 
The DFA always starts in a particular state called the &lt;B&gt;start state&lt;/B&gt;. 
The DFA reads the string from left to right, one symbol at a time.
There is no backtracking allowed.
Each new symbol on the input tape may cause the machine to change the current state, 
but this choice can only be based on the current symbol and current state, and the
rules for how that choice is made must be specified in advance and cannot change.
Every state is either an &lt;B&gt;acceptable state&lt;/B&gt; or an &lt;B&gt;unacceptable state&lt;/B&gt;. 
When the machine is done reading the string, if it is in an acceptable state then the 
machine &lt;B&gt;accepts&lt;/B&gt; the string, otherwise it &lt;B&gt;rejects&lt;/B&gt; it.
&lt;/P&gt;
&lt;P&gt;
I think you'll agree that this is a very limiting set of restrictions to place upon a computer.
&lt;/P&gt;
&lt;P&gt;
We'll write down DFAs like this example:
&lt;/P&gt;
&lt;P&gt;Let M be a DFA such that:&lt;BR&gt;
Alphabet: S = {&lt;span class="code"&gt;a&lt;/span&gt;, &lt;span class="code"&gt;b&lt;/span&gt;}&lt;BR&gt;
States: K = {0, 1}&lt;BR&gt;
Start: s = 0&lt;BR&gt;
Acceptable: F = {1}
&lt;/P&gt;

&lt;P&gt;
Every DFA has rules for determining what the state transitions are, one rule for every possible combination of states and symbols. 
For example, M might have rules:
&lt;/P&gt;
&lt;P&gt;
&lt;TABLE FRAME="box"&gt;
&lt;THEAD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;current state&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;current input&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;new state&lt;/span&gt;&lt;/TD&gt;
&lt;/THEAD&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="mine"&gt;0&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;a&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;1&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="mine"&gt;0&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;b&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;1&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="mine"&gt;1&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;a&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;0&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="mine"&gt;1&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;b&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;1&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TABLE&gt;
&lt;/P&gt;
&lt;P&gt;
Note that there MUST be a rule for every combination of K and S.
Since it will get cumbersome to write out these tables constantly I'm going to declare a new notation:
&lt;/P&gt;
(0, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1&lt;BR&gt;
(0, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 1&lt;BR&gt;
(1, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 0&lt;BR&gt;
(1, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 1&lt;BR&gt;
&lt;/P&gt;
&lt;P&gt;
The arrow is read as "yields".
&lt;/P&gt;
&lt;P&gt;
Consider the action of M on a tape containing &lt;span class="code"&gt;aab&lt;/span&gt;.
The machine uses these rules in this order:
&lt;/P&gt;
&lt;P&gt;
(0, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 1&lt;BR&gt;
(1, &lt;span class="code"&gt;a&lt;/span&gt;) &amp;#x2192; 0&lt;BR&gt;
(0, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 1&lt;BR&gt;
&lt;/P&gt;
&lt;P&gt;
The string is accepted because we've run out of string and we're in an acceptable state.
&lt;/P&gt;

&lt;P&gt;
Since any set of strings is a language, the set of strings which a machine M accepts forms a language.
We can construct a finite sequence of sets (alphabet, states, transition rules, etc) that exactly characterizes M, 
so we can build a function L which maps M onto the language which M accepts. 
Call the language L(M).
&lt;/P&gt;
&lt;P&gt;
Convince yourself that in the case above, L(M) = 
{ (&lt;span class="code"&gt;a&lt;/span&gt;(&lt;span class="code"&gt;aa&lt;/span&gt;)*)
&amp;#x222A;
(((&lt;span class="code"&gt;a&lt;/span&gt; &amp;#x222A; 
&lt;span class="code"&gt;b&lt;/span&gt;)*&lt;span class="code"&gt;b&lt;/span&gt;)
(&lt;span class="code"&gt;aa&lt;/span&gt;)*) }

 - 
that is, the regular language where every string is either an odd number of 
&lt;span class="code"&gt;a&lt;/span&gt;s or any string ending in a
&lt;span class="code"&gt;b&lt;/span&gt; followed by an even number of 
&lt;span class="code"&gt;a&lt;/span&gt;s.
&lt;/P&gt;

&lt;P&gt;
This idea of listing the rules used by the machine is pretty good, but it requires
us to mentally keep track of how far along the input tape the machine is.
Also, there's some redundancy there, since obviously the state of the next line is
going to be the same as the new state of the current line.  
I'm therefore going to change the notation I use to describe how a machine is working.
For the example above we'd say in our new notation&lt;/P&gt;
&lt;P&gt;
(0, &lt;span class="code"&gt;aab&lt;/span&gt;) &amp;#x2192; 
(1, &lt;span class="code"&gt;ab&lt;/span&gt;) &amp;#x2192; 
(0, &lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x2192; 
(1, e)
&lt;/P&gt;
&lt;P&gt;
Clearly this is the very definition of a mechanical process. 
It's going to get tedious to write out all of the steps in more complex machines.
Let me declare a new "super arrow", which means "yields eventually".
&lt;/P&gt;
&lt;P&gt;
(0, &lt;span class="code"&gt;aab&lt;/span&gt;) &amp;#x21D2; 
(1, e)
&lt;/P&gt;
&lt;P&gt;
I'll use the super arrow to mean "yields in zero or more steps".
Maybe one, maybe a thousand, but some finite number of steps.
&lt;/P&gt;
&lt;P&gt;
What's the point of a DFA?
It's pretty much the simplest thing we can possibly call a computer.
It's got input and output and storage, but is very limited.
The input, sure, it can be as long or as short as you want.
But the output and the storage consists of a single "register" which can only hold one of a finite number of states.
In our example today, we've got only a &lt;i&gt;single bit&lt;/i&gt; of storage and four rules, and we can already accept a fairly complex
language.
&lt;/P&gt;
&lt;P&gt;
If we can reason about the limits of a DFA then we can determine whether a machine with finite storage 
is buff enough to recognize any interesting languages. 
Today's machine recognizes a regular language, and that's a start.
&lt;/P&gt;
&lt;P&gt;
Next time, we'll add a little magic to a DFA.
&lt;/P&gt;

&lt;/div&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=501683" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Eight: The Diagonal Argument</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/12/regular-expressions-from-scratch-part-eight-the-diagonal-argument.aspx</link><pubDate>Mon, 12 Dec 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:499587</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/499587.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=499587</wfw:commentRss><description>&lt;div class="mine"&gt;

&lt;P&gt;
As we know, each regular expression is associated with a language by our function L. 
We also determined last time that we could list all members of S* and R in order first by length and then
by alphabetical order.
Here's a weird question: is the nth string in S*'s alphabetical ordering a member of the language 
associated with the nth alphabetical regular expression?
&lt;/P&gt;
&lt;P&gt;
Let's make a table and see if we can discern any pattern.
&lt;/P&gt;
&lt;P&gt;
&lt;TABLE FRAME="box"&gt;
&lt;THEAD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;s&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;r&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;s in L(r)?&lt;/span&gt;&lt;/TD&gt;
&lt;/THEAD&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="mine"&gt;e&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;a&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;a&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;b&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;b&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;aa&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;a*&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;yes&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;ab&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;b*&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;ba&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;&amp;#x00D8;*&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;bb&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;a**&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;aaa&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;b**&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;aab&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;&amp;#x00D8;**&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;span class="code"&gt;aba&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="code"&gt;(aa)&lt;/span&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;span class="mine"&gt;no&lt;/span&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TABLE&gt;
&lt;/p&gt;
&lt;P&gt;
We could continue this table forever, of course, and it seems that sometimes, pretty much at random, 
we're going to get a match.
Matches might not be very often, but they will happen from time to time.
As it turns out, it doesn't really matter whether we can come up with some pattern here.
All that matters is that we can answer the question definitively for every possible entry.
&lt;/P&gt;
&lt;P&gt;
Now define the language D such that
D = { w in S* such that w has "no" in the column above }
&lt;/P&gt;
&lt;P&gt;
D = {e, 
&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;,
&lt;span class="code"&gt;ab&lt;/span&gt;,
&lt;span class="code"&gt;ba&lt;/span&gt;,
&lt;span class="code"&gt;bb&lt;/span&gt;,
&lt;span class="code"&gt;aaa&lt;/span&gt;,
&lt;span class="code"&gt;aab&lt;/span&gt;,
&lt;span class="code"&gt;aba&lt;/span&gt;,
&lt;span class="code"&gt;abb&lt;/span&gt;, &amp;#x2026; }

&lt;/P&gt;
&lt;P&gt;This is a weird but perfectly well-defined property.
Given &lt;I&gt;any&lt;/I&gt; string we can determine very quickly whether it is a member of D or not. 
Just figure out where on the table it is, compute the nth regular expression, 
and see if it is a member of that regular language. If it is not, then it is in D.
&lt;/P&gt;
&lt;P&gt;
Is there any regular expression that specifies the language D?
&lt;/P&gt;
&lt;P&gt;
Which one would it be?
We've got a &lt;I&gt;complete alphabetical list&lt;/I&gt; of regular expressions, so let's just go down the list.
By our definition of D, clearly it can't be &lt;I&gt;any&lt;/I&gt; of them. 
If the third column is "yes" then the nth regexp matches a string not in D. 
If it is "no" then the nth regexp fails to match a string in D. 
Therefore there is no such regular expression that matches everything in D. 
Therefore D is not a regular language.
&lt;/P&gt;
&lt;P&gt;
By a similar argument we can show that for every nonempty alphabet there exists a language which is not regular.
&lt;/P&gt;
&lt;P&gt;
That's a pretty unlikely example of an irregular language though.
Later we'll see that many perfectly normal languages are not regular, including, ironically enough, 
the regular expression language itself.
&lt;/P&gt;
&lt;P&gt;
Remember that the reason why we came up with regular expressions in the first place is because
we wanted a concise way to characterize languages using short, simple expressions.  
We of course could come up with other mechanical means to cleverly map between strings in one 
alphabet and languages in another, but ultimately it wouldn't do us much good if our aim is to
capture all possible languages. 
The diagonal argument above doesn't depend upon any special features of regular expressions; 
it applies to &lt;i&gt;any&lt;/i&gt; function that maps strings onto languages.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;No language description system which maps finite strings onto languages can describe every language.&lt;/B&gt;
&lt;/P&gt;
&lt;P&gt;This is a stunning result; there are languages which we cannot characterize in any finite number 
of symbols, no matter how clever we are with our symbolic manipulations! 
&lt;/P&gt;
&lt;P&gt;
It's simply a fundamental fact that the definition of "language" we've chosen - 
an arbitrary set of finite strings - affords an immense number of possible results, 
more immense than the number of strings in any description language you'd care to name. 
Some - infinitely many - will be indefinable.
&lt;/P&gt;
&lt;P&gt;
You might have noticed a bit of a hand-wave in today's entry: I've made the claim that we can easily
take a regular expression and determine if a string is in its regular language. 
Over the next few entries we'll justify that claim by building some simple abstract machines that can
make this determination.
&lt;/P&gt;

&lt;/DIV&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=499587" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Seven: Listing All Members Of A Language In Order</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/08/regular-expressions-from-scratch-part-seven-listing-all-members-of-a-language-in-order.aspx</link><pubDate>Thu, 08 Dec 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:499573</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>15</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/499573.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=499573</wfw:commentRss><description>&lt;div class="mine"&gt;

&lt;P&gt;
Regular languages are by definition those languages which can be described by a regular expression. 
It should be clear from the definition that the union of any finite number of regular languages is a 
regular language, the concatenation of any finite number of regular languages is a regular language, 
and the Kleene Closure of any regular language is a regular language.
&lt;/P&gt;
&lt;P&gt;
That's rather a lot of languages. Are &lt;I&gt;all&lt;/I&gt; languages regular languages?
&lt;/P&gt;
&lt;P&gt;
We should have a strong intuition that regular languages are a very limited subset of all languages.
It seems like it would be hard to come up with a finite regular expression that, say, matches the language 
{w is in &lt;span class="code"&gt;1&lt;/span&gt;* such that there are a prime number of 
&lt;span class="code"&gt;1&lt;/span&gt;s in w}.
&lt;/P&gt;
&lt;P&gt;
But we can do better than intuition.
There are many interesting ways to prove that there exists at least one non-regular language.
Over the next couple entries we'll find a non-regular language by using a clever trick. 
The first thing we'll need to do is enumerate every member of a language.
&lt;/P&gt;
&lt;P&gt;
It doesn't really matter how we enumerate the language, as long as we guarantee that we eventually
hit every member of the language. 
Here's one way to do such an enumeration.
&lt;/P&gt;
&lt;P&gt;
Consider an alphabet, say S = {&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;}.
We can enumerate all the strings of S* first by length and then by alphabetical order.
Of course, we'll need to define an order for our alphabet, but since alphabets are by definition finite
sets, we can choose any old order. 
In general, for alphabets consisting of Roman alphabet characters and numbers we'll use the standard
alphabetical ordering we're all used to.
We'll start with the one zero-length string, then the two one-symbol strings, then the four two-symbol strings, and so on:
&lt;/P&gt;
&lt;P&gt;
e,
&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;,
&lt;span class="code"&gt;aa&lt;/span&gt;,
&lt;span class="code"&gt;ab&lt;/span&gt;,
&lt;span class="code"&gt;ba&lt;/span&gt;,
&lt;span class="code"&gt;bb&lt;/span&gt;,
&lt;span class="code"&gt;aaa&lt;/span&gt;,
&lt;span class="code"&gt;aab&lt;/span&gt;,
&lt;span class="code"&gt;aba&lt;/span&gt;,
&lt;span class="code"&gt;abb&lt;/span&gt;, &amp;#x2026;
&lt;/P&gt;

&lt;P&gt;
(A commenter correctly pointed out that there are languages which cannot be easily enumerated in this 
way, but my coming argument does not rely upon an alphabetical ordering.  
All we need is some schema for enumerating a language which guarantees that we eventually get all of them.)
&lt;/P&gt;
&lt;P&gt;
Consider R, the regular expression language of S. 
We can enumerate it too, first by length and then by alphabetical order. 
Let's say that alphabetical order of R is 
&lt;span class="code"&gt;* ( ) a b &amp;#x00D8; &amp;#x222A;&lt;/span&gt;
just to pick an arbitrary ordering for the symbols. 
We can then enumerate R in the same way like this:
&lt;/P&gt;
&lt;P&gt;
&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt;,
&lt;span class="code"&gt;a*&lt;/span&gt;,
&lt;span class="code"&gt;b*&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x00D8;*&lt;/span&gt;,
&lt;span class="code"&gt;a**&lt;/span&gt;,
&lt;span class="code"&gt;b**&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x00D8;**&lt;/span&gt;,
&lt;span class="code"&gt;(aa)&lt;/span&gt;,
&lt;span class="code"&gt;(ab)&lt;/span&gt;,
&lt;span class="code"&gt;(a&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(ba)&lt;/span&gt;,
&lt;span class="code"&gt;(bb)&lt;/span&gt;,
&lt;span class="code"&gt;(b&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;a)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;b)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;&amp;#x00D8;)&lt;/span&gt;, &amp;#x2026;
&lt;/P&gt;
&lt;P&gt;
Of course we are leaving out any strings which are not in R, such as
&lt;span class="code"&gt;*&lt;/span&gt; or 
&lt;span class="code"&gt;))(&lt;/span&gt;.
&lt;/P&gt;
&lt;P&gt;
Next time we'll use the fact that we can make both these lists to show that a non-regular language exists.
&lt;/P&gt;

&lt;/div&gt;


&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=499573" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Six: The Insanely Clever Bit</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/05/regular-expressions-from-scratch-part-six-the-insanely-clever-bit.aspx</link><pubDate>Mon, 05 Dec 2005 18:10:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:497610</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/497610.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=497610</wfw:commentRss><description>&lt;div class="mine"&gt;
&lt;P&gt;
Let's start with an easy one today, because things are about to get a little tricky.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 10&lt;/B&gt;: a &lt;B&gt;pair&lt;/B&gt; is a finite sequence with exactly two members.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 11&lt;/B&gt;: a &lt;B&gt;function&lt;/B&gt; is a set of pairs, where the second member of each 
pair is the value associated with the first member by the function.
If F is a function { (a1, b1), (a2, b2) } then we might write F(a1) = b1, F(a2) = b2.
(Again, we will figure out from context whether parentheses mean function mapping, grouping or a finite sequence.)
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 12&lt;/B&gt;: Take any alphabet S.
The &lt;B&gt;regular expression language generating function&lt;/B&gt; for S is a function where the 
first member of the pair is a string in the regular expression language of S, 
and the second member is a language in S.
&lt;/P&gt;
&lt;P&gt;
Let's call the function L and the regular expression language R. The pairs in this set go like this.
&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
The string &lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt; is paired with the empty language. 
Or, in our new notation, L(&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt;) = {}
&lt;/LI&gt;
&lt;LI&gt;
The strings in R that are just single members of S are paired with the languages that are just 
single members of S. For example, if 
S = {&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;} then  
L(&lt;span class="code"&gt;a&lt;/span&gt;)  
= {&lt;span class="code"&gt;a&lt;/span&gt;}, 
L(&lt;span class="code"&gt;b&lt;/span&gt;)  
= {&lt;span class="code"&gt;b&lt;/span&gt;}, 
&lt;/LI&gt;
&lt;LI&gt;
If x and w are in R then 
L(&lt;span class="code"&gt;(&lt;/span&gt;xw&lt;span class="code"&gt;)&lt;/span&gt;) = L(x)L(w)
&lt;/LI&gt;
&lt;LI&gt;
Similarly, L(&lt;span class="code"&gt;(&lt;/span&gt;x&lt;span class="code"&gt;&amp;#x222A;&lt;/span&gt;w&lt;span class="code"&gt;)&lt;/span&gt;) = L(x) &amp;#x222A; L(w)
&lt;/LI&gt;
&lt;LI&gt;
Similarly, L(x&lt;span class="code"&gt;*&lt;/span&gt;) = L(x)*
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;
OK, let's try it out. What's L(&lt;span class="code"&gt;((ab)&amp;#x222A;a*)&lt;/span&gt;)?
&lt;/P&gt;
&lt;P&gt;
L(&lt;span class="code"&gt;((ab)&amp;#x222A;a*)&lt;/span&gt;)&lt;BR&gt;
= L(&lt;span class="code"&gt;(ab)&lt;/span&gt;) &amp;#x222A; L(&lt;span class="code"&gt;a*&lt;/span&gt;)&lt;BR&gt;
= L(&lt;span class="code"&gt;a&lt;/span&gt;)L(&lt;span class="code"&gt;b&lt;/span&gt;) &amp;#x222A; L(&lt;span class="code"&gt;a*&lt;/span&gt;)&lt;BR&gt;
= {&lt;span class="code"&gt;a&lt;/span&gt;}{&lt;span class="code"&gt;b&lt;/span&gt;} &amp;#x222A; L(&lt;span class="code"&gt;a&lt;/span&gt;)*&lt;BR&gt;
= {&lt;span class="code"&gt;a&lt;/span&gt;}{&lt;span class="code"&gt;b&lt;/span&gt;} &amp;#x222A; {&lt;span class="code"&gt;a&lt;/span&gt;}*&lt;BR&gt;
= ((&lt;span class="code"&gt;ab&lt;/span&gt;) &amp;#x222A; &lt;span class="code"&gt;a&lt;/span&gt;*)&lt;BR&gt;
= {&lt;span class="code"&gt;ab&lt;/span&gt;,
e,
&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;aa&lt;/span&gt;,
&lt;span class="code"&gt;aaa&lt;/span&gt;,
&lt;span class="code"&gt;aaaa&lt;/span&gt;, &amp;#x2026;}
&lt;/P&gt;
&lt;P&gt;
Hey, wouldja look at that:
L(&lt;span class="code"&gt;((ab)&amp;#x222A;a*)&lt;/span&gt;) 
= ((&lt;span class="code"&gt;ab&lt;/span&gt;) &amp;#x222A; &lt;span class="code"&gt;a&lt;/span&gt;*)
&lt;/P&gt;
&lt;P&gt;
What an amazing coincidence! 
&lt;/P&gt;
&lt;P&gt;
Obviously this is no coincidence at all. 
We have just defined a mapping between regular expressions and the languages which consist of 
all strings which match those regular expressions, and the mapping is basically "turn the regular 
expression string into an expression via the obvious substitutions."
&lt;/P&gt;
&lt;P&gt;Since we have this very strong mapping, I am probably going to become very sloppy 
about making a distinction between 
L(&lt;span class="code"&gt;((ab)&amp;#x222A;a*)&lt;/span&gt;) 
and ((&lt;span class="code"&gt;ab&lt;/span&gt; &amp;#x222A; &lt;span class="code"&gt;a&lt;/span&gt;*).
If I say "the language &lt;span class="code"&gt;((ab)&amp;#x222A;a*)&lt;/span&gt; what I mean is 
L(&lt;span class="code"&gt;((ab)&amp;#x222A;a*)&lt;/span&gt;).
&lt;/P&gt;
&lt;P&gt;
Note also that 
L(&lt;span class="code"&gt;((ab)a)&lt;/span&gt;) = 
L(&lt;span class="code"&gt;(a(ba))&lt;/span&gt;) = {
{&lt;span class="code"&gt;aba&lt;/span&gt;}.
In general, the concatenation and union operators do not require parens. 
Therefore, from now on I will also be very sloppy with my parens. 
Even though &lt;span class="code"&gt;aba&lt;/span&gt; is not a "real" regular expression by the rules laid out earlier,
I will assume that you can mentally transform that back into the well-formed
&lt;span class="code"&gt;(a(ba))&lt;/span&gt;
string.
&lt;/P&gt;
&lt;P&gt;
Let's end off today's descent into the bowels of computer science with a rather obvious definition:
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 13&lt;/B&gt;: A language K over an alphabet S is called a &lt;B&gt;regular language&lt;/B&gt; if and 
only if there exists a string r in S's regular expression language such that L(r) = K.
That is, a language is called a "regular language" if and only if it can be described by a regular expression.
&lt;/P&gt;
&lt;P&gt;
Is &lt;I&gt;every&lt;/I&gt; language regular?
One would suspect not, given the incredible variety of languages that I mentioned earlier. 
Actually proving that a nonregular language exists is both amusing and character-building, so we'll do that next.
&lt;/P&gt;

&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=497610" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Five: The Regular Expression Language</title><link>http://blogs.msdn.com/ericlippert/archive/2005/12/01/regular-expressions-from-scratch-part-five-the-regular-expression-language.aspx</link><pubDate>Thu, 01 Dec 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:496290</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>10</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/496290.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=496290</wfw:commentRss><description>
&lt;div class="mine"&gt;

&lt;P&gt;
Now things start to get really weird.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;
Definition 9&lt;/B&gt;: Take any alphabet S. The &lt;B&gt;regular expression alphabet&lt;/B&gt; of S is S plus a 
bunch of extra symbols; it's S &amp;#x222A; {&lt;span class="code"&gt;(&lt;/span&gt;,
&lt;span class="code"&gt;)&lt;/span&gt;,
&lt;span class="code"&gt;*&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x222A;&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt;}

(I assume that none of those symbols are already in S.)
&lt;/P&gt;
&lt;P&gt;
I'm doing something that I said earlier that I would try not to do.
I'm using symbols in an &lt;I&gt;alphabet&lt;/I&gt; that I also use in &lt;I&gt;expressions&lt;/I&gt; that talk about strings in that alphabet.
(Of course, I also said that this would all fall apart when we got to regular expressions, and sure enough, it did.
Foreshadowing: the sign of a quality blog.)
&lt;/P&gt;
&lt;P&gt;
Again, keep a careful eye on when I'm using fixed-width blue, because those are the "meaningless" symbols, 
not expression notation.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 10&lt;/B&gt;: Take any alphabet S.
The &lt;B&gt;regular expression language&lt;/B&gt; R of an alphabet S is a language formed from strings
of the regular expression language of S, and is defined as follows:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt; is in R.
&lt;/LI&gt;
&lt;LI&gt;
Every member of S is in R
&lt;/LI&gt;
&lt;LI&gt;
If u and w are strings in R then
&lt;span class="code"&gt;(&lt;/span&gt;uw&lt;span class="code"&gt;)&lt;/span&gt; is in R.
&lt;/LI&gt;
&lt;LI&gt;
Similarly, 
&lt;span class="code"&gt;(&lt;/span&gt;u&lt;span class="code"&gt;&amp;#x222A;&lt;/span&gt;w&lt;span class="code"&gt;)&lt;/span&gt; is in R.
&lt;/LI&gt;
&lt;LI&gt; Similarly, u&lt;span class="code"&gt;*&lt;/span&gt; is in R.
&lt;/LI&gt;
&lt;LI&gt;
Nothing else is in R
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;
An example might help. Suppose that S = 
{&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;}.  
The regular expression alphabet of S is 
{&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;,
&lt;span class="code"&gt;(&lt;/span&gt;,
&lt;span class="code"&gt;)&lt;/span&gt;,
&lt;span class="code"&gt;*&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x222A;&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt;}.
The regular expression language of S is R =
{&lt;span class="code"&gt;&amp;#x00D8;&lt;/span&gt;,
&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;a)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;b)&lt;/span&gt;,
&lt;span class="code"&gt;(a&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(aa)&lt;/span&gt;,
&lt;span class="code"&gt;(ab)&lt;/span&gt;,
&lt;span class="code"&gt;(b&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(ba)&lt;/span&gt;,
&lt;span class="code"&gt;(bb)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;&amp;#x222A;&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;&amp;#x222A;a)&lt;/span&gt;,
&lt;span class="code"&gt;(&amp;#x00D8;&amp;#x222A;b)&lt;/span&gt;,
&lt;span class="code"&gt;(a&amp;#x222A;&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(a&amp;#x222A;a)&lt;/span&gt;,
&lt;span class="code"&gt;(a&amp;#x222A;b)&lt;/span&gt;,
&lt;span class="code"&gt;(b&amp;#x222A;&amp;#x00D8;)&lt;/span&gt;,
&lt;span class="code"&gt;(b&amp;#x222A;a)&lt;/span&gt;,
&lt;span class="code"&gt;(b&amp;#x222A;b)&lt;/span&gt;,
&lt;span class="code"&gt;&amp;#x00D8;*&lt;/span&gt;,
&lt;span class="code"&gt;a*&lt;/span&gt;,
&lt;span class="code"&gt;b*&lt;/span&gt;,
&amp;#x2026;}
&lt;/P&gt;
&lt;P&gt;
We've defined an alphabet, we've defined a language -- a language which looks suspiciously like 
the expressions we've been using to talk &lt;i&gt;about languages&lt;/i&gt;. 
Next time we'll do something insanely clever to 
bridge the gap between strings in the regular expression language and the languages which these 
strings would define if they were interpreted as expressions.
&lt;/P&gt;





&lt;/div&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=496290" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Four: The Kleene Closure of a Language</title><link>http://blogs.msdn.com/ericlippert/archive/2005/11/28/regular-expressions-from-scratch-part-four-the-kleene-closure-of-a-language.aspx</link><pubDate>Mon, 28 Nov 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:496289</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>7</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/496289.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=496289</wfw:commentRss><description>
&lt;div class="mine"&gt;


&lt;P&gt;
Languages are sets, so we can take any two languages (over the same alphabet) and take their 
union to form a new language. Just as a reminder:
&lt;/P&gt;
&lt;P&gt; 
&lt;B&gt;Definition 7&lt;/B&gt;: The &lt;B&gt;union&lt;/B&gt; of two sets L and K is the set with all the members 
found in &lt;I&gt;either&lt;/I&gt;, and is written L &amp;#x222A; K.
&lt;/P&gt;
&lt;P&gt;We're going to take the Kleene Star one level further and say that it applies to languages too.&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 8&lt;/B&gt;: The &lt;B&gt;Kleene Closure of a language&lt;/B&gt; L is L* = { e } &amp;#x222A; L &amp;#x222A; LL &amp;#x222A; 
LLL &amp;#x2026; 
&lt;/P&gt;
&lt;P&gt;
Got it? L* is the language consisting of set of strings, where those strings are any number of strings in 
L concatenated together.
&lt;/P&gt;
&lt;P&gt;
I want to extend our notion of concatenation in one more way. When I write a string or string variable next 
to a language, I mean "concatenate the language onto the end of the language that consists of only 
this string". So when I say 
&lt;span class="code"&gt;ab&lt;/span&gt;L&lt;span class="code"&gt;a&lt;/span&gt; I mean 
{&lt;span class="code"&gt;ab&lt;/span&gt;}L{&lt;span class="code"&gt;a&lt;/span&gt;}
Again, you can infer the braces from the context.
&lt;/P&gt;
&lt;P&gt;
Let's do an example of using * and string concatenation of languages, just so its clear.
&lt;/P&gt;
&lt;P&gt;S = {&lt;span class="code"&gt;0&lt;/span&gt;, 
&lt;span class="code"&gt;1&lt;/span&gt;}
&lt;BR&gt;
L = {&lt;span class="code"&gt;011&lt;/span&gt;*}
&lt;BR&gt;
&lt;span class="code"&gt;1&lt;/span&gt;L* = 
{&lt;span class="code"&gt;1&lt;/span&gt;, 
&lt;span class="code"&gt;101&lt;/span&gt;, 
&lt;span class="code"&gt;1011&lt;/span&gt;, 
&lt;span class="code"&gt;101101&lt;/span&gt;, 
&lt;span class="code"&gt;1011011&lt;/span&gt;, &amp;#x2026;}
&lt;/P&gt;
&lt;P&gt;
Or, in English, this is the set of strings w in S* such that w:
&lt;UL&gt;
&lt;LI&gt; starts with a single &lt;span class="code"&gt;1&lt;/span&gt;, &lt;/LI&gt;
&lt;LI&gt; has no more than one &lt;span class="code"&gt;0&lt;/span&gt; in a row, and &lt;/LI&gt;
&lt;LI&gt; ends with &lt;span class="code"&gt;1&lt;/span&gt; &lt;/LI&gt;
&lt;/UL&gt; 
&lt;P&gt;
We need one more thing to make this notation sufficiently terse. 
You probably noticed that the Kleene Star above only applied to the thing 
immediately to the star's left, whether the language or the symbol.
Suppose we've got four languages: H, J, K and L.
We can form two new languages through concatenation: G = HJ and F = KL.
And then we can form a seventh language, D = G*F*.
But we can't write D = HJ*KL*, because that's a different language.
What we mean is D = (HJ)*(KL)*.
&lt;/P&gt;
&lt;P&gt;
Therefore I'm declaring that we can use parenthesis in the normal way we're all used to for order of operations.
We will figure out from context whether parens mean "this is a finite sequence" or "this is a grouping of concatenations".
&lt;/P&gt;
&lt;P&gt;
Remember, we're looking for a way to concisely characterize languages, and clearly we're onto something here.
We now have &lt;B&gt;union&lt;/B&gt;, &lt;B&gt;grouping&lt;/B&gt;, &lt;B&gt;concatenation&lt;/B&gt; and &lt;B&gt;Kleene closure&lt;/B&gt;.
In other words, we have enough tools to define regular expressions, which we'll do next time.
Stay tuned!
&lt;/P&gt;
&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=496289" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Three: Concatenation </title><link>http://blogs.msdn.com/ericlippert/archive/2005/11/25/regular-expressions-from-scratch-part-three-concatenation.aspx</link><pubDate>Fri, 25 Nov 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:494645</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>9</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/494645.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=494645</wfw:commentRss><description>&lt;div class="mine"&gt;

&lt;P&gt;
You probably intuitively understood concatenation already, but let me define it anyway.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 5&lt;/B&gt;: 
The &lt;B&gt;concatenation&lt;/B&gt; of two strings w and u (over the same alphabet) makes a string consisting of the 
sequence of every element in w followed by every element in u. We write concatenations the same way as before:
all run together. So if we have S = {
&lt;span class="code"&gt;a&lt;/span&gt;,
&lt;span class="code"&gt;b&lt;/span&gt;,
&lt;span class="code"&gt;c&lt;/span&gt;}
and two strings in S* called w and u, where 
w=&lt;span class="code"&gt;abc&lt;/span&gt; and 
u=&lt;span class="code"&gt;cccaaabbb&lt;/span&gt; then
then wu=&lt;span class="code"&gt;abccccaaabbb&lt;/span&gt;.
&lt;/P&gt;
&lt;P&gt;
Be very clear on this point: w and u are &lt;I&gt;variables&lt;/I&gt; which represent string values.
wu is an expression which represents concatenating the value of u onto the end of the value of w. 
Similarly, w&lt;span class="code"&gt;aaaa&lt;/span&gt;
would be the result of concatenating 
&lt;span class="code"&gt;aaaa&lt;/span&gt; ont w, so 
w&lt;span class="code"&gt;aaaa&lt;/span&gt; =
&lt;span class="code"&gt;abcaaaa&lt;/span&gt;.
&lt;/P&gt;
&lt;P&gt;
Recall that I'm using the variable e to represent the empty string. You can concatenate the empty string 
onto any old string, and the result is unchanged. So we = w, for any string w.
&lt;/P&gt;
&lt;P&gt;
We can concatenate any &lt;I&gt;finite&lt;/I&gt; number of strings together with impunity. 
If v=&lt;span class="code"&gt;ab&lt;/span&gt; then 
wuvev&lt;span class="code"&gt;a&lt;/span&gt;e&lt;span class="code"&gt;a&lt;/span&gt;e&lt;span class="code"&gt;a&lt;/span&gt; =
&lt;span class="code"&gt;abccccaaabbbababaaa&lt;/span&gt;.
&lt;/P&gt;
&lt;P&gt;We cannot concatenate an &lt;I&gt;infinite&lt;/I&gt; number of non-empty strings together to form a string, 
because strings are by definition finite sequences.
&lt;/P&gt;
&lt;P&gt;
Note that with this definition we can say that the set S* is the set of &lt;I&gt;all finite concatenations&lt;/I&gt; 
of any members of S, plus the empty string.
&lt;/P&gt;
&lt;P&gt;
That pretty much takes care of making strings out of other strings. 
We're going to need to make new languages from old languages as well.
Can we meaningfully concatenate two &lt;I&gt;languages&lt;/I&gt; together? Sure, why not?
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 6&lt;/B&gt;: Two languages L and K over the same alphabet may be concatenated together form a new language. 
Concatenated languages are notated as you'd expect: by writing one after the other. 
If u is in L, and v is in K, then uv is in LK.
&lt;/P&gt;
&lt;P&gt;
If one of those languages is the empty language - the language with no strings, not even the empty string - 
then the concatenation is also the empty language.
&lt;/P&gt;
&lt;P&gt;
For example, suppose 
&lt;/P&gt;
&lt;P&gt;
S = {&lt;span class="code"&gt;a&lt;/span&gt;, &lt;span class="code"&gt;b&lt;/span&gt;}
&lt;BR&gt;
L = {e,
&lt;span class="code"&gt;a&lt;/span&gt;, 
&lt;span class="code"&gt;aa&lt;/span&gt;, 
&lt;span class="code"&gt;aaa&lt;/span&gt;, 
&lt;span class="code"&gt;aaaa&lt;/span&gt;, &amp;#x2026;}
&lt;BR&gt;
K = {e,
&lt;span class="code"&gt;b&lt;/span&gt;, 
&lt;span class="code"&gt;bb&lt;/span&gt;, 
&lt;span class="code"&gt;bbb&lt;/span&gt;, 
&lt;span class="code"&gt;bbbb&lt;/span&gt;, &amp;#x2026;}
&lt;/P&gt;
&lt;P&gt;
then
&lt;/P&gt;
&lt;P&gt;LK = {w in S* such that w has any number of 
&lt;span class="code"&gt;a&lt;/span&gt;s followed by any number of  
&lt;span class="code"&gt;b&lt;/span&gt;s} 
&lt;/P&gt;
&lt;P&gt;
That's a little irksome. We need a more compact notation for 
"any number of 
&lt;span class="code"&gt;a&lt;/span&gt;s followed by any number of  
&lt;span class="code"&gt;b&lt;/span&gt;s".  Fortunately we already have such a notation -- 
we have the Kleene Closure over alphabets. We'll just create two one-character alphabets,
star them to form languages, and concatenate the languages:
&lt;P&gt;
LK = {&lt;span class="code"&gt;a&lt;/span&gt;}*{&lt;span class="code"&gt;b&lt;/span&gt;}* 
&lt;/P&gt;
&lt;P&gt;
Let's make that notation a little more compact. From now on when I say 
&lt;/P&gt;
&lt;P&gt;
X = &lt;span class="code"&gt;a&lt;/span&gt;*
&lt;/P&gt;
what I mean is
&lt;P&gt;
X = {&lt;span class="code"&gt;a&lt;/span&gt;}*
&lt;/P&gt;
&lt;P&gt;
That is, I'm going to omit the set braces, because you can figure them out from context.
&lt;/P&gt;
&lt;P&gt;
We can write that last example much more compactly:
&lt;/P&gt;
&lt;P&gt;
S = {&lt;span class="code"&gt;a&lt;/span&gt;, &lt;span class="code"&gt;b&lt;/span&gt;}
&lt;BR&gt;
L = &lt;span class="code"&gt;a&lt;/span&gt;*
&lt;BR&gt;
K = &lt;span class="code"&gt;b&lt;/span&gt;*
&lt;BR&gt;
LK = &lt;span class="code"&gt;a&lt;/span&gt;*&lt;span class="code"&gt;b&lt;/span&gt;*
&lt;/P&gt;
&lt;P&gt;
Perhaps you can see that we are in fact heading towards regular expressions.
&lt;/P&gt;
&lt;P&gt;
Next time: The Kleene Closure applies to languages too.
&lt;/P&gt;
&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=494645" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part Two: Some Examples of Languages </title><link>http://blogs.msdn.com/ericlippert/archive/2005/11/22/regular-expressions-from-scratch-part-two-some-examples-of-languages.aspx</link><pubDate>Tue, 22 Nov 2005 18:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:494633</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/494633.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=494633</wfw:commentRss><description>
&lt;div class="mine"&gt;


&lt;P&gt;
Let's look at some sample languages to get a sense of just how flexible languages can be.
&lt;/P&gt;
&lt;P&gt;
For example, here are some languages over the alphabet 
S = {&lt;span class="code"&gt;0&lt;/span&gt;, &lt;span class="code"&gt;1&lt;/span&gt;}
&lt;/P&gt;
&lt;P&gt;
L&lt;SUB&gt;1&lt;/SUB&gt; = all members of S*, period -- the language where every string is in the language
&lt;/P&gt;
&lt;P&gt;
L&lt;SUB&gt;2&lt;/SUB&gt; = all members of S* such that there are more 
&lt;span class="code"&gt;1&lt;/span&gt;'s than
&lt;span class="code"&gt;0&lt;/span&gt;'s.
So L&lt;SUB&gt;2&lt;/SUB&gt; = {&lt;span class="code"&gt;1&lt;/span&gt;,
&lt;span class="code"&gt;11&lt;/span&gt;,
&lt;span class="code"&gt;011&lt;/span&gt;,
&lt;span class="code"&gt;101&lt;/span&gt;,
&lt;span class="code"&gt;110&lt;/span&gt;,
&lt;span class="code"&gt;111&lt;/span&gt;, &amp;#x2026;}
&lt;/P&gt;
&lt;P&gt;
L&lt;SUB&gt;3&lt;/SUB&gt; = all members of S* such that the string is a prime number of 
&lt;span class="code"&gt;1&lt;/span&gt;'s.
So L&lt;SUB&gt;3&lt;/SUB&gt; = {&lt;span class="code"&gt;11&lt;/span&gt;,
&lt;span class="code"&gt;111&lt;/span&gt;,
&lt;span class="code"&gt;11111&lt;/span&gt;,
&lt;span class="code"&gt;1111111&lt;/span&gt;,
&lt;span class="code"&gt;11111111111&lt;/span&gt;, &amp;#x2026;}
&lt;/P&gt;
&lt;P&gt;L&lt;SUB&gt;4&lt;/SUB&gt; = all members of S* such that there are n 
&lt;span class="code"&gt;1&lt;/span&gt;'s
and no 
&lt;span class="code"&gt;0&lt;/span&gt;'s
in the string, and furthermore, n is such that there is a nontrivial solution to Fermat's Last Theorem: 
a&lt;SUP&gt;n&lt;/SUP&gt; + b&lt;SUP&gt;n&lt;/SUP&gt; = c&lt;SUP&gt;n&lt;/SUP&gt; in natural numbers.
&lt;/P&gt;
&lt;P&gt;That is a perfectly good description of a language. In fact, it's a description of the finite language 
L&lt;SUB&gt;4&lt;/SUB&gt; = {&lt;span class="code"&gt;1&lt;/span&gt;,
&lt;span class="code"&gt;11&lt;/span&gt;}.
We know that now, as Fermat's Last Theorem was proven. Twenty years ago though we would not have been able to 
enumerate all the members of this language with any confidence.
&lt;/P&gt;
&lt;P&gt;
We can even come up with language descriptions that we cannot possibly know their members!
&lt;/P&gt;
&lt;P&gt;
&lt;P&gt;L&lt;SUB&gt;5&lt;/SUB&gt; = all members of S* such that there are n 
&lt;span class="code"&gt;1&lt;/span&gt;'s
and no 
&lt;span class="code"&gt;0&lt;/span&gt;'s
in the string, and furthermore, 
n is the largest number of quarters I had in my pocket at any time on the 13th of December, 1991.
&lt;/P&gt;
&lt;P&gt;
That language has only one member, but darned if I know what it is!
&lt;/P&gt;
&lt;P&gt;
This silly example raises an important point: in general, given a description of a language, we 
cannot know how many members the language has, whether it is finite or infinite, or whether any particular 
string is a member of the language. 
&lt;/P&gt;
&lt;P&gt;
We will be seeing later that there are definitely languages which 
cannot &lt;I&gt;in principle&lt;/I&gt; be determined. That is, there is no way whatsoever of computing the members 
of the language. 
&lt;/P&gt;
&lt;P&gt;
For some languages though, we can build and analyze devices which recognize them - that is, 
when given a string and a language, a device can tell you whether or not the string is a member of the language.
&lt;/P&gt;
&lt;P&gt;
To do that, we're going to need a way to describe languages in a very precise manner - we need a 
"language definition language" of some sort. The relationship between language recognizers and various 
metalanguages for describing languages will be a fundamental focus of this series.
&lt;/P&gt;
&lt;P&gt;
Next time: concatenation of strings and languages
&lt;/P&gt;


&lt;/div&gt;
&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=494633" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item><item><title>Regular Expressions From Scratch, Part One: Defining Terms</title><link>http://blogs.msdn.com/ericlippert/archive/2005/11/18/regular-expressions-from-scratch-part-one-defining-terms.aspx</link><pubDate>Fri, 18 Nov 2005 21:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:493482</guid><dc:creator>Eric Lippert</dc:creator><slash:comments>18</slash:comments><comments>http://blogs.msdn.com/ericlippert/comments/493482.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ericlippert/commentrss.aspx?PostID=493482</wfw:commentRss><description>&lt;div class="mine"&gt;

&lt;P&gt;
Over the years that I've been writing this blog some of the most positive feedback I've received has been 
for those entries where I've explored fundamental concepts in computer science. I thought that I might take 
that to its logical extreme, and do a series on what exactly a "regular expression" is. 
&lt;/P&gt;
&lt;P&gt;
Most script developers are familiar with regular expressions - they're those dense, hard-to-read, patterns 
that you can use to write self-obfuscating code that does string matching. I'd like to start by throwing 
out everything that we know about practical, real-world regular expressions and start from set theory, 
of all things.
&lt;/P&gt;
&lt;P&gt;
This is going to be a very long series I think. But it will build character!
&lt;/P&gt;
&lt;P&gt;Here's what I'm going to assume:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P&gt;
We have an &lt;I&gt;intuitive&lt;/I&gt; understanding of &lt;B&gt;sets&lt;/B&gt;. Sets are collections of stuff. They can be 
empty, finite, infinite, you name it. They can contain other sets. I am not going to start with the 
axioms of set theory, but trust me, we could justify our intuitive understanding of sets axiomatically 
if we wanted to.
&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;
We have an intuitive understanding of the set of &lt;B&gt;natural numbers&lt;/B&gt; 
&amp;#x2115; = {0, 1, 2, 3, &amp;#x2026;}
&lt;/p&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;
We can create arbitrary sets of &lt;B&gt;symbols&lt;/B&gt;. For clarity, throughout this series I'll be displaying 
symbols in a blue fixed-width typeface. We could have a set of symbols 
S = {&lt;span class="code"&gt;a&lt;/span&gt;, 
&lt;span class="code"&gt;b&lt;/span&gt;, 
&lt;span class="code"&gt;c&lt;/span&gt;,
&lt;span class="code"&gt;+&lt;/span&gt;,
&lt;span class="code"&gt;-&lt;/span&gt;,
&lt;span class="code"&gt;0&lt;/span&gt;,
&lt;span class="code"&gt;1&lt;/span&gt;,
&lt;span class="code"&gt;&amp;&lt;/span&gt;}
for instance. 
&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;
We have an intuitive understanding of &lt;B&gt;finite sequences&lt;/B&gt;. Remember, sets are by definition unordered, 
may be infinite, and contain no duplicates. A finite sequence is like a set but is finite, has an order, 
and may contain duplicates. A finite sequence may be empty. (Again, we could derive a suitable definition 
of a finite sequence from axiomatic set theory, but I won't.) To distinguish finite sequences from sets, 
I'll use curly braces for sets and round parentheses for finite sequences, but not for long.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;
That's not starting with much. Let's see what we can do.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 1&lt;/B&gt;: An &lt;B&gt;alphabet&lt;/B&gt; is a finite set of symbols.
&lt;/P&gt;
&lt;P&gt;
For example, we could have the alphabet of capital Roman letters 
R = {&lt;span class="code"&gt;A&lt;/span&gt;, 
&lt;span class="code"&gt;B&lt;/span&gt;, 
&lt;span class="code"&gt;C&lt;/span&gt;, &amp;#x2026;,
&lt;span class="code"&gt;Z&lt;/span&gt;}
or the alphabet of binary digits 
B = {&lt;span class="code"&gt;0&lt;/span&gt;, 
&lt;span class="code"&gt;1&lt;/span&gt;}. 
&lt;/P&gt;
&lt;P&gt;
The empty alphabet is a pretty boring alphabet, but it is a legal alphabet.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 2&lt;/B&gt;: a &lt;B&gt;string&lt;/B&gt; is a finite sequence of symbols drawn from a given alphabet.
&lt;/P&gt;
&lt;P&gt;
It is a pain in the rear to continually say 
"(&lt;span class="code"&gt;b&lt;/span&gt;, 
&lt;span class="code"&gt;b&lt;/span&gt;, 
&lt;span class="code"&gt;c&lt;/span&gt;)
is a string over alphabet 
{&lt;span class="code"&gt;a&lt;/span&gt;, 
&lt;span class="code"&gt;b&lt;/span&gt;, 
&lt;span class="code"&gt;c&lt;/span&gt;}"
so for the vast majority of cases, I'll be writing strings all run together like this: 
&lt;span class="code"&gt;bbc&lt;/span&gt; and assuming that you can mentally treat this as a sequence.
&lt;/P&gt;
&lt;P&gt;
I'll also be assuming that you can figure out the alphabet for a given string from context. If the alphabet 
is ever unclear then I'll call it out specifically.
&lt;/P&gt;
&lt;P&gt;
I will sometimes use "variables" to refer to strings and languages. You'll be able to distinguish variables 
from symbols because symbols are always written in blue fixed-width 
font. I shall endeavour to also avoid saying something confusing such as 
"Suppose b = &lt;span class="code"&gt;bbc&lt;/span&gt;" 
though as we'll see later, this will become unavoidable when we talk about regular expressions.
&lt;/P&gt;
&lt;P&gt;
There's no good way to show the empty string, not without getting into a whole lot of issues with quotation 
marks. I'm hereby declaring that unless I say otherwise, the variable e represents the empty string in whatever 
alphabet we're presently talking about. 
&lt;/P&gt;
&lt;P&gt;
Note that the empty string is the only string that can be formed from the empty alphabet.
&lt;/P&gt;
&lt;P&gt;
I'll almost always use the variable S to represent an alphabet and u, v and w to represent strings.
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 3&lt;/B&gt;: Consider an alphabet S. The &lt;B&gt;Kleene Closure&lt;/B&gt; of S is the set of all strings
that can be formed from that alphabet and will be written as &lt;B&gt;S*&lt;/B&gt;.
&lt;/P&gt;
&lt;P&gt;
For example, let S = {&lt;span class="code"&gt;0&lt;/span&gt;, &lt;span class="code"&gt;1&lt;/span&gt;}.
Then S* = {e, 
&lt;span class="code"&gt;0&lt;/span&gt;,
&lt;span class="code"&gt;1&lt;/span&gt;,
&lt;span class="code"&gt;00&lt;/span&gt;,
&lt;span class="code"&gt;01&lt;/span&gt;,
&lt;span class="code"&gt;10&lt;/span&gt;,
&lt;span class="code"&gt;11&lt;/span&gt;,
&lt;span class="code"&gt;000&lt;/span&gt;, &amp;#x2026;}.
That is, every string of finite length that can be formed from only
&lt;span class="code"&gt;0&lt;/span&gt; and
&lt;span class="code"&gt;1&lt;/span&gt;, including the empty string.
&lt;/P&gt;

&lt;P&gt;S* is of course an infinite set, though I hasten to emphasize that no member of S* is infinitely long, 
because strings are by definition finite sequences.
&lt;/P&gt;
&lt;P&gt;
Here comes our most important definition today:
&lt;/P&gt;
&lt;P&gt;
&lt;B&gt;Definition 4&lt;/B&gt;: A &lt;B&gt;language&lt;/B&gt; over an alphabet S is any subset of S*.
&lt;/P&gt;
&lt;P&gt;This is a rather broad definition of "language" -- any subset &lt;I&gt;whatsoever&lt;/I&gt; of the set of all finite strings 
in any alphabet forms a language.
&lt;/P&gt;
&lt;P&gt;
Next time we'll look at some examples of languages, both sensible and crazy.
&lt;/P&gt;

&lt;/div&gt;

&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=493482" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Rarefied+Heights/default.aspx">Rarefied Heights</category><category domain="http://blogs.msdn.com/ericlippert/archive/tags/Regular+Expressions/default.aspx">Regular Expressions</category></item></channel></rss>