Fabulous Adventures In Coding

Eric Lippert's Blog

Regular Expressions From Scratch, Part One: Defining Terms

Over the years that I've been writing this blog some of the most positive feedback I've received has been for those entries where I've explored fundamental concepts in computer science. I thought that I might take that to its logical extreme, and do a series on what exactly a "regular expression" is.

Most script developers are familiar with regular expressions - they're those dense, hard-to-read, patterns that you can use to write self-obfuscating code that does string matching. I'd like to start by throwing out everything that we know about practical, real-world regular expressions and start from set theory, of all things.

This is going to be a very long series I think. But it will build character!

Here's what I'm going to assume:

  • We have an intuitive understanding of sets. Sets are collections of stuff. They can be empty, finite, infinite, you name it. They can contain other sets. I am not going to start with the axioms of set theory, but trust me, we could justify our intuitive understanding of sets axiomatically if we wanted to.

  • We have an intuitive understanding of the set of natural numbers ℕ = {0, 1, 2, 3, …}

  • We can create arbitrary sets of symbols. For clarity, throughout this series I'll be displaying symbols in a blue fixed-width typeface. We could have a set of symbols S = {a, b, c, +, -, 0, 1, &} for instance.

  • We have an intuitive understanding of finite sequences. Remember, sets are by definition unordered, may be infinite, and contain no duplicates. A finite sequence is like a set but is finite, has an order, and may contain duplicates. A finite sequence may be empty. (Again, we could derive a suitable definition of a finite sequence from axiomatic set theory, but I won't.) To distinguish finite sequences from sets, I'll use curly braces for sets and round parentheses for finite sequences, but not for long.

That's not starting with much. Let's see what we can do.

Definition 1: An alphabet is a finite set of symbols.

For example, we could have the alphabet of capital Roman letters R = {A, B, C, …, Z} or the alphabet of binary digits B = {0, 1}.

The empty alphabet is a pretty boring alphabet, but it is a legal alphabet.

Definition 2: a string is a finite sequence of symbols drawn from a given alphabet.

It is a pain in the rear to continually say "(b, b, c) is a string over alphabet {a, b, c}" so for the vast majority of cases, I'll be writing strings all run together like this: bbc and assuming that you can mentally treat this as a sequence.

I'll also be assuming that you can figure out the alphabet for a given string from context. If the alphabet is ever unclear then I'll call it out specifically.

I will sometimes use "variables" to refer to strings and languages. You'll be able to distinguish variables from symbols because symbols are always written in blue fixed-width font. I shall endeavour to also avoid saying something confusing such as "Suppose b = bbc" though as we'll see later, this will become unavoidable when we talk about regular expressions.

There's no good way to show the empty string, not without getting into a whole lot of issues with quotation marks. I'm hereby declaring that unless I say otherwise, the variable e represents the empty string in whatever alphabet we're presently talking about.

Note that the empty string is the only string that can be formed from the empty alphabet.

I'll almost always use the variable S to represent an alphabet and u, v and w to represent strings.

Definition 3: Consider an alphabet S. The Kleene Closure of S is the set of all strings that can be formed from that alphabet and will be written as S*.

For example, let S = {0, 1}. Then S* = {e, 0, 1, 00, 01, 10, 11, 000, …}. That is, every string of finite length that can be formed from only 0 and 1, including the empty string.

S* is of course an infinite set, though I hasten to emphasize that no member of S* is infinitely long, because strings are by definition finite sequences.

Here comes our most important definition today:

Definition 4: A language over an alphabet S is any subset of S*.

This is a rather broad definition of "language" -- any subset whatsoever of the set of all finite strings in any alphabet forms a language.

Next time we'll look at some examples of languages, both sensible and crazy.

Published Friday, November 18, 2005 10:00 AM by Eric Lippert

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Hasani said:

All I know is senior year, I took 'theory of computation' where we studied and created dfa and nfa graphs (if I remember correctly). A year later, I picked up an orielly book regex as I needed to use it for some work I was doing... Things quickly fell into place.

For a day, I was king of the world!
November 18, 2005 1:25 PM
 

Steve said:

Automata was a fun class that gave me plenty of headaches.
November 18, 2005 3:06 PM
 

Lance Fisher said:

I love the way you address these subjects - very mathematically. I'm looking forward to the series.
November 18, 2005 6:02 PM
 

Capt. Jean-Luc Pikachu said:

Have you thought about coordinating w/ <a href="http://blogs.msdn.com/ericgu/">Eric Gunnerson</a>?
November 18, 2005 11:32 PM
 

Capt. Jean-Luc Pikachu said:

Many apologies for messing up the link.
November 18, 2005 11:33 PM
 

Chris Moorhouse said:

Sweet! Can't wait for the rest...
November 19, 2005 12:29 AM
 

Slava Ivanyuk said:

Looks like fun sequence of articles. Looking forward to it.
November 19, 2005 1:35 PM
 

Rich C said:

I really enjoy reading this stuff, Eric. But just to make life easy, can't we use Ø for the empty set? That's the only one I've seen.

It's just a <ctrl> + /, <shift> + o away.

Of course, if it doesn't show up correctly when I post this, I'll know why, eh :)?
November 19, 2005 6:53 PM
 

Rich C said:

Then again, you might be trying to show empty _strings_ not empty _sets_. As that is the case, I'll just read and enjoy the next in this series and watch myself.
November 19, 2005 11:54 PM
 

Jonathan said:

Yeah, I liked Automata in University as well. Not that I remember any of it...
November 20, 2005 4:40 AM
 

Eric Lippert said:

Right -- e is a string, it is the empty _sequence_. But we'll be using Ø for something like the empty string in a few episodes, don't worry.
November 20, 2005 10:23 AM
 

Mark W said:

Looking forward to this series... Could you make the blue symbol color a bit brighter, though? I can't see any color difference between the regular text and the symbols without setting my text size to Largest.
November 21, 2005 10:59 AM
 

Adam G said:

I thought that λ was pretty standard-ly the empty string...

That being said, I remember CS Theory fondly and look forward to the rest of this series.
November 21, 2005 7:26 PM
 

Eric Lippert said:

Indeed, traditionally lambda is used for empty string, capital sigma is used for alphabets, etc. My feeling is that Greek letters are somewhat off-putting to non-mathematicians. I'm therefore going to sacrifice both rigor and unimportant factors in traditional notation to hopefully make it easier on an audience that doesn't necessarily have a background in CS theory. (If you do have a background in CS theory, you probably already know this stuff anyway.)
November 24, 2005 10:11 AM
 

Synesthesia » Links Roundup for 2006-02-21 said:

February 22, 2006 4:18 AM
 

Dan Sellers's WebLog said:

When it comes to validating input regular expression becomes a very important part of your security plan.&amp;nbsp;...
March 3, 2006 6:29 PM
 

Tanveer Badar said:

What about the empty language? The language with no strings in it.

May 29, 2007 6:51 AM
 

Eric Lippert said:

Since that is a subset of S*, yes, the empty language is a language.

May 29, 2007 10:39 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Eric Lippert

Eric Lippert is a senior developer on the Microsoft C# compiler team. Before that he worked on the framework of Visual Studio Tools For Office. Before that, he worked on the compilers, runtimes and tools for VBScript, JScript, Windows Script Host and other Microsoft Scripting technologies. He lives in Seattle and spends his free time editing books about programming languages, playing the piano, and trying to keep his tiny sailboat upright in Puget Sound.

This Blog

Syndication


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker