Fabulous Adventures In Coding

Eric Lippert's Blog

Regular Expressions From Scratch, Part Five: The Regular Expression Language

Now things start to get really weird.

Definition 9: Take any alphabet S. The regular expression alphabet of S is S plus a bunch of extra symbols; it's S ∪ {(, ), *, , Ø} (I assume that none of those symbols are already in S.)

I'm doing something that I said earlier that I would try not to do. I'm using symbols in an alphabet that I also use in expressions that talk about strings in that alphabet. (Of course, I also said that this would all fall apart when we got to regular expressions, and sure enough, it did. Foreshadowing: the sign of a quality blog.)

Again, keep a careful eye on when I'm using fixed-width blue, because those are the "meaningless" symbols, not expression notation.

Definition 10: Take any alphabet S. The regular expression language R of an alphabet S is a language formed from strings of the regular expression language of S, and is defined as follows:

  • Ø is in R.
  • Every member of S is in R
  • If u and w are strings in R then (uw) is in R.
  • Similarly, (uw) is in R.
  • Similarly, u* is in R.
  • Nothing else is in R

An example might help. Suppose that S = {a, b}. The regular expression alphabet of S is {a, b, (, ), *, , Ø}. The regular expression language of S is R = {Ø, a, b, (ØØ), (Øa), (Øb), (aØ), (aa), (ab), (bØ), (ba), (bb), (Ø∪Ø), (Ø∪a), (Ø∪b), (a∪Ø), (a∪a), (a∪b), (b∪Ø), (b∪a), (b∪b), Ø*, a*, b*, …}

We've defined an alphabet, we've defined a language -- a language which looks suspiciously like the expressions we've been using to talk about languages. Next time we'll do something insanely clever to bridge the gap between strings in the regular expression language and the languages which these strings would define if they were interpreted as expressions.

Published Thursday, December 01, 2005 7:00 AM by Eric Lippert

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Curtis said:

Just wanted to let you know how much I am enjoying this series!
December 1, 2005 10:11 AM
 

Centaur said:

Norwegians are going to hate you for using the letter O with stroke (U+00D8) instead of the empty set character (U+2205) :)
December 1, 2005 1:39 PM
 

Eric Lippert said:

The empty set character doesn't render properly in Lucida Console.
December 1, 2005 1:56 PM
 

Lance Fisher said:

I'm really enjoying this series too. Probably my favorite series yet on your blog.
December 2, 2005 7:53 PM
 

Shawn Cheng said:

In definition 10, I think the phrase "the regular expression language of S" should be "the regular expression alphabet of S".


January 8, 2006 10:19 PM
 

Shawn Cheng said:

Again about definition 10, should we add one more entry to the list as follows:

If u is a string in R then (u) is in R.
January 8, 2006 10:36 PM
 

Jon Turner said:

That extra definition would be redundant, and could be removed. (u) === u so (u)* === u* and so on for all of the operation symbols.
January 24, 2006 11:27 AM
 

Jon Turner said:

In response to your other post, Shawn, definition 10 is correct in calling it the "regular expression language" since definition 4 says that a language is a subset of S* (which matches the notation used) and alphabets must be finite according to definition 3 (i think, maybe definition 2) while R is not.
January 24, 2006 11:33 AM
 

Synesthesia » Links Roundup for 2006-02-21 said:

February 22, 2006 4:18 AM
 

Dan Sellers's WebLog said:

When it comes to validating input regular expression becomes a very important part of your security plan. ...
March 3, 2006 6:29 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Eric Lippert

Eric Lippert is a senior developer on the Microsoft C# compiler team. Before that he worked on the framework of Visual Studio Tools For Office. Before that, he worked on the compilers, runtimes and tools for VBScript, JScript, Windows Script Host and other Microsoft Scripting technologies. He lives in Seattle and spends his free time editing books about programming languages, playing the piano, and trying to keep his tiny sailboat upright in Puget Sound.

This Blog

Syndication


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker