Regular Expressions From Scratch, Part Three: Concatenation

Regular Expressions From Scratch, Part Three: Concatenation

  • Comments 9

You probably intuitively understood concatenation already, but let me define it anyway.

Definition 5: The concatenation of two strings w and u (over the same alphabet) makes a string consisting of the sequence of every element in w followed by every element in u. We write concatenations the same way as before: all run together. So if we have S = { a, b, c} and two strings in S* called w and u, where w=abc and u=cccaaabbb then then wu=abccccaaabbb.

Be very clear on this point: w and u are variables which represent string values. wu is an expression which represents concatenating the value of u onto the end of the value of w. Similarly, waaaa would be the result of concatenating aaaa ont w, so waaaa = abcaaaa.

Recall that I'm using the variable e to represent the empty string. You can concatenate the empty string onto any old string, and the result is unchanged. So we = w, for any string w.

We can concatenate any finite number of strings together with impunity. If v=ab then wuvevaeaea = abccccaaabbbababaaa.

We cannot concatenate an infinite number of non-empty strings together to form a string, because strings are by definition finite sequences.

Note that with this definition we can say that the set S* is the set of all finite concatenations of any members of S, plus the empty string.

That pretty much takes care of making strings out of other strings. We're going to need to make new languages from old languages as well. Can we meaningfully concatenate two languages together? Sure, why not?

Definition 6: Two languages L and K over the same alphabet may be concatenated together form a new language. Concatenated languages are notated as you'd expect: by writing one after the other. If u is in L, and v is in K, then uv is in LK.

If one of those languages is the empty language - the language with no strings, not even the empty string - then the concatenation is also the empty language.

For example, suppose

S = {a, b}
L = {e, a, aa, aaa, aaaa, …}
K = {e, b, bb, bbb, bbbb, …}

then

LK = {w in S* such that w has any number of as followed by any number of bs}

That's a little irksome. We need a more compact notation for "any number of as followed by any number of bs". Fortunately we already have such a notation -- we have the Kleene Closure over alphabets. We'll just create two one-character alphabets, star them to form languages, and concatenate the languages:

LK = {a}*{b}*

Let's make that notation a little more compact. From now on when I say

X = a*

what I mean is

X = {a}*

That is, I'm going to omit the set braces, because you can figure them out from context.

We can write that last example much more compactly:

S = {a, b}
L = a*
K = b*
LK = a*b*

Perhaps you can see that we are in fact heading towards regular expressions.

Next time: The Kleene Closure applies to languages too.

  • I noticed that no one had commented on the last RegEx post, or this one, and felt that I should speak up so you know at least one person is reading.

    I'm interested in seeing more of this, and I'm sure plenty of others are, too. I like seeing experienced developers put together in-depth articles/tutorials/whatever-you-want-to-call-them.
  • I've been reading your posts but am wondering if you actually explained what the Kleene Closure is?
  • Sembol,

    The Kleene closure was definition 3, from last week:

    http://blogs.msdn.com/ericlippert/archive/2005/11/18/493482.aspx
  • Good writing. So, based on what you say about the empty string, is it true that

    e* = e

    ???
  • Yes, this would be an example of our more compact syntax. Clearly {e}* = {e} U {ee} U {eee} ... = {e}, so {e}* = {e}, so e*=e.
  • Started to read this great series!

    May I revise the example under the definition 6 a little bit to become the following:

    S = {b}
    L = {}
    K = {e, b, bb, bbb, bbbb, …}

    then

    LK = {w in S* such that w has zero number of as followed by any number of bs}

    then, we have LK = K. Agree? If yes, then the paragraph before the example is wrong.

    Please correct me.
  • Shawn Cheng:

    If a language L does not even have the empty string, then any concatenation with L could have no possible strings.

    You could change your suggested example to this:

    S = {a, b}
    L = {e}
    K = {e, b, bb, bbb, bbbb, ...}

    In that case, it is true that LK = K
    (LK = {w in S* such that w has the empty string followed by any number of bs})

    If, however,  L = {}, then LK = {}


    Unless, of course, I'm wrong.
  • PingBack from http://www.synesthesia.co.uk/blog/archives/2006/02/22/links-24/
  • When it comes to validating input regular expression becomes a very important part of your security plan. ...
Page 1 of 1 (9 items)