# Regex 101

Posts
• #### MSDN Regular Expression Forum

There's now an MSDN Regular Expression Forum
• #### Rule #1 of Regex debugging

Rule #1 of Regex debugging

The regex engine isn't hung. It will finish eventually.

However, "eventually" may not until after the heat death of the universe.

You either didn't anchor your string, or you didn't think clearly about how the quantifiers "*" or "+" work. Or both...

• #### Regex 101 posts - continue or not?

I've been getting bored with the regex 101 exercises that I 've been posting, as lots of them are simply variants of what I've posted in the past, and there's not really much value to add in the discussion.

I have 9 more of the exercises remaining.

Things I could do:

1) Do all of the remaining 9

2) Pick a few (3-5) that I feel like talking about, and do those.

3) Give up now, and spend more time looking for the ultimate cordless butter warmer.

Do you have a preference?

• #### Regex 101 Answer I10 - Extract repeating hex blocks from a string

Regex 101 Exercise I10 - Extract repeating hex blocks from a string

Given the string:

PCORR:BLOCK=V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9;

Extract all the hex numbers in the form “H’xxxx”

*****

You can match the hex digits with:

H'(?<Values>[0-9a-fA-F]{4})

Like our last example, you can call Match() multiple time, use Matches(), or do it in a single call with:

(H'(?<Values>[0-9a-fA-F]{4})&)+

• #### Regex 101 Exercise I10 - Extract repeating hex blocks from a string

Regex 101 Exercise I10 - Extract repeating hex blocks from a string

Given the string:

PCORR:BLOCK=V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9;

Extract all the hex numbers in the form “H’xxxx”

• #### Regex 101 Discussion I9 - Count the number of matches.

 Regex 101 Exercise I9 - Count the number of matches Given a string like: # # 4 6 # # 7 # 45 # 43 # 65 56 2 # 4345 # # 23   Count how many numbers there are in this string ----- There are a few ways to approach this problem. In all of them, we need a way to capture the numeric parts of the string. We've done that before, using something like: \d+ We can then apply that repeatedly, with code that looks something like this: Regex regex = new Regex("\d+"); Match match = regex.Match(inputString);int count = 0; while (match.Success){   count++;   match = match.NextMatch();} That's a bit ugly, however. There's a shortcut, however, using: MatchCollection matches = regex.Matches(inputString);int count = matches.Count; That gives the same result. There's another, more "advanced" approach we can use. The regex is more complex, and the code that you write is harder to understand, so I probably wouldn't prefer it over the last approach. To use it, you need to know a new piece of information about the Match class (where "new" means "something I haven't talked about"). In earlier examples, we used the "Groups" indexer to pull out values that we had captured. So, if we wrote something like: (?\d+) We would use: match.Groups["Digits"].Value to get the string. It is possible to write a regex in which a given capture is used more that one time, and therefore corresponds to multiple strings. If we write: (               # Start up repeating section  (?\d+)  # a sequence of digits   (\D+|\$)         # not digits or end )+              # repeat match   We have a single regex that will match a repeating series of digits follow by non-digits, and each match is stored using the "Digits" capture. To get at these captures, we use: match.Captures["Digits"] which is a collection of captures, with each one containing the value from one of the captures. To solve our problem, we'd be interested in the length, which is: match.Captures["Digits"].Length   If you want extra credit, you can also do this by using Regex.Split(), though I've found that Regex.Match() is easier to use for this sort of problem.
• #### Regex 101 Exercise I9 - Count the number of matches

Regex 101 Exercise I9 - Count the number of matches

Given a string like:

# # 4 6 # # 7 # 45 # 43 # 65 56 2 # 4345 # # 23

Count how many numbers there are in this string

• #### Regex 101 Discussion I8 - replace space count with spaces

 Exercise I8 - replace space count with spaces Given a string with embedded space counts: <15sp>Indented by 15 spaces Replace the <sp> with spaces. So, if you have <4sp>Text you should end up with     Text ******* This is fairly straightforward. First, we need to match the space count "thingy". We'll use: \d+)sp> And then we'll use a MatchEvaluator to do the replacement.

static public string Evaluator(Match match)

int count = Int32.Parse(match.Groups["Count"].Value);
return new String(' ', count);
}

• #### Regex 101 Exercise I8 - replace space count with spaces

Exercise I8 - replace space count with spaces

Given a string with embedded space counts:

<15sp>Indented by 15 spaces

Replace the <<count>sp> with <count> spaces.

So, if you have

<4sp>Text

you should end up with

Text

• #### Regex 101 Discussion I7 - Make sure all characters inside <> are uppercase

Regex 101 Exercise I7 - Make sure all characters inside <> are uppercase

First, as Sheva pointed out, making them all *lowercase* would make a lot more sense, but you have probably noticed that the correlation between these exercises and making sense is tenuous at best.

This is another case where that MatchEvaluator functionality is so useful. To match inside the <>, we simply use:

<.*?>

(Extra Credit: Discuss why the "?" is required, what other options are available, and the relation between the different match options and Adam Smith's market theories.)

Then, our MatchEvaluator is as follows:

static public string Evaluator(Match match)

return match.Groups[0].Value.ToUpper();
}

• #### Regex 101 Exercise I7 - Make sure all characters inside <> are uppercase

Regex 101 Exercise I7 - Make sure all characters inside <> are uppercase

• #### Regex 101 Discussion I6 - Remove font directives from HTML

Regex 101 Exercise I6 - Remove font directives from HTML

Remove all <font…> or </font> directives from an HTML string.

*****

I've decided to start linking my answers back to the original posts, since the answers given there are often as good or better than the one that I give.

The most obvious way to write this is:

<font.*>|</font>

That's pretty straightforward - match either a <font...>, or a </font>. But it's also wrong, since the ">" in the first part will match the last ">" in the string. We need the non-greedy qualifier:

<font.*?>|</font>

That does what we want it to do (assuming we use singleline and ignorecase options...)

Other ways of doing this showed up in the comments. Maurits suggested using 3 regexes, or a simple one:

</?font.*?>

I don't know whether I prefer that one over mine. It is shorter, though it's a bit harder for me to read the /? part.

Kbiel suggest a version without the non-greedy option:

</?font[^>]*>

which also works well, though I prefer the non-greedy version due to readability.

• #### Regex 101 Exercise I6 - Remove font directives from HTML

Regex 101 Exercise I6 - Remove font directives from HTML

Remove all <font…> or </font> directives from an HTML string.

• #### Regex 101 Discussion I5 - Remove unapproved HTML tags from a string

When accepting HTML input from a user, allow the following tags:

<b>

</b>

<a href=…>

</a>

<i>

</i>

<u>

</u>

and remove any others.

******

My first comment is that you should be very careful when you do this sort of thing, because if the user can slip script by your filter, they can execute code on your server. Which is bad. Some attacks exploit html escape characters so that what you see may not look like "<script>".

So be forewarned, and forearmed. Or lower-backed, it doesn't really matter which.

My approach is going to match all HTML tags, and then guard against the ones that I don't want to match. So, I start with:

<.*?>

as my initial match. I'll then refine it so that it won't match the good tags, so I can use replace on the bad ones. I'll start by not match <b> and </b>:

<                        # opening <
(b|/b)                      # b or /b
)
.*?                    # match between <>
>

What does that mean? Well, negative lookahead means "try to match the pattern at this point. If you do, the match fails". It doesn't eat any of the characters when it does this. In this case, it will try to match "b" or "/b" inside of the <>, and if it can, it will fail. If it can't, it will succeed.

It's very much like the ^ and \$ anchors - the match can only continue if there is a specific condition that is not met. There are both positive and negative variants of lookahead and lookbehind.

Adding the other tags is pretty simple:

<                        # opening <
(

b|/b|                      # b or /b
i|/i|                      # i or /i
u|/u|                      # u or /u
a\s+href.+?|/a             # a href= or /a
)
)
.*?                    # match between <>
>

Doing the right thing with the string inside the "<a href=...>" is left as an exercise to the reader.

#### Regex 101 Exercise I5 - Remove unapproved HTML tags from a string

When accepting HTML input from a user, allow the following tags:

<b>

</b>

<a href=…>

</a>

<i>

</i>

<u>

</u>

and remove any others.

• #### Regex 101 Discussion I4 - remove unprintable characters from a string

Exercise I4 - remove unprintable characters from a string

Given an input string, remove all characters that are not printable.

--------------

Assuming ASCII - or something like ASCII - non-printing characters have ascii values of 31 or lower. You can match them with the following:

[\x01-\x1F]

and then user Regex.Replace() to replace each of those characters with an empty string.

• #### Regex 101 Exercise I4 - remove unprintable characters from a string

Exercise I4 - remove unprintable characters from a string

Given an input string, remove all characters that are not printable.

• #### Regex 101 Answer I3 - Expand ranges in a string

Sorry about the lateness of this one. I had a meeting on Friday afternoon, and then had some things to take care of today. So anyway...

I3 - Expand ranges in a string

Given a string like:

1,2,4,6-9,12,15-17,20

expand the ranges, so the final string is:

1,2,4,6,7,8,9,12,15,16,17,20

This is a fun one, because it uses a .NET regex capability that isn't in a versions of regex. If you look at the docs for Regex.Replace, you'll see that there's a version like this:

public string Replace(string, MatchEvaluator);

When the regex engine finds a match, it calls into the MatchEvaluator delegate to do the substitution. So, to match a range, we'll use a regex like:

(?<Start>\d+)\-(?<End>\d+)

and then, we use the following method as the MatchEvaluator:

static public string Evaluator(Match match) {
int start = Int32.Parse(match.Groups["Start"].Value);
int end = Int32.Parse(match.Groups["End"].Value);
string[] values = new string[end - start + 1];
for (int i = 0; i < values.Length; i++)
{
values[i] = (start + i).ToString();
}
return String.Join(",", values);
}

Figuring out how that works is left as an exercise to the reader.

The nice thing about using MatchEvaluator methods is that it lets you create a hybrid of regex and procedural code.

• #### Regex 101 Exercise I3 - Expand ranges in a string

I3 - Expand ranges in a string

Given a string like:

1,2,4,6-9,12,15-17,20

expand the ranges, so the final string is:

1,2,4,6,7,8,9,12,15,16,17,20

#### I2 - Find two words in a string

Find any string that has the following two words in it: “dog” and “vet”

******

This is an interesting one, since it's not something that regex is particularly suited for. The test strings that I'm using are:

I took my dog to the vet
The vet fixed my dog
My dog likes to visit veterans
dog dog
The vet is great
He continued with dogged determination

The first two should be successful, all others should fail.

In the comments to the original post, Maurits said that you should use two regexes. I think that it may be the best solution (clearest and easiest), though it may be less performant. But I'm going to talk about the single-regex solution.

The only tricky thing about this is that we need to match words rather than characters. To do that, we can write:

\sdog\s

to find a dog surrounded by whitespace (please spend two minutes, think up the best joke you can having to do with "dog surrounded by whitespace", and post it as a comment). Unfortunately, if I try to match that to:

I am going to walk my dog

it fails, because there's no whitespace after "dog". What we need is a way to match between a word and non-word. We can use that with "\b", so if we write:

\bdog\b

we will get the behavior that we want. Two quick notes:

1. Like the \$ and ^ anchors, \b doesn't consume any characters, it just asserts what condition must be true to match.
2. The boundary is really between alphanumeric and non-alphanumeric characters.

So, time to string things together. We can match a sentence with dog followed by vet with the following:

\bdog\b.*?\bvet\b

That handles one case, and to handle the other case, we'll just switch the order. Finally, we get:

\bdog\b.*?\bvet\b
|
\bvet\b.*?\bdog\b

which does what we want it to do, assuming we use RegexOptions.IgnoreCase when we use it.

That's all for now. The next one is a nice one, but it will have to wait until next year...

• #### Regex 101 Discussion I1 - Match a floating point number

Match a floating point number.

[Update: Fixed a cut/paste issue with the match for + and -.

Many of the comments on the original post spoke of not having sufficient sample strings. I omitted them deliberately, so that the problem requires a bit more work and will, with any luck, be more educational. My hope is that a little more freedom will give me more issues to write about.

Or perhaps I'm just lazy.

Regardless, the floating point number I was thinking of was something like:

-333.33

It has an optional +/- at the front, at least one digit, and then an optional decimal part. And there's no internationalization, so you don't have to worry about characters between digits or "," instead of "." as the decimal point. (What is the term used instead of "decimal point" in such countries? Anybody know?)

And, of course, the usual caveat that if you want to validate a floating point number, something like Double.Parse() is likely to be a bit more robust than a regex you write...

So, to start, we need to match the optional sign character. We do that with:

(\+|-)?

Next one or more digits:

\d+

Optionally followed by a decimal point and one or more digits:

(\.\d+)?

which gives us:

^
(\+|-)?
\d+
(\.\d+)?
\$

As the final regex. If you wanted to, you could easily extend that to add in "E-038" as an allowable suffix.

#### I2 - Find two words in a string

Find any string that has the following two words in it: “dog” and “vet”

(yes, I know, I didn't get last week's discussion out there. It will be there shortly...)

• #### Regex 101 Exercise I1 - Match a floating point number

Regex 101 Exercise I1 - Match a floating point number

Match a floating point number.

Sample strings:

You know what a floating point number is.

• #### Regex 101 Discussion S6 - Change the extension on a file

Regex 101 Exercise S6 - Change the extension on a file

Given a filename including path, change the extension to .out.

Input string example:

C:\utility\Processor.cs

*****

I said in the exercise description that this take a bit of care. One first blush (what a weird turn of phrase), one might think that this is a simple problem. But when you dig into it a little deeper, you will find that it remains a simple problem, mostly because the regex defaults give you the right behavior in this case. But not always, so I'm going use this as a stepping off point to talk about something that is close to many people's hearts this time of the year.

I'd like to talk about greed.

Though Michael Douglass may have said that "Greed is Good" in Wall Street, things aren't so clear-cut in the world of regular expressions. I started to write something about greediness and non-greediness, but then I realized that I already had. So go read that, and get back to me.

Now, back to the exercise. :

If you are a seasoned regex professional, you are likely used to writing non-greedy expressions more often, because they are on the whole more well-behaved than greedy ones. So, here's the first thing you probably wrote:

(?<Path>.+?)
\.
(?<Extension>.+)

Which works fine on the example I gave, but if you add in a few more test cases:

C:\utility\processor.test.cs
C:\utility\fun.stuff\processor.cs

You'll find that it not working correctly. The problem is that the first match is a non-greedy one, so it's giving you a minimal match - a match up to the first period, not to the last one. If you switch to greedy on the path match, things work right:

(?<Path>.+)
\.
(?<Extension>.+)

and the replacement string to use with this is simply:

\${Path}.out

that was less than earthshaking, but I did notice that most of the respondents to the original post got the answer wrong, so at least it wasn't totally trivial.

Bonus exercise. Change the "Extension" match to be non-greedy (.+?), and explain the results.

So that's the last of the simple exercises, though looking at the intermediate ones, they don't really get that much harder.

• #### Regex 101 Exercise S6 - Change the extension on a file

Regex 101 Exercise S6 - Change the extension on a file

Given a filename including path, change the extension to .out.

Input string example:

C:\utility\Processor.cs

Notes:

1. The best answer to this is really to use System.IO.Path.ChangeExtension(), but that wouldn't be much of a Regex exercise, now would it?
2. It's not as simple as it looks
Page 1 of 2 (34 items) 12