Lame title jokes aside, a metacharacter is a character that describes other characters.  Put together, they form Regular Expressions.  Regular expressions are used for string matching, which is a less threatening way of saying parsing.  Basically, with regular expressions, we can slice-and-dice strings into any which way we want.

I’ve covered Regular Expressions (shortened to RegEx or even RE) before.  However, after talking to a coworker, I realized that those posts assumed a basic knowledge or RegEx and built on it.  Here, we’ll start with the basics.  First, let’s lay down some underlying rules.

1. Anything that isn’t specifically defined to be a metacharacter matches itself.  ‘a’ is not a metacharacter (yet), so it matches the letter ‘a’.

2. As far as PowerShell goes, RegExes match as often as they can in the string.  This means that the letter ‘a’ will match three times in aardvark.

3. You can do RegEx with Unicode, but it’s best avoided.

4. Because MS-DOS already absconded with the ‘\’ as the path delimiter, in certain contexts, the ‘`’ (backtick) character will serve as the escape character (the ‘magical’ character that says, “metacharacter starting here,” or “not a metacharacter starting here”.

5. This article is about .NET RegEx.  Classic RE (i.e. sed(1m) on Unix) and Perl RE have vastly different rules and features.

6. PowerShell RegExes are case-insensitive unless you specify –cmatch or –creplace.

===

Where are they used?  Primarily two places in a script:

- $string –replace $RegEx

- $string –match $RegEx

They’re also used as the argument value to any cmdlet’s –Pattern parameter, but how many of us use that parameter?

===

Now, let’s meet the metacharacters:

‘.’ – The lowly period matches any character.  Note that whitespace counts as characters: space, tabs, newlines, and carriage returns all count.

‘\s’ – Whitespace.

‘\S’ – Not whitespace.  (Yes, metacharacters are case sensitive.)

‘\d’ – Digit.  [0-9]

‘\D’ – You guessed it, not digit.

‘\w’ - ‘Word’ character.  [A-Z], [a-z], [0-9]

‘\W’ – Not a word character.  Whitespace, punctuation, etc.

‘\n’ – Newline.  “This `n is `n a `n multiline `n string” in PowerShell, so we’ll need some way of matching those newlines.

‘\t’ – Tab character.

‘^’ – Start of line.

‘$’ – End of line. Note that this cannot be confused for a variable – it is always followed by a single or double-quote..

There are many more, but those cover 80% of my use cases.

===

Keep in mind, the individual metacharacters themselves, while useful, are the whole story.  Here are the modifiers.

---

Iteration modifiers:

‘*’ – Zero or more of the previous (meta)character. These ‘this many or that many’ modifiers apply to both normal characters and metacharacters.  For example:  ‘a*’ is aRegEx for “zero or more ‘a’ characters”.  “.*’ is a RegEx for “everything”.  Yes, it’s “zero or more of any character”, which is effectively “everything.”

‘+’ – One or more of the previous (meta)character.

‘?’ – Zero or one of the previous (meta)character.

‘{5}’ – Five of the previous (meta)character.  I’m sure you can extrapolate how to specify “three of the previous (meta)character, etc.

‘{3,5}’ – Between three and five of the previous (meta)character, inclusive.

---

Set modifiers:

‘[abc]’ – The letter ‘a’, OR the letter ‘b’, OR the letter ‘c’.  This matches a single character.

‘[a-y]’ – The letters between ‘a’ and ‘y’, inclusive.  Or, any letter in the alphabet other than the letter ‘z’.

‘[a-c, e-g]’ – The letters between ‘a’ and ‘g’ excluding ‘d’.

‘[^z]’ – Any character that is not the letter ‘z’.  This isn’t the same as the above ‘[a-y]’: this one matches numbers, for example.

This brings up one of the common gotchas in RegEx: it’s easy to think you’ve exhaustively considered every possibility, only to find out you missed something.  In general, it’s better to specify what you allow than to specify “everything expect these characters.”

Also, note that the above set modifiers only match a single character each.  If you do this:

‘abc’ –replace’[abc’]

You will match only on the ‘a’, and be left with the string ‘bc’.

---

Grouping modifiers:

‘(a+b+c+)’ – A string that consists of one or more ‘a’, followed by one or more ‘b’, followed by one or more ‘c’.

‘(a+b+c+){2}’ – The same thing as ‘(a+b+c+a+b+c+)’.  Like in Algebra, we can multiple a section enclosed in parentheses and it’s applied to the term inside the parentheses.

‘((a+b+c+)(d*))’ – One or more ‘a’, followed by one or more ‘b’, followed by one or more ‘c’, followed by zero or more ‘d’.

Wait a minute.  Except for the ‘multiply the parenthetical term’ example two lines above, we could do this without the parentheses.  Why are we putting them in?

Parentheses in RegEx are also ways to refer to the matched term(s).  If we’re looking at the typical “surname, firstname” field, we could convert that back to the human friendlier “firstname, surname” by doing something like this:

PSH> 'dunn, tim' -replace '([^,]+),\s*([\S]+)', "`$2 `$1"
tim dunn

Note that the $1 and $2 need to be string literals.  They are not variables.

Parentheses are even more powerful when used with –match:

PSH> 'dunn, tim' -match '([^,]+),\s*([\S]+)'
True

PSH> $matches

Name                           Value
----                           -----
2                              tim
1                              dunn
0                              dunn, tim

Note that we now have elements in the $matches array that correspond to each parenthetical subgroup, in addition to $matches[0], which always refers to the portion of the string matched.

However, that’s still not as clear as I’d like.  (Actually, it is, but I’m trying to show something here.) 

PSH> 'dunn, tim' -match '(?[^,]+),\s*(?[\S]+)'
True

PSH> $matches

Name                           Value
----                           -----
LastName                       dunn
FirstName                      tim
0                              dunn, tim

As I said, named matches aren’t necessarily clearer, but they’re a neat trick.

===

There.  That concludes what I think will suffice for 80+% of your RegEx uses.  There’s the whole [RegEx] class that we’re skipping, but you’ll have to know if you want your RegExes to only apply themselves once, and other options.

There are many other fine RegEx tutorials on the web.  I’m posting this to my blog so I can point people at an article I know well, rather than hope they find a good RegEx tutorial off their search engine of choice.