Knowing when NOT to use RegEx to match Strings (System.Text.RegularExpressions) [Kit George]

Knowing when NOT to use RegEx to match Strings (System.Text.RegularExpressions) [Kit George]

  • Comments 18

We recently got asked this question by a customer: "In C#, how do I ensure that a string entered into a text box is of the format: letter,number,letter,number,letter,number ?"

The first answer seems to be pretty straightforward: use RegEx! Regular Expressions are a pretty powerful mechanism for matching strings, and seem the obvious choice. However, you've always got to remeber that RegEx, while powerful, is also a pretty hefty mechanism for String matching. When you're looking for complex strings it's often a good choice (since writing the code yourself can be unbelevably tricky), but when what you're looking for is pretty simple (as in this case), then doing your own matching shouldn't be too tough, and is going to perform a lot more solidly.

Here's a test I wrote to demonstrate this clearly. I've included two forms of RegEx matching. The first (the single line call, looking for whitespace and the specified pattern) is showing how robust RegEx can get, but check the numbers below: the operation is suitably expensive. The second helps the RegEx out a little by doing some of the more exhaustive searching work for it (in this case, a simple trim, followed by a length check). It therefore, doesn't need to do the space matching itself. The final form is simply doing the string match yourself. In this case, I use Cahr.IsLetter and Char.IsDigit to simply step through and ensure the string is in the right format.

The results bear out that, when you can write the simple test for the pattern you want yourself, it's going to be worth doing that. Even with an optimization, RegEx is not as performant as writing a simple check yourself. When the pattern checking is omre complex, then RegEx can be far more usable.

Duration for StringMatch test                   1798092 Ticks
Duration for RegExMatch1 test (no optimization) 31166928 Ticks (18x slower!!!)
Duration for RegExMatch2 test (optimization)    18380496 Ticks (10x slower)

using System;
using System.Text.RegularExpressions;
using System.IO;

class Test
{
  static void Main(string[] args)
  {
    // include your own test cases if needed
    string[] testCases = {"l2j1a9", "  l2j1a9  ", "l2j1a9  ",
      "a l2j1a9 ", " l2j1a9 a", "lwj1a9", "  l2 j1a9  "};
    long duration = 0;

    Console.WriteLine("Standard Test:");
    foreach(string test in testCases) {
      duration += RunTest(test, false);
    }
    Console.WriteLine("Duration of tests = {0}\r\n", duration);

    duration = 0;
    Console.WriteLine("\r\nRegex Test:");
    foreach(string test in testCases) {
      duration += RunTest(test, true);
    }
    Console.WriteLine("Duration of tests = {0}", duration);
  }

  static long RunTest(string s, bool useRegex) {
    bool result = false;

    // this is just a timing run, for interest
    long startTime = DateTime.Now.Ticks;
    if (useRegex) {
      for (int i=0;i<20000;i++) {
        RegexMatch(s);
      }
    } else {
      for (int i=0;i<20000;i++) {
        StringMatch(s);
      }
    }
    long duration = DateTime.Now.Ticks - startTime;

    result = (useRegex ? RegexMatch(s) : StringMatch(s) );

    if (result) {
      Console.WriteLine("String '{0}' matches the requirement", s);
    } else {
      Console.WriteLine("String '{0}' does NOT match the requirement", s);
    }

    return duration;
  }

  static bool RegexMatch(string s) {
    // Not optimized version. Read as:
    // any (*) amount of space (\s) at the beginining (^)
    // any word character (\w), any digit (\d), word, digit, word, digit
    // any (*) amount of space (\s) at the end ($)
    //Regex r = new Regex(@"(^\s*)\w\d\w\d\w\d(\s*$)");

    // Optimized version. We take out ONLY the checks for the whitepace
    // Try this instead of the above line, and see what the perf difference is
    /*   
    Regex r = new Regex(@"\w\d\w\d\w\d");

      s = s.Trim();
      if (s.Trim().Length != 6)
        return false;
       
    */
    return r.IsMatch(s);   
  }

  static bool StringMatch(string s) {
    if (s.Trim().Length != 6)
      return false;

    char[] chars = s.Trim().ToCharArray();

    return (Char.IsLetter(chars[0]) && Char.IsDigit(chars[1]) &&
      Char.IsLetter(chars[2]) && Char.IsDigit(chars[3]) &&
      Char.IsLetter(chars[4]) && Char.IsDigit(chars[5]));
  }
}


  • It depends, but I'd have to say in most cases performance isn't the issue. Particular in validating user input, which doesn't happen that frequently. So RTI'ing string validation code to shave off a few clock ticks is probably a waste of a programmer's time.

    Programmers should instead focus on which technique is clearer to understand for other programmers. Regexs have a reputation of being hard to read. But at least, in your simple example...

    new Regex(@"\w\d\w\d\w\d");

    ...I find that very easy to read. Easier then the StringMatch method. So I'd probably use the Regex.


  • Isn't this quite unfair to the regexes? Of course, they tend to be slower, but there are a couple of things here:

    1. You create a new RegEx each time. True, they're cached internally, but for simple expressions, the "new" line there will use much of the time.
    2. If I don't miss anything, you don't compile the regexes to IL, which is basically what you do by writing a StringMatch method on your own. If you're going to run 20000 iterations of an expression that is constant enough that you consider hardcoding it in C#, the costs of compiling the regex also seem low.

    I know you tried to make a point here, but there are certainly cases where just turning the verification into C# is a bit more complicated and the regex still may be slow.

    It would be interesting to see what numbers you get if you keep one static, compiled Regex instance.
  • Ah, I went ahead and tried it on my own.

    Standard test gives 2188984.
    "Unoptimzed" regex, compiled, externally gives 3439832. (only a factor 1.6, compare to your 18!)
    "Optimized" regex, compiled, externally gives 3752544. The trim call and length check actually made things slower!

    I then tried the unoptimized regex with new instances like your original version, but still only get a factor 7 difference, so I guess that my system (with the 1.1 framework) simply behaves differently.
  • Also, marking the two groups as non-capturing brought the factor down to 1.3 on my machine, but it is highly dependent on low timing resolution...
  • Wow, thanks for jumping on me quickly all, that's great. Without a shadow of a doubt, there's ways to make RegEx do this way better (CN points this out pretty well by showing that compiling the Regex can help significantly: just remeber, you don't want too many precompiled regexes, it ends up being a bind), but I was mostly trying to illustrate that if you can do things cheaply and easily yourself, go ahead. I love that people were leaping up in defense of RegEx, but know when to try something else: if performance is critical especially (in this case). We do this in plenty of places in the framework, where we need to optimize the common scenario above others. One example is providing overloads for members that take base types (check out Console.WriteLine as an example), to help ensure that operations for those specific types (the most commonly used) are as optimized as possible. I would urge a similar approach here
  • The techniques that make sense for the relative handful of developers writing the .NET framework rarely make sense for the vast legions of developers who are building on top of the framework.

    The guys at the top of the kernel-OS-framework pyramid don't have much in common with the lower (and order-of-magnitude larger!) levels of the coding pyramid.

    I occasionally run into developers who point to the Linux source code as a model for development, and my question to them is always the same: since when are we writing an operating system? Everyone likes to think that they're working on something fantastically complicated that will be used by millions of developers, but the reality is.. somewhat less exciting.
  • Well, I think the most important lesson is that the optimized Regex, relying on String.Trimming first, was actually inferior to the "unoptimized" regex, while both are inferior to C#.

    Of course, there are good reasons to optimize for the common case, but it's just as important to check that the optimization really gives you a benefit. I've seen things like keeping WeakReferences to a lot of objects to allow aggressive GC if needed. The only problem with the code was that the data protected was a simple combination of a DateTime and a primary ID from the database. That made the WeakReference itself just as heavy as the object referenced...

    Just my $ 0.02.
  • FWIW, here's what I get on an Athlon FX-53:

    Unoptimized = 7x slower
    Optimized = 6.6x slower
    (move Regex r up to outer loop)
    Unoptimized = 2.4x slower

    I think most of the perf penalty is just instantiating the Regex object over and over.

    > "We do this in plenty of places in the framework"

    And I agree, *you* should. But I dislike working with other developers who have delusions of grandeur and think they're working on the framework too. Hint: they're not.
  • If I move the Regex declaration up there to be static and put in RegexOptions.Compiled, the performance of RegEx will be much better only 50% slower. (1093750 vs 1562500)

  • The numbers gets even closer than 50% if you do a better Trimmming of your strings :)
    You call trim to much man!

    Also call the regex once before timing, so the dll is compiled and loaded.

    My number are now
    781260 vs 1093764 on a P4 2.8
  • I've made some measurements with an edited version of the code you posted. Specifically, I made some changes:

    * Made the regex shared and compiled (to give it similar advantages to C# code)
    * Warmed up *both* loops before timing them
    * Made the number of iterations configurable on the command line

    (I'm not posting it here for brevity, and because indenting gets lost in the blog comment formatter).

    I've timed the code for iterations between 100000 and 1000000 (that is, 100 thousand and 1 million) in 10 steps of 100 thousand.

    Here's the raw data as it is on my machine:

    100000,3281271,5937538,0.552631579
    200000,5937538,11250072,0.527777778
    300000,8906307,17031359,0.52293578
    400000,11875076,22812646,0.520547945
    500000,14687594,28593933,0.513661202
    600000,17812614,35781479,0.497816594
    700000,20781383,41406515,0.501886792
    800000,24531407,45469041,0.5395189
    900000,26718921,51406579,0.519756839
    1000000,29531439,56719113,0.520661157

    The leftmost column is iterations, the second column is the C# way, the third column is the Regex way, and the fourth column is the 2nd divided by the 3rd.

    Some points:

    1. The .Ticks field has low granularity: the number 5937538 occurs twice, (see 100000 and 200000)

    2. The scaling relationship is roughly flat (at 600000 it looks like we might have an upset because the Regex looks like it keeps on improving, but then things get jittery and randomness creeps in)

    3. The scaling factor is roughly 50%

    Conclusions:

    1. Don't use Ticks for measuring code performance, unless relatively large time periods are involved (that is, at the very least a couple of seconds).

    2. There is no algorithmic difference between the two methods.

    3. The performance gain of the C# method versus its inherent complexity increase / readability loss / flexibility loss is not sufficient to warrant its use.

    In summary: don't use this C# code. Stick to REs.
  • static bool StringMatch2(string s) {
    if (s.Trim().Length != 6)
    return false;

    return (Char.IsLetter(s, 0) && Char.IsDigit(s, 1) &&
    Char.IsLetter(s, 2) && Char.IsDigit(s, 3) &&
    Char.IsLetter(s, 4) && Char.IsDigit(s, 5));
    }
  • Well I've never seen such a good dose of feedback on an issue: awesome. Obviously I'll need to be more careful on making the right code available next time. I think Jeff had an especially good point on making sure we're aware of reality in these situations
  • I lost the reference, but one of your colleagues once wrote a blog entry on the RegexOptions.Compiled flag. To summarize, there are basically a few different modes of use and re-use for regexes.

    Disclaimer: this is straight from memory, there may be errors...

    First, the one-shot use, like in your post. This has a significant overhead cost of creating the regex. Instead of creating the Regex object on each call, you can as well call the static overloads of the methods in the Regex class. Creating a Regex instance in this case is mere overhead that makes your code less readable.

    Second, the uncompiled, but cached use. Create the Regex instance once, but *without* the RegexOption.Compiled flag and store it somewhere. Then reuse that same regex instance. For many practical situations, putting it in a static field for the class using the regex works well. In other situations (typical for regexes that are not known at compile time), use nonstatic fields. This works faster than your solution in all situations. From my experience, this way of using Regexes is the best in most practical situations.

    Third, the compiled and cached use. Create the regex instance once, but *with* the RegexOptions.Compiled flag. Your regex will be faster (about 30% IIRC), but at a very significant extra startup cost (you are creating a new dynamic assembly for each regex). Initially I used this option a lot in my code, but I noticed that the overhead cost is often more than it is worth! Measuring performance is the key to find out if this route is right for you.

    Fourth: the Regex class provides options to compile a set of Regexes to a 'real' assembly, which can be referenced from your project. This is probably the fastest option for runtime, but too much of a hassle at design/compile time in most cases.
Page 1 of 2 (18 items) 12