Regex perf bug

Regex perf bug

  • Comments 18

While trying to get a WordML -> HTML conversion that would work reasonably well with .TEXT, I came across an interesting performance bug in the Regex class of .NET.

Run the following code and see what happens. It doesn't produce any output; just step over each line in the debugger and note how long the "slow" search takes:

 

using System.Text.RegularExpressions;

 

class Class1

{

  static void Main()

  {

    string s = "fred" + new string(' ', 5000);

 

    Regex slow = new Regex(".*fred", RegexOptions.None);

    Regex fast = new Regex("^.*fred", RegexOptions.None);

 

    fast.Replace(s, "");

    slow.Replace(s, "");

  }

}

In theory the two expressions should be the same -- "anything followed by fred" is semantically the same as "anything from the start of the string followed by fred" -- but for some reason they behave very differently. I'll see if it's a known bug tomorrow (having trouble accessing the database from home).

  • This is not a bug. The slow replacement will indeed be slow because it will *laziliy* attempt to make a replacement at every place within the 5004 characters... yep *lazily*.

    The "fast" operation - although still lazy and much slower than it could be - will only ever attempt to match from its anchored marker.
  • Thanks Darren; can you elaborate a bit more on the "laziness"?

    It seems that the most naive way to implement a ".*foo" pattern would be just to do an IndexOf("foo") and return everything up to (and including) that. It only takes two tests (one on the original string, one on the replaced string) to figure out there's nothing left to do.

    Under what circumstances would they produce different results (ie, when is "everything" not equivalent to "everything from the start")?
  • I reckon that they are not semantically equivalent; consider the string "fredfred" and how many times replacement should be done in each case.

    Even better is the case where the string may be partially overlapping with itself; how should the pattern ".*abab" be matched with the string "ababab"?
  • Err... woops :-) When I wrote *lazy* I did - of course - mean *greedy*.
  • Jeffrey Friedl's "Mastering Regular Expressions 2nd Ed." has some great discussions of performance implications when using greedy and non-greedy expressions. It's well worth the read and has a specific chapter on the .NET implementation of regular expressions in PDF format that is freely downloadable.

    http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf


  • Roland,

    In both cases, there is no difference. Both patterns greedly match all the garbage at the start of the string and match the trailing "fred" / "abab".
  • Kelly,

    Thanks for the link to the .NET chapter. I actually read the Regexp book a long time ago and have it lying infront of me now :-). No obvious reasons are shown in the book though (quick perusal of the index...).
  • But can anyone show me where the "fast" and the "slow" patterns give different results?

    It doesn't matter if the engine is doing "the right thing" in the slow case; *if* it can do "the right thing" faster then it should, and not doing it is a (possibly low priority) bug with performance.

  • Peter, these things will never give different results; they will both find a match against *only* the last instance of "fred" contained within any given text.

    One of the things that Jeff Friedl mentions in his book - the Chapter about optimizing and perf I think - is that whenever possible you show expose leading/trailing characters in the pattern which allow many optimizations to kick in. It may be that, what you *percieve* as a bug is simply a bunch of optimizations "not" kicking in :)
  • Hey Darren,

    It may just be a "missing feature" but we still call those "bugs" at Microsoft :-)

    Do you remember the "news" story about there being 60,000 (or whatever) "bugs" in Windows 2000? Well now you know where they all come from. Everything at Microsoft is considered a bug -- perf issues, spelling errors, help text, new feature suggestions, icon changes, design ideas, re-arranging toolbar buttons, renaming menu items, you name it.

    We'll see if they accept my request for a change in Whudbey :-)
  • > but we still call those "bugs" at Microsoft :-)

    :-) Heh, thanks for the heads-up!

    > We'll see if they accept my request for a change in Whudbey :-)

    Great, hey... while they've got that file checked-out :-) can you ask them to fix my "pet-peeve" too

    http://regexblogs.com/dneimke/archive/2003/11/23/181.aspx

    See the section on that page titled: "My Biggest Gripe about Named Groups".

    Cheers (and Happy New Year)!
    - Darren
  • > It seems that the most naive way to
    > implement a ".*foo" pattern would be just
    > to do an IndexOf("foo") and return
    > everything up to (and including) that.

    Couple points.

    First off, you're begging the question. What algorithm does "IndexOf" use, and how do you know it is fast? Searching arbitrary strings for arbitrary patterns can be quite expensive.

    But more generally, there's a conceptual issue here. The thing about regular expressions is that they are actually a programming language for tersely specifying finite state machines. For a given set of inputs and desired outputs you could design an infinite number of different finite state machines that work, but they would have different performance characteristics!

    The regular expression engine does not make guesses as to how to optimize the FSM for a given set of inputs because the engine doesn't know what the inputs are going to be ahead of time. The regexp parser just builds up an FSM implementation and runs it on your string, for better or for worse.
Page 1 of 2 (18 items) 12