Sometimes you hear and accept advice but you don't really know the details behind it. That has always been the case with me and high CPU situations caused by badly formatted regular expressions. The reason I haven't really bothered to find out why is because I've never had a case on it until today...
It's actually not that I haven't been interested, in fact regular expressions fascinate me in a geeky way. My favourite class in college was a class on "automata theory and formal languages" a class about finite state machines, turing machines, and regular expression like functions and constructs.
The problem that I got involved a very complex regular expression, but rather than using that one for this case study, I will use a very simple regular expression that the BCL (Base Class Library) team blogged about here, since it shows the same problem.
<disclaimer>Before I go on, I just want to say that I am by no means an expert on regular expressions, so I may be missing some important points in the discussion or have some details wrong, and if you have regex specific questions you're probably better of directing them to the BCL blog.</disclaimer>
Overview of problem
The code I'm using looks like this
Regex slowRegex = new Regex("([a-z]+)*=");
Match m = slowRegex.Match("helloworldhelloworld");
The problem and solution is already given in the BCL post but here is a short description of what happens
The .NET Regular Expression Engine is an NFA (non-deterministic finite automation engine) which is great since it gives developers a great deal of flexibility when creating regular expressions. The drawbacks of an NFA vs. a DFA is that some regular expressions like the one above can take an extremely long time to run.
As David writes on the BCL blog the expression above will
It will quickly go through steps 1 and 2 and then find out that it can't match 3, then it has to backtrack and retry steps 1 and 2 in a different way, and because of the nested quantifiers ((..+)*) there is an almost endless number of combinations it has to backtrack a lot, and the string will match to all of these combinations of ([a-z]+)* so you have to go through the whole string for each combination to realize that the string isn't a match to the whole regexp.
There are a few things to point out here
To quickly explain backtracking, consider the following regular expression and string
Regexp: (abe|abc)(dg|dc|de)String: abcde
This regular expression will match any string that starts with abe or abc, and ends with any of dg, dc, or de, which could be rewritten as ab(e|c)d(g|c|e) to avoid the backtracks below.
The order of execution here will be something like
In other words we backtracked 3 times to match the expression. The reason I bring this up is because the number of backtracks is essential to figuring out how performant the regexp is.
Debugging the issue
What I want to show you in this post is how to debug it and identify it with windbg and sos along with some information that may help you figure out how to optimize your regular expressions.If i test with a fairly short string (about 20 chars) the regular expression will chug away for a good few minutes before returning to me that there are no matches.
If I attach windbg when CPU usage is high and run !clrstack on the only active thread (thread 0 in my case) I get this...
OS Thread Id: 0x90c (0)
0016eb50 7a4c5b28 System.Text.RegularExpressions.RegexInterpreter.Go()
0016ec84 7a4c5135 System.Text.RegularExpressions.RegexRunner.Scan(System.Text.RegularExpressions.Regex, System.String, Int32, Int32, Int32, Int32, Boolean)
0016ecb0 7a4b12b3 System.Text.RegularExpressions.Regex.Run(Boolean, Int32, System.String, Int32, Int32, Int32)
0016ece0 7a4b147c System.Text.RegularExpressions.Regex.Match(System.String)
0016ecec 0093050f BadRegexp.Form1.btnRun_Click(System.Object, System.EventArgs)
0016ed0c 7b062e97 System.Windows.Forms.Control.OnClick(System.EventArgs)
0016ed1c 7b11f4f9 System.Windows.Forms.Button.OnClick(System.EventArgs)
0016ed28 7b11f608 System.Windows.Forms.Button.OnMouseUp(System.Windows.Forms.MouseEventArgs)
0016ed50 7b0eaba2 System.Windows.Forms.Control.WmMouseUp(System.Windows.Forms.Message ByRef, System.Windows.Forms.MouseButtons, Int32)
0016ed9c 7b07687f System.Windows.Forms.Control.WndProc(System.Windows.Forms.Message ByRef)
0016eda0 7b08549e [InlinedCallFrame: 0016eda0]
0016ee40 7b0853bb System.Windows.Forms.Button.WndProc(System.Windows.Forms.Message ByRef)
0016ee48 7b07e43d System.Windows.Forms.Control+ControlNativeWindow.OnMessage(System.Windows.Forms.Message ByRef)
0016ee4c 7b07e416 System.Windows.Forms.Control+ControlNativeWindow.WndProc(System.Windows.Forms.Message ByRef)
0016ee60 7b07e21a System.Windows.Forms.NativeWindow.Callback(IntPtr, Int32, IntPtr, IntPtr)
0016f008 00940a0c [NDirectMethodFrameStandalone: 0016f008] System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG ByRef)
0016f018 7b0882e9 System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(Int32, Int32, Int32)
0016f0b8 7b087ea9 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)
0016f130 7b087ce7 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)
0016f160 7b0688ba System.Windows.Forms.Application.Run(System.Windows.Forms.Form)
0016f170 009300a8 BadRegexp.Program.Main()
0016f394 79e79dd3 [GCFrame: 0016f394]
This stack tells us that we are in the middle of matching a regular expression, and that the reason we are here is because we called Regex.Match in btnRun_ClickIf I let it continue running and break in a while later I see the exact same stack but the instruction pointer (EIP) in Go() has changed so i can tell that it is moving but appears to be stuck in some type of loop inside the Go function or potentially around it.
To figure out what regular expression we are running and what string we are parsing we can run !dso and dump out some of the objects on the stack
OS Thread Id: 0x90c (0)
ESP/REG Object Name
0016ec90 0152b7ac System.Text.RegularExpressions.Regex
0016ecac 0152c434 System.String helloworldhelloworld
0:000> !do 0152b7ac
MT Field Offset Type VT Attr Value Name
790fcb30 400060a 4 System.String 0 instance 0152b784 pattern
0:000> !do 0152b784
A few iterations later when i break in I see the exact same stack but this time it has a call to System.Text.RegularExpressions.RegexInterpreter.Backtrack on top, and based on the discussion above this looks extremely interesting.
In order to figure out if we are looping inside the Go function or outside of it I set a breakpoint on the Go function and on the Backtrack function.
0:000> !do 0152c47c
0:000> !dumpmt -md 7a764e3c
Entry MethodDesc JIT Name
7a4c5610 7a826100 PreJIT System.Text.RegularExpressions.RegexInterpreter.Go()
7a4c7a10 7a826040 PreJIT System.Text.RegularExpressions.RegexInterpreter.Backtrack()
0:000> !bpmd -md 7a826100
MethodDesc = 7a826100
Setting breakpoint: bp 7A4C5610 [System.Text.RegularExpressions.RegexInterpreter.Go()]
0:000> !bpmd -md 7a826040
MethodDesc = 7a826040
Setting breakpoint: bp 7A4C7A10 [System.Text.RegularExpressions.RegexInterpreter.Backtrack()]
Then I modify the breakpoints so that we just log rather than breaking in,
0 e 7a4c5610 0001 (0001) 0:**** System_ni!System.Text.RegularExpressions.RegexInterpreter.Go()
1 e 7a4c7a10 0001 (0001) 0:**** System_ni!System.Text.RegularExpressions.RegexInterpreter.Backtrack()
0:000> bp 7a4c5610 ".echo Go;g"
breakpoint 0 redefined
0:000> bp 7a4c7a10 ".echo Backtrack;g"
breakpoint 1 redefined
and finally, here are the results (I hope I got all the calculations correct when creating the formulas:))...
Number of characters
Go breakpoint (# times)
Backtrack breakpoint (# times)
so the number of backtracks grows exponentially with the length of the string
If we set the same breakpoints for the regexp ([a-z])*= we get the following results
Number of characters
Go breakpoint (# times)
Backtrack breakpoint (# times)
So here the number of backtracks doesn't grow exponentially which is why it performs a lot better even for long strings
Interestingly enough, for the slightly modified but equivalent regexp [a-z]*= the number of backtracks is 4n-1. It doesn't really make a huge difference and I don't know why there is a difference but still I found it interesting:)
When you work with regular expressions that use an NDA engine you should take extra care to try to avoid or decrease the following 3 elements
On a separate note, the technique I used to print out the number of backtracks using breakpoints can be used for other issues too if you need to determine where a tight loop exists.
Until next time,Tess
Let me just say, Automata and formal languages are soooo my last favorite class.
This is an excellent article (http://swtch.com/~rsc/regexp/regexp1.html) describing the backtracking problem in the context of Perl. It shows how the problem could even be fixed with a DFA implementation of regular expressions instead of the NFA most implementations use.
Thanks for the article link,
sure it could be fixed by using a DFA impl., but the problem with DFA is that it is less flexible.
This article describes why the .NET regexp engine is an NFA rather than a DFA http://msdn2.microsoft.com/en-us/library/0yzc2yb0.aspx
Link Listing - October 27, 2007
Silverlight and Astoria sample site from Shawn Wildermuth [Via: glengordon ] Designer Dinner (Nov....
Thanks for the article. I learnt a new command !bpmd and the technique to insert command strings into bp.
I had a similar experience with our site. I deployed a performance optimization release and horridly observed that on avg 10-20% CPU usage became 80-90%. Server was timing out at high load hours.
When I profiled the app I found a regaular expression is being used to validate a cookie for domain name. The expresson was quite expensive too.
Fixing that solve 2 nights of nightmares.
Thanks for this you've just explained why our webservers CPU maxes out at 100% in certain cases when long unicode text is passed through a certain reg expression. I think I will need to do some rewriting!
I'm a newbie with regular expressions, but I think that the option you pointed in number 3 is not an equivalent to the former regex.
I think that the original regex groups the matching characters that a [a-z]+ can match (using the "()" ) and then try to find another group of matching characters that are equal to the first group of characters.
What I try to say if ([a-z]+) matched "hello", it will try to find many "hello"s in the string, doing several backtracks searching for almost endless groups. I guess that that's one of the reasons behind the exponential backtracking.
Also, could be that I'm completely wrong :), and that's why I'm still a newbie...
Agree with nativecpp. Good tips with breakpoints.
I believe [a-z]+ and ([a-z]+) will match the same string, i.e. any string of one or more characters between a and z.
Yes... I did a test and you were right. I'm still a newbie :)
But http://www.regexpal.com/ showed that different kind of matches occured, with and without parenthesis and the *.
Well as you said, the idea of this post was to show about how to identify regex processing.
Cool, I just had to try it out with a large string and nested quantifiers and almost couldnt shut down the browser because it was hogging the CPU:) I should have known better than to try:)
An atomic group would also solve your problem. "(?>([a-z]+))*=" instead of "([a-z]+)*=".