Regular Expression performance [David Gutierrez]

Regular Expression performance [David Gutierrez]

  • Comments 7

I often get questions about Regex and what the RegexOptions.Compiled flag actually does. There are in fact three different modes that Regex can work in: interpreted (without the compiled flag), compiled on the fly (with the compiled flag), and precompiled.  Each of these modes has its own trade offs in performance - I'm mainly talking about startup performance, which is the initial cost of creating your Regex, and runtime performance, which is the cost of running matches.

Interpreted

This one is what you get by default when you don't pass in RegexOptions.Compiled as an option.  Here are some interpreted usages of Regex:

r = new Regex("abc*"); 

Regex.Match("1234bar", "(\d*)bar");

We parse your expression into a set of custom opcodes, and then use an interpreter to run the expression later.  The cost of creating the Regex is low, but this mode also has the lowest runtime performance of the three.

Compiled on the fly

In this case you've passed in RegexOptions.Compiled:

r = new Regex("abc*", RegexOptions.Compiled); 

Regex.Match("1234bar", "(\d*)bar", RegexOptions.Compiled);

In this case, we first do the work to parse into opcodes.  Then we also do more work to turn those opcodes into actual IL using Reflection.Emit. As you can imagine, this mode trades increased startup time for quicker runtime: in practice, compilation takes about an order of magnitude longer to startup, but yields 30% better runtime performance.  There are even more costs for compilation that should mentioned, however.  Emitting IL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory that you'll ever get back.  In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode.  We've fixed that problem in Whidbey.  But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly. 

Precompiled

Precompilation solves many of the problems associated with compiling on the fly.  The idea is that you do all of the work of parsing and generating IL when you compile your app, ending up with a custom class derived from Regex.  The big trade off here is that you need write a small app which will do the compilation for you (ie an app which calls Regex.CompileToAssembly(...) with the right parameters), and thus you need to know your important regexes in advance.  In general this isn't such a problem, since if you're writing a parser, you probably don't need to change your expressions at runtime. Your startup time reduces to loading and JITing your class, which should be comparable to the startup cost of interpreted mode.  Runtime performance will be identical to the compiled on the fly case.  It's the best of both worlds!

 

  • It would be really awesome if the C# compiler or IDE could do that for you, possibly by allowing a special file type and letting Regex.CompileToAssembly() or a more granular method like a hypothetical Regex.CompileToModule() have a crack at it. This approach could be extended to other DSLs too. I wonder if with the new extensible build engine it might not be possible to create such a tool as an add-in. Hmm...
  • The Whidbey C++ project system can manage and link multiple files from multiple languages (eg C++. VB.Net and C# all in the same project). Is it possible to add a .regex filetype into the C++ project system which will also compile and link into a single assembly?
  • We explored doing something like what Doug mentioned as a general purpose mechanism, though it didn't end up happening in Whidbey. I'll enter a feature request to make precompiling easier somehow, though.
  • > In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey.

    As far as I know, Whidbey does not support unloading assemblies. Do you create AppDomain for each compiled Regex (and get all cross-appdomain performance drop) or you use some sort of internal mechanism for unloading assemblies?
  • Lexp, you're right that Whidbey can't unload assemblies. What we've done is switch the compiled on the fly case to use a new form of Reflection.Emit called lightweight code-gen. Rather than generating assemblies, modules, types and methods, with lightweight code-gen you are only allowed to generate methods. In the end you receive a delegate to the generated method, and when that delegate is GC'ed, all of the IL and the JIT'ed code is reclaimed.

    Note that the precompiled case can't use lightweight code-gen, so you still can't unload those types.
  • I am currently in the middle of a way-overdue refactoring of MhtBuilder, which uses regular expressions extensively. I noticed that I had sort of mindlessly added the RegexOptions.Compiled all over the place. It says "compiled" so it must be...
  • PingBack from http://www.dijksterhuis.org/regular-expressions-advanced/

Page 1 of 1 (7 items)