<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx</link><description>The answer I came up with is at the bottom. But first, a brief digression. There were several responses to my regex puzzle. They can be grouped into: Here's how you do it Here's how you do it without using regex Using regex on this problem will cause</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#80508</link><pubDate>Thu, 26 Feb 2004 17:46:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80508</guid><dc:creator>James Geurts</dc:creator><description>Thanks for taking the time to detail how you came up with the solution.</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#80509</link><pubDate>Thu, 26 Feb 2004 17:49:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80509</guid><dc:creator>Raymond Chen</dc:creator><description>I didn't mean that regexps shouldn't be involved at all. To me this was a job for using regexps partway and normal code the rest of the way.&lt;br&gt;&lt;br&gt;By the way I think you forgot a + after the grouping. Otherwise you can't handle &amp;lt;A HREF=&amp;quot;X&amp;quot; TARGET=&amp;quot;Y&amp;quot;&amp;gt;&lt;br&gt;&lt;br&gt;I would have used something like&lt;br&gt;&lt;br&gt;&amp;lt;/?(\w+)(?:[^&amp;quot;&amp;gt;]|&amp;quot;[^&amp;quot;]*&amp;quot;)*&amp;gt;&lt;br&gt;&lt;br&gt;to match a tag, and then used code to reject p and br.&lt;br&gt;&lt;br&gt;Note that non-greedy matching doesn't mean &amp;quot;never ever match a quote&amp;quot;; it will do it if it forced to.  So your code will match&lt;br&gt;&lt;br&gt;&amp;lt;tag x=&amp;quot;a&amp;quot;b&amp;quot;&amp;gt;&lt;br&gt;&lt;br&gt;since the non-greedy .*+ between the quotes matches a&amp;quot;b.</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#80510</link><pubDate>Thu, 26 Feb 2004 17:51:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80510</guid><dc:creator>Michael Teper</dc:creator><description>I havent tried this (sorry!) but would your example handle &amp;quot;&amp;lt;br /&amp;gt;&amp;quot; ?</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#80521</link><pubDate>Thu, 26 Feb 2004 18:11:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80521</guid><dc:creator>Andrew</dc:creator><description>I know you aren't claiming to have a robust solution here, but off the top of my head I can think of several cases where your regex will fail&lt;br&gt;&lt;br&gt;safe failures * (no promises:)&lt;br&gt;&lt;br&gt;&amp;lt;BR&amp;gt; [using case-insensitive regex would solve this]&lt;br&gt;&lt;br&gt;&amp;lt;tag blah='&amp;gt;' &amp;gt;  [HTML allows single quotes]&lt;br&gt;&lt;br&gt;&amp;lt;tag blah=&amp;quot;\&amp;quot;&amp;gt;&amp;quot; &amp;gt; [Escape sequences]&lt;br&gt;&lt;br&gt;&amp;lt;tag blah=&amp;quot;&amp;gt;&amp;quot; blah2=&amp;quot;&amp;gt;&amp;quot; &amp;gt; [Multiple quoted attributes]&lt;br&gt;&lt;br&gt;If you were going to be using this regex as an attempt to prevent script injection attacks (if only br and p are allowed (or other simple formatting), then cross site scripting attacks are prevented) you would find that it is easily circumvented.&lt;br&gt;&lt;br&gt;for example:&lt;br&gt;&lt;br&gt;&amp;lt;script blah='&amp;quot;' language='javascript'&amp;gt;&lt;br&gt;alert(&amp;quot;you were just hacked&amp;quot;);&lt;br&gt;&amp;lt;/script blah='&amp;quot;'&amp;gt;&lt;br&gt;</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#80810</link><pubDate>Fri, 27 Feb 2004 02:04:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80810</guid><dc:creator>Raymond Chen</dc:creator><description>Actually \ escape sequences are not allowed in HTML, so that's a non-starter. But the other concerns are valid.&lt;br&gt;&lt;br&gt;If this were for revoking script injecting attacks, I would just use a sledgehammer.  Keep &amp;lt;P&amp;gt; and &amp;lt;BR&amp;gt; and change all other &amp;lt; and &amp;gt; to &amp;amp;lt; and &amp;amp;gt;.</description></item><item><title>Take Outs: The Digital Doggy Bag of Blog Bits for 26 February 2004</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#80934</link><pubDate>Fri, 27 Feb 2004 10:36:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80934</guid><dc:creator>Enjoy Every Sandwich</dc:creator><description>Take Outs: The Digital Doggy Bag of Blog Bits for 26 February 2004</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#81056</link><pubDate>Fri, 27 Feb 2004 13:53:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:81056</guid><dc:creator>Per Soderlind</dc:creator><description>Nice challenge, I had to dust off my old Perl books :-)&lt;br&gt;I made a small correction to your regesp (I asume ignorecase is on)&lt;br&gt;&lt;br&gt;&amp;lt;/?               # match &amp;lt; and &amp;lt;/&lt;br&gt;\b                # start group&lt;br&gt;(?!               # start negative lookahead&lt;br&gt;\b(br|p)\b        # group the words, prevents matching &amp;lt;pre&amp;gt; &lt;br&gt;)                 # end negative lookahead&lt;br&gt;\b                # end group&lt;br&gt;[^&amp;gt;]+?            # match everything except &amp;gt;&lt;br&gt;&amp;gt;                 # end of tag&lt;br&gt;&lt;br&gt;with the comments removed, it looks like this:&lt;br&gt;&amp;lt;/?\b(?!\b(br|p)\b)\b[^&amp;gt;]+?&amp;gt;&lt;br&gt;&lt;br&gt;btw: If you want to match every tag, you should use &amp;lt;[^&amp;gt;]+&amp;gt;&lt;br&gt;&lt;br&gt;An excellent tool for testing this is Expresso (&lt;a target="_new" href="http://www.ultrapico.com/expresso.htm"&gt;http://www.ultrapico.com/expresso.htm&lt;/a&gt;)&lt;br&gt;and you'll find a lot of information at &lt;a target="_new" href="http://www.regular-expressions.info/"&gt;http://www.regular-expressions.info/&lt;/a&gt;&lt;br&gt;&lt;br&gt;br,&lt;br&gt;Per&lt;br&gt;</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#81804</link><pubDate>Sun, 29 Feb 2004 23:16:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:81804</guid><dc:creator>Eric TF Bat</dc:creator><description>Here's a question for you: my beard is too long; I need to shave; I have a chainsaw.  How should I proceed?</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#85443</link><pubDate>Sun, 07 Mar 2004 14:27:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:85443</guid><dc:creator>AvonWyss</dc:creator><description>Note that your regex does not respect single quotes, which I believe are valid in HTML:&lt;br&gt;&lt;br&gt;&amp;lt;bla attr='&amp;gt;'&amp;gt;&lt;br&gt;&lt;br&gt;Will not be matched correctly.</description></item><item><title>re: Regex, HTML, and my sanity</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#211863</link><pubDate>Tue, 10 Aug 2004 13:09:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:211863</guid><dc:creator>Need_to_know</dc:creator><description>I have a typical case, where in I am searching for different procedures(start with p_ or sp_ or dbo.sp_ etc.). But I want to exclude the procedures under comments. ie., all proc's that fall in between /* to */</description></item><item><title>MBA</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#324610</link><pubDate>Sat, 18 Dec 2004 18:54:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:324610</guid><dc:creator>MBA</dc:creator><description>Helpful For MBA Fans.</description></item><item><title> Eric Gunnerson s C Compendium Regex HTML and my sanity | Hair Growth Products</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/26/80465.aspx#9721897</link><pubDate>Wed, 10 Jun 2009 05:39:51 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9721897</guid><dc:creator> Eric Gunnerson s C Compendium Regex HTML and my sanity | Hair Growth Products</dc:creator><description>&lt;p&gt;PingBack from &lt;a rel="nofollow" target="_new" href="http://hairgrowthproducts.info/story.php?id=5382"&gt;http://hairgrowthproducts.info/story.php?id=5382&lt;/a&gt;&lt;/p&gt;
</description></item></channel></rss>