<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx</link><description>I was reading our newsgroups, and I came across a post where the user wanted to filter out all tags from html text except for &amp;lt;br&amp;gt;, &amp;lt;/br&amp;gt;, &amp;lt;p&amp;gt;, and &amp;lt;/p&amp;gt;. What is the shortest .net regex to do that?</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79919</link><pubDate>Wed, 25 Feb 2004 18:20:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79919</guid><dc:creator>haacked</dc:creator><description>Assuming that the allowed tags may not have attributes, something like this:&lt;br&gt;&lt;br&gt;Search for this:&lt;br&gt;&lt;br&gt;&amp;lt;/?(p[^p&amp;gt;]+|b[^r&amp;gt;]+|br[^&amp;gt;]+)&amp;gt;&lt;br&gt;&lt;br&gt;Replace with &amp;quot;&amp;quot;.&lt;br&gt;&lt;br&gt;If the allowed tags may have attributes, I'll get back to you.&lt;br&gt;&lt;br&gt;(note, I haven't tested this).</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79927</link><pubDate>Wed, 25 Feb 2004 18:31:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79927</guid><dc:creator>haacked</dc:creator><description>Whoops. My solution was too simple. For example, it would not strip out this properly:&lt;br&gt;&lt;br&gt;&amp;lt;span title=&amp;quot;&amp;gt;&amp;quot;&amp;gt;&lt;br&gt;&lt;br&gt;This is better (but much harder to understand.&lt;br&gt;&lt;br&gt;&amp;lt;/?(([^pb]|p[^p\s&amp;gt;]|b[^r\s&amp;gt;]|br[^\s&amp;gt;])\w+)(&amp;quot;[^&amp;quot;]*&amp;quot;|'[^']*'|[^'&amp;quot;&amp;gt;])*&amp;gt;</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79929</link><pubDate>Wed, 25 Feb 2004 18:33:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79929</guid><dc:creator>haacked</dc:creator><description>sorry again. I forget that you can't edit your posts on these things. Eric, if you could delete my repost, I should point out that the &amp;quot;\w+&amp;quot; portion should be &amp;quot;\w*&amp;quot;.</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79936</link><pubDate>Wed, 25 Feb 2004 19:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79936</guid><dc:creator>Ricky Dhatt</dc:creator><description>Simple? -- there is no such thing.  In my experience of doing this, there are too many things to deal with, like entities (&amp;amp;nbsp) and XHTML (&amp;lt;br/&amp;gt;.   I just used a full fledged parser now days.  But I have the luxury of not worrying about the overhead.</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79947</link><pubDate>Wed, 25 Feb 2004 19:16:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79947</guid><dc:creator>Darren Neimke</dc:creator><description>Take a look at Html Agility Pack:&lt;br&gt;&lt;br&gt;&lt;a target="_new" href="http://blogs.regexadvice.com/dneimke/archive/2004/02/11/500.aspx"&gt;http://blogs.regexadvice.com/dneimke/archive/2004/02/11/500.aspx&lt;/a&gt;&lt;br&gt;&lt;br&gt;There's a couple of examples at the bottom of that page which demonstrate the syntax for using it.</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79953</link><pubDate>Wed, 25 Feb 2004 19:29:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79953</guid><dc:creator>Jake</dc:creator><description>Would it be easier to bring in the HTML into an XML Reader and then parse it through that way? only problem is what would it do with the content (already know that it would handle the tags properly)&lt;br&gt;&lt;br&gt;Jake</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#79980</link><pubDate>Wed, 25 Feb 2004 20:07:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:79980</guid><dc:creator>Paschal</dc:creator><description>Eric the user you talk about is not me by any chance :-) I asked the question few months ago and I still searching for a solution. This is really puzzled me.&lt;br&gt;I need it to clean some HTML documents but I want to keep the breaklines and paragraphs.&lt;br&gt;Most of the regex I saw stripped all the tags.</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#80053</link><pubDate>Wed, 25 Feb 2004 21:59:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80053</guid><dc:creator>Raymond Chen</dc:creator><description>A wise man once said,&lt;br&gt;&lt;br&gt;Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.&lt;br&gt;&lt;br&gt;I think this is one of the cases where trying a pure regex solution creates its own problems.</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#80094</link><pubDate>Wed, 25 Feb 2004 23:31:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80094</guid><dc:creator>Eric TF Bat</dc:creator><description>Step one: look for all character #255s (if any) and double them: #255#255 (you'll see why in a sec).&lt;br&gt;&lt;br&gt;Step two: look for all &amp;lt;p&amp;gt;, &amp;lt;/p&amp;gt;, &amp;lt;br&amp;gt; and &amp;lt;br/&amp;gt; (note typo in original question; &amp;lt;/br&amp;gt; is meaningless!) and replace with #255p, #255/p and #255br.  Use regexps to allow for spaces between the br and the /, if you like.&lt;br&gt;&lt;br&gt;Step three: convert all &amp;amp; to &amp;amp;amp;&lt;br&gt;&lt;br&gt;Step four: convert all &amp;lt; and &amp;gt; to &amp;amp;lt; and &amp;amp;gt;&lt;br&gt;&lt;br&gt;Step five: convert all #255p to &amp;lt;p&amp;gt;, #255/p to &amp;lt;/p&amp;gt; and #255br to &amp;lt;br /&amp;gt; (note space: required for old, dumb browsers)&lt;br&gt;&lt;br&gt;Step six: convert all remaining #255#255 to #255, if you care.&lt;br&gt;&lt;br&gt;Don't use regexps to handle HTML.  Raymond is right; that way lies insanity.&lt;br&gt;&lt;br&gt;Incidentally, it's impolite to randomly delete people's html, which is why I convert the &amp;lt; and &amp;gt; instead of rudely deleting it and giving them a nasty surprise.  I just hope this blog's commenting system doesn't delete them, cos this message will be (even more) incomprehensible...</description></item><item><title>Take Outs: The Digital Doggy Bag of Blog Bits for 25 February 2004 </title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#80199</link><pubDate>Thu, 26 Feb 2004 07:23:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:80199</guid><dc:creator>Enjoy Every Sandwich</dc:creator><description>In the bag tonight: Less bitch'n and whin'n. Counts:Blogging: 8; Dev: 22; Otherwise: 8; SQL: 5; WILY: 8. Line of the night: </description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#85439</link><pubDate>Sun, 07 Mar 2004 14:03:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:85439</guid><dc:creator>AvonWyss</dc:creator><description>What's wrong with this?&lt;br&gt;&lt;br&gt;&amp;lt;(?!/?p|br)(&amp;quot;[^&amp;quot;]*&amp;quot;|'[^']*'|[^&amp;gt;])+&amp;gt;&lt;br&gt;&lt;br&gt;(I'd suggest to use this regex with the ExplicitCapture, IgnoreCase and SingleLine options enabled)&lt;br&gt;&lt;br&gt;For some reason that I don't know, many neat features of the .NET regex engine (not exclusive to it, though) are rarely used, like assertion groups, backreferences, named groups, UNICODE char groups, and match evaluators for replacement patterns...&lt;br&gt;&lt;br&gt;Anyways, to specifically clean out HTML code, I'd also rather use some HTML to XHTML converter (like HTML Tidy) and then use some code that works on the XML, or maybe just some XSLT, to get the wanted result.</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#181822</link><pubDate>Tue, 13 Jul 2004 15:46:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:181822</guid><dc:creator>Lost_In_JavaScript_Land</dc:creator><description>This works for me (in ASP/VBScript).  It keeps &amp;lt;p&amp;gt;, &amp;lt;/p&amp;gt;, and &amp;lt;br&amp;gt; (upper and lower case)  Also compensates for parameters.&lt;br&gt;&lt;br&gt;	Function FilterHTML(tempStr)&lt;br&gt;		Dim re, tempStr2&lt;br&gt;		Set re = New RegExp&lt;br&gt;&lt;br&gt;		re.IgnoreCase = True&lt;br&gt;&lt;br&gt;		re.Pattern = &amp;quot;&amp;lt;((?!P|BR).*).*&amp;gt;.*&amp;lt;\/\1&amp;gt;&amp;quot;&lt;br&gt;		tempStr2 = re.Replace(tempStr, &amp;quot;&amp;quot;)&lt;br&gt;&lt;br&gt;		re.Pattern = &amp;quot;&amp;lt;((?!P|\/P|BR).*)&amp;gt;&amp;quot;&lt;br&gt;		FilterHTML = re.Replace(tempStr2, &amp;quot;&amp;quot;)&lt;br&gt;&lt;br&gt;		Set re = Nothing&lt;br&gt;	End Function</description></item><item><title>re: Regex puzzle</title><link>http://blogs.msdn.com/ericgu/archive/2004/02/25/79909.aspx#181823</link><pubDate>Tue, 13 Jul 2004 15:52:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:181823</guid><dc:creator>Lost_In_JavaScript_Land</dc:creator><description>Stupid me...just realized this is a C# blog...but converting the code shouldn't be too nasty ;)</description></item></channel></rss>