Welcome to MSDN Blogs Sign in | Join | Help

Regex puzzle

I was reading our newsgroups, and I came across a post where the user wanted to filter out all tags from html text except for <br>, </br>, <p>, and </p>.

What is the shortest .net regex to do that?

Published Wednesday, February 25, 2004 9:57 AM by ericgu
Filed under: ,

Comments

Wednesday, February 25, 2004 10:20 AM by haacked

# re: Regex puzzle

Assuming that the allowed tags may not have attributes, something like this:

Search for this:

</?(p[^p>]+|b[^r>]+|br[^>]+)>

Replace with "".

If the allowed tags may have attributes, I'll get back to you.

(note, I haven't tested this).
Wednesday, February 25, 2004 10:31 AM by haacked

# re: Regex puzzle

Whoops. My solution was too simple. For example, it would not strip out this properly:

<span title=">">

This is better (but much harder to understand.

</?(([^pb]|p[^p\s>]|b[^r\s>]|br[^\s>])\w+)("[^"]*"|'[^']*'|[^'">])*>
Wednesday, February 25, 2004 10:33 AM by haacked

# re: Regex puzzle

sorry again. I forget that you can't edit your posts on these things. Eric, if you could delete my repost, I should point out that the "\w+" portion should be "\w*".
Wednesday, February 25, 2004 11:00 AM by Ricky Dhatt

# re: Regex puzzle

Simple? -- there is no such thing. In my experience of doing this, there are too many things to deal with, like entities (&nbsp) and XHTML (<br/>. I just used a full fledged parser now days. But I have the luxury of not worrying about the overhead.
Wednesday, February 25, 2004 11:16 AM by Darren Neimke

# re: Regex puzzle

Take a look at Html Agility Pack:

http://blogs.regexadvice.com/dneimke/archive/2004/02/11/500.aspx

There's a couple of examples at the bottom of that page which demonstrate the syntax for using it.
Wednesday, February 25, 2004 11:29 AM by Jake

# re: Regex puzzle

Would it be easier to bring in the HTML into an XML Reader and then parse it through that way? only problem is what would it do with the content (already know that it would handle the tags properly)

Jake
Wednesday, February 25, 2004 12:07 PM by Paschal

# re: Regex puzzle

Eric the user you talk about is not me by any chance :-) I asked the question few months ago and I still searching for a solution. This is really puzzled me.
I need it to clean some HTML documents but I want to keep the breaklines and paragraphs.
Most of the regex I saw stripped all the tags.
Wednesday, February 25, 2004 1:59 PM by Raymond Chen

# re: Regex puzzle

A wise man once said,

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

I think this is one of the cases where trying a pure regex solution creates its own problems.
Wednesday, February 25, 2004 3:31 PM by Eric TF Bat

# re: Regex puzzle

Step one: look for all character #255s (if any) and double them: #255#255 (you'll see why in a sec).

Step two: look for all <p>, </p>, <br> and <br/> (note typo in original question; </br> is meaningless!) and replace with #255p, #255/p and #255br. Use regexps to allow for spaces between the br and the /, if you like.

Step three: convert all & to &amp;

Step four: convert all < and > to &lt; and &gt;

Step five: convert all #255p to <p>, #255/p to </p> and #255br to <br /> (note space: required for old, dumb browsers)

Step six: convert all remaining #255#255 to #255, if you care.

Don't use regexps to handle HTML. Raymond is right; that way lies insanity.

Incidentally, it's impolite to randomly delete people's html, which is why I convert the < and > instead of rudely deleting it and giving them a nasty surprise. I just hope this blog's commenting system doesn't delete them, cos this message will be (even more) incomprehensible...
Wednesday, February 25, 2004 11:23 PM by Enjoy Every Sandwich

# Take Outs: The Digital Doggy Bag of Blog Bits for 25 February 2004

In the bag tonight: Less bitch'n and whin'n. Counts:Blogging: 8; Dev: 22; Otherwise: 8; SQL: 5; WILY: 8. Line of the night:
Sunday, March 07, 2004 6:03 AM by AvonWyss

# re: Regex puzzle

What's wrong with this?

<(?!/?p|br)("[^"]*"|'[^']*'|[^>])+>

(I'd suggest to use this regex with the ExplicitCapture, IgnoreCase and SingleLine options enabled)

For some reason that I don't know, many neat features of the .NET regex engine (not exclusive to it, though) are rarely used, like assertion groups, backreferences, named groups, UNICODE char groups, and match evaluators for replacement patterns...

Anyways, to specifically clean out HTML code, I'd also rather use some HTML to XHTML converter (like HTML Tidy) and then use some code that works on the XML, or maybe just some XSLT, to get the wanted result.
Tuesday, July 13, 2004 8:46 AM by Lost_In_JavaScript_Land

# re: Regex puzzle

This works for me (in ASP/VBScript). It keeps <p>, </p>, and <br> (upper and lower case) Also compensates for parameters.

Function FilterHTML(tempStr)
Dim re, tempStr2
Set re = New RegExp

re.IgnoreCase = True

re.Pattern = "<((?!P|BR).*).*>.*<\/\1>"
tempStr2 = re.Replace(tempStr, "")

re.Pattern = "<((?!P|\/P|BR).*)>"
FilterHTML = re.Replace(tempStr2, "")

Set re = Nothing
End Function
Tuesday, July 13, 2004 8:52 AM by Lost_In_JavaScript_Land

# re: Regex puzzle

Stupid me...just realized this is a C# blog...but converting the code shouldn't be too nasty ;)
New Comments to this post are disabled
 
Page view tracker