On several projects, I have had the need to convert large HTML blobs into short text summaries that can be displayed in a list. For example, in SharePoint I often need to display lists of Publishing Page content and I want to summarize some of the HTML columns.
This blog post provides code and describes the process for converting HTML to a text summary.
There are three basic steps to the process:
In order to remove the HTML content, I used several regular expressions that I concatenate together to create one large regular expression. (In order to keep my sanity, I store several smaller regular expressions in separate strings that I concatenate together.)
It is obvious that we will want to remove normal HTML tags and a comments, but it is less obvious that we want to remove CDATA. CDATA tags are not that common and if we were to include their content, we would need to HTML-encode the contents; it is much easier to simply remove them.
The first three patterns below represent the "contents" of tags (the stuff in between the "<" and ">"). The fourth pattern concatenates the results inside the opening/closing brackets.
string TagContentsRegexPattern = @"(?:[^\>\""\']*(?:\""[^\""]*\""|\'[^\']*\')?)*"; string CommentContentsRegexPattern = @"\!\-\-.*?\-\-"; string CDataContentsRegexPattern = @"\!\[CDATA\[.*?\]\]"; string HtmlTagCommentOrCDataRegexPattern = @"\<(?:" + CommentContentsRegexPattern + "|" + CDataContentsRegexPattern + "|" + TagContentsRegexPattern + @")\>";
The final combined regular expression for identifying HTML tags is below:
\<(?:\!\-\-.*?\-\-|\!\[CDATA\[.*?\]\]|(?:[^\>\"\']*(?:\"[^\"]*\"|\'[^\']*\')?)*)\>
In the final code, you will a method called StripTags that replaces these tags with an empty string.
It was difficult to choose whether replace tags with a space or a zero-length string. I ultimately choose to use a zero-length string which introduces the possible risk of incorrectly concatenating two words together (for example, if two <p> tags had no whitespace between them). In this case, I felt it was a better choose to incorrectly combine two words rather than introduce extra whitespace. An improvement to the code might be to detect certain tags such as <p> and <td> and always convert those to spaces.
The NormalizeWhitespace method is responsible for converting sequences of whitespace (including space, tabs and linefeeds) into a single space. The string is also effectively trimmed so all whitespace at the start or end of the string is removed.
Once we have removed the tags and normalized the whitespace, it's time to truncate the results.
If life were simple, we would simply truncate the string at a particular length; unfortunately, it's a bit more complicated. To do a "great job", we perform the following steps:
That's correct. The assumption for this method is that we ultimately want to rewrite the result into an HTML stream; therefore, we can leave the entities as they are. Do not run the results of these methods through a function that HTML-encodes; otherwise, your output will be double-encoded!
Our final code is listed below:
using System; using System.Text; using System.Text.RegularExpressions; namespace Core.Web { public class HtmlToText { // // Html Tag Regex Patterns // public static readonly string TagContentsRegexPattern = @"(?:[^\>\""\']*(?:\""[^\""]*\""|\'[^\']*\')?)*"; public static readonly string CommentContentsRegexPattern = @"\!\-\-.*?\-\-"; public static readonly string CDataContentsRegexPattern = @"\!\[CDATA\[.*?\]\]"; public static readonly string HtmlTagCommentOrCDataRegexPattern = @"\<(?:" + CommentContentsRegexPattern + "|" + CDataContentsRegexPattern + "|" + TagContentsRegexPattern + @")\>"; public static Regex FindTagRegex = new Regex(HtmlTagCommentOrCDataRegexPattern, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.Compiled | RegexOptions.ExplicitCapture); public static string CreateHtmlSummary(string s, int maximumLength, bool appendEllipse) { string result; if (s == null) result = null; else if (s.Length == 0 || maximumLength <= 0) result = ""; else { // Remove Tags... result = StripTags(s); // Normalize Whitespace... result = NormalizeWhitespace(result); if (result.Length > maximumLength) { int truncateLen = maximumLength; // // Find the last position of the "&" and ";". // If the last ";" is not after the last "&" // then we have split an Entity and need to truncate // before the "&"... // int lastAmpersandPosition = result.LastIndexOf('&', truncateLen - 1); if (lastAmpersandPosition != -1) { int lastSemicolonPosition = result.LastIndexOf(';', truncateLen - 1); if (lastSemicolonPosition < lastAmpersandPosition) truncateLen = lastAmpersandPosition; } // Locate the last space and truncate there so we don't // split words... if (truncateLen > 0 && result[truncateLen] != ' ') { int spacePosition = result.LastIndexOf(' ', truncateLen); if (spacePosition > 0) truncateLen = spacePosition; } result = result.Substring(0, truncateLen); // Append ellipse, if needed... if (appendEllipse) result += "..."; } } return result; } public static string NormalizeWhitespace(string s) { string result; if (s == null) result = null; else if (s.Length == 0) result = ""; else { int startPos = 0; // Trim initial whitespace while (startPos < s.Length && char.IsWhiteSpace(s[startPos])) { startPos++; } if (startPos == s.Length) result = ""; else { int firstNonWhitespaceCharacter = startPos; while (startPos < s.Length && !char.IsWhiteSpace(s[startPos])) { startPos++; } if (startPos == s.Length) { if (firstNonWhitespaceCharacter == 0) result = s; else result = s.Substring(firstNonWhitespaceCharacter); } else { bool haveSeenWhitespace = true; char c; StringBuilder sb = new StringBuilder(s.Length - startPos); sb.Append(s, firstNonWhitespaceCharacter, startPos - firstNonWhitespaceCharacter); for (int i = startPos + 1; i < s.Length; i++) { c = s[i]; if (char.IsWhiteSpace(c) && !haveSeenWhitespace) { haveSeenWhitespace = true; } else { if (haveSeenWhitespace) { sb.Append(' '); haveSeenWhitespace = false; } sb.Append(c); } } result = sb.ToString(); } } } return result; } public static string StripTags(string s) { if (s == null) return null; else return FindTagRegex.Replace(s, string.Empty); } public static string StripTagsAndNormalize(string s) { return NormalizeWhitespace(StripTags(s)); } } }
This could could be enhanced by:
Drop me a line if you find this code helpful!