Scraping text from web pages

Scraping text from web pages

Rate This
  • Comments 0

We've all written some html over the years (ok, in my case not so much), and maybe we all have our own standards, and most likely our standards differ. My test automation had to crawl through a product which is in effect a fairly large website, capturing all the different screens and the html behind those screens. Each screen has multiple frames and as the product has evolved over many years different pages or even frames have been designed by various people or teams at different times. Pages and/or groups of pages belong to different feature teams on different continents and sometimes look very different. So the challenge was to write some code that could genrically parse any html, and hopefully identify just the text that is displayed on screen at the time the html was captured.

 

A short investigation into the potential of regular expressions to do this gave me a glimpse of the world of pain that would be. Although I am certainly a fan of Regex, and have used it in other places during this project it is a technology which suits a specific purpose; IMHO it is very powerful when you need to find and replace etc... across multiple sets of strings which conform to a fairly simple pattern. Once the pattern gets complex, the expression becomes hard to write/test and before you know it the clock is ticking into the wee small hours and you are working hard (very hard) to create a maintenance nightmare. As a wise man once said to me "I had a problem and I tried to solve it with regular expressions - now I have two problems".

But a litte bit more research on the web, reading various sites like http://www.codeproject.com/ etc... and I discovered the html agility pack (http://htmlagilitypack.codeplex.com/). This open source library, distributed under the Microsoft Public License (Ms-PL), allows loading of even badly formed html as a document instance and provides many utilty methods to "walk the DOM"; basically to process the hierarchy of html elements.

 

Starting with the many implementations of the Load and LoadHtml methods  (and it's open source - you can add more), I found that this library (I used version 1.4.0) provided many useful utility methods to parse the html elements in my frames. I chose to process all the html elements by working down the tree from the document node, so initially I created a method called ProcessNodes(), and passed it the highest level set of html elements in the doument as "ProcessNodes(doc.DocumentNode.ChildNodes);"

 The ProcessNodes "master" method then is quite simple, and it just needs to manage the overall process. It is recursive so it "should" get right to the bottom of the tree:

 private void ProcessNodes(HtmlNodeCollection nodes)
{
if (nodes.Count == 0)
{
return;
}

int siblingCount = 0;
HtmlNode node = nodes[0];

while (node != null)
{
siblingCount = 1; //initialize to 1 so we always move at least one node along
if (!IsHiddenNode(node) && !node.Name.Equals("SCRIPT", StringComparison.InvariantCultureIgnoreCase)) //don't scrape non=displayed elements or script elements
{
ProcessNodes(node.ChildNodes);

if (node.NextSibling!=null && node.NextSibling.Name.Equals("BR", StringComparison.InvariantCultureIgnoreCase))
{
siblingCount = ProcessTextNodesWithBreaks(node);
}
else
{
ProcessOneNode(node);
}
}
for (int i = 0; i < siblingCount; i++)
{
node = node.NextSibling;
if (node == null)
{
break;
}
}
}
return;
}

 

Then I had to identify the categories of nodes which required different types of processing. I covered what I considered to be standard html with the "ProcessOneNode (...)" method. One of the issues was working out the nesting levels for the text in the html, and depending on how each frame/page was rendered by the product this sometimes meant that the same text appeared at multuiple levels in the heriarchy. The first piece of logic in this method is aimed at ensuring I only grab the displayed text once:

 

 private void ProcessOneNode(HtmlNode node)
{
if (node != null && !string.IsNullOrEmpty(node.InnerText) && !node.InnerText.Equals("&nbsp;"))
{
if (node.ChildNodes.Any(p => node.InnerText.Contains(p.InnerText)))
{
return;
}


string nodeValue = System.Web.HttpUtility.HtmlDecode(node.InnerText).Trim(NonTextChars);

if (!String.IsNullOrEmpty(nodeValue))
{
TextNode storedNode = new TextNode();
if (node.InnerText.Equals(node.ParentNode.InnerText))
{
storedNode.nodeID = node.ParentNode.Id;
}
else
{
storedNode.nodeID = node.Id;
}

nodeValue = nodeValue.Replace('<', ' ');
storedNode.nodeValue = nodeValue.Replace('>', ' ');
allNodes.Add(storedNode);
}
}
 

Interestingly enough if you find this kind of thing interesting (which I assume you do - you're reading this blog!), one of the categories I needed to handle seperately was the category of html display elements which included break characters ("<BR">"). Because at the end of the day I need to match these displayed strings back to their resource file string tables and the break characters were inserted by the products rendering logic, I needed to process these to remove the break characters:

 

 private int ProcessTextNodesWithBreaks(HtmlNode node)
{
int nodesProcessed = 1;

TextNode nodeWithBreaks = new TextNode();
nodeWithBreaks.nodeID = node.Id;
string text = node.InnerText;

while (node.NextSibling!=null && node.NextSibling.Name.Equals("BR", StringComparison.InvariantCultureIgnoreCase))
{
if (node.NextSibling.NextSibling == null) //because sometimes the <BR> is at the end of the element
{
node = node.NextSibling;
}
else
{
node = node.NextSibling.NextSibling;
text += " " + node.InnerText;
nodesProcessed += 2;
}
}
nodeWithBreaks.nodeValue = text;
allNodes.Add(nodeWithBreaks);
return nodesProcessed;
}

 

The last challenge was to try and work out if the text I was "harvesting" from the html stream was actually visible displayed text at the time the screen was scraped/saved. This is an area where I have much refinement left to do. I am by no means a html expert - in fact the vast bulk of my career has been involved in winforms systems and tools development, but from what I do know I tried to assess the various attrubutes of the html elements to determine if they, while embedded in the page data, were actually not visible to the user. Here's the current version of that particular method:

 

 private bool IsHiddenNode(HtmlNode node)
{
string itemId = string.Empty;
string itemClass = string.Empty;
string itemStyle = string.Empty;

if (node.HasAttributes)
{
foreach (HtmlAttribute att in node.Attributes)
{
switch (att.Name)
{
case "id"
: itemId = att.Value;
break;
case "class"
: itemClass = att.Value.Trim(NonTextChars);
break;
case "Class"
: itemClass = att.Value.Trim(NonTextChars);
break;
case "Style"
: itemStyle = att.Value.Trim(NonTextChars);
break;
case "style"
: itemStyle = att.Value.Trim(NonTextChars);
break;
default:
if (att.Name.Contains("ms-crm"))
{
itemClass += ";" + att.Name.Trim(NonTextChars);
}
else
{
itemStyle += att.Name.Trim(NonTextChars);
}
break;

}
}
}

if (itemClass.Contains("BACKGROUND-IMAGE") || itemStyle.Contains("DISPLAY: none") || itemClass.Contains("hidden")
)
{
return true;
}
else
{
return false;
}
}

Incidentally, the values "NonTextChars" (that you see getting trimmed off some of these values) and "allNodes" (to which each of these methods adds the text data)

are defined at class level as:

char[] NonTextChars = { '\r', '\t', '\n', '\"', '\\', ' ', '&' };

 

List<TextNode> allNodes;

Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 1 and 4 and type the answer here:
  • Post