Welcome to MSDN Blogs Sign in | Join | Help

Better HTML parsing and validation with HtmlAgilityPack

Let's face it; sometimes the Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument class just doesn't cut it when you're writing custom extraction and validation code.  HtmlDocument was originally designed as an internal class to very efficiently parse URLs for dependent requests (such as images) out of HTML response bodies.  Before VS 2005 RTM, we made HtmlDocument part of the public WebTestFramework API, but scheduling and resource constraints prevented us from adding more general purpose DOM features like InnerHtml, InnerText, and GetElementById.  You could always parse the HTML string yourself, but fortunately there's a better option: HtmlAgilityPack.

HtmlAgilityPack is an open source project on CodePlex.  It provides standard DOM APIs and XPath navigation -- even when the HTML is not well-formed!

Here's a sample web test that uses the HtmlAgilityPack.HtmlDocument instead of the one in WebTestFramework.  It simply validates that Microsoft's home page lists Windows as the first item in the navigation sidebar.  Download HtmlAgilityPack and add a reference to it from your test project to try out this coded web test.

using System;

using System.Collections.Generic;

using System.Text;

using Microsoft.VisualStudio.TestTools.WebTesting;

using HtmlAgilityPack;

public class WebTest1Coded : WebTest

{

public override IEnumerator<WebTestRequest> GetRequestEnumerator()

{

WebTestRequest request1 = new WebTestRequest("http://www.microsoft.com/");

request1.ValidateResponse += new EventHandler<ValidationEventArgs>(request1_ValidateResponse);

yield return request1;

}

void request1_ValidateResponse(object sender, ValidationEventArgs e)

{

//load the response body string as an HtmlAgilityPack.HtmlDocument

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(e.Response.BodyString);

//locate the "Nav" element

HtmlNode navNode = doc.GetElementbyId("Nav");

//pick the first <li> element

HtmlNode firstNavItemNode = navNode.SelectSingleNode(".//li");

//validate the first list item in the Nav element says "Windows"

e.IsValid = firstNavItemNode.InnerText == "Windows";

}

}



Updated: Fixed XPath query thanks to Oleg's comment.  Also fixed indention of the code.

Published Sunday, December 10, 2006 9:56 PM by JoshCh

Comments

# Html Agility Pack

Now, this is cool if you do a lot of html parsing! You can tell I was drawn to it by the word "Agile"

Monday, December 11, 2006 4:23 PM by ISerializable - Roy Osherove's Blog

# re: Better HTML parsing and validation with HtmlAgilityPack

What's wrong with SgmlReader?

Tuesday, December 12, 2006 3:29 AM by Oleg Tkachenko

# re: Better HTML parsing and validation with HtmlAgilityPack

Josh, your sample is broken. //li is absolute XPath selection. So navNode.SelectSingleNode("//li") returns first <li> in the document, not under navNode. If you need to select <li> descendant of navNode you need

navNode.SelectSingleNode(".//li") or

navNode.SelectSingleNode("descendant::li");

Tuesday, December 12, 2006 3:47 AM by Oleg Tkachenko

# re: Better HTML parsing and validation with HtmlAgilityPack

Thanks Oleg, I thought something wasn't right with that XPath, but it worked so I left it alone :)  I'll update the code.

I haven't used SgmlReader myself, but I've read multiple posts saying HtmlAgilityPack works much better for malformed HTML.

Josh

Tuesday, December 12, 2006 9:00 AM by JoshCh

# VSTS Links - 12/22/2006

Jeff Beehler on Sam's Credo. Josh Christie on Better HTML parsing and validation with HtmlAgilityPack....

Friday, December 22, 2006 9:57 AM by Team System News

# Content Index for Web Tests and Load Tests

Visual Studio Team System for Testers Content Index for Web Tests and Load Tests Getting Started Online

Wednesday, December 19, 2007 4:12 PM by Ed Glas's blog on VSTS load testing

# 使用HtmlAgilityPack更好的HTML分析和验证

让我们面对它,有时候,当您正在编写自定义的提取和验证规则时Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument类不会剪切它。HtmlDoc...

Tuesday, October 21, 2008 11:59 PM by chenming
New Comments to this post are disabled
 
Page view tracker