The Need

Given an HTML file you want to exract all the HREF urls.

 

You could use a Regex

I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.

 

What I recommend

Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.

 

HTML Agility Pack URL

http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

 

Two things that make HTML Agility Pack interesting

- It doesn't depend on Internet Explorer

- It works on malformed HTML. See this post for a little for context:  NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML

 

Sample code

// this isn't a full sample, but enough to see the value of using the HTML Agility Pack

HtmlDocument input_doc = HtmlDocument();

input_doc.Load(“foo.htm”);

foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )

{

string href_url = node.GetAttributeValue("href", "");

}