Welcome to MSDN Blogs Sign in | Join | Help

Finding all the A HREF Urls in an HTML document (even in malformed HTML)

The Need

Given an HTML file you want to exract all the HREF urls.

 

You could use a Regex

I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.

 

What I recommend

Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.

 

HTML Agility Pack URL

http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

 

Two things that make HTML Agility Pack interesting

- It doesn't depend on Internet Explorer

- It works on malformed HTML. See this post for a little for context:  NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML

 

Sample code

// this isn't a full sample, but enough to see the value of using the HTML Agility Pack

HtmlDocument input_doc = HtmlDocument();

input_doc.Load(“foo.htm”);

foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )

{

string href_url = node.GetAttributeValue("href", "");

}

Published Monday, October 16, 2006 11:04 PM by saveenr

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: Finding all the A HREF Urls in an HTML document

Doesn't this require the html document to be xhtml compliant? Or at least be wellformed xml? And using lower-case element-tags only?

Surely it's more convenient than regular expressions, but I'm not convinced about it being more reliable than regex. Especially with the help of http://regexlib.com, it shouldn't be too difficult to find a pattern that works.

Tuesday, October 17, 2006 5:05 AM by Sander

# re: Finding all the A HREF Urls in an HTML document (even in malformed HTML)

Great questions.

- xhtml compliant/well-formed - it does *NOT* require the HTML document to be XHTML compliant or even well-formed. As a result, the HTML Agility pack is resilient in the face of the *real* HTML one finds in the wild.

- lower-case tags - it handles this correctly. The HTML documents can have elements in any case and it will handle it correctly.

Tuesday, October 17, 2006 7:02 AM by saveenr

# re: Finding all the A HREF Urls in an HTML document (even in malformed HTML)

Bear in mind that an <a> tag may not have a href and your code should check for that.

It could be a named anchor, e.g. <a name="this">Something</a>

[)amien

Tuesday, October 17, 2006 9:39 AM by Damien Guard

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker