The first post in this series was learning to crawl.  I introduced Get-Web, which allows you to use System.Net.Webclient to download web sites in a variety of ways.  The next post was learning to walk.  I showed us Get-MarkupTag, which helps coerce parts of the web into XML.  Now we can start to really have some fun with the data and run wild.

Pulling out semi-structured data is one thing, but it’s important to be able to pull out more complex information as well.  One interesting case is pulling out all of the links from a webpage.   This task breaks down into four smaller tasks:

  1. Downloading the page (done with Get-Web)
  2. Getting the <a> tags in a meaningful way (done with Get-MarkupTag)
  3. Extracting out the href attribute
  4. Determining if the link is relative or absolute

To determine if the link is relative or absolute, I made a Resolve-Link function.  It takes a base url (e.g http://www.foo.com/blah/blah.asp) and a link found on it, and returns the real item it resolves to.  It optionally returns a property bag with the type of link and the resolved link.

Here’s Resolve-Link:

function Resolve-Link([Uri]$uri,
    [string]$link,
    [switch]$returnLinkType) {
    #.Synopsis
    #   Resolves a relative or absolute link to an absolute url
    #.Description
    #   Takes a uri and a link to a page and returns the absolute url, or
    #   optionally returns a property bag with the link type
    #   (absolute, relative, or host relative) and the link
    #.Parameter uri
    #   The uri the link is located on
    #.Parameter link
    #   The original link text
    #.Parameter returnLinkType
    #   The return link type
    #.Example
    #   Resolve-Link http:/www.microsoft.com/ /technet/scriptcenter
    if ($link.StartsWith("/")) {
        # Relative to Host site
        if ($returnLinkType) {
            return New-Object Object |
                Add-Member NoteProperty Type "Host Relative" -PassThru |
                Add-Member NoteProperty Link ([uri]"$($uri.Scheme)://$($uri.DnsSafehost)$($link)") -PassThru
        }
        return "$($uri.Scheme)://$($uri.DnsSafehost)$($link)"
    } else {
        if ($link.StartsWith("$($uri.Scheme)://")) {
            # Absolute Link
            if ($returnLinkType) {
                return New-Object Object |
                    Add-Member NoteProperty Type "Absolute" -PassThru |
                    Add-Member NoteProperty Link ([uri]$link) -PassThru
            }            
            return $link
        } else {
            # Relative link
            $realLink = $uri.AbsoluteUri.Substring(0,
                $uri.AbsoluteUri.LastIndexOf("/")) + "/$link"    
            if ($returnLinkType) {
                return New-Object Object |
                    Add-Member NoteProperty Type "Relative" -PassThru |
                    Add-Member NoteProperty Link ([uri]$realLink) -PassThru
            }
            return $realLink            
        }
    }    
}

Once Resolve-Link was written, making Get-WebPageLink is an incredible snap.  It’s below, and it actually takes only 3 lines to do the real work and  11 lines to explain the work and give examples.

function Get-WebPageLink($url) {
    #.Synopsis
    #   Returns all of the links within a webpage
    #.Description
    #   Resolves all <A> references and returns a property bag with
    #   the text contained in the link, the page the link came from,
    #   and the type of link returned (absolute, host relative, or relative)
    #.Parameter urltp
    #   The page to get links from
    #.Example
    #   Get-WebPageLink http://blogs.msdn.com/
    Get-MarkupTag a (Get-Web $url) | Foreach-Object {
        Resolve-Link $url $_.Xml.Href -returnLinkType |
            Add-Member NoteProperty Text $_.Xml."#text" -PassThru 
    }
}

Go ahead and give Get-WebpageLink a whirl:

Get-WebpageLink http://blogs.msdn.com

Ready for some real fun? Remember way back when I did a post about getting RSS feeds in PowerShell with Microsoft.FeedsManager (Get-Feed).  If you have that script handy, go ahead and check out this one liner that will refresh every RSS item you’ve got and extract out all of the links from it.

    
Get-Feed -recurse -articles | Foreach-Object { Get-WebPageLink $_.Link }

That particular command line can take a while, depending on how many blogs you subscribe to, but it gives you a brand new view on blogs (as a simmering stew of scripts, rather than just text to be read and comprehended).

There’s more fun to come in unlocking the web, but these two scripts should get you started in extracting a little more into the wild world of the web.

Hope this Helps,

James Brundage [MSFT]