<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Media And Microcode : Get-MarkupTag</title><link>http://blogs.msdn.com/mediaandmicrocode/archive/tags/Get-MarkupTag/default.aspx</link><description>Tags: Get-MarkupTag</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Microcode: PowerShell Scripting Tricks: Scripting the Web (Part 3) (Resolve-Link, Get-WebPageLink)</title><link>http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/12/microcode-powershell-scripting-tricks-scripting-the-web-part-3-resolve-link-get-webpagelink.aspx</link><pubDate>Fri, 12 Dec 2008 12:59:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9201602</guid><dc:creator>JamesBrundage</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/mediaandmicrocode/comments/9201602.aspx</comments><wfw:commentRss>http://blogs.msdn.com/mediaandmicrocode/commentrss.aspx?PostID=9201602</wfw:commentRss><description>&lt;P&gt;The first post in this series was learning to crawl.&amp;nbsp; I introduced &lt;A href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/01/microcode-powershell-scripting-tricks-scripting-the-web-part-1-get-web.aspx" mce_href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/01/microcode-powershell-scripting-tricks-scripting-the-web-part-1-get-web.aspx"&gt;Get-Web&lt;/A&gt;, which allows you to use System.Net.Webclient to download web sites in a variety of ways.&amp;nbsp; The next post was learning to walk.&amp;nbsp; I showed us &lt;A href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/08/microcode-powershell-scripting-tricks-scripting-the-web-part-2-get-markuptag.aspx" mce_href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/08/microcode-powershell-scripting-tricks-scripting-the-web-part-2-get-markuptag.aspx"&gt;Get-MarkupTag&lt;/A&gt;, which helps coerce parts of the web into XML.&amp;nbsp; Now we can start to really have some fun with the data and run wild.&lt;/P&gt;
&lt;P&gt;Pulling out semi-structured data is one thing, but it’s important to be able to pull out more complex information as well.&amp;nbsp; One interesting case is pulling out all of the links from a webpage.&amp;nbsp;&amp;nbsp; This task breaks down into four smaller tasks:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Downloading the page (done with Get-Web) 
&lt;LI&gt;Getting the &amp;lt;a&amp;gt; tags in a meaningful way (done with &lt;A href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/08/microcode-powershell-scripting-tricks-scripting-the-web-part-2-get-markuptag.aspx" mce_href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/08/microcode-powershell-scripting-tricks-scripting-the-web-part-2-get-markuptag.aspx"&gt;Get-MarkupTag&lt;/A&gt;) 
&lt;LI&gt;Extracting out the href attribute 
&lt;LI&gt;Determining if the link is relative or absolute &lt;/LI&gt;&lt;/OL&gt;
&lt;P&gt;To determine if the link is relative or absolute, I made a Resolve-Link function.&amp;nbsp; It takes a base url (e.g &lt;A href="http://www.foo.com/blah/blah.asp" mce_href="http://www.foo.com/blah/blah.asp"&gt;http://www.foo.com/blah/blah.asp&lt;/A&gt;) and a link found on it, and returns the real item it resolves to.&amp;nbsp; It optionally returns a property bag with the type of link and the resolved link.&lt;/P&gt;
&lt;P&gt;Here’s Resolve-Link:&lt;/P&gt;&lt;I&gt;
&lt;BLOCKQUOTE&gt;&lt;PRE class=CmdletDefinition&gt;function Resolve-Link([Uri]$uri,
    [string]$link,
    [switch]$returnLinkType) {
    #.Synopsis
    #   Resolves a relative or absolute link to an absolute url
    #.Description
    #   Takes a uri and a link to a page and returns the absolute url, or
    #   optionally returns a property bag with the link type
    #   (absolute, relative, or host relative) and the link
    #.Parameter uri
    #   The uri the link is located on
    #.Parameter link
    #   The original link text
    #.Parameter returnLinkType
    #   The return link type
    #.Example
    #   Resolve-Link http:/www.microsoft.com/ /technet/scriptcenter
    if ($link.StartsWith("/")) {
        # Relative to Host site
        if ($returnLinkType) {
            return New-Object Object |
                Add-Member NoteProperty Type "Host Relative" -PassThru |
                Add-Member NoteProperty Link ([uri]"$($uri.Scheme)://$($uri.DnsSafehost)$($link)") -PassThru
        }
        return "$($uri.Scheme)://$($uri.DnsSafehost)$($link)"
    } else {
        if ($link.StartsWith("$($uri.Scheme)://")) {
            # Absolute Link
            if ($returnLinkType) {
                return New-Object Object |
                    Add-Member NoteProperty Type "Absolute" -PassThru |
                    Add-Member NoteProperty Link ([uri]$link) -PassThru
            }            
            return $link
        } else {
            # Relative link
            $realLink = $uri.AbsoluteUri.Substring(0,
                $uri.AbsoluteUri.LastIndexOf("/")) + "/$link"    
            if ($returnLinkType) {
                return New-Object Object |
                    Add-Member NoteProperty Type "Relative" -PassThru |
                    Add-Member NoteProperty Link ([uri]$realLink) -PassThru
            }
            return $realLink            
        }
    }    
}&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;/I&gt;
&lt;P&gt;Once Resolve-Link was written, making Get-WebPageLink is an incredible snap.&amp;nbsp; It’s below, and it actually takes only 3 lines to do the real work and&amp;nbsp; 11 lines to explain the work and give examples.&lt;/P&gt;&lt;I&gt;
&lt;BLOCKQUOTE&gt;&lt;PRE class=CmdletDefinition&gt;function Get-WebPageLink($url) {
    #.Synopsis
    #   Returns all of the links within a webpage
    #.Description
    #   Resolves all &amp;lt;A&amp;gt; references and returns a property bag with
    #   the text contained in the link, the page the link came from,
    #   and the type of link returned (absolute, host relative, or relative)
    #.Parameter urltp
    #   The page to get links from
    #.Example
    #   Get-WebPageLink http://blogs.msdn.com/
    Get-MarkupTag a (Get-Web $url) | Foreach-Object {
        Resolve-Link $url $_.Xml.Href -returnLinkType |
            Add-Member NoteProperty Text $_.Xml."#text" -PassThru 
    }
}&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;/I&gt;
&lt;P&gt;Go ahead and give Get-WebpageLink a whirl: &lt;/P&gt;&lt;I&gt;
&lt;BLOCKQUOTE&gt;&lt;PRE&gt;Get-WebpageLink http://blogs.msdn.com&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;/I&gt;
&lt;P&gt;Ready for some real fun? Remember way back when I did a post about getting RSS feeds in PowerShell with Microsoft.FeedsManager (&lt;A href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/11/11/microcode-scripting-rss-feeds-with-powershell-and-microsoft-feedsmanager.aspx" mce_href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/11/11/microcode-scripting-rss-feeds-with-powershell-and-microsoft-feedsmanager.aspx"&gt;Get-Feed&lt;/A&gt;).&amp;nbsp; If you have that script handy, go ahead and check out this one liner that will refresh every RSS item you’ve got and extract out all of the links from it.&lt;/P&gt;&lt;I&gt;
&lt;BLOCKQUOTE&gt;&lt;PRE&gt;    
Get-Feed -recurse -articles | Foreach-Object { Get-WebPageLink $_.Link }&lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;/I&gt;
&lt;P&gt;That particular command line can take a while, depending on how many blogs you subscribe to, but it gives you a brand new view on blogs (as a simmering stew of scripts, rather than just text to be read and comprehended).&lt;/P&gt;
&lt;P&gt;There’s more fun to come in unlocking the web, but these two scripts should get you started in extracting a little more into the wild world of the web.&lt;/P&gt;
&lt;P&gt;Hope this Helps,&lt;/P&gt;
&lt;P&gt;James Brundage [MSFT]&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9201602" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/PowerShell/default.aspx">PowerShell</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Microcode/default.aspx">Microcode</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Scripting+Tricks/default.aspx">Scripting Tricks</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Get-Feed/default.aspx">Get-Feed</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Get-Web/default.aspx">Get-Web</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Get-MarkupTag/default.aspx">Get-MarkupTag</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Get-WebPageLink/default.aspx">Get-WebPageLink</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Resolve-Link/default.aspx">Resolve-Link</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Scripting+The+Web/default.aspx">Scripting The Web</category></item><item><title>Microcode: PowerShell Scripting Tricks: Scripting The Web (Part 2) (Get-MarkupTag)</title><link>http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/08/microcode-powershell-scripting-tricks-scripting-the-web-part-2-get-markuptag.aspx</link><pubDate>Mon, 08 Dec 2008 11:17:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9183927</guid><dc:creator>JamesBrundage</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/mediaandmicrocode/comments/9183927.aspx</comments><wfw:commentRss>http://blogs.msdn.com/mediaandmicrocode/commentrss.aspx?PostID=9183927</wfw:commentRss><description>&lt;P&gt;The first post about scripting the was a lot of waxing philosophical but little about how to extract data and give it form.&amp;nbsp; There are several approaches, with various difficulties.&amp;nbsp; I could build a full HTML parser and walk though object models, or I could use the object model of IE.&amp;nbsp; Since I personally like to minimize dependencies, I’ve chosen to use &lt;A href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/01/microcode-powershell-scripting-tricks-scripting-the-web-part-1-get-web.aspx" mce_href="http://blogs.msdn.com/mediaandmicrocode/archive/2008/12/01/microcode-powershell-scripting-tricks-scripting-the-web-part-1-get-web.aspx"&gt;System.Net.Webclient&lt;/A&gt; to download the webpage as text rather than use Internet Explorer and get it through a proper object model. This means I will either have to write an HTML parser, or I’ll have to nudge HTML into something more useful to avoid writing a full parser.&lt;/P&gt;
&lt;P&gt;HTML is preciously close to XML, and XML is something that PowerShell supports quite well, so e IE sI decided to try to nudge HTML into XML.&amp;nbsp; I wrote a function, Get-MarkupTag, which will extract out the text for a markup tag (e.g. &amp;lt;a&amp;gt;) and attempt to coerce it into XML.&lt;/P&gt;
&lt;P&gt;It’s not a perfect approach, because there several things that are legal in HTML but not in XML.&amp;nbsp; Here’s an inventory of every curveball I’ve hit so far.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Unclosed tags like &amp;lt;IMG&amp;gt;: &lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Some tags in HTML, like &amp;lt;IMG&amp;gt; are unmatched.&amp;nbsp; I’ll first have to identify all of the unmatched tags and coerce them into property closed xml (e.g. ensure that &amp;lt;IMG is &amp;lt;IMG /&amp;gt;).&amp;nbsp; For things like &amp;lt;BR&amp;gt; or &amp;lt;HR&amp;gt; this is easy, because attributes are rare, but for IMG, getting this right is critically important, because it’s how you download images.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;HTML Escape Sequences end up as XML unrecognized entities:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;There’s a lot of ways to embed special characters within HTML that get recognized by the parser as XML entities.&amp;nbsp; The most common example is &amp;amp;nbsp; (the explicit space), but foreign currencies often show up in this format as well.&amp;nbsp; I found a complete list online that I could extract out all of the sequences from, and I embedded a hashtable within to hold each item.&amp;nbsp; Then I walked through the hashtable and replaced the escape sequences with their real value.&amp;nbsp; However, since the entity is legal in any since the entity is legal in any case (e.g. &amp;amp;nsbp; is the same as &amp;amp;NBSP; to an HTML parser), I had to write a quick case insensitive replace.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Unquoted Attributes&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;Most browsers will accept HTML attributes without quotes, e.g. &amp;lt;a href=www.microsoft.com&amp;gt;, but XML can handle this.&amp;nbsp; Once I’ve got figured out the text in each tag, I need to check each attribute within that tag to ensure that the attributes are quoted.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Nested tags&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;One of the more interesting parts of parsing HTML was nested tags.&amp;nbsp; In order to match nested tags, what I did was use two regular expressions to identify all of the matches.&amp;nbsp; While PowerShell has a –match operator, –match doesn’t work for multiline strings.&amp;nbsp; So what I did was use New-Object to create the Regular Expression for extracting out a tag and then stored the result in a variable.&lt;/P&gt;
&lt;P&gt;If the number of start tags was equal to the number of end tags, I assumed that the tags are balanced.&amp;nbsp; If they’re odd, I’ll only deal with the start tags instead.&amp;nbsp; If I need to determine which start tag matches which end tag, I need to use a data structure called a stack.&amp;nbsp; Luckily, .NET comes with one (&lt;A href="http://msdn.microsoft.com/en-us/library/system.collections.stack.aspx" mce_href="http://msdn.microsoft.com/en-us/library/system.collections.stack.aspx"&gt;System.Collections.Stack&lt;/A&gt;).&amp;nbsp; A stack in code is just like a stack in real life.&amp;nbsp; You can put something on the top of the stack (Push), see what’s on top of the stack (Peek), and pull off the top of of the stack (Pop).&lt;/P&gt;
&lt;P&gt;If I put start tags and end tags in one list, and sort them by the order that they occurred (in PowerShell, this is $list1 + $list2 | Sort-Object PropertyName), then I can simply walk through the list, pushing the start tags onto the stack and popping off end tags as I encounter them.&lt;/P&gt;
&lt;P&gt;I don’t believe that I’ve taken care of all of the possible pains that exist (for instance, I know that georgraphic coordinates on wikipedia do not coerce into XML at this point.&amp;nbsp; However, it’s a good start and it can yield some quite interesting results.&lt;/P&gt;
&lt;P&gt;Since Get-MarkupTag involves a lot of escape sequences, which lends to very annoying blog formatting, I’m going to attach the script instead of embedding it.&amp;nbsp; &lt;A title=Get-MarkupTag href="http://blogs.msdn.com/mediaandmicrocode/attachment/9183927.ashx" mce_href="http://blogs.msdn.com/mediaandmicrocode/attachment/9183927.ashx"&gt;You can download it here&lt;/A&gt;.&lt;I&gt;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;If you look at the code in Get-MarkupTag, close to the end, you’ll see a trap statement. This means that there’s any error coercing the chunk of HTML into XML will be swallowed into the verbose log.&amp;nbsp; If you want to examine why you can’t use the XML, simply set $verbosePreference to “continue” and you will be able to see the errors produced trying to convert the tag into XML.&amp;nbsp; This is an example of a technique I call error redirection.&lt;/P&gt;
&lt;P&gt;One of the kind of cool things you can do with it is extract the individual rows from ConvertTo-HTML:&lt;/P&gt;&lt;I&gt;
&lt;BLOCKQUOTE&gt;&lt;PRE&gt;    $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String 
    Get-MarkupTag "tr" $text
        &lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;/I&gt;
&lt;P&gt;It's also possible to extract out all of the pages &lt;A href="http://www.microsoft.com/" mce_href="http://www.microsoft.com"&gt;www.microsoft.com&lt;/A&gt; links to:&lt;/P&gt;&lt;I&gt;
&lt;BLOCKQUOTE&gt;&lt;PRE&gt;    $microsoft= (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/")
    Get-MarkupTag "a" $microsoft | % { $_.Xml.Href }        &lt;/PRE&gt;&lt;/BLOCKQUOTE&gt;&lt;/I&gt;
&lt;P&gt;What this fairly slightly quick and dirty approach to extracting out HTML gives you is a fairly simply way to start giving the data form. With XML you get the syntactic sugar PowerShell pours on XML, and you also get the power of XPath, and from that, you can start turning the HTML into something more interesting.&lt;/P&gt;
&lt;P&gt;In the next post, I’ll look at some of the things you can do with this new toy.&lt;/P&gt;
&lt;P&gt;Try it out on a few sites.&lt;/P&gt;
&lt;P&gt;Hope this helps,&lt;/P&gt;
&lt;P&gt;James Brundage [MSFT]&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9183927" width="1" height="1"&gt;</description><enclosure url="http://blogs.msdn.com/mediaandmicrocode/attachment/9183927.ashx" length="11118" type="application/octet-stream" /><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/PowerShell/default.aspx">PowerShell</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Microcode/default.aspx">Microcode</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Get-MarkupTag/default.aspx">Get-MarkupTag</category><category domain="http://blogs.msdn.com/mediaandmicrocode/archive/tags/Scripting+The+Web/default.aspx">Scripting The Web</category></item></channel></rss>