Get-MarkupTag

Get-MarkupTag

  • Comments 2

On my personal blog (Media And Microcode), I've been posting a series called "Scripting the Web", which introduced a function called Get-MarkupTag. Get-MarkupTag is a very handy little function that coerces individual tag elements of a web page into HTML, so you can scrape data from a webpage.

I've updated Get-MarkupTag a tiny bit for CTP3, marking the tag name as value from pipeline so I can get multiple tag types from a single document. I'm posting it again here so that a wider audience can make use of it, and so that I can use it in some later blog posts. Since I've already got inline help for the function, I simply used Write-CommandBlogPost to output its documentation and post it again here.

Enjoy!


Synopsis:

Extracts out a markup language (HTML or XML) tag from within a document


Syntax:

Get-MarkupTag [[-tag] [<Object>]] [[-html] [<String>]] [-Verbose] [-Debug] [-ErrorAction [<ActionPreference>]] [-WarningAction [<ActionPreference>]] [-ErrorVariable [<String>]] [-WarningVariable [<String>]] [-OutVariable [<String>]] [-OutBuffer [<Int32>]] [<CommonParameters>]


Detailed Description:

Extracts out a markup language (HTML or XML) tag from within a document.
Returns the tag, the text within the tag, and, if possible, the tag converted
to XML


Examples:

    -------------------------- EXAMPLE 1 --------------------------





# Download the Microsoft front page and extract out links and div tags
$microsoft = (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/")
"a", "div" | Get-MarkupTag -html $microsoft
    
    -------------------------- EXAMPLE 2 --------------------------





# Extract the rows from ConvertTo-HTML
$text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String 
Get-MarkupTag "tr" $text
    


Command Parameters:

Name Description
tag The tag to extract, e.g. "a", "div"
html The text to extract the tag from


Here's Get-MarkupTag:

function Get-MarkupTag {
           
    #.Synopsis
    #   Extracts out a markup language (HTML or XML) tag from within a document
    #.Description
    #   Extracts out a markup language (HTML or XML) tag from within a document.
    #   Returns the tag, the text within the tag, and, if possible, the tag converted
    #   to XML
    #.Parameter tag 
    #   The tag to extract, e.g. "a", "div"
    #.Parameter html
    #   The text to extract the tag from
    #.Example
    #   # Download the Microsoft front page and extract out links and div tags
    #   $microsoft = (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/")
    #   "a", "div" | Get-MarkupTag -html $microsoft
    #.Example
    #   # Extract the rows from ConvertTo-HTML
    #   $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String 
    #   Get-MarkupTag "tr" $text
    param(
        [Parameter(ValueFromPipeline=$true,Position=0)]$tag,
        [Parameter(Position=1)[string]$html)
begin {
    
        $replacements = @{
            "<BR>" = "<BR />"
            "<HR>" = "<HR />"
            "&nbsp;" = " "
            '&macr;'='¯'
            '&ETH;'='Ð'
            '&para;'='¶'
            '&yen;'='¥'
            '&ordm;'='º'
            '&sup1;'='¹'
            '&ordf;'='ª'
            '&shy;'='­'
            '&sup2;'='²'
            '&Ccedil;'='Ç'
            '&Icirc;'='Î'
            '&curren;'='¤'
            '&frac12;'='½'
            '&sect;'='§'
            '&Acirc;'='â'
            '&Ucirc;'='Û'
            '&plusmn;'='±'
            '&reg;'='®'
            '&acute;'='´'
            '&Otilde;'='Õ'
            '&brvbar;'='¦'
            '&pound;'='£'
            '&Iacute;'='Í'
            '&middot;'='·'
            '&Ocirc;'='Ô'
            '&frac14;'='¼'
            '&uml;'='¨'
            '&Oacute;'='Ó'
            '&deg;'='°'
            '&Yacute;'='Ý'
            '&Agrave;'='À'
            '&Ouml;'='Ö'
            '&quot;'='"'
            '&Atilde;'='Ã'
            '&THORN;'='Þ'
            '&frac34;'='¾'
            '&iquest;'='¿'
            '&times;'='×'
            '&Oslash;'='Ø'
            '&divide;'='÷'
            '&iexcl;'='¡'
            '&sup3;'='³'
            '&Iuml;'='Ï'
            '&cent;'='¢'
            '&copy;'='©'
            '&Auml;'='Ä'
            '&Ograve;'='Ò'
            '&Aring;'='Å'
            '&Egrave;'='È'
            '&Uuml;'='Ü'
            '&Aacute;'='Á'
            '&Igrave;'='Ì'
            '&Ntilde;'='Ñ'
            '&Ecirc;'='Ê'
            '&cedil;'='¸'
            '&Ugrave;'='Ù'
            '&szlig;'='ß'
            '&raquo;'='»'
            '&euml;'='ë'
            '&Eacute;'='É'
            '&micro;'='µ'
            '&not;'='¬'
            '&Uacute;'='Ú'
            '&AElig;'='Æ'
            '&euro;'= "€"        
        }
    
}
process {

        foreach ($r in $replacements.GetEnumerator()) {
            $l = 0 
            do {
                $l = $html.IndexOf($r.Key, $l, [StringComparison]"CurrentCultureIgnoreCase")
                if ($l -ne -1) {
                    $html = $html.Remove($l, $r.Key.Length)
                    $html = $html.Insert($l, $r.Value)
                }
            } while ($l -ne -1)         
        }
     
        $r = New-Object Text.RegularExpressions.Regex ('</' + $tag + '>'), ("Singleline", "IgnoreCase")
        $endTags = @($r.Matches($html))
        $r = New-Object Text.RegularExpressions.Regex ('<' + $tag + '[^>]*>'), ("Singleline", "IgnoreCase")
        $startTags = @($r.Matches($html))
        $tagText = @()
        if ($startTags.Count -eq $endTags.Count) {
            $allTags = $startTags + $endTags | Sort-Object Index   
            $startTags = New-Object Collections.Stack
            foreach($t in $allTags) {
                if (-not $t) { continue } 
                if ($t.Value -like "<$tag*") {
                    $startTags.Push($t)
                } else {
                    $start = $startTags.Pop()
                    $tagText+=($html.Substring($start.Index, $t.Index + $t.Length - $start.Index))
                }
            }
        } else {
            # Unbalanced document, use start tags only and make sure that the tag is self-enclosed
            $startTags | Foreach-Object {
                $t = "$($_.Value)"
                if ($t -notlike "*/>") {
                    $t = $t.Insert($t.Length - 1, "/")
                }
                $tagText+=$t
            } 
        }
        foreach ($t in $tagText) {
            if (-not $t) {continue }
            # Correct HTML which doesn't quote the attributes so it can be coerced into XML
            $inTag = $false
            for ($i = 0; $i -lt $t.Length; $i++) {
                if ($t[$i] -eq "<") {
                    $inTag = $true
                } else {
                    if ($t[$i] -eq ">") {
                        $inTag = $false
                    }
                }
                if ($inTag -and ($t[$i] -eq "=")) {
                    if ($t[$i + 1] -notmatch '[''|"]') {
                        $endQuoteSpot = $t.IndexOfAny(" >", $i + 1)
                        # Find the end of the attribute, then quote
                        $t = $t.Insert($i + 1, "'")
                        $t = $t.Insert($endQuoteSpot + 1, "'")                    
                        $i = $endQuoteSpot
                    } else {
                        # Make sure the quotes are correctly formatted, otherwise,
                        # end the quotes manually
                        $whichQuote = "$($Matches.Values)"
                        $endQuoteSpot = $t.IndexOf($whichQuote, $i + 2)
                        $i = $endQuoteSpot
                    }
                }
            }        
            $t | Select-Object @{
                Name='Tag'
                Expression={$_}
            }, @{
                Name='Xml'
                Expression= {
                    ([xml]$t).$tag      
                    trap {
                        Write-Verbose ($_ | Out-String) 
                        continue
                    }
                }
            }    
        }
    
}

}
    


Automatically generated with Write-CommandBlogPost

Leave a Comment
  • Please add 1 and 7 and type the answer here:
  • Post
  • Geez James you been very busy :)

    Chris

  • You might also be interested in the alternative of using the HTML AgilityPack (http://www.codeplex.com/htmlagilitypack) to convert HTML to XML, so that you can then use all the PowerShell and .Net XML goodness on any old web page.

    I compiled my own HTML AgiltyPack dll and include code to load it locally. I also assume that there's a $source parameter or variable.

    ## load the HtmlAgilityPack and read the HTML as XML

    $hapLocation = join-path $pwd.Path 'HtmlAgilityPack.dll'

    [Reflection.Assembly]::LoadFile($hapLocation)

    [HtmlAgilityPack.HtmlDocument]$doc = New-Object -TypeName HtmlAgilityPack.HtmlDocument

    $doc.OptionOutputAsXml = 'true'

    $content = gc $source

    $doc.LoadHtml($content)

    ## turn HTML into XML in one line!

    [xml]$xml  = $doc.DocumentNode.OuterHtml

    ## now do something with it...

Page 1 of 1 (2 items)