The first post about scripting the was a lot of waxing philosophical but little about how to extract data and give it form.  There are several approaches, with various difficulties.  I could build a full HTML parser and walk though object models, or I could use the object model of IE.  Since I personally like to minimize dependencies, I’ve chosen to use System.Net.Webclient to download the webpage as text rather than use Internet Explorer and get it through a proper object model. This means I will either have to write an HTML parser, or I’ll have to nudge HTML into something more useful to avoid writing a full parser.

HTML is preciously close to XML, and XML is something that PowerShell supports quite well, so e IE sI decided to try to nudge HTML into XML.  I wrote a function, Get-MarkupTag, which will extract out the text for a markup tag (e.g. <a>) and attempt to coerce it into XML.

It’s not a perfect approach, because there several things that are legal in HTML but not in XML.  Here’s an inventory of every curveball I’ve hit so far.

Unclosed tags like <IMG>:

Some tags in HTML, like <IMG> are unmatched.  I’ll first have to identify all of the unmatched tags and coerce them into property closed xml (e.g. ensure that <IMG is <IMG />).  For things like <BR> or <HR> this is easy, because attributes are rare, but for IMG, getting this right is critically important, because it’s how you download images.

HTML Escape Sequences end up as XML unrecognized entities:

There’s a lot of ways to embed special characters within HTML that get recognized by the parser as XML entities.  The most common example is &nbsp; (the explicit space), but foreign currencies often show up in this format as well.  I found a complete list online that I could extract out all of the sequences from, and I embedded a hashtable within to hold each item.  Then I walked through the hashtable and replaced the escape sequences with their real value.  However, since the entity is legal in any since the entity is legal in any case (e.g. &nsbp; is the same as &NBSP; to an HTML parser), I had to write a quick case insensitive replace.

Unquoted Attributes:

Most browsers will accept HTML attributes without quotes, e.g. <a href=www.microsoft.com>, but XML can handle this.  Once I’ve got figured out the text in each tag, I need to check each attribute within that tag to ensure that the attributes are quoted. 

Nested tags:

One of the more interesting parts of parsing HTML was nested tags.  In order to match nested tags, what I did was use two regular expressions to identify all of the matches.  While PowerShell has a –match operator, –match doesn’t work for multiline strings.  So what I did was use New-Object to create the Regular Expression for extracting out a tag and then stored the result in a variable.

If the number of start tags was equal to the number of end tags, I assumed that the tags are balanced.  If they’re odd, I’ll only deal with the start tags instead.  If I need to determine which start tag matches which end tag, I need to use a data structure called a stack.  Luckily, .NET comes with one (System.Collections.Stack).  A stack in code is just like a stack in real life.  You can put something on the top of the stack (Push), see what’s on top of the stack (Peek), and pull off the top of of the stack (Pop).

If I put start tags and end tags in one list, and sort them by the order that they occurred (in PowerShell, this is $list1 + $list2 | Sort-Object PropertyName), then I can simply walk through the list, pushing the start tags onto the stack and popping off end tags as I encounter them.

I don’t believe that I’ve taken care of all of the possible pains that exist (for instance, I know that georgraphic coordinates on wikipedia do not coerce into XML at this point.  However, it’s a good start and it can yield some quite interesting results.

Since Get-MarkupTag involves a lot of escape sequences, which lends to very annoying blog formatting, I’m going to attach the script instead of embedding it.  You can download it here. 

If you look at the code in Get-MarkupTag, close to the end, you’ll see a trap statement. This means that there’s any error coercing the chunk of HTML into XML will be swallowed into the verbose log.  If you want to examine why you can’t use the XML, simply set $verbosePreference to “continue” and you will be able to see the errors produced trying to convert the tag into XML.  This is an example of a technique I call error redirection.

One of the kind of cool things you can do with it is extract the individual rows from ConvertTo-HTML:

    $text = Get-ChildItem | Select Name, LastWriteTime | ConvertTo-HTML | Out-String 
    Get-MarkupTag "tr" $text
        

It's also possible to extract out all of the pages www.microsoft.com links to:

    $microsoft= (New-Object Net.Webclient).DownloadString("http://www.microsoft.com/")
    Get-MarkupTag "a" $microsoft | % { $_.Xml.Href }        

What this fairly slightly quick and dirty approach to extracting out HTML gives you is a fairly simply way to start giving the data form. With XML you get the syntactic sugar PowerShell pours on XML, and you also get the power of XPath, and from that, you can start turning the HTML into something more interesting.

In the next post, I’ll look at some of the things you can do with this new toy.

Try it out on a few sites.

Hope this helps,

James Brundage [MSFT]