There was an interesting article about using VB's XML support for generating HTML: http://www.infoq.com/news/2009/02/MVC-VB.

I've been using VB and XML for the reverse purpose -- scraping web pages to retreive information. I enjoy sailing, and I wanted to find historical data on windspeeds to know when would be the best time of year to set out on a long trip. (Answer: March and April have the best winds around Seattle).

I found an excellent site, to scrape from, http://www.almanac.com/, which has historical weather data for many places around the country. The first step in scraping is copyright law. Facts alone are not copyrightable, but the act of selecting and compiling facts is a creative work and so the compilation is copyrightable. Hence, for instance, a telephone directory is protected by copyright. So too is almanac.com's compilation. And that's why I only scraped their pages for my own personal use.

The almanac has URLs like this: http://www.almanac.com/weatherhistory/oneday.php?number=994014&wban=99999&day=1&month=4&year=2008&searchtype=. It's easy to see what the format is, and generate similar URLs myself.

 

The code to parse XHTML:

I looked at the HTML source code of a page from the almanac in Notepad, figured out its structure, and wrote some simple XML queries to dig into it. (Note: the function "Fetch" fetches HTML pages from the web, but converts them into XHTML ready for VB XML queries. More on that later). Here's the VB code. I highlighted the XML queries.

 

Option Strict On

Imports System.Net

Imports System.IO

Imports <xmlns:xhtml="http://www.w3.org/1999/xhtml">

 

 

Module Module1

 

    Dim Places As Integer() = {994014}

    Dim Years As Integer() = {2008}

    Dim Months As Integer() = {4, 5}

 

    Sub Main()

        Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, "Date (Y/M/D)", "Location", "Temp (^F)", "Precipitation (in)", "Visibility (miles)", "Wind Mean (mph)", "Wind Sustained (mph)", "Wind Gust (mph)")

        For Each year As Integer In Years

            For Each month As Integer In Months

                Dim d = New DateTime(year, month, 1)

                Dim dnm = New DateTime(If(d.Month = 12, d.Year + 1, d.Year), If(d.Month = 12, 1, d.Month + 1), d.Day)

                Dim lastDay = CInt((dnm - d).TotalDays)

                For day As Integer = 1 To lastDay

                    For Each place As Integer In Places

                        Dim url = String.Format("http://www.almanac.com/weatherhistory/oneday.php?number={0}&wban=99999&day={1}&month={2}&year={3}&searchtype=", place, day, month, year)

                        Dim fn = Fetch(url)

                        Dim xml = XElement.Load(fn)

                        Dim body = (From i In xml...<xhtml:div> Where i.GetAttr("class") = "yui-u first").FirstOrDefault

                        If body Is Nothing Then Continue For

                        Dim title = body.<xhtml:h2>.Value.ToString.Replace(",", " ")

                        If title.ToLower.StartsWith("no data") Then Continue For

                        Dim temp, precipitation, visibility, windMean, windSustained, windGust As Double?

                        Dim data = From i In body...<xhtml:td>

                        For Each td In data

                            Dim text = td.<xhtml:p>.FirstOrDefault

                            If text Is Nothing Then Continue For

                            Dim category = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower

                            text = td.<xhtml:b>.FirstOrDefault

                            If text Is Nothing Then Continue For

                            Dim svalue = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower

                            Dim value = 0.0 : If Not Double.TryParse(svalue, value) Then Continue For

                            If category Like "mean temperature" Then temp = value

                            If category Like "total precipitation" Then precipitation = value

                            If category Like "visibility" Then visibility = value

                            If category Like "mean wind speed" Then windMean = value

                            If category Like "maximum sustained" Then windSustained = value

                            If category Like "maximum gust" Then windGust = value

                        Next

                        Dim s = String.Format("{0:0000}/{1:00}/{2:00}", year, month, day)

                        Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, s, title, temp, precipitation, visibility, windMean, windSustained, windGust)

                    Next

                Next

            Next

        Next

    End Sub

End Module

 

Fetching pages: HTML into XHTML

Goal: to use VB's XML support for reading the web page. That's because VB has such nice syntax (I find it easier than xpath, or beautiful soup, or the alternatives). The problem is that most web-pages are written in a sloppy kind of HTML that might render okay but certainly can't be loaded into XElement.Load.

Solution: download Tidy, an awesome open-source library and executable for, well, tidying HTML into proper XHTML. I downloaded "tidy.exe" and put it into my windows directory, so I could execute it without messing around with the path.

The above code calls a function "Fetch". This is the one that fetches pages, and invokes "tidy" to clean up the html. Here is the implementation of Fetch. It uses a function "InputAndOutputToEnd" to redirect input and output of tidy.exe when it runs it. I wrote about InputAndOutputToEnd last month.

 

Module Helpers

 

    ''' <summary>

    ''' GetAttr: x.GetAttr("attr") is equivalent to x.@attr. It's here to work around a MONO bug: MONO

    ''' will throw an exception on x.@attr if the attribute is absent; the CLR doesn't. This function

    ''' also doesn't throw.

    ''' </summary>

    <System.Runtime.CompilerServices.Extension()> Function GetAttr(ByVal e As XElement, ByVal attr As String) As String

        If e Is Nothing Then Return ""

        For Each a In e.Attributes

            If String.Compare(attr, a.Name.LocalName, True) = 0 Then Return a.Value

        Next

        Return ""

    End Function

 

    ''' <summary>

    ''' Fetch: this function fetches the given Url and saves it into a cache in a temporary directory.

    ''' It returns the filename. If the Url had given back "text/html", then this function invokes

    ''' "tidy.exe" (from http://tidy.sourceforge.net/) to turn the html into valid XHTML such as can

    ''' be read with XElement.Load. The function will throw an exception if anything bad happened,

    ''' e.g. WebException or BadUriException. If asked to fetch a url but this url had already been downloaded

    ''' previously, and the previous download was no more than "CacheAtLeastDays" old and hadn't

    ''' been deleted, then the previous download is used. The idea is that our program might well hammer

    ''' web-services, and we don't want to be too cruel on them, so even if they didn't specify caching

    ''' for a page then we might still want to cache it. (If the webservice specified a cache longer than

    ''' CacheAtLeastDays, then any number of internet proxies along the way might cache it, and so

    ''' CacheAtLeastDays is a minimum rather than a maximum.) This function is not protected against

    ''' multiple threads calling it. There might be contention if multiple threads call it and try to

    ''' download and write to the same file. Note: in the cache, URLs are escaped then truncated to 240

    ''' characters. So if they were longer than that (e.g. long query strings) then there'll be cache

    ''' conflicts and the wrong data might be returned.

    ''' </summary>

    Function Fetch(ByVal Url As String, Optional ByVal CacheAtLeastDays As Double = 7) As String

        Dim dir = IO.Path.GetTempPath & My.Application.Info.AssemblyName & "\fetch"

        If Not Directory.Exists(dir) Then Directory.CreateDirectory(dir)

        ' Note: if the directory already existed, then CreateDirectory just proceeds silently without fuss.

 

        Dim fn = dir & "\" & Uri.EscapeDataString(Url.Replace("http://", "").Replace("/", "_")).Replace("%", "#")

        ' MONO: If you try to XElement.Load(fn) where fn includes %escapes, then it tries to unescape them.

        ' So we make sure there are no %escapes in the filename.  (CLR doesn't have this quirk.)

 

        fn = fn.Substring(0, Math.Min(240, fn.Length))

        ' MONO on unix: is fine so long as every directory/filename component is <=240 characters.

        ' CLR on windows: requires the entire path "fn" to be <=240 characters.

        ' http://blogs.msdn.com/bclteam/archive/2007/02/13/long-paths-in-net-part-1-of-3-kim-hamilton.aspx

 

        If File.Exists(fn) Then

            Dim age = DateTime.Now - File.GetLastWriteTime(fn)

            If age.TotalDays <= CacheAtLeastDays Then Return fn

            File.Delete(fn)

        End If

 

        Dim x = WebRequest.Create(Url)

        Using r = x.GetResponse

            Dim t = ""

            Using rs As New StreamReader(r.GetResponseStream)

                t = rs.ReadToEnd

            End Using

            If Not r.ContentType.StartsWith("text/html") Then

                My.Computer.FileSystem.WriteAllText(fn, t, False, Text.Encoding.UTF8)

                Return fn

            End If

            Using tidy As New System.Diagnostics.Process

                Dim cmd = "tidy"

                Dim args = "-asxml -numeric -quiet --doctype omit"

                ' MONO: XElement.Load throws an exception if DOCTYPE is present. CLR doesn't. Hence we omit the DOCTYPE.

                tidy.StartInfo.FileName = cmd

                tidy.StartInfo.Arguments = args

                tidy.StartInfo.UseShellExecute = False

                tidy.StartInfo.RedirectStandardInput = True

                tidy.StartInfo.RedirectStandardOutput = True

                tidy.StartInfo.RedirectStandardError = True

                tidy.Start()

                Dim err = "", op = ""

                tidy.InputAndOutputToEnd(t, op, err)

                tidy.WaitForExit(5000)

                If tidy.HasExited Then

                    ' We had already asked ("-numeric") for tidy to escape non-ascii characters. But

                    ' nonetheless, XElement.Load will throw an exception if there are any, and we really

                    ' don't want that, so we'll do belt-and-braces here:

                    Dim op2 As New Text.StringBuilder(op.Length)

                    For i = 0 To op.Length - 1

                        Dim c = AscW(op(i))

                        If (c >= 32 AndAlso c < 127) OrElse c = 13 OrElse c = 10 OrElse c = 9 Then

                            op2.Append(op(i))

                        End If

                    Next

                    My.Computer.FileSystem.WriteAllText(fn, op2.ToString, False, Text.Encoding.ASCII)

                    Return fn

                End If

                tidy.Kill()

                tidy.WaitForExit(2000)

            End Using

        End Using

        Return ""

    End Function

 

 

    ''' <summary>

    ''' InputAndOutputToEnd: Given a started process, this lets you supply a string as input if you want,

    ''' and will read all output and error to the end. This function has no timeout: if we give it an input string

    ''' but the process fails to read it to completion, or if we ask for standard-output/error but the process

    ''' fails to close these streams, then the function will block indefinitely. The function will throw

    ''' an exception if there was an error reading from the streams. The caller is expected to have started

    ''' the process before calling the function, and the caller is expected to wait for the process to close

    ''' and to dispose of it afterwards. If the caller uses this function, then the caller should do no

    ''' other input/output to the process.

    ''' </summary>

    <Runtime.CompilerServices.Extension()> Sub InputAndOutputToEnd(ByVal p As Diagnostics.Process, ByVal StandardInput As String, ByRef StandardOutput As String, ByRef StandardError As String)

        If p Is Nothing Then Throw New ArgumentException("process must be non-null", "p")

        ' Assume p has started. Alas there's no way to check.

        If p.StartInfo.UseShellExecute Then Throw New ArgumentException("Set StartInfo.UseShellExecute to false")

        If (p.StartInfo.RedirectStandardInput <> (StandardInput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Input only when StartInfo.RedirectStandardInput")

        If (p.StartInfo.RedirectStandardOutput <> (StandardOutput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Output only when StartInfo.RedirectStandardOutput")

        If (p.StartInfo.RedirectStandardError <> (StandardError IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Error only when StartInfo.RedirectStandardError")

        '

        ' MSDN notes, http://msdn.microsoft.com/en-us/library/system.diagnostics.processstartinfo.redirectstandardoutput.aspx,

        ' that "Synchronous read operations introduce a dependency between the caller reading from the StandardOutput stream

        ' and the child process writing to that stream. These dependencies can cause deadlock conditions." We avoid the deadlock

        ' by running in a separate thread.

        '

        Dim outputData As New InputAndOutputToEndData

        Dim errorData As New InputAndOutputToEndData

        '

        If p.StartInfo.RedirectStandardOutput Then

            outputData.Stream = p.StandardOutput

            outputData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)

            outputData.Thread.Start(outputData)

        End If

        If p.StartInfo.RedirectStandardError Then

            errorData.Stream = p.StandardError

            errorData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)

            errorData.Thread.Start(errorData)

        End If

        '

        If p.StartInfo.RedirectStandardInput Then

            p.StandardInput.Write(StandardInput)

            p.StandardInput.Close()

        End If

        '

        If p.StartInfo.RedirectStandardOutput Then outputData.Thread.Join() : StandardOutput = outputData.Output

        If p.StartInfo.RedirectStandardError Then errorData.Thread.Join() : StandardError = errorData.Output

        If outputData.Exception IsNot Nothing Then Throw outputData.Exception

        If errorData.Exception IsNot Nothing Then Throw errorData.Exception

    End Sub

 

    Private Class InputAndOutputToEndData

        Public Thread As Threading.Thread

        Public Stream As IO.StreamReader

        Public Output As String

        Public Exception As Exception

    End Class

 

    Private Sub InputAndOutputToEndProc(ByVal data_ As Object)

        Dim data = DirectCast(data_, InputAndOutputToEndData)

        Try : data.Output = data.Stream.ReadToEnd : Catch e As Exception : data.Exception = e : End Try

    End Sub

 

End Module