There was an interesting article about using VB's XML support for generating HTML: http://www.infoq.com/news/2009/02/MVC-VB.
I've been using VB and XML for the reverse purpose -- scraping web pages to retreive information. I enjoy sailing, and I wanted to find historical data on windspeeds to know when would be the best time of year to set out on a long trip. (Answer: March and April have the best winds around Seattle).
I found an excellent site, to scrape from, http://www.almanac.com/, which has historical weather data for many places around the country. The first step in scraping is copyright law. Facts alone are not copyrightable, but the act of selecting and compiling facts is a creative work and so the compilation is copyrightable. Hence, for instance, a telephone directory is protected by copyright. So too is almanac.com's compilation. And that's why I only scraped their pages for my own personal use.
The almanac has URLs like this: http://www.almanac.com/weatherhistory/oneday.php?number=994014&wban=99999&day=1&month=4&year=2008&searchtype=. It's easy to see what the format is, and generate similar URLs myself.
The code to parse XHTML:
I looked at the HTML source code of a page from the almanac in Notepad, figured out its structure, and wrote some simple XML queries to dig into it. (Note: the function "Fetch" fetches HTML pages from the web, but converts them into XHTML ready for VB XML queries. More on that later). Here's the VB code. I highlighted the XML queries.
Option Strict On
Imports System.Net
Imports System.IO
Imports <xmlns:xhtml="http://www.w3.org/1999/xhtml">
Module Module1
Dim Places As Integer() = {994014}
Dim Years As Integer() = {2008}
Dim Months As Integer() = {4, 5}
Sub Main()
Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, "Date (Y/M/D)", "Location", "Temp (^F)", "Precipitation (in)", "Visibility (miles)", "Wind Mean (mph)", "Wind Sustained (mph)", "Wind Gust (mph)")
For Each year As Integer In Years
For Each month As Integer In Months
Dim d = New DateTime(year, month, 1)
Dim dnm = New DateTime(If(d.Month = 12, d.Year + 1, d.Year), If(d.Month = 12, 1, d.Month + 1), d.Day)
Dim lastDay = CInt((dnm - d).TotalDays)
For day As Integer = 1 To lastDay
For Each place As Integer In Places
Dim url = String.Format("http://www.almanac.com/weatherhistory/oneday.php?number={0}&wban=99999&day={1}&month={2}&year={3}&searchtype=", place, day, month, year)
Dim fn = Fetch(url)
Dim xml = XElement.Load(fn)
Dim body = (From i In xml...<xhtml:div> Where i.GetAttr("class") = "yui-u first").FirstOrDefault
If body Is Nothing Then Continue For
Dim title = body.<xhtml:h2>.Value.ToString.Replace(",", " ")
If title.ToLower.StartsWith("no data") Then Continue For
Dim temp, precipitation, visibility, windMean, windSustained, windGust As Double?
Dim data = From i In body...<xhtml:td>
For Each td In data
Dim text = td.<xhtml:p>.FirstOrDefault
If text Is Nothing Then Continue For
Dim category = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower
text = td.<xhtml:b>.FirstOrDefault
Dim svalue = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower
Dim value = 0.0 : If Not Double.TryParse(svalue, value) Then Continue For
If category Like "mean temperature" Then temp = value
If category Like "total precipitation" Then precipitation = value
If category Like "visibility" Then visibility = value
If category Like "mean wind speed" Then windMean = value
If category Like "maximum sustained" Then windSustained = value
If category Like "maximum gust" Then windGust = value
Next
Dim s = String.Format("{0:0000}/{1:00}/{2:00}", year, month, day)
Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, s, title, temp, precipitation, visibility, windMean, windSustained, windGust)
End Sub
End Module
Fetching pages: HTML into XHTML
Goal: to use VB's XML support for reading the web page. That's because VB has such nice syntax (I find it easier than xpath, or beautiful soup, or the alternatives). The problem is that most web-pages are written in a sloppy kind of HTML that might render okay but certainly can't be loaded into XElement.Load.
Solution: download Tidy, an awesome open-source library and executable for, well, tidying HTML into proper XHTML. I downloaded "tidy.exe" and put it into my windows directory, so I could execute it without messing around with the path.
The above code calls a function "Fetch". This is the one that fetches pages, and invokes "tidy" to clean up the html. Here is the implementation of Fetch. It uses a function "InputAndOutputToEnd" to redirect input and output of tidy.exe when it runs it. I wrote about InputAndOutputToEnd last month.
Module Helpers
''' <summary>
''' GetAttr: x.GetAttr("attr") is equivalent to x.@attr. It's here to work around a MONO bug: MONO
''' will throw an exception on x.@attr if the attribute is absent; the CLR doesn't. This function
''' also doesn't throw.
''' </summary>
<System.Runtime.CompilerServices.Extension()> Function GetAttr(ByVal e As XElement, ByVal attr As String) As String
If e Is Nothing Then Return ""
For Each a In e.Attributes
If String.Compare(attr, a.Name.LocalName, True) = 0 Then Return a.Value
Return ""
End Function
''' Fetch: this function fetches the given Url and saves it into a cache in a temporary directory.
''' It returns the filename. If the Url had given back "text/html", then this function invokes
''' "tidy.exe" (from http://tidy.sourceforge.net/) to turn the html into valid XHTML such as can
''' be read with XElement.Load. The function will throw an exception if anything bad happened,
''' e.g. WebException or BadUriException. If asked to fetch a url but this url had already been downloaded
''' previously, and the previous download was no more than "CacheAtLeastDays" old and hadn't
''' been deleted, then the previous download is used. The idea is that our program might well hammer
''' web-services, and we don't want to be too cruel on them, so even if they didn't specify caching
''' for a page then we might still want to cache it. (If the webservice specified a cache longer than
''' CacheAtLeastDays, then any number of internet proxies along the way might cache it, and so
''' CacheAtLeastDays is a minimum rather than a maximum.) This function is not protected against
''' multiple threads calling it. There might be contention if multiple threads call it and try to
''' download and write to the same file. Note: in the cache, URLs are escaped then truncated to 240
''' characters. So if they were longer than that (e.g. long query strings) then there'll be cache
''' conflicts and the wrong data might be returned.
Function Fetch(ByVal Url As String, Optional ByVal CacheAtLeastDays As Double = 7) As String
Dim dir = IO.Path.GetTempPath & My.Application.Info.AssemblyName & "\fetch"
If Not Directory.Exists(dir) Then Directory.CreateDirectory(dir)
' Note: if the directory already existed, then CreateDirectory just proceeds silently without fuss.
Dim fn = dir & "\" & Uri.EscapeDataString(Url.Replace("http://", "").Replace("/", "_")).Replace("%", "#")
' MONO: If you try to XElement.Load(fn) where fn includes %escapes, then it tries to unescape them.
' So we make sure there are no %escapes in the filename. (CLR doesn't have this quirk.)
fn = fn.Substring(0, Math.Min(240, fn.Length))
' MONO on unix: is fine so long as every directory/filename component is <=240 characters.
' CLR on windows: requires the entire path "fn" to be <=240 characters.
' http://blogs.msdn.com/bclteam/archive/2007/02/13/long-paths-in-net-part-1-of-3-kim-hamilton.aspx
If File.Exists(fn) Then
Dim age = DateTime.Now - File.GetLastWriteTime(fn)
If age.TotalDays <= CacheAtLeastDays Then Return fn
File.Delete(fn)
End If
Dim x = WebRequest.Create(Url)
Using r = x.GetResponse
Dim t = ""
Using rs As New StreamReader(r.GetResponseStream)
t = rs.ReadToEnd
End Using
If Not r.ContentType.StartsWith("text/html") Then
My.Computer.FileSystem.WriteAllText(fn, t, False, Text.Encoding.UTF8)
Return fn
Using tidy As New System.Diagnostics.Process
Dim cmd = "tidy"
Dim args = "-asxml -numeric -quiet --doctype omit"
' MONO: XElement.Load throws an exception if DOCTYPE is present. CLR doesn't. Hence we omit the DOCTYPE.
tidy.StartInfo.FileName = cmd
tidy.StartInfo.Arguments = args
tidy.StartInfo.UseShellExecute = False
tidy.StartInfo.RedirectStandardInput = True
tidy.StartInfo.RedirectStandardOutput = True
tidy.StartInfo.RedirectStandardError = True
tidy.Start()
Dim err = "", op = ""
tidy.InputAndOutputToEnd(t, op, err)
tidy.WaitForExit(5000)
If tidy.HasExited Then
' We had already asked ("-numeric") for tidy to escape non-ascii characters. But
' nonetheless, XElement.Load will throw an exception if there are any, and we really
' don't want that, so we'll do belt-and-braces here:
Dim op2 As New Text.StringBuilder(op.Length)
For i = 0 To op.Length - 1
Dim c = AscW(op(i))
If (c >= 32 AndAlso c < 127) OrElse c = 13 OrElse c = 10 OrElse c = 9 Then
op2.Append(op(i))
My.Computer.FileSystem.WriteAllText(fn, op2.ToString, False, Text.Encoding.ASCII)
tidy.Kill()
tidy.WaitForExit(2000)
''' InputAndOutputToEnd: Given a started process, this lets you supply a string as input if you want,
''' and will read all output and error to the end. This function has no timeout: if we give it an input string
''' but the process fails to read it to completion, or if we ask for standard-output/error but the process
''' fails to close these streams, then the function will block indefinitely. The function will throw
''' an exception if there was an error reading from the streams. The caller is expected to have started
''' the process before calling the function, and the caller is expected to wait for the process to close
''' and to dispose of it afterwards. If the caller uses this function, then the caller should do no
''' other input/output to the process.
<Runtime.CompilerServices.Extension()> Sub InputAndOutputToEnd(ByVal p As Diagnostics.Process, ByVal StandardInput As String, ByRef StandardOutput As String, ByRef StandardError As String)
If p Is Nothing Then Throw New ArgumentException("process must be non-null", "p")
' Assume p has started. Alas there's no way to check.
If p.StartInfo.UseShellExecute Then Throw New ArgumentException("Set StartInfo.UseShellExecute to false")
If (p.StartInfo.RedirectStandardInput <> (StandardInput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Input only when StartInfo.RedirectStandardInput")
If (p.StartInfo.RedirectStandardOutput <> (StandardOutput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Output only when StartInfo.RedirectStandardOutput")
If (p.StartInfo.RedirectStandardError <> (StandardError IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Error only when StartInfo.RedirectStandardError")
'
' MSDN notes, http://msdn.microsoft.com/en-us/library/system.diagnostics.processstartinfo.redirectstandardoutput.aspx,
' that "Synchronous read operations introduce a dependency between the caller reading from the StandardOutput stream
' and the child process writing to that stream. These dependencies can cause deadlock conditions." We avoid the deadlock
' by running in a separate thread.
Dim outputData As New InputAndOutputToEndData
Dim errorData As New InputAndOutputToEndData
If p.StartInfo.RedirectStandardOutput Then
outputData.Stream = p.StandardOutput
outputData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)
outputData.Thread.Start(outputData)
If p.StartInfo.RedirectStandardError Then
errorData.Stream = p.StandardError
errorData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)
errorData.Thread.Start(errorData)
If p.StartInfo.RedirectStandardInput Then
p.StandardInput.Write(StandardInput)
p.StandardInput.Close()
If p.StartInfo.RedirectStandardOutput Then outputData.Thread.Join() : StandardOutput = outputData.Output
If p.StartInfo.RedirectStandardError Then errorData.Thread.Join() : StandardError = errorData.Output
If outputData.Exception IsNot Nothing Then Throw outputData.Exception
If errorData.Exception IsNot Nothing Then Throw errorData.Exception
Private Class InputAndOutputToEndData
Public Thread As Threading.Thread
Public Stream As IO.StreamReader
Public Output As String
Public Exception As Exception
End Class
Private Sub InputAndOutputToEndProc(ByVal data_ As Object)
Dim data = DirectCast(data_, InputAndOutputToEndData)
Try : data.Output = data.Stream.ReadToEnd : Catch e As Exception : data.Exception = e : End Try