[UDPATE in 2007-09-29]

I've updated the sample code to parse more of the output. It will make it easier to re-use the date. I still need to account for the TimeZone.

 

Inspired by a recent post about Adding the NHL Schedule Your Outlook Calendar, I thought it would be useful to explore how C# 3.0, LINQ, extension methods can simplify screen scraping.

The basic task is to take the transform this web page:

http://www.nhl.com/nhl/app?service=page&page=TeamSchedule&team=ALL&season=20072008

image

And programmatically move it into something more structured like this ...

 

9/17/2007 | Thrashers | Islanders | 4 | 3 | 7:00 PM |  |  |
9/17/2007 | Avalanche | Coyotes | 4 | 3 | 10:00 PM |  |  |
9/17/2007 | Penguins | Canadiens | 2 | 3 | 7:30 PM |  |  |
9/18/2007 | Red Wings | Wild | 6 | 1 | 8:00 PM |  |  |
9/18/2007 | Maple Leafs | Oilers | 2 | 3 | 9:00 PM |  | LEAFS TV |
9/18/2007 | Panthers | Flames | 3 | 2 | 9:00 PM |  |  |
9/18/2007 | Blackhawks | Blue Jackets | 4 | 3 | 7:00 PM |  |  |
9/18/2007 | Senators | Flyers | 4 | 0 | 7:00 PM |  | ROGERS 22 |

I've attached the source code to this post. When run, the it generates 1 MB text file that contains the schedule and game schedule and results.

 

This task is vastly simplified when we use the Html Agility Pack and take advantage of LINQ and extension methods.

 

Here's the core function from the project that illustrates how simple this is:

 

public void get_schedule()
{
    string schedule_url = "
http://www.nhl.com/nhl/app?service=page&page=TeamSchedule&team=ALL&season=20072008";
    string local_fname = "nhlschedule20072008.html";
    if (!System.IO.File.Exists(local_fname))
    {
        this.download_file(schedule_url, local_fname);
    }

    hap.HtmlDocument schedule_doc = new HtmlAgilityPack.HtmlDocument();
    schedule_doc.Load(local_fname);

    // identify all the td nodes that directly contain the text "Date"
    var td_nodes = schedule_doc.DocumentNode.EnumSelectNodesContainingText("//td", "Date");

    // for each of those td nodes look up and find the corresponding table
    var table_nodes = td_nodes.Select(n => n.ParentNode.ParentNode).Where(n => n.Name == "table");

    List<SchedRec> dates = new List<SchedRec>();

    // collect all the records
    foreach (var cur_table_node in table_nodes )
    {
        var rows = cur_table_node.EnumSelectNodes( "tr" ).ToArray();
        foreach (var row in rows)
        {
            string[] pieces = row.EnumSelectNodes("td").Select(x => this.clean_text( x.InnerText ) ).ToArray();

            if (pieces[0] == "Date") { continue; } // skip header rows

            if (pieces.Length >= 9 )
            {
                var rec = new SchedRec();
                rec.raw_date = pieces[0];
                rec.raw_visitor = pieces[1];
                rec.raw_home = pieces[2];
                rec.raw_score = pieces[3];
                rec.raw_dec = pieces[4];
                rec.raw_time = pieces[5];
                rec.raw_tv_national = pieces[6];
                rec.raw_tv_local_analog = pieces[7];
                rec.raw_tv_local_hd = pieces[8];
                dates.Add(rec);
            }

        }
    }

    foreach (var d in dates)
    {
        con.WriteLine("{0} | {1} | {2} | {3} | {4} | {5} | {6} | {7} | {8}", d.raw_date, d.raw_visitor , d.raw_home, d.raw_score , d.raw_dec, d.raw_time , d.raw_tv_national , d.raw_tv_local_analog , d.raw_tv_local_hd );
    }
}

I leave it as an exercise to the reader on how to move the data into XML or push it to outlook or visualize it.