For kicks, I started writing a tool to use internet searches to automatically catch plagiarism. The first thing I needed was a way to easily execute a search (I'll use MSN search) from a C# API. Ideally, it would be a function that takes in a search string and then returns a list of Urls for the search results.

However, it looks like MSN Search does not have such an API yet for online searches. So I wrote my own primitive version (called MsnSearch.SearchString in the function below). It has problems, but it's more than good enough for my purposes. The code is at the bottom of this blog entry. It has a tiny main() function to search for a predefined string and then just print out the URLs of the search results.

        // Helper to test a search.
        static void Main()
        {
           
IList<Uri> results = MsnSearch.SearchString(@"""ICorDebug is just a set of interfaces; and a debugger has various ways of creating an ICorDebug object that implements those interfaces:""");
           
foreach (Uri s in results)
            {
               
Console.WriteLine(s.ToString());
            }
        }


Now that's the sort of code I want a  client to be able to write to get the job done!

Compile and run it and you get:
C:\temp>csc t.cs /debug+ /r:System.Web.dll
Microsoft (R) Visual C# 2005 Compiler version 8.00.50215.44
for Microsoft (R) Windows (R) 2005 Framework version 2.0.50215
Copyright (C) Microsoft Corporation 2001-2005. All rights reserved.


C:\temp>t.exe
http://blogs.msdn.com/jmstall/
http://blogs.msdn.com/jmstall/rss.aspx?CategoryID=7367
http://209.34.241.67/jmstall/archive/category/7367.aspx

Compare that to the actual MSN search results.  (As a tangent, once the search robots discover this blog entry, it too should show up on the search list.)

How does it work?
It's just brute force:
[Update] I've used CThota's advice to include 'format=xml' in the query string so that I get back an XML document instead of an HTML document. XML is easier and more reliable to extract data from. I used Dominic Cooney's advice to use System.Web.HttpUtility.UrlEncode instead of handrolling my own The instructions and sample code are updated, and I've removed some comments that are no longer relevant.

1. Construct a MSN search query string.  I couldn't find a spec for the MSN search query string, but it looks close enough to System.Web.HttpUtility.UrlEncode. Yes, this is very fragile. But it works well enough for my immediate purposes. Include 'format=xml' in the query string to get back the results as an XML page instead of HTML.
2. Use System.Net.WebRequest  to get the web page for the query string. MSN Search will execute the query and return the results as an XML page.
3. The URLs are conveniently encoded in <url>...</url> tags in the resulting XML page. Use regular expression to scrape these results. (Originally when I was scraping HTML, I was scraping for <H3> tags)

So it will be cool once the MSN Search team comes out with a managed API to use their search engine. In the meantime, my collection of hacks below enables me to play around with some internet-search centric tools.

Some other notes:
I notice that FXCop wants me to use System.Uri instead of System.String to describe urls. I have mixed feelings. System.Uri is not a valuetype, so it's pure inefficiency on top of System.String.
There's also lots of opportunity for cleanup and expansion. For example, you could have an internet seach interface "IInternetSearch" and then have multiple implementations for different search engines. You could also have more complex query string encoding to support a search engine's advanced options and keywords (eg, like the "AND", "OR", "LINK" keywords). You could even take in an search via a data structure more type safe than just a string.

Here's the full code:


//-----------------------------------------------------------------------------
// Test harness to expose MSN search via a simple C# API.
// Author: Mike Stall.  http://blogs.msdn.com/jmstall
// Thanks to: Cthota (http://blogs.msdn.com/cthota/) and 
// Dominic Cooney (http://www.dcooney.com/) for suggestions.
//-----------------------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;

namespace Web2
{
    // Class to get Search results using http://search.msn.com.
    // We want to get a list of Urls from a given search string.
    // It currently does a HTTP query with an embedded query string to retrieve the result as XML.
    // It then extracts the URLS from the XML (which are conveniently in <url> tags).
    // If MSN Search ever comes out with a real API for internet searches, we should use that instead of this.
    class MsnSearch
    {
        // Helper to get the Search result for an exact string.
        // This will escape the string results. The MSN encoding appears to use the same encoding as HttpUtility.UrlEncode.
        // This also does not account for search keywords (like "AND").
        // If there's a spec for the query string, we should find and use it.
        static Uri GetMSNSearchURL(string input)
        {
            // The 'FORMAT=XML' part request the results back as XML, which will be easier to parse than HTML.
            StringBuilder sb = new StringBuilder(@"http://search.msn.com/results.aspx?FORMAT=XML&q=");
            sb.Append(System.Web.HttpUtility.UrlEncode(input)); // requires ref to System.Web.dll
            return new System.Uri(sb.ToString());
        }

        // Return an list of URLs for the search results against an string input.
        // This currently does not recognize any search keywords.
        // For an exact search, place the input string in quotes.
        // Note that these searches are not exact. For example, the search engine may have used a cached 
        // webpage and the URL may have changed since then. Or the search engine may take some
        // liberties about what constitutes an "exact" match.
        public static IList<Uri> SearchString(string input)
        {
            Uri url = GetMSNSearchURL(input);
            WebRequest request = HttpWebRequest.Create(url);
            WebResponse response = request.GetResponse();

            Stream raw = response.GetResponseStream();
            StreamReader s = new StreamReader(raw);
            string x = s.ReadToEnd();

            List<Uri> list = new List<Uri>();

            // In the XML format, the URLs are conveniently in URL tags. We could use a full XmlReader / XPathQuery
            // to find them, or we can just grab them with a regular expression.
            Regex r = new Regex("<url>(.+?)</url>", RegexOptions.Singleline);

            for (Match m = r.Match(x); m.Success; m = m.NextMatch())
            {
                list.Add(new Uri(m.Groups[1].Value));
            }


            return list;
        }
    }

    class ProgramTest
    {
        // Helper to test a search.
        static void Main()
        {
            IList<Uri> results = MsnSearch.SearchString(@"""ICorDebug is just a set of interfaces; and a debugger has various ways of creating an ICorDebug object that implements those interfaces:""");
            foreach (Uri s in results)
            {
                Console.WriteLine(s.ToString());
            }
        }
    }
}