Visual Studio's IDE will copy code as RTF (Rich Text Format). Web browsers like HTML. So posting code from Visual into blogs means a decent RTF to HTML conversion. And having a technical blog means posting code.  So I needed to solve this conversion problem.

The sad tale:
At first I tried Word, but Word had a heart-attack trying to convert RTF to HTML and generates heavily mangled HTML  (even it's allegedly "filtered" html is still garbled), which in turn gave Community Server (which runs my blog) a heart-attack. That almost killed my blogging days, until I switched to Front-page. But FrontPage 2003 can't properly convert RTF to HTML either (flabbergasting!), so I eventually wrote my own converter.

"How can I post Visual Studio code on my blog" was actually a very popular question on our internal blog alias. It took a while to get good answers. Several folks wrote their own tools. (Here's an example of Shawn's. His puts a pretty box around the code). I think CS's support improved here too over time.

The right way:
There are some great tools out there that solve this properly, like a VS Plug-in that copies code as HTML (http://blogs.msdn.com/powertoys/archive/2004/10/21/245850.aspx ). Anybody who actually wants a working reasonably solution should use that.

There are also sample RTF 2 HTML converters all around, including some nice web-based ones. Just a search away.

What I did:
It was easier to just write an RTF to HTML converter than to deal with these other apps.  And more fun.

I've had a few people ask about it, so I wanted to throw it up on my blog.

This takes RTF in from the clipboard, and then dumps it out as a an html file called "out.html" in the current directory. 

  1. RTF is just a text file with embedded control sequences. Check out the RTF spec on MSDN. Or create an RTF file with word-pad and then open it with notepad.
  2. I wanted the input to be via the clipboard, as opposed to a file, since I was copying the text from Visual Studio. That was part of my motivation for writing http://blogs.msdn.com/jmstall/archive/2005/08/22/Clipboard_tools.aspx
  3. The output HTML is very straightforward. No CSS. the most fancy thing it has are <span> tags.

It's only about 150 lines of C#, so I took a few shortcuts.

  1. It's by no means a complete RTF converter. It just handles the subset of RTF that VS2005's IDE produces. That's all I needed.
  2. It's hard-coded to use the colortable matching VS's default C# color scheme.
  3. It doesn't handle tabs. You should be using spaces anyways.

 

The comparison:

Here's a comparison of Word and FrontPage trying to convert the RTF to HTM on a simple snippet

What it should be:


                // check for RTF escape characters. According to the spec, these are the only escaped chars.
                char chNext = rtf[idx];
                if (chNext == '{' || chNext == '}' || chNext == '\\')
                {
                    // Escaped char
                    tw.Write(chNext);
                    idx++;
                    continue;
                }

------------------------------------------------------

Word 2003: It's got all these Mso class tags and extra <p> tags. And in my browser, it's got extra newlines.

                // check for RTF escape characters. According to the spec, these are the only escaped chars.

                char chNext = rtf[idx];

                if (chNext == '{' || chNext == '}' || chNext == '\\')

                {

                    // Escaped char

                    tw.Write(chNext);

                    idx++;

                    continue;

                }

------------------------------------------------------
Frontpage 2003
: It loses the indenting and the font.

// check for RTF escape characters. According to the spec, these are the only escaped chars.

char chNext = rtf[idx];

if (chNext == '{' || chNext == '}' || chNext == '\\')

{

// Escaped char

tw.Write(chNext);

idx++;

continue;

}

------------------------------------------------------

 

The code:

Here's the code. In good dogfooding fashion, I got the HTML for it via running it on itself. If you find it useful or entertaining, great.

[update: missing & check] 
[update: added ';' in Escape]

// Very primitive RTF 2 HTML reader 
// Converts tiny subset of RTF (from VS IDE) into html.
// Author: Mike Stall (http://blogs.msdn.com/jmstall)
// Gets input RTF from clipboard.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.IO;

namespace ClipBoard1
{
    class Program
    {
        [STAThread()]
        static void Main(string[] args)
        {
            Console.WriteLine("Get RTF from the clipboard.");
            IDataObject iData = Clipboard.GetDataObject();
            string[] f = iData.GetFormats();
            string rtf = (string)iData.GetData(DataFormats.Rtf);

            Console.WriteLine(iData.GetData(DataFormats.Text));

            // We assume the colortable and fontable are a standard preset used by VS.
            // Avoids hassle of parsing them.
            // Skip past {\colortbl.*;} and to the start of the real data
            // @todo - regular expression would be good here.
            int i1 = rtf.IndexOf(@"{\colortbl");
            if (i1 <= 0) throw new ArgumentException("Bad input RTF.");
            int i2 = rtf.IndexOf(";}", i1);
            if (i2 <= 0) throw new ArgumentException("Bad input RTF.");
            string data = rtf.Substring(i2 + 2, rtf.Length - (i2 + 2) - 1);

            TextWriter tw = new StreamWriter("out.html");
            Format(tw, data);
            tw.Close();
        }

        // Default color table used by VS's IDE.
        static string[] m_colorTable = new string[] 
            {
               // rrGGbb
                "#000000", // default, starts at index 0
                "#000000", // real color table starts at index 1
                "#0000FF",
                "#00ffFF",
                "#00FF00",
                "#FF00FF",
                "#FF0000",
                "#FFFF00",
                "#FFffFF",
                "#000080",
                "#008080",
                "#008000",
                "#800080",
                "#800000",
                "#808000",
                "#808080",
                "#c0c0c0"
            };


        // Escape HTML chars
        static string Escape(string st)
        {
            st = st.Replace("&", "&amp;");
            st = st.Replace("<", "&lt;");
            st = st.Replace(">", "&gt;");            
            return st;
        }
        // Convert the RTF data into an HTML stream.
        // This rtf snippet is past the font + color tables, so we're just transfering control words now.
        // Write out HTML to the text writer.        
        static void Format(TextWriter tw, string rtf)
        {
            tw.Write("<html><pre>");
            tw.Write("<span color=black>");
            // Example: \fs20 \cf2 using\cf0  System;
            // root --> ('text' '\' ('control word' | 'escaped char'))+
            // 'control word'  --> (alpha)+ (numeric*) space?
            // 'escaped char' = 'x'. Some characters \, {, } are escaped: '\x' --> 'x'
            // @todo - handle embedded groups (begin with '{')

            int idx = 0;
            while (idx < rtf.Length)
            {
                // Get any text up to a '\'. 
                Regex r1 = new Regex(@"(.*?)\\", RegexOptions.Singleline | RegexOptions.IgnoreCase);
                Match m = r1.Match(rtf, idx);
                if (m.Length == 0) break;

                // text will be empty if we have adjacent control words
                string stText = m.Groups[1].ToString();
                tw.Write(Escape(stText));
                idx += m.Length;

                // check for RTF escape characters. According to the spec, these are the only escaped chars.
                char chNext = rtf[idx];
                if (chNext == '{' || chNext == '}' || chNext == '\\')
                {
                    // Escaped char
                    tw.Write(chNext);
                    idx++;
                    continue;
                }

                // Must be a control char. @todo- delimeter includes more than just space, right?
                Regex r2 = new Regex(@"([\{a-z]+)([0-9]*) ", RegexOptions.Singleline | RegexOptions.IgnoreCase);
                m = r2.Match(rtf, idx);
                string stCtrlWord = m.Groups[1].ToString();
                string stCtrlParam = m.Groups[2].ToString();

                if (stCtrlWord == "cf")
                {
                    // Set font color.
                    int iColor = Int32.Parse(stCtrlParam);
                    tw.Write("</span>"); // close previous span, and start a new one for the given color.                    
                    tw.Write("<span style=\"color: " + m_colorTable[iColor] + "\">");
                }
                else if (stCtrlWord == "fs")
                {
                    // Sets font size. ignore
                }
                else if (stCtrlWord == "par")
                {
                    // This is a newline. ignore
                    // @todo- I think the only reason we can ignore this is because the \par in our input are always followed by
                    // a '\r\n' and we're accidentally writing that.
                }
                else
                {
                    throw new ArgumentException("Unrecognized control word '" + stCtrlWord + stCtrlParam + "'after:" + stText);
                }
                idx += m.Length;
            }
            tw.Write(Escape(rtf.Substring(idx))); // rest of string

            tw.Write("</pre></html>");
        } // end Format()
    }
}