HTML Import in OneNote 2003
Copy/paste is one of those invisible features that you never really think about or notice until something goes wrong. It should just work, right? Especially for OneNote (which is supposed to be a sort of “data well” [1]), good clipboard integration is vital, and that turns out to actually be kind of hard. We spent as much time and effort on it as any other major feature. Here's why.
The primary interchange format in Office, including OneNote, is HTML. When you paste content from one Office application, or the web, into another Office application, you're moving HTML around. The source app has to take the content you selected and transform it from its own internal data format into HTML. Then, the destination app has to take this HTML and transform it into its internal data format. Choosing HTML as the lowest common denominator for this handoff has an obvious advantage - if you're writing code to read and write HTML anyway (because it's circa 1995 and that's the sort of thing you do now [2]), then you might as well use it to exchange content between apps as well. And since it's an endlessly pliable format, it's easy to load it with Office-specific goo to make the exchange appropriately richer between Office apps than with other (“downlevel”) apps without having to use a different format. Brilliant!
But the flexible, general-purpose nature of HTML is also what makes writing a really good (that is, invisible) importer for it maddeningly difficult. To accommodate the needs of all these billions of web pages, HTML has evolved into an electronic publishing format that is as idiosyncratic as it is rich. Consuming content from the web means being prepared to deal with any weird glob of HTML that the web designer, via your user, may choose to hurl at you.
Now, WYSIWYG fidelity when pasting external content was never the design goal - without a general-purpose HTML layout engine at our core, that was impossible anyway. Rather, the goal was to turn that content into great OneNote outlines. And here we run into a very basic problem: there are a lot of things you can express in HTML that don't have any meaning in OneNote.
For example, OneNote doesn't have tables. You can nest headings in an outline to produce table-like layout, but that's it. This is actually a bigger deal with respect to HTML import than it might first appear, because a lot of web designers use the <table> tag to lay out content on the page, not just to display "tables" in the traditional sense. If we created nested headings whenever we saw a <table> in HTML, the output would, frankly, be a mess most of the time [3]. So we decided to do this only when importing from other Office apps (where we have a reasonable expectation that a <table> tag actually corresponds to something that looks like a table to the user - Excel being the prime example), and to ignore <table> tags in general HTML from the web. That's why it's not uncommon to select a bit of harmless-looking text on a web page and have it show up linearized in some unexpected way when it's pasted into OneNote - chances are the content was chopped up into table cells in the source HTML.
We also run into problems when HTML can express something at a higher granularity than we can. For example, we attempt to figure out what each pasted paragraph's "indent" on the page should be, so that we can preserve any outline-like structure that may have existed in the source content. But outline elements in OneNote can only be indented in half-inch increments, so we have to snap each imported element to the next half-inch indent level, which can cause outline elements that were at different indents in the source to land at the same level in OneNote. Argh.
The truth is, complex content pasted from the web or other apps will probably always require some amount of cleanup before you're happy with it. But I think we've made it as painless as possible given the constraints.
1: "NoteWell” was an actual name we considered for the product at one point, though I'm not entirely sure whether “Well” was supposed to be an adverb or a noun. Maybe that was the point. Someone also proposed the Latin equivalent of the adverb form, “Nota Bene,” but a) apparently the company has a rule that product names have to either be English or completely made up (e.g. “Encarta”), and b) it's already taken anyway. You can read more about the OneNote naming process in Chris Pratley's blog.
2: Chris has some background on this.
3: In the original OneNote 2003 release, our plaintext import (used when HTML isn't available) created nested headings when it saw inline TABs - i.e. TABs that are not at the beginning of a line. A lot of text pasted from notepad and other non-HTML-emitting apps didn't show up very nicely when factored into a nested heading outline like this, so we dropped it in SP1 and now try to preserve the whitespace within a line as well as we can.