Welcome to MSDN Blogs Sign in | Join | Help

HTML Import in OneNote 2003

Copy/paste is one of those invisible features that you never really think about or notice until something goes wrong.  It should just work, right?  Especially for OneNote (which is supposed to be a sort of “data well” [1]), good clipboard integration is vital, and that turns out to actually be kind of hard.  We spent as much time and effort on it as any other major feature.  Here's why.

The primary interchange format in Office, including OneNote, is HTML.  When you paste content from one Office application, or the web, into another Office application, you're moving HTML around.  The source app has to take the content you selected and transform it from its own internal data format into HTML.  Then, the destination app has to take this HTML and transform it into its internal data format.  Choosing HTML as the lowest common denominator for this handoff has an obvious advantage - if you're writing code to read and write HTML anyway (because it's circa 1995 and that's the sort of thing you do now [2]), then you might as well use it to exchange content between apps as well.  And since it's an endlessly pliable format, it's easy to load it with Office-specific goo to make the exchange appropriately richer between Office apps than with other (“downlevel”) apps without having to use a different format.  Brilliant!

But the flexible, general-purpose nature of HTML is also what makes writing a really good (that is, invisible) importer for it maddeningly difficult.  To accommodate the needs of all these billions of web pages, HTML has evolved into an electronic publishing format that is as idiosyncratic as it is rich.  Consuming content from the web means being prepared to deal with any weird glob of HTML that the web designer, via your user, may choose to hurl at you.

Now, WYSIWYG fidelity when pasting external content was never the design goal - without a general-purpose HTML layout engine at our core, that was impossible anyway.  Rather, the goal was to turn that content into great OneNote outlines.  And here we run into a very basic problem: there are a lot of things you can express in HTML that don't have any meaning in OneNote.

For example, OneNote doesn't have tables.  You can nest headings in an outline to produce table-like layout, but that's it.  This is actually a bigger deal with respect to HTML import than it might first appear, because a lot of web designers use the <table> tag to lay out content on the page, not just to display "tables" in the traditional sense.  If we created nested headings whenever we saw a <table> in HTML, the output would, frankly, be a mess most of the time [3].  So we decided to do this only when importing from other Office apps (where we have a reasonable expectation that a <table> tag actually corresponds to something that looks like a table to the user - Excel being the prime example), and to ignore <table> tags in general HTML from the web.  That's why it's not uncommon to select a bit of harmless-looking text on a web page and have it show up linearized in some unexpected way when it's pasted into OneNote - chances are the content was chopped up into table cells in the source HTML.

We also run into problems when HTML can express something at a higher granularity than we can.  For example, we attempt to figure out what each pasted paragraph's "indent" on the page should be, so that we can preserve any outline-like structure that may have existed in the source content.  But outline elements in OneNote can only be indented in half-inch increments, so we have to snap each imported element to the next half-inch indent level, which can cause outline elements that were at different indents in the source to land at the same level in OneNote.  Argh.

The truth is, complex content pasted from the web or other apps will probably always require some amount of cleanup before you're happy with it.  But I think we've made it as painless as possible given the constraints.

1: "NoteWell” was an actual name we considered for the product at one point, though I'm not entirely sure whether “Well” was supposed to be an adverb or a noun.  Maybe that was the point.  Someone also proposed the Latin equivalent of the adverb form, “Nota Bene,” but a) apparently the company has a rule that product names have to either be English or completely made up (e.g. “Encarta”), and b) it's already taken anyway.  You can read more about the OneNote naming process in Chris Pratley's blog.

2: Chris has some background on this.

3: In the original OneNote 2003 release, our plaintext import (used when HTML isn't available) created nested headings when it saw inline TABs - i.e. TABs that are not at the beginning of a line.  A lot of text pasted from notepad and other non-HTML-emitting apps didn't show up very nicely when factored into a nested heading outline like this, so we dropped it in SP1 and now try to preserve the whitespace within a line as well as we can.

Published Monday, May 03, 2004 8:53 PM by pbaer
Filed under:

Comments

# re: HTML Import

This must be a real time consumer. Trying to make one format match up against another one (especially when going from a markup language to the approximate electronic equivalent of free-form print layout) has got to be a nearly impossible task.

Random thought: Not sure how much I like the Office "goo" in HTML generate by office apps, but hey, that's another story.

Still curious about the text-to-graphics conversions that happen sometimes - what's the trigger(s) that cause OneNote to do the conversion? I think I mentioned elsewhere it woudl be nice to have a toggle switch to turn that feature off or on as desired. :)

Glad to see you're blogging now! :)

- Greg
Monday, May 03, 2004 10:32 PM by Greg Hughes

# re: HTML Import

IMHO, the subtle differences between WinIE, Word, OneNote make life maddening when copy pasting. Try this. Copy some text (a lot of simple paragraph breaked text) from IE and paste into OneNote and WinWord (or WordMail). Notice that all your paragraph spacing is messed up. It's really not roudtrippable and quite frustrating to deal with. I really wish this would be fixed for Office v.Next.
Monday, May 03, 2004 11:04 PM by Omar Shahine

# re: HTML Import

post more
Thursday, August 05, 2004 6:04 AM by ssdsfdgd

# re: HTML Import

what
Thursday, August 05, 2004 6:04 AM by dsdsgr

# Peter Baer HTML Import in OneNote 2003 | Paid Surveys

New Comments to this post are disabled
 
Page view tracker