As a web developer, one of my biggest gripes with Office used to be the bloated HTML it would save out.

I mean, give me a break, I type two words in a Word document. I expect something like this:

<html>
 <head>
  <title>Wayne Kao</title>
 </head>
 <body>Wayne Kao</body>
</html>

Instead, Word produces this huge file that barely resembles HTML, with all this extra gunk. Is this the best that Microsoft developers can do?

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 11">
<meta name=Originator content="Microsoft Word 11">
<link rel=File-List href="Wayne%20Kao_files/filelist.xml">
<title>Wayne Kao</title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>Wayne Kao</o:Author>
  <o:LastAuthor>Wayne Kao</o:LastAuthor>
  <o:Revision>1</o:Revision>
  <o:TotalTime>2</o:TotalTime>
  <o:Created>2004-03-25T08:18:00Z</o:Created>
  <o:LastSaved>2004-03-25T08:20:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Words>1</o:Words>
  <o:Characters>9</o:Characters>
  <o:Company>Microsoft Corporation</o:Company>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:CharactersWithSpaces>9</o:CharactersWithSpaces>
  <o:Version>11.5703</o:Version>
 </o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:SpellingState>Clean</w:SpellingState>
  <w:GrammarState>Clean</w:GrammarState>
  <w:ValidateAgainstSchemas/>
  <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
  <w:IgnoreMixedContent>false</w:IgnoreMixedContent>
  <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
  <w:Compatibility>
   <w:BreakWrappedTables/>
   <w:SnapToGridInCell/>
   <w:WrapTextWithPunct/>
   <w:UseAsianBreakRules/>
   <w:UseWord2002TableStyleRules/>
  </w:Compatibility>
  <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
 </w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" LatentStyleCount="156">
 </w:LatentStyles>
</xml><![endif]-->
<style>
<!--
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:12.0pt;
	font-family:"Times New Roman";
	mso-fareast-font-family:"Times New Roman";}
@page Section1
	{size:8.5in 11.0in;
	margin:1.0in 1.25in 1.0in 1.25in;
	mso-header-margin:.5in;
	mso-footer-margin:.5in;
	mso-paper-source:0;}
div.Section1
	{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
	{mso-style-name:"Table Normal";
	mso-tstyle-rowband-size:0;
	mso-tstyle-colband-size:0;
	mso-style-noshow:yes;
	mso-style-parent:"";
	mso-padding-alt:0in 5.4pt 0in 5.4pt;
	mso-para-margin:0in;
	mso-para-margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:10.0pt;
	font-family:"Times New Roman";
	mso-ansi-language:#0400;
	mso-fareast-language:#0400;
	mso-bidi-language:#0400;}
</style>
<![endif]-->
</head>

<body lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoNormal>Wayne Kao</p>

</div>

</body>

</html>

Actually, there's a very good reason Word saves out all this stuff. While's it's definitely not optimized for the web, customers expect to be able to open the Word HTML back up without any data loss. After all, they said, if Word can save it, it should be able to open it.

Therefore, the HTML needed to be as expressive as the normal binary .doc format. Since Word's added a good number of features over the years, this obviously requires saving out a lot of information, explaining the file size above.

On the bright side, if you don't care about reopening in Word and just want some lean, mean HTML, Word now also lets you save out filtered HTML.

[image]

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>Wayne Kao</title>
<style>
<!--
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman";}
@page Section1
	{size:8.5in 11.0in;
	margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
	{page:Section1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=Section1>

<p class=MsoNormal>Wayne Kao</p>

</div>

</body>

</html>

Not perfect, but not too shabby for a word processor. If you're super hard core, you can strip it down more by hand or using Frontpage.