<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Web-scraping with VB's XML support</title><link>http://blogs.msdn.com/lucian/archive/2009/02/21/web-scraping-with-vb-s-xml-support.aspx</link><description>There was an interesting article about using VB's XML support for generating HTML: http://www.infoq.com/news/2009/02/MVC-VB . I've been using VB and XML for the reverse purpose -- scraping web pages to retreive information. I enjoy sailing , and I wanted</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>re: Web-scraping with VB's XML support</title><link>http://blogs.msdn.com/lucian/archive/2009/02/21/web-scraping-with-vb-s-xml-support.aspx#9472092</link><pubDate>Thu, 12 Mar 2009 22:27:10 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9472092</guid><dc:creator>Pavel Minaev [MSFT]</dc:creator><description>&lt;p&gt;Don't trust Tidy too much - its HTML parser is far from perfect, and when it fails, you do not get valid XHTML as output, even if you asked for it. One example of something it can't handle are MSOffice HTML extensions (such as VML), and the non-standard-compliant way Office uses to declare namespaces in HTML documents it produces. And there are quite a few pages on the Web made by using &amp;quot;Save as HTML&amp;quot; in word.&lt;/p&gt;
&lt;p&gt;Apart from writing an HTML parser from scratch, the only other reasonable option is to use IE (or rather HTMLDocument coclass and IHTMLDocument interface) to parse it, and then walk its DOM. Along these lines:&lt;/p&gt;
&lt;p&gt;&lt;a rel="nofollow" target="_new" href="http://www.codeproject.com/KB/IP/parse_html.aspx"&gt;http://www.codeproject.com/KB/IP/parse_html.aspx&lt;/a&gt;&lt;/p&gt;
</description></item></channel></rss>