<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx</link><description>Here's my sample code for a tool to catch blog plagiarism that I described earlier . In retrospect, it was pretty easy to write (under 400 lines!). And edit-and-continue in C# and interceptable exceptions made my development time a lot faster! 
 The</description><dc:language>en-US</dc:language><generator>Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><item><title>re: Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#455669</link><pubDate>Wed, 24 Aug 2005 20:36:44 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:455669</guid><dc:creator>Mike Stall - MSFT</dc:creator><description>I apologize; I didn't intentd to accuse tagcloud of plagiarising - I'll update my original post to be very clear about that. &lt;br&gt;I explicitly called out tagcloud as an example of a false-postive:&lt;br&gt;&amp;quot;It turns out the 2% one is a false positive: the search engine cache found the copied content, but then the page had completely changed since then (it's an &amp;quot;Under Construction page&amp;quot; now).&amp;quot;&lt;br&gt;&lt;br&gt;The problem is that the tool uses MSN Search (which uses a cache) to find candidates; but then live page access to look for credits to the author. It will search both the raw HTML and the stripped HTML. So having credits inside of tags (such as a &amp;lt;A&amp;gt;) will show up.&lt;br&gt;&lt;br&gt;The problem is if the cached page has a credit, but the site is down (or replaced with a &amp;quot;I'm under construction&amp;quot; notice), the tool won't see it and report a false positive.&lt;br&gt;This would also be a problem for blog homepages where the entry that credits the author is no longer on the homepage.&lt;br&gt;&lt;br&gt;This is a shortcoming of the tool. The tool needs to find candidates and check for crediting using the same copy of the HTML.&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=455669" width="1" height="1"&gt;</description></item><item><title>re: Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#455492</link><pubDate>Wed, 24 Aug 2005 11:14:57 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:455492</guid><dc:creator>Grant</dc:creator><description>Hi Mike,&lt;br&gt;&lt;br&gt;I was interested to see why your tool picked up the tagcloud.com URL for the tag &amp;quot;Readify&amp;quot;&lt;br&gt;&lt;br&gt;Tagcloud shows extracts of blogs for particular tags, not unlike technorati.com.&lt;br&gt;&lt;br&gt;Seeing as the site is down at the moment, this is an example of what your scanner would have seen (although it's been updated since)&lt;br&gt;&lt;br&gt;&lt;a rel="nofollow" target="_new" href="http://cc.msnscache.com/cache.aspx?q=2145254345000&amp;amp;lang=en-US&amp;amp;FORM=CVRE"&gt;http://cc.msnscache.com/cache.aspx?q=2145254345000&amp;amp;lang=en-US&amp;amp;FORM=CVRE&lt;/a&gt;&lt;br&gt;&lt;br&gt;So your scanner would've seen the tagcloud of Readify, and of course Matthew's post.&lt;br&gt;&lt;br&gt;It looks as though the summary strips the HTML so that's why there's no link back to your post.&lt;br&gt;&lt;br&gt;Just to be safe, I've removed the Readify tagcloud so that it doesn't get accused of plagiarising again.&lt;br&gt;&lt;br&gt;Grant&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=455492" width="1" height="1"&gt;</description></item><item><title>Interesting Finds</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#455457</link><pubDate>Wed, 24 Aug 2005 08:12:12 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:455457</guid><dc:creator>Jason Haley</dc:creator><description>&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=455457" width="1" height="1"&gt;</description></item><item><title>re: Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#455108</link><pubDate>Tue, 23 Aug 2005 17:33:04 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:455108</guid><dc:creator>Mike Stall - MSFT</dc:creator><description>I reran the tool and the Matthew's original page no longer shows up (since he added the credits). That's actually a great full-circle demo. I've updated the entry to reflect the new search.&lt;br&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=455108" width="1" height="1"&gt;</description></item><item><title>re: Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#454977</link><pubDate>Tue, 23 Aug 2005 09:43:57 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:454977</guid><dc:creator>Jonathan de Halleux</dc:creator><description>Michael,&lt;br&gt;&lt;br&gt;Once you have extracted the &amp;quot;raw&amp;quot; data from MSN, you can apply this (&lt;a rel="nofollow" target="_new" href="http://research.microsoft.com/research/sv/PageTurner/similarity.htm"&gt;http://research.microsoft.com/research/sv/PageTurner/similarity.htm&lt;/a&gt;) to get a better idead of the degree of plagiarism involved.&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=454977" width="1" height="1"&gt;</description></item><item><title>re: Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#454935</link><pubDate>Tue, 23 Aug 2005 07:19:55 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:454935</guid><dc:creator>Mike Stall - MSFT</dc:creator><description>Matthew - No problem! I'm glad you found the content useful, and I had a lot of fun writing the searcher tool. :)&lt;br&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=454935" width="1" height="1"&gt;</description></item><item><title>re: Sample code for Plagiarism Searcher tool</title><link>http://blogs.msdn.com/b/jmstall/archive/2005/08/22/plagiarism-searcher-sample.aspx#454889</link><pubDate>Tue, 23 Aug 2005 04:59:49 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:454889</guid><dc:creator>Matthew Cosier</dc:creator><description>Hi Mike,&lt;br&gt;&lt;br&gt;Regarding your tools findings to the url above:&lt;br&gt;&lt;a rel="nofollow" target="_new" href="http://mcosier.blogspot.com/2005_06_01_mcosier_archive.html"&gt;http://mcosier.blogspot.com/2005_06_01_mcosier_archive.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;If you take a look at the original post, you will see a white image (a box, representing the end of a quote).  For some reason that I have only just realised, blogger doesn't actually provide a link to the original, but instead, it creates an image with an embeded reference to &lt;a rel="nofollow" target="_new" href="http://blogs.msdn.com/aggbug.aspx?PostID=POSTID"&gt;http://blogs.msdn.com/aggbug.aspx?PostID=POSTID&lt;/a&gt;&lt;br&gt;&lt;br&gt;Unfortunately I didn't realise this until now when I was forced to open the HTML.  I generally post using the blogger web client, and then leave it as that without checking the resultant view in a browser.  &lt;br&gt;&lt;br&gt;So perhaps either a) blogger needs to fix their stuff, or b) you should check for a reference to that url as well in your app...(not that I know anything about what aggbug.aspx does...)&lt;br&gt;&lt;br&gt;Having said this, I will no longer rely on the blogger web client, and will never post with it again.  As a matter of fact, I will probably start hosting my own very soon.  I have fixed all references so that they are now direct links to your original.&lt;br&gt;&lt;br&gt;I appologise for not checking my posts more thoughoughly... if you look at the post you will also see that I ended my part of the post with the colon:&lt;br&gt;I did not try and take credit for this, it was made clear that it wasnt my post.&lt;br&gt;I just wish blogger had of put a proper link in... gah.&lt;br&gt;&lt;br&gt;Thanks,&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=454889" width="1" height="1"&gt;</description></item></channel></rss>