Welcome to MSDN Blogs Sign in | Join | Help

Tool to catch plagiarism

If you copy somebody else's blog entry verbatim, credit the original author and link back to the original post.

Sometimes I'll google my own topics to learn more about what other people have to say about it. I stumbled across some blatant plagiarism. While that was annoying, the cool thing was it hit me that you could write a tool to search for blog plagiarism:

1.) Have some some tool which reads through a blog feed. For each entry in the feed:
2.) use a search engine to search for a large part of the entry's text. Perhaps search a paragraph at a time since there's a higher chance of copying a single paragraph  instead of the whole document. Since a whole paragraph is a pretty specific search, you'd expect only a few matches.
3.) scan each search result (skipping the ones for the original post, of course!) for a hyperlink back to the original blog or for the author's name.  If there is no such reference, the search result is likely plagiarizing the blog entry.

It seems like it should be pretty straightforward. It's mostly glue around an RSS reader and an search engine API . (Actually, it sounds so simple, I bet such a tool is already out there. I expect this is a common problem with schools and student papers)
As a sanity check, I tried this method out be hand with an example search using MSN Search on my post about 0xFeeFee sequence points. At the time of writing (8/20/05), there are only 3  different matches: my original post, this, and this. (For each match, there's actually a blog entry and an archived blog entry, so there were 6 total matches). When I pull up the source HTML for each of the results, I can see the 2nd one does not include any reference (either my name or blog URL) back to me; whereas the 3rd one does. So the tool could automatically flag the 2nd one as plagiarizing.  

Offhand, I don't know how to automate the search APIs. If I do end up writing such a tool, I'll be sure to post back.  (Update: I wrote the tool and it's available here)

 

Published Sunday, August 21, 2005 12:24 AM by jmstall

Comments

# re: Tool to catch plagiarism

Sunday, August 21, 2005 2:53 AM by Edge
Interesting, and sad.. I hope plagiarism doesn't become a big problem for bloggers.

# re: Tool to catch plagiarism

Sunday, August 21, 2005 3:25 AM by jmstall
I wrote up some preliminary stuff and it's interesting and promising. Some issues I observe:
- It would be easy for this to generate false positives. For example, if both your blog and blog X quote article Y, blog X may not link to you. The tool needs to be smart.
- The RSS / atom feed is a great way of getting the input data, but it only works for recent entries.

My suspicion is that you could write a tool that could present you with some reasonable plagiarism candidates - but it will be difficult to make it more than 95% sure.

# re: Tool to catch plagiarism

Sunday, August 21, 2005 6:35 AM by Miki Watts
There is something called Simian (http://www.redhillconsulting.com.au/products/simian/)which is supposed to search for duplicate text, maybe it can be adapted to this?

# Sample code to execute MSN searches from C#

Sunday, August 21, 2005 1:05 PM by Mike Stall's .NET Debugging Blog
For kicks, I started writing a tool to use internet searches to automatically catch plagiarism. The first...

# re: Tool to catch plagiarism

Sunday, August 21, 2005 5:03 PM by Mike Weller
There is indeed a tool that does just this...

I can't for the life of me remember what it's called or where it is... I'll try and find it

# re: Tool to catch plagiarism

Sunday, August 21, 2005 5:05 PM by Mike Weller
Here it is... not quite how i remembered it, but it might help

http://copyscape.com/

# Sample code to execute MSN searches from C#

Sunday, August 21, 2005 5:19 PM by Mike Stall's .NET Debugging Blog
For kicks, I started writing a tool to use internet searches to automatically catch plagiarism. The first...

# re: Tool to catch plagiarism

Monday, August 22, 2005 4:02 PM by Richard Ashcroft
I am a small-time webmaster with several webpages I maintain simply for my own satisfaction and for the enjoyment of other people like me. Recently, I found a corporate site that has, in my opinion, plagiarised my site, it was very frustrating.

# Sample code for Plagiarism Searcher tool

Monday, August 22, 2005 5:46 PM by Mike Stall's .NET Debugging Blog
Here's my sample code for a tool to catch blog plagiarism that I described earlier. In retrospect, it...

# re: Tool to catch plagiarism

Wednesday, August 24, 2005 1:37 AM by jmstall
Richard - I'm sorry to hear about that. Surely there must be something you can do. Did they take something concrete (like code or text) or something more abstract (like a "style" or color scheme).

# Sample code for Plagiarism Searcher tool

Wednesday, August 24, 2005 1:41 PM by Mike Stall's .NET Debugging Blog
Here's my sample code for a tool to catch blog plagiarism that I described earlier. In retrospect, it...

# Mike Stall: Tool to catch Blog Plagiarism

There is even such tool out there to catch blog's plagiarism. LOL!http://blogs.msdn.com/jmstall/archive/2005/08/21/plagiarism_tool.aspxSomething...
New Comments to this post are disabled
 
Page view tracker