Brad Smith's blog

Search, SharePoint, Stuff (SSS)

How to download Wikipedia

How to download Wikipedia

  • Comments 5

So you're looking for some dummy data?  Well how about downloading the wikipedia???!! 

There are over 2 milliion pages on the wikipedia.  Don't try to crawl the site, it won't let you.  No robots allowed!

Go to http://download.wikipedia.org and you'll see a list of all the databases.  If you're looking for the English one it's "enwiki".  Then you can choose to download a whole bunch of stuff ... but the file you generally want to download is "pages-articles.xml.bz2".  This contains current versions of article content, and is the archive most mirror sites will probably want.  The latest version at the time of writing is 1.7GB.

Now you can run some decent content through your search engine or proof of conept applcation!

 

Leave a Comment
  • Please add 7 and 3 and type the answer here:
  • Post
  • Thank you for higligting this, this is so cool!

  • Not a problem Hannes :)

  • Very cool. Don't forget you can use DataDude to generate data too... I believe it's now out as CTP 7

  • Update: DataDude is now RTM 1.0 :)

  • i like it

Page 1 of 1 (5 items)