<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Wenming's Big Data and Big Compute Blog</title><link>http://blogs.msdn.com/b/hpctrekker/</link><description>All about running HPC and Big Data workload on the Microsoft Windows Azure platform.  I have been and will continue to cover a variety of topics on Hadoop, HPC, Migration from / Interops with Unix, Tips and Tricks for running HPC and Big Data.</description><dc:language>en-US</dc:language><generator>Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><item><title>Data Science in a Box using IPython: Scipy and Scikit-Learn (3/4)</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/17/data-science-in-a-box-using-ipython-data-science-packages-3-4.aspx</link><pubDate>Wed, 17 Apr 2013 22:43:25 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10412069</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10412069</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/17/data-science-in-a-box-using-ipython-data-science-packages-3-4.aspx#comments</comments><description>&lt;p&gt;In the first two blogs of this series, we installed the IPython notebook using the minimum requirement.&amp;#160;&amp;#160;&amp;#160; &lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-creating-a-linux-vm-on-windows-azure-1-4.aspx"&gt;&lt;font size="2"&gt;Creating a Linux VM on Windows Azure (1/4)&lt;/font&gt;&lt;/a&gt;&lt;/li&gt;   &lt;font size="2"&gt;&lt;/font&gt;    &lt;li&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-installing-ipython-notebook-2-4.aspx"&gt;&lt;font size="2"&gt;Installing IPython notebook (2/4)&lt;/font&gt;&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;The third blog post will walk you through some of the common packages used for Data Science.&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;strong&gt;SciPy/NumPy&lt;/strong&gt; packages are usually mentioned together.&amp;#160; At this point, we have not installed SciPy, &lt;a href="http://docs.scipy.org/doc/scipy-dev/reference/tutorial/index.html"&gt;SciPy&lt;/a&gt; includes a collection of numerical packages, that includes Linear solvers that we used in a previous post.&amp;#160; &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-2-2.aspx"&gt;Enter the Big Data Matrix: analyzing meanings and relations of everything (2/2)&lt;/a&gt;.&amp;#160; &lt;/p&gt;  &lt;p&gt;To install the package type: &lt;strong&gt;sudo apt-get install python-scipy&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Scikit Learn&lt;/strong&gt; is a fantastic python-based machine learning package, it includes algorithms for both supervised and unsupervised learning.&amp;#160; Moreover, it includes support for sample datasets, data import tools, and model evaluation. &lt;/p&gt;  &lt;p&gt;Scikit Learn is included with your Ubuntu distribution, but the default is about 2 versions behind.&amp;#160; The best way to install Scikit Learn is to use PIP.&lt;/p&gt;  &lt;p&gt;type: &lt;strong&gt;pip install scikit-learn&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;The installation process includes building many of the packages from scratch; much of the code base is written in C. Check the installation for errors. You can verify by checking new files in /usr/local/lib/python2.7/dist-packages for sklearn. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0777.image_5F00_7642E9D0.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7142.image_5F00_thumb_5F00_1A6AD511.png" width="806" height="78" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;h2&gt;Getting samples&lt;/h2&gt;  &lt;p&gt;It is easy to find samples and run them in IPython Notebook.&amp;#160; You can easily get them from various websites and even tutorials.&amp;#160; To save you time, I’ve make a small collection at:&amp;#160; &lt;a title="https://github.com/wenming/BigDataSamples" href="https://github.com/wenming/BigDataSamples"&gt;https://github.com/wenming/BigDataSamples&lt;/a&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Get the package by typing:&amp;#160; wget &lt;a title="https://github.com/wenming/BigDataSamples/archive/master.zip" href="https://github.com/wenming/BigDataSamples/archive/master.zip"&gt;https://github.com/wenming/BigDataSamples/archive/master.zip&lt;/a&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;On your Ubuntu box, you might have to install unzip by typing:&amp;#160; &lt;strong&gt;sudo apt-get install unzip&amp;#160; &lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Unzip master.zip; then copy content from BigDataSamples-master/ipythonMLsamples&amp;#160; into your Ipython dir.&amp;#160; A sample command may look like:&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;cp /home/azureuser/samples/BigDataSamples-master/ipythonMLsamples/* /home/azureuser/.ipython/&lt;/p&gt;  &lt;p&gt;Check to make sure the files have been copied. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6153.image_5F00_49BC4A9B.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0876.image_5F00_thumb_5F00_074C3916.png" width="810" height="185" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;h2&gt;Running the samples&lt;/h2&gt;  &lt;p&gt;Go back to the website for IPython, log in, and the files listed should show up in the root directory.&amp;#160; Click on K-Means clustering on the handwritten digits data. &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2352.image_5F00_1FDE6717.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6175.image_5F00_thumb_5F00_0A834260.png" width="824" height="407" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Click on the Play button to run the machine learning sample.&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5001.image_5F00_23157061.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5187.image_5F00_thumb_5F00_5919EF6E.png" width="822" height="639" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The code uses the K-means algorithm with 3 different types of initialization, then plots the results.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0310.image_5F00_3F4849F0.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5826.image_5F00_thumb_5F00_76252EE7.png" width="827" height="271" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The code for making the color scattered plot.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7266.image_5F00_210023AB.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8168.image_5F00_thumb_5F00_14FE3377.png" width="839" height="300" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Additional samples to explore&lt;/h2&gt;  &lt;p&gt;These samples are also includes, feel free to explore them on your own.&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;A demo of K-Means clustering on the handwritten digits data     &lt;br /&gt; A demo of structured Ward hierarchical clustering on Lena image      &lt;br /&gt; Faces dataset decompositions      &lt;br /&gt; Gaussian Processes regression      &lt;br /&gt; Manifold learning      &lt;br /&gt; Non-linear SVM      &lt;br /&gt; Hand writing recognition using SVM      &lt;br /&gt; Hierarchical clustering-structured vs unstructured ward      &lt;br /&gt; demo2 of the K Means clustering algorithm      &lt;br /&gt; Weighted SVM      &lt;br /&gt; Visualizing the stock market structure&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Conclusion &lt;/h2&gt;  &lt;p&gt;IPython Notebook gives us a quick and easy way to share compute resources through the web-based IPython notebook interface.&amp;#160; Scikit-Learn, NumPy, and Scipy all simply work out of the box for IPython notebook.&amp;#160; The simple, yet powerful combination lets users focus on learning and getting the data analysis done.&lt;/p&gt;  &lt;p&gt;In the next blog, we will introduce additional packages in Python that can be used for Data analysis including scaling out using clustering. &lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10412069" width="1" height="1"&gt;</description></item><item><title>Data Science in a Box using IPython: Installing IPython notebook (2/4)</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-installing-ipython-notebook-2-4.aspx</link><pubDate>Tue, 09 Apr 2013 21:45:41 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10409808</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10409808</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-installing-ipython-notebook-2-4.aspx#comments</comments><description>&lt;p&gt;In the previous blog, we demonstrated how to create a &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-creating-a-linux-vm-on-windows-azure-1-4.aspx"&gt;Windows Azure Linux VM in detail&lt;/a&gt;. We will continue the installation process for the IPython notebook and related packages. &lt;/p&gt;&lt;h2&gt;Python 2.7 or 3.3&lt;/h2&gt;&lt;p&gt;One of the discussions that happened at the Python in Finance conference is which version of Python you should use?&amp;nbsp; My personally opinion is that unless you have a special need, you should stick with Python 2.7.&amp;nbsp; 2.7 comes as the default on most of the latest Linux distros.&amp;nbsp; Until 3.3 becomes the default Python interpreter on your OS, it is better to use 2.7.&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;h2&gt;The Basics of package management for Python&lt;/h2&gt;&lt;p&gt;There are several ways you can get Python packages installed.&amp;nbsp; The easiest is probably by running the OS default installer, but sometimes it may not have the latest version in the version of Linux you are running.&amp;nbsp; For Ubuntu,&lt;strong&gt; apt-get is the installer for the OS.&amp;nbsp; apt-get &lt;/strong&gt;will install your packages in &lt;code&gt;/usr/lib/python/dist-packages.&lt;/code&gt;&lt;/p&gt;&lt;p&gt; Another option is to use &lt;strong&gt;easy_install.&amp;nbsp; &lt;a href="http://pythonhosted.org/distribute/easy_install.html"&gt;Easy install&lt;/a&gt; &lt;/strong&gt;is part of Python, not part of the Ubuntu OS.&amp;nbsp;&amp;nbsp; We need to have python-setuptools package installed using apt-get first, before being able to use it.&amp;nbsp; If you use easy_install, all your packages will end up in /usr/&lt;code&gt;local/lib/python/site-packages instead.&lt;/code&gt;&lt;/p&gt;&lt;p&gt;&lt;code&gt;type &lt;strong&gt;sudo apt-get install python-setuptools&lt;/strong&gt;&lt;/code&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5706.image_5F00_3FE59E3A.png"&gt;&lt;img width="560" height="301" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3058.image_5F00_thumb_5F00_68840A41.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;code&gt;&lt;font face="Segoe UI"&gt;&lt;a href="https://pypi.python.org/pypi/pip"&gt;PIP&lt;/a&gt; is another tool for installing and managing python packages, it is recommended over easy_install.&amp;nbsp; For our purposes, we simply will use which ever of these tools that can install our packages easily and correctly.&lt;/font&gt;&lt;/code&gt;&lt;/p&gt;&lt;p&gt;&lt;code&gt;&lt;font face="Segoe UI"&gt;type &lt;strong&gt;sudo apt-get install python-pip&lt;/strong&gt;, this might take a few minutes as pip has many dependency packages that it must install.&lt;/font&gt;&lt;/code&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2068.image_5F00_18ADE5B6.png"&gt;&lt;img width="563" height="531" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6685.image_5F00_thumb_5F00_5F2AD5BE.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;code&gt;&lt;font face="Segoe UI"&gt;&lt;/font&gt;&lt;/code&gt;&amp;nbsp;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;&lt;code&gt;&lt;font face="Segoe UI"&gt;Installing IPython, Tornado web server, Matplotlib and other packages&lt;/font&gt;&lt;/code&gt;&lt;/h2&gt;&lt;h2&gt;&lt;code&gt;&lt;font face="Segoe UI"&gt;&lt;/font&gt;&lt;/code&gt;&amp;nbsp;&lt;/h2&gt;&lt;h2&gt;&lt;code&gt;&lt;font face="Segoe UI" size="2"&gt;&lt;a href="http://matplotlib.org/"&gt;Matplotlib&lt;/a&gt; is a popular 2D plotting library, it is one of IPython notebook’s component.&amp;nbsp; As you interact with the Notebook, plots are generated on the server using matplotlib and sent for displaying in your web browser.&lt;/font&gt;&lt;/code&gt;&lt;/h2&gt;&lt;p&gt;&lt;a href="http://matplotlib.org/users/screenshots.html"&gt;&lt;img width="557" height="136" align="middle" alt="screenshots" src="http://matplotlib.org/_static/logo_sidebar_horiz.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;To install, type:&amp;nbsp; &lt;strong&gt;sudo apt-get install python-matplotlib &lt;/strong&gt;&lt;/p&gt;&lt;p&gt;IPython notebook is browser based, it uses the &lt;strong&gt;Tornado webserver&lt;/strong&gt;.&amp;nbsp; The Python-based Tornado webserver supports web sockets for interactive and efficient communication between the webserver and the browser.&amp;nbsp; &lt;/p&gt;&lt;p&gt;To install, type:&amp;nbsp; &lt;strong&gt;sudo apt-get install python-tornado&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Upon completion, we will now install Python itself.&amp;nbsp; The IPython team recommends installing through easy_install to get the latest package from their website.&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;strong&gt;sudo easy_install &lt;a href="http://github.com/ipython/ipython/tarball/master"&gt;http://github.com/ipython/ipython/tarball/master&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This should install version 1.0 dev version of IPython.&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3630.image_5F00_75ECADF8.png"&gt;&lt;img width="657" height="592" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2084.image_5F00_thumb_5F00_236DCDBC.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;We also need to install a package called Pyzmq, Zero MQ is a very fast networking package that IPython uses for its clustered configuration.&amp;nbsp; IPython is capable of interactively controlling a cluster of machines and run massively parallel&amp;nbsp; Big Compute and Big Data applications.&lt;/p&gt;&lt;p&gt;Type:&amp;nbsp; &lt;strong&gt;sudo apt-get install python-zmq&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Finally&lt;strong&gt;, Jinja2,&lt;/strong&gt; is a fast, modern and designer friendly templating language for Python is is now required for IPython notebook.&lt;/p&gt;&lt;p&gt;Type:&amp;nbsp; &lt;strong&gt;sudo apt-get install python-jinja2&lt;/strong&gt;&lt;br&gt;&amp;nbsp; &lt;/p&gt;&lt;h2&gt;Configuring IPython notebook&lt;/h2&gt;&lt;p&gt;Type:&lt;strong&gt; ipython profile create nbserver&lt;/strong&gt;&amp;nbsp; to create a profile.&amp;nbsp; The command generates a default in your home directory under .ipython/profile_nbserver/ipython_config.py&amp;nbsp;&amp;nbsp;&amp;nbsp; Note that any directory starts with a “.” is a hidden directory in Linux. You must type&lt;strong&gt; ls –al&lt;/strong&gt; to see it.&lt;/p&gt;&lt;p&gt;The .ipython directory is shown below in blue.&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0815.image_5F00_7750D0CA.png"&gt;&lt;img width="673" height="213" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8764.image_5F00_thumb_5F00_59DE7FBE.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Once we’ve created a profile, the next step is to create an SSL certificate and generate a password to protect the notebook webpage.&lt;/p&gt;&lt;p&gt;Type: &lt;strong&gt;cd ~\.ipython\profile_nbserver&lt;/strong&gt;&amp;nbsp; to switch into the profile we just created.&lt;/p&gt;&lt;p&gt;Then, type: &lt;strong&gt;openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem&amp;nbsp;&amp;nbsp; &lt;/strong&gt;to create a certificate. Below is a sample session we used to create the certificate. &lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;/strong&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0827.image_5F00_1531E57D.png"&gt;&lt;img width="678" height="386" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7183.image_5F00_thumb_5F00_66D85FCF.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Since this is a self-signed certificate, the notebook your browser will give you a security warning. For long-term production use, you will want to use a properly signed certificate associated with your organization. Since certificate management is beyond the scope of this demo, we will stick to a self-signed certificate for now.&lt;/p&gt;&lt;p&gt;The next step is to create a password to protect your notebook.&lt;/p&gt;&lt;p&gt;Type: &lt;strong&gt; python -c "import IPython;print IPython.lib.passwd()"&amp;nbsp;&amp;nbsp; # password generation&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5125.image_5F00_3DD05184.png"&gt;&lt;img width="680" height="127" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2068.image_5F00_thumb_5F00_5D7F2B4C.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Next, we will edit the profile's configuration file, the &lt;code&gt;ipython_notebook_config.py&lt;/code&gt; file in the profile directory you are in. This file has a number of fields and by default all are commented out. You can open this file with any text editor of your liking, and you should ensure that it has at least the following content, you may use either the Unix vi editor or nano which would be easier for beginners.&amp;nbsp; &lt;p&gt;Make sure you make a copy of the &lt;font style="background-color: rgb(255, 255, 0);"&gt;sha1:c70c9b9671ef:43cf678c8dcae580fb87b2d18055abd084d0e2ad&lt;/font&gt;&amp;nbsp; string you got from the python password generator line above. &lt;p&gt;&amp;nbsp;&lt;p&gt;Type: &lt;strong&gt;nano ipython_config.py &lt;/strong&gt;&lt;p&gt;This will go into the editor, copy the appropriate line into your editor.&amp;nbsp; Note # is the comment sign for Python.&lt;pre&gt;c = get_config()
   
 # This starts plotting support always with matplotlib
 &lt;font style="background-color: rgb(255, 255, 0);"&gt;c.IPKernelApp.pylab = 'inline'&lt;/font&gt;

 # You must give the path to the certificate file.

 # If using a Linux VM:
 c.NotebookApp.certfile = &lt;font style="background-color: rgb(255, 255, 0);"&gt;u'/home/azureuser/.ipython/profile_nbserver/mycert.pem'&lt;/font&gt;

 # Create your own password as indicated above
 c.NotebookApp.password = &lt;font style="background-color: rgb(255, 255, 0);"&gt;u'sha1:c70c9b9671ef:43cf678c8dcae580fb87b2d18055abd084d0e2ad'&lt;/font&gt; &lt;strong&gt;&lt;font style="background-color: rgb(255, 255, 0);"&gt;#use your own&lt;/font&gt;&lt;/strong&gt;
  
 # Network and browser details. We use a fixed port (9999) so it matches
 # our Windows Azure setup, where we've allowed traffic on that port

 &lt;font style="background-color: rgb(255, 255, 0);"&gt;c.NotebookApp.ip = '*'
 c.NotebookApp.port = 8888
 c.NotebookApp.open_browser = False&lt;/font&gt;&lt;/pre&gt;&lt;pre&gt;&lt;font style="background-color: rgb(255, 255, 0);"&gt;&lt;/font&gt;&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2480.image_5F00_38ED9DC8.png"&gt;&lt;img width="686" height="279" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6281.image_5F00_thumb_5F00_34771D01.png" border="0"&gt;&lt;/a&gt;&lt;/pre&gt;&lt;pre&gt;&lt;strong&gt;Press control –X to exit nano and press Y to save the file.&lt;/strong&gt;&lt;/pre&gt;&lt;pre&gt;&amp;nbsp;&lt;/pre&gt;&lt;h2&gt;Configure the Windows Azure Virtual Machines Firewall&lt;/h2&gt;&lt;pre&gt;This was done in Post 1 of this blog series.  Please see Create your first Linux Virtual Machine section of the &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-creating-a-linux-vm-on-windows-azure-1-4.aspx"&gt;&lt;strong&gt;blog&lt;/strong&gt;&lt;/a&gt;.&lt;/pre&gt;&lt;h2&gt;&amp;nbsp;&lt;/h2&gt;&lt;h2&gt;Run the IPython Notebook&lt;/h2&gt;&lt;p&gt;At this point we are ready to start the IPython Notebook. To do this, navigate to the directory you want to store notebooks in and start the IPython Notebook Server:&lt;pre&gt;Type: &lt;strong&gt;ipython notebook --profile=nbserver&lt;/strong&gt;&lt;/pre&gt;&lt;p&gt;You should now be able to access your IPython Notebook at the address &lt;code&gt;https://[Your Chosen Name Here].cloudapp.net&lt;/code&gt;.&lt;p&gt;In our case it is:&amp;nbsp; &lt;a href="https://ipythonvm.cloudapp.net"&gt;https://ipythonvm.cloudapp.net&lt;/a&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1488.image_5F00_2CEBAD94.png"&gt;&lt;img width="693" height="87" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4456.image_5F00_thumb_5F00_767D8C42.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;Type in the Password you set when you ran the &lt;strong&gt;python -c "import IPython;print IPython.lib.passwd()"&lt;/strong&gt; command.&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1401.image_5F00_03E39F49.png"&gt;&lt;img width="692" height="207" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6786.image_5F00_thumb_5F00_5F5211C4.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Once logged in, you should see an empty directory.&amp;nbsp; Click on &lt;strong&gt;“New NoteBook”&lt;/strong&gt; to start.&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6523.image_5F00_77E1AF14.png"&gt;&lt;img width="690" height="227" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5633.image_5F00_thumb_5F00_53502190.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;To reward your hard work, we’ll have IPython notebook plot a few donuts for us.&amp;nbsp; You can copy and paste the code from: &lt;a title="http://matplotlib.org/examples/api/donut_demo.html" href="http://matplotlib.org/examples/api/donut_demo.html"&gt;http://matplotlib.org/examples/api/donut_demo.html&lt;/a&gt;&amp;nbsp; Please your cursor to the end of the last line, Press shift + Enter to run the code right after the last line.&amp;nbsp; If all goes well, you should see a set of 4 chocolate donuts almost instantly.&amp;nbsp; &lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2500.image_5F00_47BA6451.png"&gt;&lt;img width="689" height="503" title="image" style="display: inline; background-image: none;" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1423.image_5F00_thumb_5F00_30FB1CC8.png" border="0"&gt;&lt;/a&gt;&lt;/p&gt;&lt;h2&gt;Conclusion&lt;/h2&gt;&lt;p&gt;In the second part of this blog series, we showed you the minimum steps to install the IPython notebook inside a Windows Azure VM running Linux Ubuntu 12.10.&amp;nbsp; In the next blog, we’ll take a look at a few popular, common packages for machine learning,&amp;nbsp; data analysis, and scientific Computing.&amp;nbsp; If you have questions, please contact me at @wenmingye on twitter. &lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10409808" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/HPC/">HPC</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/azure/">azure</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/Big+Data/">Big Data</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/python/">python</category></item><item><title>Data Science in a Box using IPython: Creating a Linux VM on Windows Azure (1/4)</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-creating-a-linux-vm-on-windows-azure-1-4.aspx</link><pubDate>Tue, 09 Apr 2013 17:22:54 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10409713</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10409713</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/09/data-science-in-a-box-using-ipython-creating-a-linux-vm-on-windows-azure-1-4.aspx#comments</comments><description>&lt;p&gt;I just returned from the &lt;a href="http://pythoninfinancenyc2013.eventbrite.com/"&gt;Python in Finance Conference&lt;/a&gt; in New York, I would like to thank Bank of America and Andrew Shepped organizing the event.&amp;#160; It was not difficult to see the popularity of Python in the financial community; the event was quickly sold out with over 400 attendees.&amp;#160; I gave a 35 minute talk on Python and Windows Azure, and was pleasantly surprised by the amount of interests from the audience and there after.&amp;#160; The purpose of this tutorial series is to help you to get IPython notebook installed and&lt;strong&gt; &lt;/strong&gt;start playing&lt;strong&gt; with machine learning, and other data science packages in Python.&lt;/strong&gt; &lt;/p&gt;  &lt;h2&gt;IPython: Convenience leads to mainstream popularity &lt;/h2&gt;  &lt;p&gt;One Python package that really stood out at the conference was the IPython notebook.&amp;#160; Almost every single presenter mentioned the greatness of &lt;a href="http://ipython.org/"&gt;IPython notebook&lt;/a&gt;.&amp;#160;&amp;#160; It is a web based Python environment that makes sharing Python code/projects that much easier.&amp;#160; IPython was developed by my former colleagues from Tech-X corp and alumni, Brian Granger and Fernando Perez from the CU Boulder Physics dept.&amp;#160; Over the years, I have collaborated, and helped to fund some of the work for the project to get IPython running smoothly, especially on &lt;a href="http://ipython.org/ipython-doc/stable/api/generated/IPython.parallel.apps.launcher.html?highlight=windows%20hpc#IPython.parallel.apps.launcher.WindowsHPCControllerLauncher"&gt;Windows HPC Server&lt;/a&gt; and on &lt;a href="http://www.windowsazure.com/en-us/develop/python/tutorials/ipython-notebook/"&gt;Windows Azure cluster&lt;/a&gt;.&amp;#160; It is good to see these investments have paid off and benefited the Python community greatly.&amp;#160; Most recently, Microsoft External Research has made a sizable donation to the IPython foundation to further support the community,&amp;#160; the announcement was made at PyCon this year.&lt;/p&gt;  &lt;p&gt;Due to high demand from recent conferences, we’ll do a walk through of the installation process with more details for those who are new to either IPython or Windows Azure.&amp;#160; The &lt;a href="http://www.windowsazure.com/en-us/develop/python/tutorials/ipython-notebook/"&gt;original instructions&lt;/a&gt; can be found on the official site of Windows Azure.&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5850.image_5F00_72BB4309.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4774.image_5F00_thumb_5F00_2E0EA8C8.png" width="582" height="471" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;h2&gt;Windows Azure free trial sign up&lt;/h2&gt;  &lt;p&gt;Windows Azure is Microsoft’s Cloud platform, we support both Windows, and Linux VMs. The free trial gives you 3 months free with 750 core hours each month, 70 GB free storage and so on.&amp;#160; The Sign up process is quick and completely risk free, your credit card will NOT be charged until you specifically instructing Azure to do so. &lt;strong&gt; You will need a &lt;a href="https://login.live.com/login.srf?wa=wsignin1.0&amp;amp;rpsnv=11&amp;amp;ct=1365520004&amp;amp;rver=6.1.6206.0&amp;amp;wp=MBI_SSL_SHARED&amp;amp;wreply=https:%2F%2Fmail.live.com%2Fdefault.aspx%3Frru%3Dhome%26livecom%3D1&amp;amp;lc=1033&amp;amp;id=64855&amp;amp;mkt=en-us&amp;amp;cbcxt=mai"&gt;liveID&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Sign up link:&amp;#160; &lt;a title="http://www.windowsazure.com/en-us/pricing/free-trial/?WT.mc_id=directtoaccount_control" href="http://www.windowsazure.com/en-us/pricing/free-trial/?WT.mc_id=directtoaccount_control"&gt;http://www.windowsazure.com/en-us/pricing/free-trial/?WT.mc_id=directtoaccount_control&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2625.image_5F00_4227C551.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1157.image_5F00_thumb_5F00_6B32644D.png" width="581" height="560" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Login and sign up for the Virtual Machine Preview Feature&lt;/h2&gt;  &lt;p&gt;Since Windows Azure Virtual Machines or our IaaS (infrastructure as a service) is still in preview, you will need to log in through the portal and then enable the preview feature at:&amp;#160; &lt;a title="https://account.windowsazure.com/PreviewFeatures" href="https://account.windowsazure.com/PreviewFeatures"&gt;https://account.windowsazure.com/PreviewFeatures&lt;/a&gt;&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Click on Try it now to enable the preview feature.&amp;#160; &lt;/strong&gt;You will get queued for approval.&amp;#160; This process may take a few minutes to a day depending on availability. For us, it became available instantly by going back to and refresh the Windows Azure dashboard.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3386.image_5F00_7421F68C.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8463.image_5F00_thumb_5F00_1A83D9D8.png" width="585" height="195" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2703.image_5F00_1CC06294.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1541.image_5F00_thumb_5F00_5F3304CA.png" width="586" height="442" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Upon signing up for the VM preview feature, Virtual Machines menu item appears in the dashboard.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1122.image_5F00_33823ACE.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6318.image_5F00_thumb_5F00_07D170D2.png" width="591" height="293" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Create your first Linux Virtual Machine&lt;/h2&gt;  &lt;p&gt;IPython works really well for both Windows and Linux instances.&amp;#160; In this tutorial, I would like to take this opportunity to show majority of the readers here who are Windows users how to get up and running on Linux.&amp;#160; As I believe that a good developer should be tools and platform agnostic.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Click on&lt;/strong&gt; &lt;strong&gt;+NEW, then select Compute and Virtual Machine&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4353.image_5F00_0A0DF98E.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3364.image_5F00_thumb_5F00_229D96DE.png" width="596" height="284" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Use the &lt;strong&gt;QUICK CREATE&lt;/strong&gt; option.&amp;#160; &lt;strong&gt;Fill out the fields&lt;/strong&gt; with &lt;strong&gt;DNS Name&lt;/strong&gt;, this is the name of your machine.&amp;#160; I picked &lt;strong&gt;Ubuntu 12.10&lt;/strong&gt;, this is a preferred VM on the IPython development team.&amp;#160; You may want to pick a smaller VM size for the trial, as it may run out much quicker with the&lt;strong&gt; Extra large&lt;/strong&gt;.&amp;#160; Pick a Secure Password.&amp;#160; It is also recommended that you &lt;strong&gt;pick a data center closer&lt;/strong&gt; to where you are.&amp;#160; &lt;strong&gt;Click on Create A virtual Machine&lt;/strong&gt;.&amp;#160;&amp;#160; A Virtual machine along with a storage account will be automatically created for you.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8547.image_5F00_2BF95C12.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4834.image_5F00_thumb_5F00_14CDE194.png" width="598" height="326" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;To understand how IaaS Virtual Machines work, please take a look at the diagram below.&amp;#160; Windows Azure virtual machines are much more advanced than simple machine hosting.&amp;#160; When we normally buy a server box, we use its disks for keeping the OS and data, but if the disk dies it will have to be replaced.&amp;#160; If the server dies, we will have to get a new server.&amp;#160; In Windows Azure Virtual machine, a user no longer have to worry about such hardware failures or down time.&amp;#160; In case there’s hardware failures on the physical host that hosts your VM,&amp;#160; your VM can be moved onto a different host.&amp;#160; In order to do this, the VM does not use local physical hard drive, but instead it uses virtual drives sitting on Windows Azure Storage remotely.&amp;#160; Windows Azure Storage keeps 3 copies of your Image in case of physical drive failures on Windows Azure storage itself.&amp;#160; Such architecture gives us flexibility, reliability and great service level for preventing down time.&amp;#160; You can also attach multiple drives to the VM depending on its size.&amp;#160; For an extra large instance, we can attach up to 16 drives at 1TB each.&amp;#160; You can read more about &lt;a href="http://www.windowsazure.com/en-us/manage/windows/"&gt;Windows virtual machines here.&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7484.image_5F00_2D5D7EE4.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6404.image_5F00_thumb_5F00_21C7C1A5.png" width="600" height="343" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;It only takes a few minutes to provision a Windows Azure Virtual Machine.&amp;#160; IPythonVM’s status is now running.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1373.image_5F00_7D363420.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0121.image_5F00_thumb_5F00_3FA8D657.png" width="602" height="154" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;h2&gt;Configuring your VM for log in&lt;/h2&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5482.image_5F00_13F80C5B.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2678.image_5F00_thumb_5F00_1A3EE2E9.png" width="614" height="487" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;SSH details or the default way of logging into a Linux machine are at the bottom of the Dashboard page.&amp;#160; In case you want to change the port to its default 22 instead of randomly selected port 50390 listed here, you will need to do that on the End points Tab at the top of the page. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6076.image_5F00_316CEE18.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0216.image_5F00_thumb_5F00_511BC7E0.png" width="622" height="140" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1185.image_5F00_7E9CE7A3.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3240.image_5F00_thumb_5F00_1E4BC16C.png" width="329" height="351" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Click on Edit the endpoint at the bottom and change the public port to 22 from&lt;/strong&gt; 50390&lt;strong&gt;.&amp;#160; &lt;/strong&gt;This may take a few seconds for the changes to reflect.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2337.image_5F00_60BE63A2.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1680.image_5F00_thumb_5F00_350D99A6.png" width="635" height="130" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;To expose the IPython notebook webserver, we need to add an additional end point.&amp;#160; We will be running the web server internally at port 8888, and expose it at 443 as the public end point.&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Click on Add Endpoint&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4812.image_5F00_4BD935E6.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4428.image_5F00_thumb_5F00_795A55A9.png" width="535" height="378" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Port 443 has been created for the IPython VM.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5582.image_5F00_0DDFA528.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2746.image_5F00_thumb_5F00_7477A1ED.png" width="651" height="212" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Log into Your Windows Azure Linux VM&lt;/h2&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Download &lt;/strong&gt;&lt;a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html"&gt;&lt;strong&gt;Putty&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; or your favorite SSH client to login.&amp;#160; Use the full hostname displayed on the dashboard for your VM.&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1106.image_5F00_77803BDC.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1777.image_5F00_thumb_5F00_4926B62F.png" width="406" height="389" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8117.image_5F00_0722D79F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2350.image_5F00_thumb_5F00_58C951F1.png" width="412" height="298" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Accept the remote SSH key, then type in your user name and password to login.&amp;#160; By default it is &lt;strong&gt;azureuser&lt;/strong&gt; and the passwd you created.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6457.image_5F00_141CB7B0.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0601.image_5F00_thumb_5F00_6F8B2A2B.png" width="641" height="405" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Security Updates and patches&lt;/h2&gt;  &lt;p&gt;&lt;strong&gt;Linux machines that are not secure are the primary attack targets on the internet&lt;/strong&gt;, is is advised that you immediately and frequently update your VM with security patches.&amp;#160; The commands are simple:&amp;#160; &lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;strong&gt;sudo apt-get update&lt;/strong&gt;&amp;#160; // note that sudo allows you to run command as the super user (root), you will need to type in your own password.&lt;/li&gt;    &lt;li&gt;&lt;strong&gt;sudo apt-get upgrade&lt;/strong&gt;&amp;#160; // once in a while you may want to upgrade your packages too.&lt;/li&gt;    &lt;li&gt;&lt;strong&gt;adduser&lt;/strong&gt; allows you to add additional users.&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8611.image_5F00_58CBE2A2.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6052.image_5F00_thumb_5F00_3844A2F0.png" width="645" height="407" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;update command results above.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4382.image_5F00_4F067B2A.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2742.image_5F00_thumb_5F00_15836B33.png" width="644" height="412" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;upgrade may ask user input, and will take a while to complete.&lt;/p&gt;  &lt;h2&gt;Conclusion&lt;/h2&gt;  &lt;p&gt;This is the first in a blog series that shows you how to turn a Windows Azure VM into a powerful IPython-based machine learning in a box solution.&amp;#160; If you have questions please contact me via @wenmingye on twitter.&amp;#160; In the next tutorial we are ready to get all the Python packages installed.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10409713" width="1" height="1"&gt;</description></item><item><title>Enter the Big Data Matrix: analyzing meanings and relations of everything (2/2)</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-2-2.aspx</link><pubDate>Mon, 08 Apr 2013 02:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10409224</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10409224</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-2-2.aspx#comments</comments><description>&lt;h2&gt;Running the Python example step by step:&lt;/h2&gt;
&lt;p&gt;We explained the basic idea behind LSA or latent semantic analysis in the &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-1-2.aspx"&gt;first part of this blog&lt;/a&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We built a matrix by word counting for each document.&amp;nbsp; The set of document vectors are then sorted by words they appear in.&lt;/li&gt;
&lt;li&gt;Then we applied SVD (single value decomposition) on the document and took a smaller number of Eigen values to shrink the three newly decomposed matrices.&amp;nbsp; We keep these three smaller matrices as the basis for our data model.&lt;/li&gt;
&lt;li&gt;Any new documents can be computed by a formula rather than having to run SVD again.&lt;/li&gt;
&lt;li&gt;We can then use the new matrix to compute the similarities of word vectors.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In the second part of this tutorial, we&amp;rsquo;ll go through a simple example with real code.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4024.image_5F00_76C455AD.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7266.image_5F00_thumb_5F00_3217BB6C.png" alt="image" width="491" height="209" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This is the original corpus.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3857.image_5F00_58799EB7.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0724.image_5F00_thumb_5F00_3A9B1AB6.png" alt="image" width="489" height="259" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Word count gives us a matrix composed by Matrix[ row of words, column document]&lt;/p&gt;
&lt;h2&gt;&amp;nbsp;&lt;/h2&gt;
&lt;h2&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4431.image_5F00_0EEA50BA.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8231.image_5F00_thumb_5F00_184615EE.png" alt="image" width="488" height="367" border="0" /&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Compute SVD using NumPy&lt;/h2&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s run the actual code, you can find the Python code &lt;a href="https://github.com/wenming/BigDataSamples/tree/master/svdsample"&gt;here&lt;/a&gt; in github.&amp;nbsp; A &lt;a href="https://github.com/wenming/BigDataSamples/blob/master/svdsample/exampleMatlab.txt"&gt;matlab&lt;/a&gt; script is also available.&lt;/p&gt;
&lt;p&gt;First, start the PyLab console Window.&amp;nbsp; If you have not downloaded Enthought Python Distribution, please get it from the &lt;a href="http://epd-free.enthought.com/epd_free-7.3-2-win-x86.msi"&gt;download link&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;from numpy import *&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;from numpy.linalg import svd&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8737.image_5F00_3EA7F939.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3343.image_5F00_thumb_5F00_011A9B70.png" alt="image" width="494" height="290" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The SVD solver is part of the NumPy package.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The next step is to enter the word matrix that represent all the sentences and word occurrences in them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A = array ([( 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 1.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 1.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 1.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 1.,&amp;nbsp; 1.,&amp;nbsp; 2.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 1.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 1.,&amp;nbsp; 1.,&amp;nbsp; 0.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 1.,&amp;nbsp; 1.), &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ( 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 0.,&amp;nbsp; 1.,&amp;nbsp; 1.)])&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Now run apply SVD to decompose the matrix into 3:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;U,sigma,V = svd(A)&amp;nbsp;&amp;nbsp;&amp;nbsp; // Numpy built in function.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6012.image_5F00_7584DE30.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8737.image_5F00_thumb_5F00_1533B7F9.png" alt="image" width="487" height="258" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Results of the decomposition, you get 3 Matrices, U, sigma (W Matrix), V. Let me explain each one of them.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1803.image_5F00_2299CAFF.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7242.image_5F00_thumb_5F00_72064D46.png" alt="image" width="489" height="791" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3&gt;The U Matrix&lt;/h3&gt;
&lt;p&gt;These values are basically a coefficient times the original vector, you are looking at the U matrix.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3365.image_5F00_56645201.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6661.image_5F00_thumb_5F00_2AB38805.png" alt="image" width="492" height="365" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;The sigma Matrix (W matrix)&lt;/h3&gt;
&lt;p&gt;Or the one that contains all the Eigen values on its diagonal, we will be pick to use perhaps 2 of these Eigen values for our work.&amp;nbsp; Why 2?&amp;nbsp; That is purely by experimentation.&amp;nbsp; In general you should pick about 300 in a really large matrix, and 2-3 in a really small matrix like this one.&amp;nbsp; Your job as the data scientist is to play with the model and find the optimal number of Eigen value for this particular dataset.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0211.image_5F00_7F02BE08.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5516.image_5F00_thumb_5F00_5A713084.png" alt="image" width="494" height="372" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;The V Transposed Matrix&lt;/h3&gt;
&lt;p&gt;This is the word matrix, or the Vt matrix.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5187.image_5F00_20EE208D.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5584.image_5F00_thumb_5F00_381C2BBC.png" alt="image" width="495" height="372" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Dimension Reduction&lt;/h2&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5100.image_5F00_0C6B61C0.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1616.image_5F00_thumb_5F00_73035E85.png" alt="image" width="498" height="375" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;As discussed earlier, the next step is to do a dimension reduction.&amp;nbsp; In our case, we pick n=2.&amp;nbsp; We are trying to compute the 3 new matrices in this step.&amp;nbsp; U matrix should have 2 columns, W(Sigma) matrix should have 2x2, and the Vt matrix should have 2 rows.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;In the code, we will make a new matrix and copy over 2 Eigen values.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sigma = zeros_like(A)&amp;nbsp; # A temp matrix A filled with 0. Same as&amp;nbsp; Sigma = A&lt;tt&gt;.copy().fill(0)&lt;/tt&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1200.image_5F00_155AF3FF.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4530.image_5F00_thumb_5F00_6E20AAC9.png" alt="image" width="496" height="196" border="0" /&gt;&lt;/a&gt;&lt;strong&gt; &lt;br /&gt; n = min(A.shape)&amp;nbsp; # should return 9, since the original A matrix is 12 x 9, the shorter dimension is 9 which is N.&amp;nbsp; &lt;br /&gt; Sigma[:n,:n] = diag(sigma)&amp;nbsp; # Fill the matrix with Eigen values in diagonal. &lt;br /&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0211.image_5F00_498F1D45.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0310.image_5F00_thumb_5F00_4FD5F3D3.png" alt="image" width="499" height="375" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;U2=U[:,:2] take 2 columns of the decomposed U matrix. &lt;br /&gt; S2=Sigma[:2,:2]&amp;nbsp;&amp;nbsp;&amp;nbsp; ## take 2 Eigen values for the new W Matrix.&amp;nbsp; (N xN) in the middle.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;V=transpose(V) &lt;br /&gt; V2=V[:,:2]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # take two rows of V(transposed); The matrix to the right.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8750.image_5F00_46E8F245.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0714.image_5F00_thumb_5F00_7F939C52.png" alt="image" width="434" height="405" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3463.image_5F00_4FD88484.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2451.image_5F00_thumb_5F00_6F875E4C.png" alt="image" width="439" height="66" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Multiply them back,&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A2=dot(U2,dot(S2,transpose(V2))&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6406.image_5F00_6B10DD85.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1307.image_5F00_thumb_5F00_51A8DA4B.png" alt="image" width="431" height="270" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;The New Matrix ready for computing meaning of words&lt;/h2&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6406.image_5F00_0CFC400A.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2451.image_5F00_thumb_5F00_08198C4E.png" alt="image" width="437" height="134" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This is the New Matrix that we now can use to compute the relations. Notice we are comparing the word vectors using the Cosine function.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;print "relation between human and user: ",&amp;nbsp;&amp;nbsp; dot(A2[0,:],A2[ 3,:])/linalg.norm(A2[0,:])/linalg.norm(A2[ 3,:])&amp;nbsp;&amp;nbsp;&amp;nbsp; # human is close to user (row 0, row 3) &lt;br /&gt; print "relation between human and minor: ", dot(A2[0,:],A2[11,:])/linalg.norm(A2[0,:])/linalg.norm(A2[11,:])&amp;nbsp; &lt;/strong&gt;&lt;strong&gt;#minors in this context has nothing to do with human.&amp;nbsp; (row 0, row 11)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4431.image_5F00_70EE11CF.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5584.image_5F00_thumb_5F00_109CEB98.png" alt="image" width="485" height="78" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Human and minor = similar at 0.8878; human minors = &amp;ndash;0.275 (not similar).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;print "relation between tree and graph: ", dot(A2[9,:],A2[10,:])/linalg.norm(A2[9,:])/linalg.norm(A2[10,:])&lt;/strong&gt;&amp;nbsp; &lt;strong&gt;#(row 9, row 10)&amp;nbsp; &lt;/strong&gt; &lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4034.image_5F00_7734E85D.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0317.image_5F00_thumb_5F00_39A78A94.png" alt="image" width="487" height="35" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;&lt;span style="font-family: Segoe UI Light;"&gt;Conclusion&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span style="font-family: Segoe UI Light;"&gt;This example shows you how advanced math models can be applied to text to figure out the relations between words.&amp;nbsp; The same technique can be applied to just about anything that can be turned into a feature vector.&amp;nbsp; Data scientists can fine tune these models to discover interesting things, and ask new questions that normal BI can not help answer. SVD is very useful in big data computation and predictive analysis, you will encounter it often in data analysis.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-family: Segoe UI Light;"&gt;Questions or comments?&amp;nbsp; follow me on twitter @wenmingye&lt;/span&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10409224" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/HPC/">HPC</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/Big+Data/">Big Data</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/Big+Compute/">Big Compute</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/python/">python</category></item><item><title>Enter the Big Data Matrix: analyzing meanings and relations of everything (1/2)</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-1-2.aspx</link><pubDate>Sun, 07 Apr 2013 23:58:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10409203</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10409203</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-1-2.aspx#comments</comments><description>&lt;h2&gt;Data Science is compute and labor intensive&lt;/h2&gt;
&lt;p&gt;In the &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/01/make-another-small-step-with-the-javascript-console-pig-in-hdinsight.aspx"&gt;previous blogs&lt;/a&gt;, we showed you how to find a dataset, clean it and run simple mapReduce, sort on the dataset.&amp;nbsp; It was meant to give you a flavor of what data science is all about, and I also wanted to expose Big Data&amp;rsquo;s rather labor intensive nature.&amp;nbsp; It does take some processing power and thinking to work on even just 10 Gigabytes of data.&amp;nbsp; Now you can imagine why it takes a large team at large web 2.0 companies to deal with terabytes of data.&amp;nbsp; To maintain good service levels on those live services, it takes some serious processing and man power.&amp;nbsp; Back to our word count examples, one of the most common questions I get is that what exactly do people do after the word count is done, and why would you do word count in the first place.&amp;nbsp; There are two parts of this blog, first explains how things work in detail, &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-2-2.aspx"&gt;the second blog goes through an example.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Meaning of word count&lt;/h2&gt;
&lt;p&gt;Word count is the foundation of (NLP) or &lt;strong&gt;natural language processing&lt;/strong&gt;, which is fundamental to applications such as speech recognition, search engines, machine translation, and just about any applications that needs to derive meaning from human input.&amp;nbsp; By counting words and putting that into a mathematical model, you can derive the meaning of a sentence based on the words appear in it, at the same time, the context which words appear also determine the meaning of words.&amp;nbsp; The simplest example is a word cloud.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4477.image_5F00_48174516.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3225.image_5F00_thumb_5F00_6759EBE9.png" alt="image" width="386" height="260" border="0" /&gt;&lt;/a&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/01/new-breakthrough-in-big-data-technologies-the-nullsql-paradigm-shift.aspx"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6443.image5_5F00_2010A7C5.png" alt="image" width="384" height="259" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;We will go through a relatively complex example, even if you do not understand the mathematics behind it, please follow along to understand how words of English are turned into a mathematical model, and how relationships can be derived between words based on your model.&amp;nbsp; &lt;strong&gt;These mathematical techniques can be applied to just about anything that can be turned into such a model, that includes your Facebook post behaviors, your online purchases, and even your travel patterns.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Python:&amp;nbsp; the Data Modeling Language&lt;/h2&gt;
&lt;p&gt;If you are working with mathematical models and scientific computing, you are likely to encounter tools such as R, Python, Matlab, SAS, etc.&amp;nbsp; Python is becoming more popular by the day due to its user base and the availability of packages.&amp;nbsp; Another important reason we are using Python here is the fact that most data scientists use Python to get a &amp;ldquo;feel&amp;rdquo; for their data, any one of them will tell you that they use &amp;ldquo;Scripting&amp;rdquo; to &amp;ldquo;feel&amp;rdquo; their data, and that is here to stay.&amp;nbsp; We&amp;rsquo;ll have a discussion on&lt;strong&gt; how Python and Hadoop fit together&lt;/strong&gt; in a later blog.&amp;nbsp; In a short sentence is that you need to use the right tool for the right job.&amp;nbsp; Hadoop is an amazing and proven tool for scaling out data analysis at Peta-scale, while Python is great for prototyping and analytics for small amounts of data.&amp;nbsp; In production, big data scientists may port, or modify their Python code to run on Hadoop.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;It should take you just a couple of day and perhaps even a few hours to get acquainted with it.&amp;nbsp; For this example, we encourage you to install Python with a few numerical packages.&amp;nbsp; Luckily for Windows users, you can simply &lt;a href="http://www.enthought.com/products/epd.php"&gt;&lt;strong&gt;download EPD&lt;/strong&gt;&lt;/a&gt; from Enthought, our friends in Austin with batteries included.&amp;nbsp; Their distribution provides scientists with a set of tools including over 100 libraries to perform data analysis and visualization.&amp;nbsp; Some of the most common ones include: &lt;a href="http://www.scipy.org/"&gt;SciPy&lt;/a&gt; and &lt;a href="http://www.numpy.org/"&gt;NumPy&lt;/a&gt; and &lt;a href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt;.&amp;nbsp;&amp;nbsp; &lt;a href="http://epd-free.enthought.com/epd_free-7.3-2-win-x86.msi"&gt;&lt;strong&gt;Link to the binary download&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Once you have it installed, run a simple test with PyLab.&amp;nbsp; If both import statements ran fine, you have it installed correctly, and those are the packages we need.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2806.image_5F00_6A7FED2D.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3554.image_5F00_thumb_5F00_5117E9F3.png" alt="image" width="612" height="313" border="0" /&gt;&lt;/a&gt; &lt;/p&gt;
&lt;h2&gt;Semantic Matrix Model&lt;/h2&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7380.image_5F00_5A73AF27.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5633.image_5F00_thumb_5F00_131E5935.png" alt="image" width="658" height="279" border="0" /&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The basic idea of this example is that the meaning of a document is determined by the words that appear in it, and the meaning of a word is determined by the documents in which it appears.&amp;nbsp; Consider the following piece of text, each sentence consists of an array of words, and the words themselves appear in multiple sentences.&amp;nbsp; Within different context (sentences), words may have different meanings.&amp;nbsp; In this case, the word &amp;ldquo;tree&amp;rdquo; in m2 refers to a computer science tree.&amp;nbsp; &lt;strong&gt;Graph&lt;/strong&gt; and &lt;strong&gt;Trees&lt;/strong&gt; are very much the same thing in sentence m2.&amp;nbsp; You&amp;rsquo;ll also notice that while minors are human in our every day context, the word minor appears in m3 has nothing to do with the word human, while the word &amp;ldquo;user&amp;rdquo; refers to human in these sentences.&amp;nbsp; Will a a mathematical model, or a semantic matrix analysis be able to tell them apart?&amp;nbsp; Let us explore.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The first step is, you guessed it, word count.&amp;nbsp; In the semantic matrix below, you can find a word count for each of the words.&amp;nbsp; &amp;ldquo;User&amp;rdquo; appears in sentence c2, c3 and c5 1 time each, and we mark all the other context (sentences) that it didn&amp;rsquo;t appear in as 0.&amp;nbsp; The Word system appeared in c4, 2 times.&amp;nbsp;&amp;nbsp; As you can see, we are simply counting up all the words for each of the sentences and each of the words.&amp;nbsp; In the case of the entire internet, this could become a really really large Matrix.&amp;nbsp; Perhaps, many billions of documents vs. millions of words.&amp;nbsp; The concept of word counting is also known as N-Gram,&amp;nbsp; in the case of counting just 1 word, we call it the 1-Gram, while we could also count 2-grams&amp;hellip;. Ngrams.&amp;nbsp; A couple of examples of 2-grams above are &amp;ldquo;response time&amp;rdquo;, and &amp;ldquo;human system.&amp;rdquo;&amp;nbsp; Then, we will end up with even larger matrix.&amp;nbsp; There are tools in Python that let&amp;rsquo;s you deal with this such as the &lt;a href="http://nltk.org/"&gt;&lt;strong&gt;NLTK&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The import thing to note here is that &lt;strong&gt;we turned the original corpus into a matrix, each document into a document vector, and each word into a word vector.&lt;/strong&gt;&amp;nbsp; You can do the same with just anything you want to compare relationships with, that was also explained in the &amp;ldquo;meaning of word count&amp;rdquo; section above.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1004.image_5F00_2E56B236.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8357.image_5F00_thumb_5F00_777C5DEF.png" alt="image" width="661" height="267" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;To express these vectors in equations:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1321.image_5F00_319FF371.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5037.image_5F00_thumb_5F00_462542EF.png" alt="image" width="663" height="139" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The meaning of a sentence is equal to adding together of meaning of all the words times a coefficient, or a linear combination of meaning of words. The coefficients are something we need to solve for in this matrix to determine their actual values, these are the a[i, j].&amp;nbsp; The word &amp;amp; document vectors themselves don&amp;rsquo;t get changed, but we are trying to find the coefficients.&lt;/p&gt;
&lt;h2&gt;SVD:&amp;nbsp; Single Value decomposition&lt;/h2&gt;
&lt;p&gt;Single value decomposition is widely used technique in working with matrix transformation.&amp;nbsp; It is the basis of a number of vector-based methods like Latent Semantic Analysis (LSA) and Principal Component Analysis (PCA), Independent Analysis (ICA).&amp;nbsp; When you work with big data, you will encounter these methods rather frequently especially &lt;a href="http://en.wikipedia.org/wiki/Principal_components_analysis"&gt;PCA&lt;/a&gt;. In recent context, SVD is used in advanced &lt;a href="http://www.windowsazure.com/en-us/manage/services/hdinsight/recommendation-engine-using-mahout/"&gt;recommendation engines&lt;/a&gt; among other practical uses.&amp;nbsp; By transforming the matrix in different ways using SVD, it allows us to look at meaning of words from different angles, or tuning the nobs to reveal different properties of the matrix.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s go through the process of running SVD on a matrix.&amp;nbsp; If you would like to know more details, consult your linear algebra book. Basically a matrix can be decomposed into 3 matrices. A is the original matrix, and U and V are the new basis. Same vectors, but diff coefficient. U, rows are document vectors but now with diff coefficient. W (the NxN matrix in the middle), only has diagonal Eigen values.&amp;nbsp; V: columns of VT(transposed) are word vectors, since it is transposed.&lt;/p&gt;
&lt;p&gt;You do need a sparse matrix solver to do a SOLVE, that is a topic that people have been working on for 50 years in HPC. There are efficient math solver libraries that you can use to compute these 3 matrices.&amp;nbsp; You can find these solvers from Petsc, Trilinos, Python, or Matlab.&amp;nbsp; Your best bet is the Numpy, SciPy library in Python that you just installed. There are also higher level tools that you can use such as the NLTK, natural language processing tool kit&amp;rsquo;s clustering algorithm API.&amp;nbsp; We will have a lecture on using NLPTK with Hadoop in the future.&amp;nbsp; Another package on Hadoop is the &lt;strong&gt;Mahout&lt;/strong&gt; Machine learning package, it also has a SVD solver that you can use.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1401.image_5F00_54BB2632.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7762.image_5F00_thumb_5F00_694075B0.png" alt="image" width="632" height="475" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Dimension reduction: Smaller matrix but retain the most important part&lt;/h2&gt;
&lt;p&gt;As we mentioned earlier, the matrix can get rather large when you go to the internet scale.&amp;nbsp; Perhaps SVD isn&amp;rsquo;t really the right solution, but for modest amounts of Data, SVD can easily handle the computation of a few hundred million cells in a large sparse matrix.&amp;nbsp; Even so, we will need to do a dimension reduction by shrinking the size of the matrix.&amp;nbsp; Eigen values are in decreasing order in the W matrix.&amp;nbsp; We don&amp;rsquo;t have to keep all of them.&amp;nbsp; We can keep a much smaller N, and it&amp;rsquo;s a good approximation of the original matrix.&amp;nbsp; In language applications keeping about 300 for documents in the &amp;ldquo;millions.&amp;ldquo;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;In other words, you are reducing the dimensionality by many orders of magnitude.&amp;nbsp; Visually it is represented by a skinnier U matrix that&amp;rsquo;s as tall as the original one &amp;amp; flatter Vt Matrix that is as wide as the original one. As we recall from matrix multiplication, when you multiply them together, you still get the M x N dimension back!&amp;nbsp; Now, that is Math Magic by approximation.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4532.image_5F00_6F874C3E.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6278.image_5F00_thumb_5F00_18FE1E30.png" alt="image" width="630" height="474" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;&amp;nbsp;&lt;/h2&gt;
&lt;h2&gt;SVD only needs to be done once&lt;/h2&gt;
&lt;p&gt;If you are adding new documents, there are no need to redo this entire process again.&amp;nbsp;&amp;nbsp; That the red dots on the right side of the equation (U matrix&amp;rsquo;s last row) can be calculated easily by using the Au vector or the last row of the original matrix, AKA the added new document.&lt;/p&gt;
&lt;p&gt;Uu = Au * V * W inverse&lt;/p&gt;
&lt;p&gt;Au = new document and it is a vector. V, the smaller matrix, and W (inverse).&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4048.image_5F00_1F44F4BE.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7840.image_5F00_thumb_5F00_33CA443C.png" alt="image" width="636" height="480" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Distances and Relations&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1884.image_5F00_6387ECBB.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0410.image_5F00_thumb_5F00_5F116BF4.png" alt="image" width="638" height="467" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;As we recall,&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dot products:&lt;/strong&gt; They are simply a projected value from 1 vector onto the other; imagine you have two pencils at an angle to each other, the similarity can be computed by projecting a shadow from first pencil onto the second one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cosine:&lt;/strong&gt; normalized dot product, you don&amp;rsquo;t care about the length, just the direction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Euclidean distance:&lt;/strong&gt; compute distance for all the points and add them together,&lt;/p&gt;
&lt;p&gt;In any case, we are looking for distance/similarity of vectors, we can calculate meaning of words by comparing their vectors in the new Matrix. The calculation examples will continue into the second half of this blog.&lt;/p&gt;
&lt;p&gt;&lt;span style="color: #a5a5a5;"&gt;One note is that Euclidean distance and the angle (the inverse-cosine) are metrics. The dot product and the cosine are NOT metrics. A metric should be zero when the distance is zero, it should always be non-negative, and it should obey the triangle inequality, so that the distance A-to-B + B-to-C is always greater or equal to the distance A-to-C (meaning that a roundabout path cannot be shorter than the direct route). That's not true for cosines; cosines will tell you that some roundabout routes are shorter than the direct one.&lt;/span&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;SVD is one of the more advanced topics and you may not have fully understood it in this blog without a math background.&amp;nbsp; However, the things to remember here are:&lt;/p&gt;
&lt;p&gt;1. &lt;strong&gt;Data that we collect can be turned into a mathematical model. &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Documents, or individual behavior data can be turned into a feature vector.&amp;nbsp; &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. These vectors can be magically transformed, factorized to reveal important characters of the matrix and the relations between the various feature vectors.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The example using Python will be through through in detail in the &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/04/07/enter-the-big-data-matrix-analyzing-meanings-and-relations-of-everything-2-2.aspx"&gt;2nd part of this blog&lt;/a&gt;, we will demonstrate to you mathematically that how this computer model can understand that USER = HUMAN.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10409203" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/hadoop/">hadoop</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/Big+Data/">Big Data</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/python/">python</category></item><item><title>New Breakthrough in Big Data Technologies: the NullSQL Paradigm shift</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/01/new-breakthrough-in-big-data-technologies-the-nullsql-paradigm-shift.aspx</link><pubDate>Mon, 01 Apr 2013 07:32:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406643</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>1</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10406643</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/01/new-breakthrough-in-big-data-technologies-the-nullsql-paradigm-shift.aspx#comments</comments><description>&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0564.image_5F00_41CAA42B.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4774.image_5F00_thumb_5F00_588C7C65.png" alt="image" width="687" height="388" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: large;"&gt;Mammoth the NullSQL tool&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: small;"&gt;Most of us by now understand the properties of big data.&amp;nbsp; Many of us are already working with big data tools, or NoSQL tools such as Hadoop.&amp;nbsp;&amp;nbsp;I've spent a bit&amp;nbsp;of my spare time in the last 2 months&amp;nbsp;working on prototypes of a new set of tools that can help the big data industry accelerate their progress of handling large amounts of data.&amp;nbsp; The first version, &lt;strong&gt;code named&lt;/strong&gt; &lt;strong&gt;Mammoth&lt;/strong&gt; will be to get the new distribute file system fully implemented and tested.&amp;nbsp; The NullFS, or the Null File System contains a name node which keeps track of a set of Data nodes in a cluster formation. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: small;"&gt;Each of these data nodes contain a &lt;strong&gt;data storage device&lt;/strong&gt; that is &lt;strong&gt;write-only, &lt;/strong&gt;the write only device has the advantage of being to hold any amounts of data you write onto it.&amp;nbsp; This device is the equivalent of achieving quantum computing in big Compute. It is the underlying breakthrough for the NullSQL paradigm.&amp;nbsp; I have tested my implementation with over 1Exabyte of real data including live data from my inbox.&amp;nbsp; The results are promising.&amp;nbsp; In a cluster of 16 nodes, I was able to process more than 1Exabyte of data in minutes. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size: small;"&gt;The Architecture of &lt;strong&gt;Mammoth&lt;/strong&gt; is similar to Hadoop, it is portable,&amp;nbsp;distributed, fault tolerant, and scalable.&amp;nbsp; More importantly, it does satisfy all 3 aspects of the CAP theorem, Concurrency, Availability, and Partition tolerance.&amp;nbsp;It has two layers, the NullFS Layer, and the MapRemove layer. The &lt;strong&gt;MapRemove&lt;/strong&gt; layer will optimally copy data to the right device that is free.&amp;nbsp; We've fully implemented the device drivers for both Windows and Unix:&amp;nbsp;/dev/null (UNIX) and&amp;nbsp;NUL(Windows).&amp;nbsp; The Job tracker and task tracker can optimally handle batch processing jobs as well as jobs submitted by different users. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6837.image_5F00_012AE86D.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6835.image_5F00_thumb_5F00_20D9C235.png" alt="image" width="848" height="479" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;span style="font-size: small;"&gt;Here&amp;rsquo;s an example for a simple MapRemove job, each data node will run the Map process and write results to /dev/null and send leftover data to the global NullContext, and the Remove process simply checks to make sure that there are no more bits left to be processed.&amp;nbsp; This simple programming pattern allows constant scaling. big O(16 ) ，with constant scaling, you will never need a cluster larger than 16 nodes, as the processing time stays the same regardless how many nodes you use beyond 16, or how much data you need to process. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5857.image_5F00_75952B2D.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4375.image_5F00_thumb_5F00_03677129.png" alt="image" width="852" height="481" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;NullSQL&lt;/h2&gt;
&lt;p&gt;&lt;span style="font-size: small;"&gt;Mammoth will be available on 2/29 of next year, but pre-alpha versions are already available.&amp;nbsp; the NullSQL paradigm with NullFS and MapRemove allow users to store &lt;strong&gt;any amounts of data&lt;/strong&gt;, &lt;strong&gt;write-only&lt;/strong&gt; for less than 16 compute node cluster with constant speed.&amp;nbsp; I believe it will soon be a valuable tool for many organizations that need to store their data write-only economically.&amp;nbsp; It also opens up new opportunities for cloud companies to provide this as a security storage service, as write-only is inherently secure.&amp;nbsp; &lt;/span&gt;&lt;/p&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Happy&amp;nbsp; 4/1.&amp;nbsp;&amp;nbsp; @wenmingye&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406643" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/hadoop/">hadoop</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/Big+Data/">Big Data</category><category domain="http://blogs.msdn.com/b/hpctrekker/archive/tags/HDInsight/">HDInsight</category></item><item><title>Make another small step, with the JavaScript Console Pig in HDInsight</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/01/make-another-small-step-with-the-javascript-console-pig-in-hdinsight.aspx</link><pubDate>Mon, 01 Apr 2013 06:06:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406635</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10406635</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/04/01/make-another-small-step-with-the-javascript-console-pig-in-hdinsight.aspx#comments</comments><description>&lt;p&gt;Our previous blog, &lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/mapreduce-on-27-000-books-using-multiple-storage-account-and-hdinsight.aspx"&gt;MapReduce on 27,000 books using multiple storage accounts and HDInsight&lt;/a&gt; showed you how to run the Java version of the MapReduce code against the Gutenberg dataset we uploaded to the blog storage.&amp;nbsp; We also explained how you can add multiple storage accounts and access them from your HDInsight cluster.&amp;nbsp; In this blog, we&amp;rsquo;ll take a smaller step and show you how this works with the JavaScript example, and see if it can operate on a real dataset.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The JavaScript Console gives you simpler syntax, and a convenient web interface.&amp;nbsp; You can do quick tasks such as running a query, and check on your data without having to RDP into your HDInsight cluster head node.&amp;nbsp;&amp;nbsp; It is for convenience only; not meant for complex workflow.&amp;nbsp; The JavaScript has a few features built in, that includes being able to use the HDFS commands such as ls, mkdir, copy files, etc.&amp;nbsp; Moreover, it allows you to invoke pig commands.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Let's go through the process of running the PIG Script with the entire Gutenberg collection, we first uploaded the MapReduce word count file, WordCount.js&amp;nbsp;&amp;nbsp; &lt;a href="https://github.com/wenming/BigDataSamples/raw/master/gutenberg/WordCount.js"&gt;[link]&lt;/a&gt; by typing fs.put()&amp;nbsp; it brings up dialog box for you to upload the WordCount.js file.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5658.image_5F00_7AF82F04.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5672.image_5F00_thumb_5F00_169CBAFB.png" alt="image" width="782" height="403" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Next, you can verify that the WordCount.js file has been uploaded properly by typing #cat /user/admin/WordCount.js.&amp;nbsp; As you noticed, HDFS commands that normally looks like:&amp;nbsp;&amp;nbsp; &lt;span style="color: #c0504d;"&gt;hdfs dfs &amp;ndash;&lt;/span&gt;ls has been abstracted to &lt;span style="color: #c0504d;"&gt;#&lt;/span&gt;&lt;span style="color: #000000;"&gt;ls.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;We then ran a Pig command to kick off a set of map reduce operations.&amp;nbsp; The JavaScript below is compiled into Pig Latin Java and then executed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;pig.from("asv://textfiles@gutenbergstore.blob.core.windows.net/").mapReduce("/user/admin/WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10")&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Load files from ASV storage, notice the format, asv://container@storageAccoutURL.&lt;/li&gt;
&lt;li&gt;Run MapReduce on the dataset using WordCount.js, results are in the format of, words, and count key value pair.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Sort the key value dictionary by descending count value.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Copy top 10 of the values to the DaVinciTop10 directory in the default HDFS.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process may take 10s of minutes to complete, since the dataset is rather large.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4604.image_5F00_61902BCA.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5265.image_5F00_thumb_5F00_1D4FC47E.png" alt="image" width="783" height="296" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The View Log link provides detailed progress logs.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1067.image_5F00_7D34B7C0.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8311.image_5F00_thumb_5F00_6AEBF0FE.png" alt="image" width="795" height="215" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;You can check the progress by RDP into the HeadNode, it will give you more detailed progress than the &amp;ldquo;View Log&amp;rdquo; link on the JavaScript Console.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0871.image_5F00_7CC884CB.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8308.image_5F00_thumb_5F00_1F8C4D3A.png" alt="image" width="787" height="617" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Click on The Reduce link in the Table above to check on the Reduce Job, notice there shuffle and sort processes.&amp;nbsp; Shuffle basically is the process where the reducer is fed with output with all the mappers output that it needs to process.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0878.image_5F00_4A64B14C.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7725.image_5F00_thumb_5F00_5836F747.png" alt="image" width="794" height="185" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Click into the Counters link, there are significant amount of data being read and written in this process.&amp;nbsp; The nice thing about Map Reduce jobs is that you can speed up the process by adding more compute resources. The mapping phase can be significantly speed up by running more processes in parallel.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5672.image_5F00_66093D42.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6242.image_5F00_thumb_5F00_45EE3085.png" alt="image" width="537" height="434" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;When everything finishes, the summary page tells us that the pig script was really about 5 different jobs, 07 &amp;ndash; 11.&amp;nbsp; for learning purposes, I&amp;rsquo;ve posted my results at:&amp;nbsp; &lt;a title="https://github.com/wenming/BigDataSamples/blob/master/gutenberg/results.txt" href="https://github.com/wenming/BigDataSamples/blob/master/gutenberg/results.txt"&gt;https://github.com/wenming/BigDataSamples/blob/master/gutenberg/results.txt&lt;/a&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0407.image_5F00_25D323C8.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0572.image_5F00_thumb_5F00_138A5D06.png" alt="image" width="883" height="451" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The JavaScript Console also provides you with simple graph functions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;file = fs.read("DaVinciTop10")&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;data = parse(file.data, "word, count:long")&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;graph.bar(data)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;graph.pie(data)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8715.image_5F00_01419644.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0804.image_5F00_thumb_5F00_280FAC84.png" alt="image" width="883" height="688" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;When we compare the entire Gutenberg collection with just the Davinci.txt file, there&amp;rsquo;s a significant difference, with our new data we can certainly estimate the occurrences of these top words in the English language more accurately than just looking through 1 book.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0882.image_5F00_20F0700C.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6761.image_5F00_thumb_5F00_3575BF8A.png" alt="image" width="360" height="344" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;More data always gives us more confidence, that&amp;rsquo;s why big data processing is so important.&amp;nbsp; When it comes to processing large amounts of data, parallel big data processing tools such as HDInsight (Hadoop) can deliver results faster than running them on single workstations.&amp;nbsp; Map Reduce is like the assembly language of Big Data, Higher level languages such as PIG Latin can be decomposed into a series of map reduce jobs for us.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406635" width="1" height="1"&gt;</description></item><item><title>MapReduce on 27,000 books using multiple storage accounts and HDInsight</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/mapreduce-on-27-000-books-using-multiple-storage-account-and-hdinsight.aspx</link><pubDate>Sun, 31 Mar 2013 04:36:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406529</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10406529</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/mapreduce-on-27-000-books-using-multiple-storage-account-and-hdinsight.aspx#comments</comments><description>&lt;p&gt;In our previous blog, &lt;a title="Preparing and uploading datasets for HDInsight" href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/preparing-and-uploading-datasets-for-hdinsight.aspx"&gt;Preparing and uploading datasets for HDInsight&lt;/a&gt;, we showed you some of the important utilities that are used on the Unix platform for data processing.&amp;nbsp; That includes Gnu Parallel, Find, Split, and AzCopy for uploading large amounts of data reliably.&amp;nbsp; In this blog, we&amp;rsquo;ll use an HDInsight cluster to operate on the Data we have uploaded.&amp;nbsp; Just to review, here&amp;rsquo;s what we have done so far:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx"&gt;Downloaded ISO Image from Gutenberg, copied the content to a local dir.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx"&gt;Crawled the INDEXES pages and copied English only books (zips) using a custom Python script.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/preparing-and-uploading-datasets-for-hdinsight.aspx"&gt;Unzipped all the zip files using gnu parallel, and then took all the text files combined and then split them into 256mb chunks using find and split.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/preparing-and-uploading-datasets-for-hdinsight.aspx"&gt;Uploaded the files in parallel using the AzCopy Utility.&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Map Reduce&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5224.image_5F00_1EE8F119.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0830.image_5F00_thumb_5F00_45B70759.png" alt="image" width="825" height="466" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Map reduce is the programming pattern for HDInsight, or Hadoop.&amp;nbsp; It has two functions, Map and Reduce.&amp;nbsp; Map takes the code to each of the nodes that contains the data to run computation in parallel, while reduced summarizes results from map functions to do a global reduction.&lt;/p&gt;
&lt;p&gt;In the case of this word count example In JavaScript, the mapping function below, simply splits words from a text document into an array of words, then it writes it to a global context. The map function takes three parameters, key, value, and a global context object. Keys in this case are individual files, while the value is the actual content of the documents. The Map function is called on every compute node in parallel.&amp;nbsp; As you noticed, it writes a key-value pair out to the global context: the word being the key and the value being 1 since it counted 1. Obviously the output from the mapper could contain many duplicate keys (words).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;// Map Reduce function in JavaScript &lt;br /&gt;var map = function (key, value, context) { &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; var words = value.split(/[^a-zA-Z]/); &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (var i = 0; i &amp;lt; words.length; i++) { &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (words[i] !== "") &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; context.write(words[i].toLowerCase(), 1); &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; } &lt;br /&gt; }}; &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The reduce function also takes key, values, context parameters and is called when the Map function completes. In this case, it takes output from all the mappers, and sums up all the values for a particular key. In the end you get word:count key value pair.&amp;nbsp; This gives you a good feel of how map reduce works.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;var reduce = function (key, values, context) { &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; var sum = 0; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; while (values.hasNext()) { &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; sum += parseInt(values.next()); &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; } &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; context.write(key, sum); &lt;br /&gt; }; &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To run map reduce against the dataset we have uploaded, we have to add the blob container in the cluster&amp;rsquo;s configuration page, if you are trying to learn how to create a new cluster.&amp;nbsp; Please take a look at this video:&amp;nbsp; &lt;a title="Creating your first HDInsight cluster and run samples" href="http://channel9.msdn.com/Series/Getting-started-with-Windows-Azure-HDInsight-Service/Creating-your-first-HDInsight-cluster-and-run-samples"&gt;Creating your first HDInsight cluster and run samples&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;HDInsight&amp;rsquo;s Default Storage Account: Windows Azure Blob Storage&lt;/h2&gt;
&lt;p&gt;The diagram below explains the difference between HDFS, or the distributed file system natively to Hadoop, and the Azure blob storage. Our engineering team had to do extra work to make the Azure blob storage system work with Hadoop.&lt;/p&gt;
&lt;p&gt;The original HDFS uses of many local disks on the cluster, while azure blob storage is a remote storage system to all the compute nodes in the cluster. For beginners, all you have to know is that the HDInsight team has abstracted both systems for you through the HDFS tool. And you should use the Azure blob storage as a default, since when you tear down the cluster, all your files will still persist in the remote storage system.&lt;/p&gt;
&lt;p&gt;On the other hand, when you tear down a cluster, the content you store on HDFS contained on the cluster will disappear with it. So, only store temp data that you don&amp;rsquo;t mind losing in HDFS. Or before you tear down the cluster, you should copy them to your blob storage account.&lt;/p&gt;
&lt;p&gt;You can explicitly reference hdfs (local) by using hdfs:// while asv:/// to reference files in the blob storage system. (default).&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0743.image_5F00_3A214A1A.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8360.image_5F00_thumb_5F00_47F39015.png" alt="image" width="829" height="468" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Adding Additional Azure Blob Storage container to your HDInsight Cluster&lt;/h2&gt;
&lt;p&gt;On the head node of the HDInsight cluster in C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-sites.html, you need to add:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;lt;property&amp;gt; &lt;br /&gt;&amp;lt;name&amp;gt;fs.azure.account.key.[account_name].blob.core.windows.net&amp;lt;/name&amp;gt; &lt;br /&gt;&amp;lt;value&amp;gt;[account-key]&amp;lt;/value&amp;gt; &lt;br /&gt;&amp;lt;/property&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For example, in my account, I simply copied the default property and added the new name/key pair.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5226.image_5F00_20B946E0.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6786.image_5F00_thumb_5F00_158FBC96.png" alt="image" width="814" height="102" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In the RDP session, using the Hadoop commandline console, we can verify the new storage can be accessed.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3005.image_5F00_0E70801E.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4456.image_5F00_thumb_5F00_6E557360.png" alt="image" width="585" height="396" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In the JavaScript Console, it works just the same.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0755.image_5F00_22F5CF9C.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4150.image_5F00_thumb_5F00_02DAC2DF.png" alt="image" width="586" height="353" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2&gt;Deploy and Run word count against the second storage&lt;/h2&gt;
&lt;p&gt;In the samples page in the HDInsight Console.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3482.image_5F00_7025C927.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1348.image_5F00_thumb_5F00_60198B21.png" alt="image" width="593" height="313" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Deploy the Word Count sample.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1832.image_5F00_3FFE7E64.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5226.image_5F00_thumb_5F00_46B187E7.png" alt="image" width="593" height="274" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;Modify Parameter 1 to:&amp;nbsp; &lt;strong&gt;asv://textfiles@gutenbergstore.blob.core.windows.net/&lt;/strong&gt;&amp;nbsp; asv:///DaVinciAllTopWords&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7215.image_5F00_6D7F9E27.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2500.image_5F00_thumb_5F00_423B0720.png" alt="image" width="597" height="377" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Navigate all the way back to the main page and click on Job History, find the job that you just started running.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3404.image_5F00_500D4D1B.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6523.image_5F00_thumb_5F00_48EE10A3.png" alt="image" width="598" height="321" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0755.image_5F00_7AE5B12D.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5545.image_5F00_thumb_5F00_6FBC26E3.png" alt="image" width="71" height="100" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You may also check more detailed progress in the RDP session, recall that we have 40 files, and there are 16 mappers total (16 cores) running in parallel.&amp;nbsp; The current status is: 16 complete, 16 running 8 pending.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6116.image_5F00_36A549E1.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8204.image_5F00_thumb_5F00_2F860D69.png" alt="image" width="643" height="462" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3580.image_5F00_565423A9.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6644.image_5F00_thumb_5F00_1213BC5D.png" alt="image" width="641" height="290" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The job completed within about 10 minutes, and the results are stored in DaVinciAllTopWords directory.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3005.image_5F00_18C6C5E0.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0743.image_5F00_thumb_5F00_3F94DC20.png" alt="image" width="640" height="401" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The results is about 256mb&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4075.image_5F00_21B6581F.png"&gt;&lt;img style="display: inline; background-image: none;" title="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1832.image_5F00_thumb_5F00_53ADF8A9.png" alt="image" width="649" height="109" border="0" /&gt;&lt;/a&gt;\&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We showed you how to configure additional ASV storage on your HDInsight Cluster to run Map Reduce Jobs against.&amp;nbsp; This concludes our 3 part blog Set.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406529" width="1" height="1"&gt;</description></item><item><title>Preparing and uploading datasets for HDInsight</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/preparing-and-uploading-datasets-for-hdinsight.aspx</link><pubDate>Sun, 31 Mar 2013 01:04:40 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406517</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10406517</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/preparing-and-uploading-datasets-for-hdinsight.aspx#comments</comments><description>&lt;p&gt;In the previous blog&amp;#160; &lt;a title="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx" href="http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx"&gt;http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx&lt;/a&gt;&amp;#160; we went over how to get English only documents from the Gutenberg DVD.&amp;#160; We showed you the Cygwin Unix emulation environment and also some simple Python code.&amp;#160; The script takes about 15 minutes to run.&amp;#160; Async or some simple task scheduling would probably have saved us some time in copying about 25,000 files.&amp;#160; We will continue to explore some of the Unix utilities that are useful for data processing.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2248.image_5F00_32E3BF5E.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3821.image_5F00_thumb_5F00_39995992.png" width="609" height="332" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;After the Script finished running, we go into the zips directory and did a count on total number of files:&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2678.image_5F00_2AF13E5E.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8535.image_5F00_thumb_5F00_5A42B3E8.png" width="610" height="332" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Some of these files are in zip format and some are text files.&amp;#160; We could simply upload these files to HDInsight, but in general it is a bad idea to send lots of small files for Hadoop to process.&amp;#160; The best way to do this is to combine all the text files into a smaller set of larger files.&amp;#160; To do this, we need to unzip these files first, and then combine them into similar sized chunks in 250mb in size.&amp;#160; 64MB – 256 MB are common block sizes for Hadoop file systems.&amp;#160; &lt;/p&gt;  &lt;h2&gt;Installing and using Gnu Parallel&lt;/h2&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;As we saw earlier, it was painfully slow to process 25000 files in serial execution.&amp;#160; Parallelism in this case might help when it comes to unzipping the files, as long as the extraction process is CPU bound.&amp;#160; GNU Parallel is a set of Perl Scripts that can help you do simple task based parallelism.&amp;#160; Download them from:&amp;#160; &lt;a title="ftp://ftp.gnu.org/gnu/parallel/" href="ftp://ftp.gnu.org/gnu/parallel/"&gt;ftp://ftp.gnu.org/gnu/parallel/&lt;/a&gt;&amp;#160; You might want to get a more recent version from 2013.&amp;#160; &lt;/p&gt;  &lt;p&gt;Building gnu Parallel is like any raw Unix packages.&amp;#160; You need to Unpack, configure and then run Make.&amp;#160; Please make sure these utilities are installed under Cygwin.&amp;#160; If not, you can simply re-run setup.exe from &lt;a title="http://cygwin.com/" href="http://cygwin.com/"&gt;http://cygwin.com/&lt;/a&gt; .&amp;#160;&amp;#160; You should get “make”, “wget”, as well.&amp;#160; Unfortunately, gnu parallel is not part of the cygwin pre-built packages.&lt;/p&gt;  &lt;p&gt;Open gnu parallel with WinRAR and press control+C, copy the directory and paste it into a temp location.&amp;#160; In this case, e:\temp&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1106.image_5F00_40DD415F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4338.image_5F00_thumb_5F00_0447D373.png" width="609" height="218" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;In the Cygwin Shell, navigate to the parallel package directory, and type ./configure;make&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3252.image_5F00_020DDB68.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4331.image_5F00_thumb_5F00_2AAC476F.png" width="610" height="332" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Finally, make install, by default, most raw Unix source packages will install everything in /usr/local/&amp;#160; &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8551.image_5F00_7D4FD300.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8562.image_5F00_thumb_5F00_1AC4B4BE.png" width="611" height="333" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;You can verify this by typing &lt;strong&gt;which parallel&lt;/strong&gt;, it will show you that parallel is in /usr/local/bin (binary).&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2625.image_5F00_55AE7838.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8407.image_5F00_thumb_5F00_0E592246.png" width="621" height="71" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;When we have large amount of files(26808 total), unzip *.zip would not even work properly.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8883.image_5F00_772DA7C7.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3113.image_5F00_thumb_5F00_14A28985.png" width="623" height="138" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Instead, we can list the files and pipe it to work with gnu parallel.&amp;#160; but first, let’s run&lt;strong&gt; chmod 555 *.zip&lt;/strong&gt; to make sure we can read the zip files for unzip. &lt;/p&gt;  &lt;p&gt;Then the command:&amp;#160; &lt;strong&gt;ls *.zip |parallel unzip&lt;/strong&gt;&amp;#160; &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0160.image_5F00_56A8F8C6.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8103.image_5F00_thumb_5F00_3F7D7E48.png" width="590" height="321" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Well have other chances to use parallel in the future for processing files.&amp;#160; By default, gnu parallel will use number of cores the CPU contains to spawn the sanme number of processes, in this case 4 parallel processes or tasks.&amp;#160; How well gnu parallel speeds up the processing depends on the actual workload, if this is disk bound work, then multiple processes won’t help, but if it is CPU bound work, then it would.&amp;#160; The best way to find out is to run a benchmark test with and without gnu parallel.&amp;#160; You can find more examples here &lt;a title="http://en.wikipedia.org/wiki/GNU_parallel" href="http://en.wikipedia.org/wiki/GNU_parallel"&gt;http://en.wikipedia.org/wiki/GNU_parallel&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;The Unix “Find” Command&lt;/h2&gt;  &lt;p&gt;The Unix find command is extremely powerful, you can use it in conjunction with GNU Parallel.&amp;#160; In this step of the tutorial, we will need to combine the 27000 text files we just extracted and prepare them as 250MB blocks to upload to Windows Azure Blob Storage.&amp;#160; Some of these text files are in the root directory, some are in sub dirs. The easiest way is to combine all text files into one giant file, and then split it up.&amp;#160; With “find”, we can accomplish this by a single command:&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;find ./ –name “*.txt” –exec cat {} &amp;gt;&amp;gt; giant_file.txt \;&amp;#160;&amp;#160; &lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Note that {} is the argument list, and &amp;gt;&amp;gt; means “append”&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5466.image_5F00_47973B4E.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7433.image_5F00_thumb_5F00_7BCB6494.png" width="584" height="318" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The text file is about 10GB in size. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5873.image_5F00_72DE6306.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2555.image_5F00_thumb_5F00_2B890D14.png" width="584" height="102" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;The Unix Split Command &lt;/h2&gt;  &lt;p&gt;We often need to split a larger file into smaller chunks, split is a very useful command for data files.&amp;#160; You can split by number of lines, or simply by total number of chunks. In our case, we need to split the 10GB file into 40 files to get approximately 250MB each.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;mkdir split_files; cd split_files&lt;/strong&gt;&amp;#160;&amp;#160;&amp;#160; create a directory that we keep the split files separately, and cd into it.&lt;/p&gt;  &lt;p&gt;In split_files directory, type&lt;strong&gt; split –n 40 ./giant_file.txt&lt;/strong&gt; to start the process.&amp;#160; &lt;/p&gt;  &lt;p&gt;A list of files with auto-generated file names appear in the split_files directory. The next step is to upload these files to the Blob Storage.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3644.image_5F00_6918FB8E.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1581.image_5F00_thumb_5F00_155AA3D5.png" width="635" height="251" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;The Windows Azure AZCopy Utility&lt;/h2&gt;  &lt;p&gt;Download the Windows Azure Blob copy utility from &lt;a title="Github" href="https://github.com/downloads/WindowsAzure/azure-sdk-downloads/AzCopy.zip"&gt;Github&lt;/a&gt;, then copy it to c:\cygwin\usr\local\bin&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4188.image_5F00_61B4C827.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6330.image_5F00_thumb_5F00_5AE403BB.png" width="704" height="509" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The AZcopy utility can copy files to/from blob storage, you can learn more about it at:&amp;#160; &lt;a title="http://blogs.msdn.com/b/windowsazurestorage/archive/2012/12/03/azcopy-uploading-downloading-files-for-windows-azure-blobs.aspx" href="http://blogs.msdn.com/b/windowsazurestorage/archive/2012/12/03/azcopy-uploading-downloading-files-for-windows-azure-blobs.aspx"&gt;http://blogs.msdn.com/b/windowsazurestorage/archive/2012/12/03/azcopy-uploading-downloading-files-for-windows-azure-blobs.aspx&lt;/a&gt;&amp;#160; Or you can get help by typing AzCopy /?&lt;/p&gt;  &lt;p&gt;In our case, we are copy files to a blob container we created on Azure blob storage account.&amp;#160; To do so, you need to create an new storage account, it is recommended that you do that in the east data center, or the data center which your HDInsight cluster will be created.&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4617.image_5F00_5E1B0D05.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4606.image_5F00_thumb_5F00_00066F8A.png" width="437" height="349" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/7750.image_5F00_27D19709.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2465.image_5F00_thumb_5F00_2B724297.png" width="702" height="368" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;I recommend turning off Geo replication to keep cost low.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0815.image_5F00_487AF15F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2577.image_5F00_thumb_5F00_35C88859.png" width="702" height="190" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;You can add/remote container by selecting the containers tab in Gutenbergstore storage account.&amp;#160; &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3566.image_5F00_4B485865.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4075.image_5F00_thumb_5F00_71ACCC61.png" width="708" height="350" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;We will be using this command to upload all the files to blob storage:&amp;#160; (please note you should replace your own keys with /destKey switch)&amp;#160; The storage account keys are at the bottom tool bar, “manage keys&amp;quot;.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3414.image_5F00_6F06A161.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3414.image_5F00_thumb_5F00_17A50D69.png" width="713" height="382" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;wenmingy@LPA-MSFT /cygdrive/e/temp/INDEXES/zips/split_files    &lt;br /&gt; $ &lt;strong&gt;AzCopy ./ &lt;/strong&gt;&lt;a href="https://gutenbergstore.blob.core.windows.net/textfiles/"&gt;&lt;strong&gt;https://gutenbergstore.blob.core.windows.net/textfiles/&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;&amp;#160; &lt;/strong&gt;&lt;strong&gt;/destKey:yourKey&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Note ./ means currently directory, destination is the URL for the blob container. &lt;/strong&gt;    &lt;br /&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6153.image_5F00_3850F870.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2844.image_5F00_thumb_5F00_53895171.png" width="723" height="82" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The AzCopy Utility can perform at excellent transfer speed using the default configuration.&amp;#160; We sustained 18Mbps over our wifi network.&amp;#160; AzCopy uses 8 threads/core as its default number for transfer.&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2055.image_5F00_72625600.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0218.image_5F00_thumb_5F00_48EE14C0.png" width="725" height="437" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Once done, you should find the files in Blob storage by inspecting the container using &lt;a href="http://azurestorageexplorer.codeplex.com/"&gt;Azure Storage Explorer&lt;/a&gt;.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8540.image_5F00_4DDAD57E.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8156.image_5F00_thumb_5F00_19CB578D.png" width="560" height="449" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;h2&gt;Conclusion&lt;/h2&gt;  &lt;p&gt;In this part of the tutorial, we learned how to use a few useful Unix tools including Gnu Parallel, Find, Split, and the AzCopy command to efficiently deal with large files using the Cygwin environment.&amp;#160; It should also give you a flavor of what Data scientists do on a daily basis when it comes to dealing with data files.&amp;#160; We have not touched on how to use HDInsight to do some of these tasks in Parallel.&amp;#160;&amp;#160; In the next blog, we will cover Map Reduce in detail on doing word count on the Gutenberg files we just uploaded. &lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406517" width="1" height="1"&gt;</description></item><item><title>Finding and pre-processing datasets for use with HDInsight</title><link>http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx</link><pubDate>Sat, 30 Mar 2013 22:02:29 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406500</guid><dc:creator>HPC Trekker</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.msdn.com/b/hpctrekker/rsscomments.aspx?WeblogPostID=10406500</wfw:commentRss><comments>http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx#comments</comments><description>&lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Free datasets&lt;/h2&gt;  &lt;p&gt;There are many difficult aspects associated with Big Data, getting a good, clean, well tagged dataset is the first barrier.&amp;#160; After all, you can not really do much data processing without data!&amp;#160; Many companies are yet to realize and discover the value of their data.&amp;#160; For those of you who want to play with HDInsight, the good news is that there are plenty of free datasets on the internet, here are some examples:&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;The Million Song Dataset&lt;/strong&gt;:&amp;#160; A freely available collection of audio features and metadata for a million contemporary popular music tracks.&amp;#160; You can get up to a few Terabytes worth.&amp;#160; This is a very interesting dataset for a recommendation engine.&amp;#160; I have a simple tutorial to walk you through on WindowsAzure.com at&lt;b&gt; &lt;/b&gt;&lt;a href="http://www.windowsazure.com/en-us/manage/services/hdinsight/recommendation-engine-using-mahout/"&gt;http://www.windowsazure.com/en-us/manage/services/hdinsight/recommendation-engine-using-mahout/&lt;/a&gt;&amp;#160; &lt;strong&gt;&lt;font color="#ffc000"&gt;Example&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Freebase&lt;/strong&gt;: A community-curated database of well-known people, places, and things &lt;a href="http://www.freebase.com/"&gt;http://www.freebase.com/&lt;/a&gt; Freebase stores its data in a graph structure to describe relationships among different topics and objects.&amp;#160; APIs are provided for running &lt;a href="http://wiki.freebase.com/wiki/Mql"&gt;MQL&lt;/a&gt; queries against the graph dataset. Free base is very useful when it comes to data mash up: make looking up facts in the Freebase being part of your big data processing. &lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Data.Gov&lt;/strong&gt;:&amp;#160; One of my favorite places to find data is &lt;a href="https://explore.data.gov/Other/Data-gov-Catalog/pyv4-fkgv"&gt;&lt;strong&gt;Data.gov&lt;/strong&gt;&lt;/a&gt;, it contains datasets ranging from census, to earth quake, to economics data.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Wikipedia&lt;/strong&gt; and its sub projects: &lt;a title="http://en.wikipedia.org/wiki/Wikipedia:Database_download" href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"&gt;http://en.wikipedia.org/wiki/Wikipedia:Database_download&lt;/a&gt;&amp;#160; You can download the entire wikipedia if you wanted to.&amp;#160; I am personally guilty of downloading 100s of gigabytes from wikipedia for parsing projects a few years ago.&amp;#160; &lt;/p&gt;  &lt;p&gt;These are just a few examples of free datasets you can get from the internet.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;DataMarket&lt;/strong&gt;: Microsoft has its own &lt;a href="https://datamarket.azure.com/"&gt;data market&lt;/a&gt; on Windows Azure, you can access them through Odata, I have written a &lt;strong&gt;hands on lab&lt;/strong&gt; on how to get data into HDInsight here: &lt;a href="https://github.com/WindowsAzure-TrainingKit/HOL-WindowsAzureHDInsight/blob/master/HOL.md"&gt;https://github.com/WindowsAzure-TrainingKit/HOL-WindowsAzureHDInsight/blob/master/HOL.md&lt;/a&gt;&amp;#160;&amp;#160; &lt;strong&gt;&lt;font color="#ffc000"&gt;Example&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;h2&gt;An example: Getting the Gutenberg free books dataset&lt;/h2&gt;  &lt;p&gt;If we want to get a large dataset that’s well structured for a simple word count example, where would we go about finding such datasets?&amp;#160; We might want to craw the web, or search our own computer for text documents, but that can be quiet involved.&amp;#160; Luckily,&lt;strong&gt;&amp;#160;&lt;/strong&gt;&lt;a href="http://www.gutenberg.org/"&gt;&lt;strong&gt;Gutenberg.com&lt;/strong&gt;&lt;/a&gt; has a huge collection of free books. The largest and latest dataset can be download, here’s the link to one of their DVD ISO images that contain much of the archive: &lt;a href="ftp://snowy.arsc.alaska.edu/mirrors/gutenberg-iso"&gt;ftp://snowy.arsc.alaska.edu/mirrors/gutenberg-iso&lt;/a&gt; I recommend the ISO image pgdvd042010.iso, it is about 8 GB.&amp;#160; Once you download the ISO, take a look inside.&amp;#160; If you have older versions of Windows OS, you might want to use winRAR to peak inside, windows 8 mounts it for you as a drive automatically.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4062.image_5F00_4A4ADC04.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8357.image_5F00_thumb_5F00_7118F244.png" width="431" height="164" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The ISO contains a web interface INDEX.HTML, and many books in various formats. Some books have images so they are in HTML or RTF formats, some are simply text without any illustrations. There are also books in different languages, and encoded in various formats.&amp;#160; As you can see it becomes a big job to even prepare this modest dataset. In our case, we would like to get all the English text files, for that we will have to use the meta data from the index pages as much as we can.&amp;#160; Click on the DVD and open the INDEX.HTML page by right click and Open With Internet explorer.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6204.image_5F00_7EEB383F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5226.image_5F00_thumb_5F00_0CBD7E3B.png" width="521" height="118" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Now feel free to browse the index pages.&amp;#160; You will soon discover that the pages actually contain good clue on if the book if English or not.&amp;#160; &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3005.image_5F00_1A8FC436.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0456.image_5F00_thumb_5F00_0C514B46.png" width="569" height="362" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Click on Browse by title:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6523.image_5F00_331F6186.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/2821.image_5F00_thumb_5F00_130454C9.png" width="573" height="334" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;HTML Page for each individual files.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6116.image_5F00_4EC3ED7C.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0743.image_5F00_thumb_5F00_47A4B104.png" width="570" height="417" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The best way to get all the English text is to write a simple crawler and copy all the English documents into a directory for further processing.&amp;#160; The code is relatively straight forward, and I spent a couple of hours writing this in Python.&amp;#160; You are welcome to use your favorite language.&amp;#160; For those of you who are not familiar with the Python programming language, it is one of the most popular scripting languages for data processing.&amp;#160; The Syntax is simple but may not be as intuitive to C, C# developers.&amp;#160; Full source available on github: &lt;a title="https://github.com/wenming/BigDataSamples/blob/master/gutenberg/gutenbergcrawl.py" href="https://github.com/wenming/BigDataSamples/blob/master/gutenberg/gutenbergcrawl.py"&gt;https://github.com/wenming/BigDataSamples/blob/master/gutenberg/gutenbergcrawl.py&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;First section of the code:&amp;#160; &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4075.image_5F00_6E72C744.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1832.image_5F00_thumb_5F00_355BEA42.png" width="472" height="144" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;We need an HTML Parser(HTMLParser, htmlentitydefs), a crawler or http client (urllib), and commands for copying files.&amp;#160; Please note we are using Python 2.7, not 3.0.&amp;#160; Python 2.7 is still the most popular at this time.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4578.image_5F00_432E303D.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1030.image_5F00_thumb_5F00_69FC467D.png" width="498" height="379" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Let’s look at the main function first:&amp;#160; make a directory called zips, this is where all the final ZIP files will go if they qualify as English text. &lt;/p&gt;  &lt;p&gt;Then, get a list of title pages a-z, other.&amp;#160; These HTML index pages reside in the INDEXES directory, for each of these files, we unleash our crawler by running the function called:&amp;#160; get_english_only_urls().&amp;#160; The TITLES_*HTML pages contain information about each of the books and their language, and location of the book’s property html file that contains the actual book’s text file URL location.&lt;/p&gt;    &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8875.image_5F00_29C62D03.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3404.image_5F00_thumb_5F00_1E9CA2B9.png" width="773" height="357" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The next step is to open each of the TITILES*.HTML find the urls for each of the book’s HTML page, and then process them one by one.&amp;#160; Notice TitleFilesHTMLParser is called here.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/3808.image_5F00_177D6641.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4478.image_5F00_thumb_5F00_497506CB.png" width="775" height="338" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;The title page parser is called, it only adds books with (English) tags next to it for processing.&amp;#160; &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1325.image_5F00_05349F7F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/4062.image_5F00_thumb_5F00_651992C1.png" width="757" height="532" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Now parse each HTML page for the individual books.&amp;#160; Notice we are calling BookPropertyHTMLParser() on each of the HTML pages (per book).&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8204.image_5F00_0BE7A902.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/8765.image_5F00_thumb_5F00_3DDF498C.png" width="947" height="225" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The BookPropertyHTMLParser simply finds the &amp;lt;a href=”…..*.zip&amp;gt; tags and copy the right zip files over the file to the zips directory.&amp;#160; The code is a stub for future modification in case image, html, and pdf files need to be processed. &lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6622.image_5F00_799EE23F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0358.image_5F00_thumb_5F00_5983D582.png" width="749" height="426" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;    &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Running the sample:&lt;/h2&gt;  &lt;p&gt;Extract the ISO content into a temp directory, on Windows 8 you can simply control +A copy and paste into a temp directory.&amp;#160;&amp;#160; Other versions, you can use WinRAR for the extraction.&amp;#160; &lt;/p&gt;  &lt;p&gt;Install Cygwin:&amp;#160; &lt;a title="http://cygwin.org/" href="http://cygwin.org/"&gt;http://cygwin.org/&lt;/a&gt;&amp;#160; Make sure Python 2.7 is installed, also make sure Unzip is installed.&amp;#160; It certainly helps to learn some basic Unix commands if you are dealing with lots of text files, and raw data files.&lt;/p&gt;  &lt;p&gt;Get the code:&amp;#160; &lt;a title="https://raw.github.com/wenming/BigDataSamples/master/gutenberg/gutenbergcrawl.py" href="https://raw.github.com/wenming/BigDataSamples/master/gutenberg/gutenbergcrawl.py"&gt;https://raw.github.com/wenming/BigDataSamples/master/gutenberg/gutenbergcrawl.py&lt;/a&gt;,&amp;#160; put the file in the root/INDEXES directory of Gutenberg files. &lt;/p&gt;  &lt;p&gt;Open the Cygwin terminal and change directory to where you extracted your files, in my case I stored the data in E:\temp&amp;#160; in Cygwin, you’ll see your drive in a special format:&amp;#160; /cygdrive/e/temp/INDEXES (unix style directory). &lt;/p&gt;  &lt;p&gt;cd e: to change Drive letters in the cygwin command window.&amp;#160; &lt;/p&gt;  &lt;p&gt;Another note is that your python script will need execute permission.&amp;#160; In Unix shell, you can change it by setting the permission bits in the format of read,write, execute. (rwx).&amp;#160; We wanted this file to be rwx, bits are set to be 111 for yourself, and r-x only 101 for group and world.&amp;#160; Thus it is really:&amp;#160;&amp;#160; 111 (you/user)&amp;#160; 101(group)&amp;#160;&amp;#160; 101(everyone).&amp;#160;&amp;#160; That translates into 755, thus the chmod 755 command. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/1030.image_5F00_0051EBC3.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/0842.image_5F00_thumb_5F00_6036DF05.png" width="668" height="364" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Now, we are simply going to run the gutenbergcrawl.py by executing &lt;strong&gt;./gutenbergcrawl.py&lt;/strong&gt;&amp;#160; Why the ./ in front?&amp;#160; That is to specify that you want to run the command from the current directory.&amp;#160; By default, bash shells don’t pick up the ./ directory for security reasons; you may accidentally run a script that you don’t intend to. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/5460.image_5F00_0704F546.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-03-28-metablogapi/6153.image_5F00_thumb_5F00_7FE5B8CD.png" width="668" height="361" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;h2&gt;Conclusion&lt;/h2&gt;  &lt;p&gt;In this tutorial, we showed you how to find datasets on the internet for demos and examples.&amp;#160; We went through a detailed example of pre-processing Gutenberg datasets.&amp;#160; In this example, we used Cygwin, which is a Unix emulation environment for Windows, and we showed you some simple Python code for parsing HTML pages to extract meta data.&amp;#160;&amp;#160; This gives you a flavor of what data scientists do when it comes to getting and pre-processing datasets.&lt;/p&gt;  &lt;p&gt;In later blogs, we will explore how we can use this dataset to run word count on HDInsight. &lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406500" width="1" height="1"&gt;</description></item></channel></rss>