How to Install Hadoop on a Linux-based Windows Azure Virtual Machine

 

How to Install Hadoop on a Linux-based Windows Azure Virtual Machine

Rate This
  • Comments 1

Introduction
The purpose of this blog is to create a very cost effective single-node, Hadoop cluster for testing and development purposes. This allows you to test and develop without needing to provision a large, expensive cluster. Once some testing and development is complete, it then makes sense to move to a product like HDInsight, which is a high-end service offered by Microsoft to greatly simplify the provisioning of a Hadoop cluster and the execution Hadoop jobs.

The developer who is creating algorithms for the first time typically works with test data just to experiment with logic and code. Once most of the issues have been resolved, then it makes sense to go to a full-blown cluster scenario.

Special Thanks to Lance for much of the guidance here: http://lancegatlin.org/tech/centos-6-install-hadoop-from-cloudera-cdh

I've added some extra steps for Java 1.6 to clarify Java setup issues.
Creating a new virtual machine The assumption is that you have a valid Azure account. You can create a single node Hadoop cluster using a virtual machine.

eoviqqci



Choose quick create Next, you will choose a specific flavor of Linux, called CentOS.

1c1zpmr2
Specify details Notice that we need to provide a DNS name. This will need to be globally unique. we will also specify the image type to be OpenLogic CentOS. A small core will work just fine. Finally provide a password and a location closest to you.

obe4erov


Click on the virtual machine so we can get it's url, which will be needed to connect to it. Then select DASHBOARD. You can see, that in my case, the DNS for my server is centos-hadoop.cloudapp.net. You can remote in from Putty.
22jfhrer
Now we are read to log into the VM to setup it up.

Start by downloading Putty.

hyperlink2  

Download Putty

gne2f12w


Click open and now you are allowed to login into the remote linux machine.

Login with the command to be the mighty and powerful "root" user.

1
sudo -s
uwcxebab

You will need to get version 6u26. It used to be easy but Oracle has made it difficult. wget allows you to download a binary file to the local Virtual Machine (VM).

1
wget http://[some location from oracle]/jdk-6u26-linux-x64-rpm.bin

chmod makes the binary executable. Line 3 actually executes the binary file to install java.

1
2
3
#Once downloaded, you can install Java:
chmod +x jdk-6u26-linux-x64-rpm.bin
./jdk-6u26-linux-x64-rpm.bin

We will need to create a file that will set the proper Java environment variables.

1
vim /etc/profile.d/java.sh

This should be the contents of java.sh

1
2
3
4
5
export JRE_HOME=/usr/java/jdk1.6.0_26/jre
export PATH=$PATH:$JRE_HOME/bin
export JAVA_HOME=/usr/java/jdk1.6.0_26
export JAVA_PATH=$JAVA_HOME
export PATH=$PATH:$JAVA_HOME/bin

You will probably need to go back to the Windows Azure Portal to do a proper reboot.

1
reboot

Now that Java is configured, let's turn our attention to Hadoop.

Cloudera offer some binaries that we can work with. the following commands download Hadoop and then install it.

The following code gets the hadoop code from cloudera.

1
2
#Download binaries to install hadoop
sudo wget -O /etc/yum.repos.d/cloudera-cdh4.repo http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo

yum install is how the installation works best on CentOS.

1
2
#Perform the install
sudo yum install hadoop-0.20-conf-pseudo

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

1
2
#Format the name node
sudo -u hdfs hdfs namenode -format

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

1
2
3
4
# Start the namenode and data node services
sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-secondarynamenode start
sudo service hadoop-hdfs-datanode start

This ensures that hadoop starts up with the OS.

1
2
3
4
#Make sure that they will start on boot
sudo chkconfig hadoop-hdfs-namenode on
sudo chkconfig hadoop-hdfs-secondarynamenode on
sudo chkconfig hadoop-hdfs-datanode on

Create some directories with the proper permissions.xxx

1
2
3
4
#Create some directories
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /user

Map Reduce needs some local directories for processing data.

1
2
3
4
#Create directories for map reduce
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Jobs and Tasks are fundamental services to hadoop and need starting.

1
2
3
#Start map reduce services
sudo service hadoop-0.20-mapreduce-jobtracker start
sudo service hadoop-0.20-mapreduce-tasktracker start

Make sure jobs and tasks are available on boot up with the OS.

1
2
3
#Start map reduce on boot
sudo chkconfig hadoop-0.20-mapreduce-jobtracker on
sudo chkconfig hadoop-0.20-mapreduce-tasktracker on

Some directories for the hadoop user running jobs.

1
2
3
#Create some home folders
sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER

Create a shell script that executes on boot. It will setup the propery environment variables needed by Hadoop.

1
2
#Export hadoop home folders. Or use "nano" as your editor.
vi /etc/profile.d/hadoop.sh

Make sure hadoop.sh looks like this..

1
2
#Create hadoop.sh and add this line to hadoop.sh
export HADOOP_HOME=/usr/lib/hadoop

Source runs hadoop.sh with requiring a reboot. The variables are now available for use.

1
2
#Load into session
source /etc/profile.d/hadoop.sh

Some hadoop commands to test things out. Line 4 actually runs a PI estimation algorithm. The JAR file got installed with Hadoop previously.

1
2
3
4
#First test to make sure hadoop is loaded
sudo -u hdfs hadoop fs -ls -R /
#Lets run a hadoop job to validate everything, we will estimate PI
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 10 1000

Conclusion

You can now start and shutdown your Azure Virtual Machine as needed. There is no need to accrue expensive charges for keeping your single-node hadoop cluster up and running.

The portal offers a shutdown and start command so you only pay for what you use. I’ve even figured out how to install HIVE and a few other Hadoop tools. Happy big data coding!

  • Good article and good explanation of the commands required. For those who got scared of those commands or their syntax, other simple way of getting into hadoop single node cluster (without having to install manually) is using pre-prepared VMs such as Cloudera QuickVM or Horton SandBox vms.

    GK (Gopalakrishna Palem)

    http://gk.palem.in/  

Page 1 of 1 (1 items)
Leave a Comment
  • Please add 3 and 6 and type the answer here:
  • Post