In my previous post on Hadoop I showed how you could easily deploy a cluster to run on Azure. What was missing was a way to efficiently use the cluster. You could always remote desktop to the Job Tracker and kick off a job but there are better ways.
This post is about actually using the cluster once it has been deployed to Azure. I chose the theme of a Development Cluster to justify making a few changes to how I previously configured the cluster and show some new techniques.
As a developer I expect easy access to the development cluster. The goal is to allow developers to safely connect to the cluster to deploy and debug their map/reduce jobs. SSH provides all the necessary tools for this – secure connection and tunneling. SSH not only allows developers to establish a secure session with the cluster in Azure but it also allows for full integration with IDEs making the typical development tasks a breeze.
I will be referring to two of my previous posts on Setting Up a Hadoop Cluster and Configuring SSH. It would be helpful to have them open as you follow the setup instructions below.
In this scenario of a development cluster I will use a single host to run both the Name Node and the Job Tracker. This is obviously not true for every development cluster but suffices for this demo. The number of slaves is initially set to 3. You can dynamically change the cluster size as I demonstrated in my previous post. If you are going to try it you might also want to adjust the VM size to meet your needs.
The procedure for deploying a Hadoop cluster has not changed. The dependencies are different though. First is the Hadoop version, I had previously used 0.21 which is not supported by many development tools since it’s an unstable release. I reverted to the stable versions and ended up using 0.20.2. At the time of this writing 0.20.203.0rc1 was out but did not work on Windows. Cygwin needs the OpenSSH package installed to provide the SSH Windows Service (instructions in the SSH post). Finally is YAJSW. That didn’t technically need to be updated, I just grabbed the latest drop for which is Beta-10.8.
Just follow the instructions from my previous post using the updated dependencies and grab this cluster configuration template and this Visual Studio 2010 project instead. You should have the following files in a container in your storage account:
You should be able to deploy your development cluster by publishing the HadoopAzure project directly from Visual Studio.
A developer needs to connect to the cluster using SSH. I demonstrated how to do that using PuTTY, the only difference here is that we will need to setup a couple of tunnels. This screen shows the two tunnels required to access the Name Node and Job Tracker.
Once you connect and login you can minimize the PuTTY window. We won’t be using it but it must be open for the tunnels to remain open.
With the tunnels open to the development cluster you can use it as if it was local.
I followed the steps in Setting Up a Hadoop Cluster plus the updated dependencies in this post but I got error binding to endpoints. I tried both InternalEndpoint 8020 and InputEndpoint 50070 but I got the same error:
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to SunnyvaleHadoop1.cloudapp.net/220.127.116.11:8020 : Cannot assign requested address: bind
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to SunnyvaleHadoop1.cloudapp.net/18.104.22.168:50070 : Cannot assign requested address: bind
Any advice for the Hadoop binding to endpoint problem above would be appreciated. We'd like to have an early look at Hadoop in Azure.
A binding error means another process is already using the port. The namenode is apparently already running.