Microsoft distribution to Apache Hadoop comes by direct connectivity to cloud storage i.e. Windows Azure Blob storage or Amazon S3. Here we will learn how to connect your Windows Azure Storage directly from your Hadoop Cluster.

 

To learn how to connect you Hadoop Cluster to Windows Azure Storage, please read the following blog first:

 

After reading the above blog, please setup your Hadoop configuration to connect with Azure Storage and verify that connection to Azure Storage is working. Now, before running Hadoop Job please be sure to understand the correct format to use asv:// as below:

 

When using input or output string using Azure storage you must use the following format:

Input

      asv://<container_name>/<symbolic_folder_name>

      Example: asv://hadoop/input

Output

      asv://<container_name>/<symbolic_folder_name>

Example:asv://hadoop/output

Note If you will use asv://<only_container_name> then job will return error.

 

Let’s verify at Azure Storage that we do have some data in proper location

 

 

The contents of the file helloworldblo.txt are as below:

This is Hello World

I like Hello World

Hello Country

Hello World

Love World

World is Love

Hello World

 

 

Now let’s run a simple WordCount Map/Reduce Job and use HelloWorldBlob.txt as input file and store results also in Azure Storage.

 

 

Job Command:

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount asv://hadoop/input asv://hadoop/output

 

Once the Job Completes the following screenshot shows the results output:

 

Opening part-r-00000 shows the results as below: 

  • Country 1
  • Hello     5
  • I            1
  • Love      2
  • This       1
  • World   6
  • is           2
  • like        1

 

Finally the Azure HeadNode WebApp shows the following final output about the Hadoop Job:

 

WordCount Example

•••••

Job Info

Status: Completed Sucessfully
Type: jar
Start time: 1/5/2012 5:53:49 AM
End time: 1/5/2012 5:55:52 AM
Exit code: 0

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount asv://hadoop/input asv://hadoop/output

Output (stdout)

 

Errors (stderr)


12/01/05 05:53:59 INFO mapred.JobClient: Running job: job_201201042206_0001
12/01/05 05:54:00 INFO mapred.JobClient: map 0% reduce 0%
12/01/05 05:54:39 INFO mapred.JobClient: map 100% reduce 0%
12/01/05 05:55:00 INFO mapred.JobClient: map 100% reduce 66%
12/01/05 05:55:30 INFO mapred.JobClient: map 100% reduce 100%
12/01/05 05:55:51 INFO mapred.JobClient: Job complete: job_201201042206_0001
12/01/05 05:55:51 INFO mapred.JobClient: Counters: 25
12/01/05 05:55:51 INFO mapred.JobClient: Job Counters
12/01/05 05:55:51 INFO mapred.JobClient: Launched reduce tasks=1
12/01/05 05:55:51 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=40856
12/01/05 05:55:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/05 05:55:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/01/05 05:55:51 INFO mapred.JobClient: Rack-local map tasks=1
12/01/05 05:55:51 INFO mapred.JobClient: Launched map tasks=1
12/01/05 05:55:51 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=48433
12/01/05 05:55:51 INFO mapred.JobClient: File Output Format Counters
12/01/05 05:55:51 INFO mapred.JobClient: Bytes Written=56
12/01/05 05:55:51 INFO mapred.JobClient: FileSystemCounters
12/01/05 05:55:51 INFO mapred.JobClient: FILE_BYTES_READ=1134
12/01/05 05:55:51 INFO mapred.JobClient: HDFS_BYTES_READ=102
12/01/05 05:55:51 INFO mapred.JobClient: ASV_BYTES_WRITTEN=56
12/01/05 05:55:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44949
12/01/05 05:55:51 INFO mapred.JobClient: File Input Format Counters
12/01/05 05:55:51 INFO mapred.JobClient: Bytes Read=0
12/01/05 05:55:51 INFO mapred.JobClient: Map-Reduce Framework
12/01/05 05:55:51 INFO mapred.JobClient: Reduce input groups=8
12/01/05 05:55:51 INFO mapred.JobClient: Map output materialized bytes=94
12/01/05 05:55:51 INFO mapred.JobClient: Combine output records=8
12/01/05 05:55:51 INFO mapred.JobClient: Map input records=7
12/01/05 05:55:51 INFO mapred.JobClient: Reduce shuffle bytes=0
12/01/05 05:55:51 INFO mapred.JobClient: Reduce output records=8
12/01/05 05:55:51 INFO mapred.JobClient: Spilled Records=16
12/01/05 05:55:51 INFO mapred.JobClient: Map output bytes=178
12/01/05 05:55:51 INFO mapred.JobClient: Combine input records=19
12/01/05 05:55:51 INFO mapred.JobClient: Map output records=19
12/01/05 05:55:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=102
12/01/05 05:55:51 INFO mapred.JobClient: Reduce input records=8

 

 

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce