Microsoft distribution to Apache Hadoop comes by direct connectivity to cloud storage i.e. Windows Azure Blob storage or Amazon S3. Here we will learn how to connect your Windows Azure Storage directly from your Hadoop Cluster.
To learn how to connect you Hadoop Cluster to Windows Azure Storage, please read the following blog first:
After reading the above blog, please setup your Hadoop configuration to connect with Azure Storage and verify that connection to Azure Storage is working. Now, before running Hadoop Job please be sure to understand the correct format to use asv:// as below:
When using input or output string using Azure storage you must use the following format:
Input
asv://<container_name>/<symbolic_folder_name>
Example: asv://hadoop/input
Output
Example:asv://hadoop/output
Note If you will use asv://<only_container_name> then job will return error.
Let’s verify at Azure Storage that we do have some data in proper location
The contents of the file helloworldblo.txt are as below:
This is Hello World
I like Hello World
Hello Country
Hello World
Love World
World is Love
Now let’s run a simple WordCount Map/Reduce Job and use HelloWorldBlob.txt as input file and store results also in Azure Storage.
Job Command:
call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount asv://hadoop/input asv://hadoop/output
Once the Job Completes the following screenshot shows the results output:
Opening part-r-00000 shows the results as below:
Finally the Azure HeadNode WebApp shows the following final output about the Hadoop Job:
WordCount Example
•••••
Job Info
Status: Completed Sucessfully Type: jar Start time: 1/5/2012 5:53:49 AM End time: 1/5/2012 5:55:52 AM Exit code: 0
Command
Output (stdout)
Errors (stderr)
12/01/05 05:53:59 INFO mapred.JobClient: Running job: job_201201042206_0001 12/01/05 05:54:00 INFO mapred.JobClient: map 0% reduce 0% 12/01/05 05:54:39 INFO mapred.JobClient: map 100% reduce 0% 12/01/05 05:55:00 INFO mapred.JobClient: map 100% reduce 66% 12/01/05 05:55:30 INFO mapred.JobClient: map 100% reduce 100% 12/01/05 05:55:51 INFO mapred.JobClient: Job complete: job_201201042206_0001 12/01/05 05:55:51 INFO mapred.JobClient: Counters: 25 12/01/05 05:55:51 INFO mapred.JobClient: Job Counters 12/01/05 05:55:51 INFO mapred.JobClient: Launched reduce tasks=1 12/01/05 05:55:51 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=40856 12/01/05 05:55:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/01/05 05:55:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/01/05 05:55:51 INFO mapred.JobClient: Rack-local map tasks=1 12/01/05 05:55:51 INFO mapred.JobClient: Launched map tasks=1 12/01/05 05:55:51 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=48433 12/01/05 05:55:51 INFO mapred.JobClient: File Output Format Counters 12/01/05 05:55:51 INFO mapred.JobClient: Bytes Written=56 12/01/05 05:55:51 INFO mapred.JobClient: FileSystemCounters 12/01/05 05:55:51 INFO mapred.JobClient: FILE_BYTES_READ=1134 12/01/05 05:55:51 INFO mapred.JobClient: HDFS_BYTES_READ=102 12/01/05 05:55:51 INFO mapred.JobClient: ASV_BYTES_WRITTEN=56 12/01/05 05:55:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44949 12/01/05 05:55:51 INFO mapred.JobClient: File Input Format Counters 12/01/05 05:55:51 INFO mapred.JobClient: Bytes Read=0 12/01/05 05:55:51 INFO mapred.JobClient: Map-Reduce Framework 12/01/05 05:55:51 INFO mapred.JobClient: Reduce input groups=8 12/01/05 05:55:51 INFO mapred.JobClient: Map output materialized bytes=94 12/01/05 05:55:51 INFO mapred.JobClient: Combine output records=8 12/01/05 05:55:51 INFO mapred.JobClient: Map input records=7 12/01/05 05:55:51 INFO mapred.JobClient: Reduce shuffle bytes=0 12/01/05 05:55:51 INFO mapred.JobClient: Reduce output records=8 12/01/05 05:55:51 INFO mapred.JobClient: Spilled Records=16 12/01/05 05:55:51 INFO mapred.JobClient: Map output bytes=178 12/01/05 05:55:51 INFO mapred.JobClient: Combine input records=19 12/01/05 05:55:51 INFO mapred.JobClient: Map output records=19 12/01/05 05:55:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=102 12/01/05 05:55:51 INFO mapred.JobClient: Reduce input records=8
Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce