Avkash Chauhan's Blog

Windows Azure, Windows 8, Cloud Computing, Big Data and Hadoop: All together at one place.. One problem, One solution at One time...

Apache Hadoop on Windows Azure Part 3 - Creating a Word Count Hadoop Job with a few twists

Apache Hadoop on Windows Azure Part 3 - Creating a Word Count Hadoop Job with a few twists

Rate This
  • Comments 2

In this example I am starting a new Hadoop Job with few intentional errors to understand the processing better. You can go to Samples and deploy the Wordcount sample job to your cluster. Verify all the parameters and then you can start the job as below:

 

Note: There are to error in above steps:

  1. Intensely I haven’t uploaded the davinci.txt file to the cluster yet
  2. I have given the wrong parameter

 

Once the Job will start very soon you will hit this error:

 

As you can see above the class name was wrong which resulted into an error. Now you can change the correct parameter name as “wordcount” and restart the job.

Now you will hit another error as below:

 

To solve this problem lets upload the txt file name davinci.txt to cluster. (Please see the wordcount sample page for more info about this step)

To upload the file, we will launch Interactive JavaScript console as below:

 

When Interactive JavaScript console is open you can use fs.put() command to select the txt file from local machine and upload to desired folder at HDFS file system in cluster.

 

Once file upload is completed you will get the result message:

 

 

Let’s run the job again and now you will see the expected results as below. To solve this problem you just need to pass the second parameter with new output directory name.

 

WordCount Example

Job Info

Status: Completed Successfully
Type: jar
Start time: 12/29/2011 5:33:00 PM
End time: 12/29/2011 5:33:58 PM
Exit code: 0

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount /example/data/davinci.txt DaVinciAllTopWords

Output (stdout)

 

Errors (stderr)

11/12/29 17:33:02 INFO input.FileInputFormat: Total input paths to process : 1
11/12/29 17:33:03 INFO mapred.JobClient: Running job: job_201112290558_0003
11/12/29 17:33:04 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 17:33:29 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 17:33:47 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 17:33:58 INFO mapred.JobClient: Job complete: job_201112290558_0003
11/12/29 17:33:58 INFO mapred.JobClient: Counters: 25
11/12/29 17:33:58 INFO mapred.JobClient: Job Counters
11/12/29 17:33:58 INFO mapred.JobClient: Launched reduce tasks=1
11/12/29 17:33:58 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29185
11/12/29 17:33:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/29 17:33:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/29 17:33:58 INFO mapred.JobClient: Rack-local map tasks=1
11/12/29 17:33:58 INFO mapred.JobClient: Launched map tasks=1
11/12/29 17:33:58 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15671
11/12/29 17:33:58 INFO mapred.JobClient: File Output Format Counters
11/12/29 17:33:58 INFO mapred.JobClient: Bytes Written=337623
11/12/29 17:33:58 INFO mapred.JobClient: FileSystemCounters
11/12/29 17:33:58 INFO mapred.JobClient: FILE_BYTES_READ=467151
11/12/29 17:33:58 INFO mapred.JobClient: HDFS_BYTES_READ=1427899
11/12/29 17:33:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=977063
11/12/29 17:33:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=337623
11/12/29 17:33:58 INFO mapred.JobClient: File Input Format Counters
11/12/29 17:33:58 INFO mapred.JobClient: Bytes Read=1427785
11/12/29 17:33:58 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 17:33:58 INFO mapred.JobClient: Reduce input groups=32956
11/12/29 17:33:58 INFO mapred.JobClient: Map output materialized bytes=466761
11/12/29 17:33:58 INFO mapred.JobClient: Combine output records=32956
11/12/29 17:33:58 INFO mapred.JobClient: Map input records=32118
11/12/29 17:33:58 INFO mapred.JobClient: Reduce shuffle bytes=466761
11/12/29 17:33:58 INFO mapred.JobClient: Reduce output records=32956
11/12/29 17:33:58 INFO mapred.JobClient: Spilled Records=65912
11/12/29 17:33:58 INFO mapred.JobClient: Map output bytes=2387798
11/12/29 17:33:58 INFO mapred.JobClient: Combine input records=251357
11/12/29 17:33:58 INFO mapred.JobClient: Map output records=251357
11/12/29 17:33:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=114
11/12/29 17:33:58 INFO mapred.JobClient: Reduce input records=32956

 

If you run the same Job again you will see the following results:

 

WordCount Example

•••••

Job Info

Status: Failed
Type: jar
Start time: 12/29/2011 5:46:11 PM
End time: 12/29/2011 5:46:13 PM
Exit code: -1

Command

call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar wordcount /example/data/davinci.txt DaVinciAllTopWords

Output (stdout)

 

Errors (stderr)


org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory DaVinciAllTopWords already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:134)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:830)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 

 Keywords: Azure, Hadoop, Apache, BigData, Cloud, MapReduce

Leave a Comment
  • Please add 6 and 7 and type the answer here:
  • Post
  • Hi Avkash,

    I tried following the Hadoop Example on word count by following the Docs. Although it worked at first it is now throwing me errors regarding SubscriptionName?! I have the free Subscription and the errors started after i set the parameters. Can you Please help?

    -Apurva

  • Hi Avkash,

    I tried following the Hadoop Example on word count by following the Docs. Although it worked at first it is now throwing me errors regarding SubscriptionName?! I have the free Subscription and the errors started after i set the parameters. Can you Please help? Also....I do have the free subscripition which shows as the subscriptioname when i use Get-AzureSubscription  

    -Apurva

Page 1 of 1 (2 items)