Avkash Chauhan's Blog

Windows Azure, Windows 8, Cloud Computing, Big Data and Hadoop: All together at one place.. One problem, One solution at One time...

Apache Hadoop on Windows Azure Part 7 – Writing your very own WordCount Hadoop Job in Java and deploying to Windows Azure Cluster

Apache Hadoop on Windows Azure Part 7 – Writing your very own WordCount Hadoop Job in Java and deploying to Windows Azure Cluster

Rate This
  • Comments 1

In this article,  I will help you writing your own WordCount Hadoop Job and then deploy it to Windows Azure Cluster for further processing.

 

Let’s create  Java code file as “AvkashWordCount.java” as below:

 

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.util.*;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 

 

public class AvkashWordCount {

             public static class Map extends Mapper

                                                                  <LongWritable, Text, Text, IntWritable> {

                           private final static IntWritable one = new IntWritable(1);

                           private Text word = new Text();

                          

                           public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

                                        String line = value.toString();

                                        StringTokenizer tokenizer = new StringTokenizer(line);

                                        while (tokenizer.hasMoreTokens()) {

                                                     word.set(tokenizer.nextToken());

                                                     context.write(word, one);

                                        }

                           }

             }

             public static class Reduce extends Reducer

                                                                  <Text, IntWritable, Text, IntWritable> {

                           public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {

                                        int sum = 0;

                                        while (values.hasNext()) {

                                                     sum += values.next().get();

                                        }

                                        context.write(key, new IntWritable(sum));

                           }

             }

             public static void main(String[] args) throws Exception {

                           Configuration conf = new Configuration();

                           Job job = new Job(conf);

                           job.setJarByClass(AvkashWordCount.class);

                           job.setJobName("avkashwordcountjob");

                           job.setOutputKeyClass(Text.class);

                           job.setOutputValueClass(IntWritable.class);

                           job.setMapperClass(AvkashWordCount.Map.class);

                           job.setCombinerClass(AvkashWordCount.Reduce.class);

                           job.setReducerClass(AvkashWordCount.Reduce.class);

                           FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

                           job.waitForCompletion(true);

                           }

}

 

Let’s Compile the Java code first. You must have Hadoop 0.20 or above installed in your machined to use this code:

 

C:\Azure\Java>C:\Apps\java\openjdk7\bin\javac -classpath c:\Apps\dist\hadoop-core-0.20.203.1-SNAPSHOT.jar -d . AvkashWordCount.java

 

Now let’s crate the JAR file

C:\Azure\Java>C:\Apps\java\openjdk7\bin\jar -cvf AvkashWordCount.jar org

 added manifest

adding: org/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/(in = 0) (out= 0)(stored 0%)

adding: org/myorg/AvkashWordCount$Map.class(in = 1893) (out= 792)(deflated 58%)

adding: org/myorg/AvkashWordCount$Reduce.class(in = 1378) (out= 596)(deflated 56%)

adding: org/myorg/AvkashWordCount.class(in = 1399) (out= 754)(deflated 46%)

 

Once Jar is created please deploy it to your Windows Azure Hadoop Cluster as below:

 

In the page below please follow all the steps as described below:

  • Step 1: Click Browse to select your "AvkashWordCount.Jar" file here
  • Step 2: Enter the Job name as defined in the  source code
  • Step 3: Add the parameter as below
  • Step 4: Add folder name where files will be read to word count
  • Step 5: Add output folder name where the results will be stored
  • Step 6: Start the Job

 

 

 

Note: Be sure to have some data in your input folder. (Avkash I am using /user/avkash/inputfolder which has a text file with lots of word to be used as Word Count input file)

Once the job is stared, you will see the results as below:

 

avkashwordcountjob

•••

Job Info

Status: Completed Sucessfully
Type: jar
Start time: 12/31/2011 4:06:51 PM
End time: 12/31/2011 4:07:53 PM
Exit code: 0

Command

call hadoop.cmd jar AvkashWordCount.jar org.myorg.AvkashWordCount /user/avkash/inputfolder /user/avkash/outputfolder

Output (stdout)

 

Errors (stderr)


11/12/31 16:06:53 INFO input.FileInputFormat: Total input paths to process : 1
11/12/31 16:06:54 INFO mapred.JobClient: Running job: job_201112310614_0001
11/12/31 16:06:55 INFO mapred.JobClient: map 0% reduce 0%
11/12/31 16:07:20 INFO mapred.JobClient: map 100% reduce 0%
11/12/31 16:07:42 INFO mapred.JobClient: map 100% reduce 100%
11/12/31 16:07:53 INFO mapred.JobClient: Job complete: job_201112310614_0001
11/12/31 16:07:53 INFO mapred.JobClient: Counters: 25
11/12/31 16:07:53 INFO mapred.JobClient: Job Counters
11/12/31 16:07:53 INFO mapred.JobClient: Launched reduce tasks=1
11/12/31 16:07:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29029
11/12/31 16:07:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/31 16:07:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/31 16:07:53 INFO mapred.JobClient: Launched map tasks=1
11/12/31 16:07:53 INFO mapred.JobClient: Data-local map tasks=1
11/12/31 16:07:53 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18764
11/12/31 16:07:53 INFO mapred.JobClient: File Output Format Counters
11/12/31 16:07:53 INFO mapred.JobClient: Bytes Written=123
11/12/31 16:07:53 INFO mapred.JobClient: FileSystemCounters
11/12/31 16:07:53 INFO mapred.JobClient: FILE_BYTES_READ=709
11/12/31 16:07:53 INFO mapred.JobClient: HDFS_BYTES_READ=234
11/12/31 16:07:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=43709
11/12/31 16:07:53 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=123
11/12/31 16:07:53 INFO mapred.JobClient: File Input Format Counters
11/12/31 16:07:53 INFO mapred.JobClient: Bytes Read=108
11/12/31 16:07:53 INFO mapred.JobClient: Map-Reduce Framework
11/12/31 16:07:53 INFO mapred.JobClient: Reduce input groups=7
11/12/31 16:07:53 INFO mapred.JobClient: Map output materialized bytes=189
11/12/31 16:07:53 INFO mapred.JobClient: Combine output records=15
11/12/31 16:07:53 INFO mapred.JobClient: Map input records=15
11/12/31 16:07:53 INFO mapred.JobClient: Reduce shuffle bytes=0
11/12/31 16:07:53 INFO mapred.JobClient: Reduce output records=15
11/12/31 16:07:53 INFO mapred.JobClient: Spilled Records=30
11/12/31 16:07:53 INFO mapred.JobClient: Map output bytes=153
11/12/31 16:07:53 INFO mapred.JobClient: Combine input records=15
11/12/31 16:07:53 INFO mapred.JobClient: Map output records=15
11/12/31 16:07:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=126
11/12/31 16:07:53 INFO mapred.JobClient: Reduce input records=15

 

 

Finally you can open output folder /user/avkash/outputfolder and read the Word Count results.

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce

Leave a Comment
  • Please add 3 and 5 and type the answer here:
  • Post
  • Hi Avkash,

    You have compiled the java files using hadoop mahout core version 0.20 as shown below

    C:\Azure\Java>C:\Apps\java\openjdk7\bin\javac -classpath c:\Apps\dist\hadoop-core-0.20.203.1-SNAPSHOT.jar -d . AvkashWordCount.java

    We have the jars from hadoop mahout 0.4 version, which we want to test in Azure hadoop, so are these jars compatible with the platform same as the above.

    Thanks

Page 1 of 1 (1 items)