Let's go through the process of running the PIG Script with the entire Gutenberg collection, we first uploaded the MapReduce word count file, WordCount.js [link] by typing fs.put() it brings up dialog box for you to upload the WordCount.js file.
Next, you can verify that the WordCount.js file has been uploaded properly by typing #cat /user/admin/WordCount.js. As you noticed, HDFS commands that normally looks like: hdfs dfs –ls has been abstracted to #ls.
pig.from("asv://firstname.lastname@example.org/").mapReduce("/user/admin/WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10")
This process may take 10s of minutes to complete, since the dataset is rather large.
The View Log link provides detailed progress logs.
Click on The Reduce link in the Table above to check on the Reduce Job, notice there shuffle and sort processes. Shuffle basically is the process where the reducer is fed with output with all the mappers output that it needs to process.
Click into the Counters link, there are significant amount of data being read and written in this process. The nice thing about Map Reduce jobs is that you can speed up the process by adding more compute resources. The mapping phase can be significantly speed up by running more processes in parallel.
When everything finishes, the summary page tells us that the pig script was really about 5 different jobs, 07 – 11. for learning purposes, I’ve posted my results at: https://github.com/wenming/BigDataSamples/blob/master/gutenberg/results.txt
file = fs.read("DaVinciTop10")
data = parse(file.data, "word, count:long")
When we compare the entire Gutenberg collection with just the Davinci.txt file, there’s a significant difference, with our new data we can certainly estimate the occurrences of these top words in the English language more accurately than just looking through 1 book.
More data always gives us more confidence, that’s why big data processing is so important. When it comes to processing large amounts of data, parallel big data processing tools such as HDInsight (Hadoop) can deliver results faster than running them on single workstations. Map Reduce is like the assembly language of Big Data, Higher level languages such as PIG Latin can be decomposed into a series of map reduce jobs for us.