Avkash Chauhan's Blog

Windows Azure, Windows 8, Cloud Computing, Big Data and Hadoop: All together at one place.. One problem, One solution at One time...

Running Apache Mahout at Hadoop on Windows Azure (www.hadooponazure.com)

Running Apache Mahout at Hadoop on Windows Azure (www.hadooponazure.com)

Rate This
  • Comments 1

Once you have access enabled to Hadoop on Windows Azure you can run any mahout sample on head node. I am just trying to run original Apache Mahout (http://mahout.apache.org/) sample which is derived from the clustering sample on Mahout's website (https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data).

Step 1: Please RDP to your head node and open the Hadoop command line window.
Here you can just launch MAHOUT to see what happens


Step 2: Download necessary data file from the Internet:

Please download Synthetic control data from http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data and place it under c:\apps\dist\mahout\examples\bin\work\synthetic_control.data"

Step 3: Go to folder c:\apps\dist\mahout\examples\bin and Run command "build-cluster-syntheticcontrol.cmd" and select the desired clustering algorithm from the driver script.

c:\Apps\dist\mahout\examples\bin>build-cluster-syntheticcontrol.cmd
"Please select a number to choose the corresponding clustering algorithm"
"1. canopy clustering"
"2. kmeans clustering"
"3. fuzzykmeans clustering"
"4. dirichlet clustering"
"5. meanshift clustering"
Enter your choice:1
"ok. You chose 1 and we'll use canopy Clustering"
"DFS is healthy... "
"Uploading Synthetic control data to HDFS"
rmr: cannot remove testdata: No such file or directory.
"Successfully Uploaded Synthetic control data to HDFS "
"Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver org.apache.mahout.clustering.synthet
iccontrol.canopy.Job
12/03/06 00:50:10 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on classpath, will use command-lin
e arguments only
12/03/06 00:50:10 INFO canopy.Job: Running with default arguments
12/03/06 00:50:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:50:18 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:50:20 INFO mapred.JobClient: Running job: job_201203052259_0001
12/03/06 00:50:21 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:51:00 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:51:11 INFO mapred.JobClient: Job complete: job_201203052259_0001
12/03/06 00:51:11 INFO mapred.JobClient: Counters: 16
12/03/06 00:51:11 INFO mapred.JobClient: Job Counters
12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33969
12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:51:11 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:51:11 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/06 00:51:11 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:51:11 INFO mapred.JobClient: Bytes Written=335470
12/03/06 00:51:11 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_READ=130
12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_READ=288508
12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21557
12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=335470
12/03/06 00:51:11 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:51:11 INFO mapred.JobClient: Bytes Read=288374
12/03/06 00:51:11 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:51:11 INFO mapred.JobClient: Map input records=600
12/03/06 00:51:11 INFO mapred.JobClient: Spilled Records=0
12/03/06 00:51:11 INFO mapred.JobClient: Map output records=600
12/03/06 00:51:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
12/03/06 00:51:11 INFO canopy.CanopyDriver: Build Clusters Input: output/data Out: output Measure: org.apache.mahout.common.distance.EuclideanDistance
Measure@1997c1d8 t1: 80.0 t2: 55.0
12/03/06 00:51:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:51:12 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:51:13 INFO mapred.JobClient: Running job: job_201203052259_0002
12/03/06 00:51:14 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:51:58 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:52:16 INFO mapred.JobClient: map 100% reduce 100%
12/03/06 00:52:27 INFO mapred.JobClient: Job complete: job_201203052259_0002
12/03/06 00:52:27 INFO mapred.JobClient: Counters: 25
12/03/06 00:52:27 INFO mapred.JobClient: Job Counters
12/03/06 00:52:27 INFO mapred.JobClient: Launched reduce tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30345
12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:52:27 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15968
12/03/06 00:52:27 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:52:27 INFO mapred.JobClient: Bytes Written=6615
12/03/06 00:52:27 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_READ=14296
12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_READ=335597
12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=73063
12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=6615
12/03/06 00:52:27 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:52:27 INFO mapred.JobClient: Bytes Read=335470
12/03/06 00:52:27 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:52:27 INFO mapred.JobClient: Reduce input groups=1
12/03/06 00:52:27 INFO mapred.JobClient: Map output materialized bytes=13906
12/03/06 00:52:27 INFO mapred.JobClient: Combine output records=0
12/03/06 00:52:27 INFO mapred.JobClient: Map input records=600
12/03/06 00:52:27 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/06 00:52:27 INFO mapred.JobClient: Reduce output records=6
12/03/06 00:52:27 INFO mapred.JobClient: Spilled Records=50
12/03/06 00:52:27 INFO mapred.JobClient: Map output bytes=13800
12/03/06 00:52:27 INFO mapred.JobClient: Combine input records=0
12/03/06 00:52:27 INFO mapred.JobClient: Map output records=25
12/03/06 00:52:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
12/03/06 00:52:27 INFO mapred.JobClient: Reduce input records=25
12/03/06 00:52:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:52:27 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:52:28 INFO mapred.JobClient: Running job: job_201203052259_0003
12/03/06 00:52:29 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:53:46 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:58:20 INFO mapred.JobClient: Job complete: job_201203052259_0003
12/03/06 00:58:20 INFO mapred.JobClient: Counters: 16
12/03/06 00:58:20 INFO mapred.JobClient: Job Counters
12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30407
12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:58:20 INFO mapred.JobClient: Rack-local map tasks=1
12/03/06 00:58:20 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/06 00:58:20 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:58:20 INFO mapred.JobClient: Bytes Written=340891
12/03/06 00:58:20 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:58:20 INFO mapred.JobClient: FILE_BYTES_READ=130
12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_READ=342212
12/03/06 00:58:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=22251
12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=340891
12/03/06 00:58:21 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:58:21 INFO mapred.JobClient: Bytes Read=335470
12/03/06 00:58:21 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:58:21 INFO mapred.JobClient: Map input records=600
12/03/06 00:58:21 INFO mapred.JobClient: Spilled Records=0
12/03/06 00:58:21 INFO mapred.JobClient: Map output records=600
12/03/06 00:58:21 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
C-0{n=21 c=[29.552, 33.073, 35.876, 36.375, 35.118, 32.761, 29.566, 26.983, 25.272, 24.967, 25.691, 28.252, 30.994, 33.088, 34.015, 34.349, 32.826, 31
.053, 29.116, 27.975, 27.879, 28.103, 28.775, 30.585, 31.049, 31.652, 31.956, 31.278, 30.719, 29.901, 29.545, 30.207, 30.672, 31.366, 31.032, 31.567,
30.610, 30.204, 29.266, 29.753, 29.296, 29.930, 31.207, 31.191, 31.474, 32.154, 31.746, 30.771, 30.250, 29.807, 29.543, 29.397, 29.838, 30.489, 30.705
, 31.503, 31.360, 30.827, 30.426, 30.399] r=[0.979, 3.352, 5.334, 5.851, 4.868, 3.000, 3.376, 4.812, 5.159, 5.596, 4.940, 4.793, 5.415, 5.014, 5.155,
4.262, 4.891, 5.475, 6.626, 5.691, 5.240, 4.385, 5.767, 7.035, 6.238, 6.349, 5.587, 6.006, 6.282, 7.483, 6.872, 6.952, 7.374, 8.077, 8.676, 8.636, 8.6
97, 9.066, 9.835, 10.148, 10.091, 10.175, 9.929, 10.241, 9.824, 10.128, 10.595, 9.799, 10.306, 10.036, 10.069, 10.058, 10.008, 10.335, 10.160, 10.249,
10.222, 10.081, 10.274, 10.145]}
Weight: Point:
……...
……..
…….

1.0: [27.414, 25.397, 26.460, 31.978, 26.125, 27.463, 30.489, 34.929, 27.558, 30.686, 27.511, 32.269, 32.834, 27.129, 24.991, 32.610, 25.387,
32.674, 34.607, 33.519, 29.012, 28.705, 32.116, 29.121, 26.424, 33.452, 33.623, 29.457, 35.025, 26.607, 34.442, 34.847, 28.897, 34.439, 32.011, 34.816
, 27.773, 11.549, 20.219, 19.678, 14.715, 14.384, 15.556, 9.573, 10.636, 16.639, 17.236, 19.643, 18.317, 15.323, 19.106, 11.455, 16.888, 18.269, 11.58
3, 112/03/06 00:58:24 INFO driver.MahoutDriver: Program took 493470 ms

After the Mahout job was completed the output was stored as below:

js> #ls

Found 3 items

drwxr-xr-x   - avkash supergroup          0 2012-03-06 01:05 /user/avkash/.oink

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:52 /user/avkash/output

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:49 /user/avkash/testdata

js> #ls /user/avkash/output

Found 3 items

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:53 /user/avkash/output/clusteredPoints

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:52 /user/avkash/output/clusters-0

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:51 /user/avkash/output/data

 

Now let’s analyzing mahout cluster output using clusterdump utility:

 

Clusterdump utility takes 3 parameters:

  1. –seqFileDir – this is the path folder where clustering sequence folder is (in this case output/clusters-0)
  2. –pointsDir – this is the path folder where clustering points folder is (in this case output/clusteredPoints)
  3. --output– this is the path where you would want to create your analysis result.
    1. Be sure that this parameter will force to create analysis result text in local machine not on HDFS

Running the command as below:

c:\Apps\dist\mahout\examples\bin>mahout clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt

"Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
12/03/06 21:05:53 WARN driver.MahoutDriver: No clusterdump.props found on classpath, will use command-line arguments only
12/03/06 21:05:53 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=clusteranalyze.txt, --pointsDir=output\clusteredPoints, --seqFileDir=output\clusters-0, --startPhase=0, --tempDir=temp}
12/03/06 21:05:55 INFO driver.MahoutDriver: Program took 2031 ms


 

Now if you open folder  at your machine, will find “clusteranalyze.txt” as below:

 

Opening clusteranalyze.txt shows the data as below:

 

Cluster Dumper Reference:

 

 


Leave a Comment
  • Please add 8 and 2 and type the answer here:
  • Post
  • I run into the following when trying to run Mahout on my Azure environment. I don't have much experience with Windows shell scripting, so please forgive me if it's something obvious:

    c:\apps\dist\mahout\bin>mahout

    Running here: c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop jar c:\apps\dist\mah

    out\bin\..\\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver

    Usage: java [-options] class [args...]

              (to execute a class)

      or  java [-options] -jar jarfile [args...]

              (to execute a jar file)

    where options include:

       -d32          use a 32-bit data model if available

       -d64          use a 64-bit data model if available

       -server       to select the "server" VM

       -hotspot      is a synonym for the "server" VM  [deprecated]

                     The default VM is server.

       -cp <class search path of directories and zip/jar files>

       -classpath <class search path of directories and zip/jar files>

                     A ; separated list of directories, JAR archives,

                     and ZIP archives to search for class files.

       -D<name>=<value>

                     set a system property

       -verbose[:class|gc|jni]

                     enable verbose output

       -version      print product version and exit

       -version:<value>

                     require the specified version to run

       -showversion  print product version and continue

       -jre-restrict-search | -no-jre-restrict-search

                     include/exclude user private JREs in the version search

       -? -help      print this help message

       -X            print help on non-standard options

       -ea[:<packagename>...|:<classname>]

       -enableassertions[:<packagename>...|:<classname>]

                     enable assertions with specified granularity

       -da[:<packagename>...|:<classname>]

       -disableassertions[:<packagename>...|:<classname>]

                     disable assertions with specified granularity

       -esa | -enablesystemassertions

                     enable system assertions

       -dsa | -disablesystemassertions

                     disable system assertions

       -agentlib:<libname>[=<options>]

                     load native agent library <libname>, e.g. -agentlib:hprof

                     see also, -agentlib:jdwp=help and -agentlib:hprof=help

       -agentpath:<pathname>[=<options>]

                     load native agent library by full pathname

       -javaagent:<jarpath>[=<options>]

                     load Java programming language agent, see java.lang.instrument

       -splash:<imagepath>

                     show splash screen with specified image

    See java.sun.com/.../reference for more details.

Page 1 of 1 (1 items)