In Hadoop cluster, namenode communicate with all the other nodes. Apache Hadoop on Windows Azure have the following XML file which includes all the primary settings for Hadoop:

 

C:\Apps\Dist\conf\HDFS-SITE.XML

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>

    <name>dfs.permissions</name>

    <value>false</value>

  </property>

  <property>

    <name>dfs.replication</name>

    <value>3</value>

  </property>

  <property>

    <name>dfs.datanode.max.xcievers</name>

    <value>4096</value>

  </property>

  <property>

    <name>dfs.name.dir</name> <======This is the NAME node data directory

    <value>c:\hdfs\nn</value>

  </property>

  <property>

    <name>dfs.data.dir</name> <========= This is the DATA node data directory

    <value>c:\hdfs\dn</value>

  </property>

</configuration>

 

 

C:\Apps\Dist\conf\Core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/hdfs/tmp</value>

    <description>A base for other temporary directories.</description>

  </property>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://10.26.104.45:9000</value> <== After the role started the VM gets IP Address and then included here

  </property>

  <property>

    <name>io.file.buffer.size</name>

    <value>131072</value>

  </property>

</configuration>

 

C:\Apps\Dist\conf\Mapred-site.xml:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>

    <name>mapred.job.tracker</name>

    <value>10.26.104.45:9010</value>

  </property>

  <property>

    <name>mapred.local.dir</name>

    <value>/hdfs/mapred/local</value>

  </property>

  <property>

    <name>mapred.tasktracker.map.tasks.maximum</name>

    <value>2</value>

  </property>

  <property>

    <name>mapred.tasktracker.reduce.tasks.maximum</name>

    <value>1</value>

  </property>

  <property>

    <name>mapred.child.java.opts</name>

    <value>-Xmx1024m</value>

  </property>

  <property>

    <name>mapreduce.client.tasklog.timeout</name>

    <value>6000000</value>

  </property>

  <property>

    <name>mapred.task.timeout</name>

    <value>6000000</value>

  </property>

  <property>

    <name>mapreduce.reduce.shuffle.connect.timeout</name>

    <value>600000</value>

  </property>

  <property>

    <name>mapreduce.reduce.shuffle.read.timeout</name>

    <value>600000</value>

  </property>

</configuration>

 

You sure can make necessary changes to above setting however after that you would need to restart namenode as below:

 

  • C:\Apps\Dist \> Hadoop namenode -format

 

For more command you can see check the Hadoop command line shortcut:

c:\apps\dist>hadoop

Usage: hadoop [--config confdir] COMMAND

where COMMAND is one of:

  namenode -format     format the DFS filesystem

  secondarynamenode    run the DFS secondary namenode

  namenode             run the DFS namenode

  datanode             run a DFS datanode

 dfsadmin             run a DFS admin client

  mradmin              run a Map-Reduce admin client

  fsck                 run a DFS filesystem checking utility

  fs                   run a generic filesystem user client

  balancer             run a cluster balancing utility

  jobtracker           run the MapReduce job Tracker node

  pipes                run a Pipes job

  tasktracker          run a MapReduce task Tracker node

  job                  manipulate MapReduce jobs

  queue                get information regarding JobQueues

  version              print the version

  jar <jar>            run a jar file

 

  distcp <srcurl> <desturl> copy file or directories recursively

  archive -archiveName NAME <src>* <dest> create a hadoop archive

  daemonlog            get/set the log level for each daemon

or

  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

 

 

You can also use the following configuration related with Java logging which can be modified, however you would need to re-launch Java process again:

C:\Apps\Dist\conf\Log4j.properties:

hadoop.log.file=hadoop.log

log4j.rootLogger=${hadoop.root.logger}, EventCounter

log4j.threshhold=ALL

 

#

# TaskLog Appender

#

 

#Default values

hadoop.tasklog.taskid=null

hadoop.tasklog.noKeepSplits=4

hadoop.tasklog.totalLogFileSize=100

hadoop.tasklog.purgeLogSplits=true

hadoop.tasklog.logsRetainHours=12

 

log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender

log4j.appender.TLA.taskId=${hadoop.tasklog.taskid}

log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}

 

log4j.appender.TLA.layout=org.apache.log4j.PatternLayout

log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n

 

# FSNamesystem Audit logging

log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN

 

# Custom Logging levels

#log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

#log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG

 

# Jets3t library

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR

 

# Event Counter Appender

# Sends counts of logging messages at different severity levels to Hadoop Metrics.

log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

 

Resources:

http://hadoop.apache.org/common/docs/current/cluster_setup.html

http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/