If you have been using the “Generics based Framework for .Net Hadoop MapReduce Job Submission” you may want to download the latest version of the code.

The previous version of the code, when processing XML and Binary files, was dependent on a custom streaming JAR that contained the necessary reader classes. This was not an ideal solution. As such I have now mapped the code to remove this dependency, using the standard streaming jar file, and mapped the code over to use the libjars generic option.

The libjars option allows one to specify the jar files to include in the classpath for the streaming job. Using this option an example of submitting a job for XML processing would be:

  -input "stores/demographics" -output "stores/banking"
  -mapper "MSDN.Hadoop.MapReduceFSharp.StoreXmlElementMapper, MSDN.Hadoop.MapReduceFSharp"
  -reducer "MSDN.Hadoop.MapReduceFSharp.StoreXmlElementReducer, MSDN.Hadoop.MapReduceFSharp"
  -file "%HOMEPATH%\Projects\MSDN.Hadoop.MapReduce\Samples\MSDN.Hadoop.MapReduceFSharp.dll"
  -nodename Store -format Xml -numberKeys 2

Leading to the job submission:

%HADOOP_HOME%\bin\hadoop.cmd jar %HADOOP_HOME%\lib\hadoop-streaming.jar
  -libjars "file:///C:/Projects/MSDN.Hadoop.MapReduce/Release/msdn.hadoop.readers.jar"
  "-D xmlinput.element=Store" "-D stream.num.map.output.key.fields=2"
  -input "stores/demographics" -output "stores/banking"
  -mapper "..\..\jars\MSDN.Hadoop.MapperXml.exe"
  -reducer "..\..\jars\MSDN.Hadoop.Reducer.exe"
  -file "C:\Projects\MSDN.Hadoop.MapReduce\Release\MSDN.Hadoop.MapperXml.exe"
  -file "C:\Projects\MSDN.Hadoop.MapReduce\Release\MSDN.Hadoop.Reducer.exe"
  -file "C:\Projects\MSDN.Hadoop.MapReduce\Release\MSDN.Hadoop.Core.dll"
  -file "C:\Projects\MSDN.Hadoop.MapReduce\Release\MSDN.Hadoop.MapReduceBase.dll"
  -file "C:\Projects\MSDN.Hadoop.MapReduce\Samples\MSDN.Hadoop.MapReduceFSharp.dll"
  -file "C:\Temp\f3f0159ea97c4be08fe3ac07e0607b1a\MSDN.Hadoop.MapperXml.exe.config"
  -file "C:\Temp\f3f0159ea97c4be08fe3ac07e0607b1a\MSDN.Hadoop.Reducer.exe.config"
  -inputformat org.apache.mahout.classifier.bayes.XmlElementStreamingInputFormat

The msdn.hadoop.readers.jar file necessary for processing XML and Binary files has been built and is available in the Release folder; when the library is built.

Along with this major change there have been some bug fixes for dealing with blank input lines, necessary for word-count examples, the samples have been moved into a separate solution, and the generation of the necessary app.config files has been standardized.