Carl Nolan’s ramblings on development
To better support configuring the Stream environment whilst running .Net Streaming jobs I have made a change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code.
I have fixed a few bugs around setting job configuration options which were being controlled by the submission code. However, more importantly, I have added support for two additional command lines submission options:
The full set of options for the command line submission is now:
-help (Required=false) : Display Help Text -input (Required=true) : Input Directory or Files -output (Required=true) : Output Directory -mapper (Required=true) : Mapper Class -reducer (Required=true) : Reducer Class -combiner (Required=false) : Combiner Class (Optional) -format (Required=false) : Input Format |Text(Default)|Binary|Xml| -numberReducers (Required=false) : Number of Reduce Tasks (Optional) -numberKeys (Required=false) : Number of MapReduce Keys (Optional) -outputFormat (Required=false) : Reducer Output Format |Text|Binary| (Optional) -file (Required=true) : Processing Files (Must include Map and Reduce Class files) -nodename (Required=false) : XML Processing Nodename (Optional) -cmdenv (Required=false) : List of Environment Variables for Job (Optional) -jobconf (Required=false) : List of Job Configuration Parameters (Optional) -debug (Required=false) : Turns on Debugging Options
The job configuration option supports providing a list of standard job options. As an example, to set the name of a job and compress the Map output, which could improve performance, one would add:
-jobconf mapred.compress.map.output=true -jobconf mapred.job.name=MobileDataDebug
For a complete list of options one would need to review the Hadoop documentation.
The command environment option supports setting environment variables accessible to the Streaming process. However, it will replace non-alphanumeric characters with an underscore “_” character. As an example if one wanted to pass in a filter into the Streaming process one would add:
-cmdenv DATA_FILTER=Windows
This would then be accessible in the Streaming process.
To support providing better feedback into the Hadoop running environment I have added a new static class into the code; named Context. The Context object contains the original FormatKeys() and GetKeys() operations, along with the following additions:
The code contained in this Context object, although simple, will hopefully provide some abstraction over the idiosyncrasies of using the Streaming interface.