Slightly off-topic, but a friend of mine posted a video on Facebook that struck me as being very relevant to our space.
At the same time I saw this, another friend and co-worker was having a fairly crummy day with lots of crazy meetings and requests from co-workers, but that day was made a lot better by sitting down and solving a fun little deployment script problem. The deployment script is neither here nor there, it was more that so many of the folks in this industry love solving a problem, whether that is getting a query right, chasing down a pesky bug, or getting that a-ha moment when you are trying to design something. That’s “painting” for a lot of us.
Every day is a good day when you paint.
[and working auto-tuned Bob Ross into a blog post was just too good to pass up]
I had a little bit of time on my hand and wanted to whip up a quick sample using PowerShell for a Hadoop job.
This uses the Hadoop streaming capability, which essentially allows for mappers and reducers to be written as arbitrary executables that operate on standard input and output.
The .ps1 scripts are pretty simple, these operate over a set of airline data that looks like this:
The schema here is a comma separated set of US flights with delays.
The goal of my job is to pull out the airlines, the number of flights, and then some very basic (min and max) statistics on the arrival and departure delays.
1: function Map-Airline
3: [Console]::Error.WriteLine( "reporter:counter:powershell,invocations,1")
4: $line = [Console]::ReadLine()
5: while ($line -ne $null)
7: [Console]::WriteLine($line.Split(",") + "`t" + $line)
9: $line = [Console]::ReadLine()
1: function Reduce-Airlines
3: $line = ""
4: $oldLine = "<initial invalid row value>"
5: $count = 0
6: $minArrDelay = 10000
7: $maxArrDelay = 0
8: $minDepDelay = 10000
9: $maxDepDelay = 0
11: $line = [Console]::ReadLine()
13: while ($line -ne $null)
15: if (($oldLine -eq $line.Split("`t")) -or ($oldLine -eq "<initial invalid row value>"))
17: $flightRecord = $line.Split("`t").Split(',')
18: if ([Int32]::Parse($flightRecord) -ne 0)
20: $minArrDelay = [Math]::Min($minArrDelay, $flightRecord)
22: if ([Int32]::Parse($flightRecord) -ne 0)
24: $minDepDelay = [Math]::Min($minDepDelay, [Int32]::Parse($flightRecord))
26: $maxArrDelay = [Math]::Max($maxArrDelay, $flightRecord)
27: $maxDepDelay = [Math]::Max($maxDepDelay, $flightRecord)
28: $count = $count+ 1
29: [Console]::Error.WriteLine( "reporter:counter:powershell,"+$oldLine + ",1")
33: [Console]::WriteLine($oldLine + "`t" + $count + "," + $minArrDelay + "," +$maxArrDelay + "," + $minDepDelay +","+ $maxDepDelay)
34: $count = 1
35: $minArrDelay = 10000
36: $maxArrDelay = 0
37: $minDepDelay = 10000
38: $maxDepDelay = 0
39: [Console]::Error.WriteLine("reporter:counter:powershell,"+$oldLine + ",1")
41: $oldLine = $line.Split("`t")
42: $line = [Console]::ReadLine()
44: [Console]::WriteLine($oldLine + "`t" + $count + "," + $minArrDelay + "," +$maxArrDelay + "," + $minDepDelay +","+ $maxDepDelay)
One thing to note on the reducer is that we use the $oldLine variable in order to keep tabs on when our group of results is moving to the next one. When using Java, your reduce function will be invoked once per group and so you can reset the state at the beginning of each of those. With streaming, you will never have groups that split reducers, but your executable will only be spun up once per reducer (which, in the sample here, is one). You can also see that I’m writing out to STDERR in order to get a few counters recorded as well.
The next trick is to get these to execute. The process spawned by the Streaming job does not know about .ps1 files, it’s basically just cmd.exe. To get around that we will also create a small driver .cmd file and upload the file with the –file directive from the command line.
@call c:\windows\system32\WindowsPowerShell\v1.0\powershell -file ..\..\jars\AirlineMapper.ps1
@call c:\windows\system32\WindowsPowerShell\v1.0\powershell -file ..\..\jars\AirlineReducer.ps1
The ..\..\jars directory is where the –file directive will place the files when they execute
And now we execute:
hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar
And we get our results.
There is still some work to be done here, I’d like to make it a little easier to get these running (so possibly wrapping submission in a script which takes care of the wrapping for me). Also, on Azure, we either need to sign the scripts, or log into each of the machines and allow script execution. As I get that wrapped up, I’ll drop it along side our other samples. We’ll also work to make this easier to get up and running on Azure if that’s interesting for folks.