Windows Azure SQL Database Marketplace
Today we announced the general availability of the HDInsight Service for Windows Azure. HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud.
HDInsight offers the following benefits:
An HDInsight cluster can be created from the Windows Azure Management portal by clicking the new button and selecting HDInsight from the Data Services menu. To create an HDInsight cluster specify a name for the cluster, the size of the cluster in number of data nodes and a password for logging in.
A cluster must have at least one storage account associated with it that will be the permanent storage mechanism for that cluster and the region the cluster is created in will always be the same region as the storage account chosen. At the time of general availability the storage account must reside in either West US, East US or North Europe to be associated with an HDInsight cluster. Additional storage accounts can be associated with a cluster using the custom create option.
It will take a few minutes for the cluster to be deployed and configured but once it is ready you will be presented with a getting started screen that provides links to additional help content as well as some sample code to run your first Hadoop job using HDInsight.
If you select the dashboard tab on the HDInsight page for your cluster, you will see the following screen that provides some basic information on the current status of your cluster including the usage in number of cores, job history and linked storage accounts.
Submitting Your First Map Reduce Job
Before you submit your first job you must prepare your development environment to use the HDInsight PowerShell cmdlets. The PowerShell cmdlets require two main components to be installed and configured: Windows Azure Powershell and the HDInsight PowerShell tools. Follow the links on step 1 of the Getting Started screen to setup your environment.
The Getting Started page has a screen that shows sample commands for submitting either a Hive or MapReduce job and we will start by submitting a MapReduce job.
Run the sample using these commands to create the job definition. The job definition contains all the information for your job, for example what mappers and reducers to use, which data to use as input and where to store the output. In this example we are going to use a sample MapReduce program and sample file that are included with the cluster. We will create an output directory in the samples directory to store the results.
$jarFile = "/example/jars/hadoop-examples.jar"
$className = "wordcount"
$statusDirectory = "/samples/wordcount/status"
$outputDirectory = "/samples/wordcount/output"
$inputDirectory = "/example/data/gutenberg"
$wordCount = New-AzureHDInsightMapReduceJobDefinition -JarFile $jarFile -ClassName
$className -Arguments $inputDirectory, $outputDirectory -StatusFolder $statusDirectory
Run these commands to get your subscription information and start execution of the MapReduce program. MapReduce jobs are typically long-running this so example shows how to use the asynchronous commands to kick off execution of the job.
$subscriptionId = (Get-AzureSubscription -Current).SubscriptionId
$wordCountJob = $wordCount | Start-AzureHDInsightJob -Cluster HadoopIsAwesome -
Subscription $subscriptionId | Wait-AzureHDInsightJob -Subscription $subscriptionId
Finally, run this command to retrieve the results of execution and display those on the PowerShell command line.
Get-AzureHDInsightJobOutput -Subscription (Get-AzureSubscription -Current).SubscriptionId -
Cluster bc-newhdstorage -JobId $wordCountJob.JobId –StandardError
The result of a MapReduce job is the information on the execution of the job itself as shown below.
The output of the job was placed in your storage account in the "/samples/wordcount/output" directory. Open the storage viewer in the Windows Azure Portal and navigate to this file to download and view the output file.
Submitting Your First Hive Job
The Getting Started page also has a screen that shows some sample commands for connecting to your cluster and submitting a Hive job. Click the Hive button in the Job type section to see the sample.
Run this sample now by first executing this command in PowerShell to connect to your cluster.
Use-AzureHDInsightCluster HadoopIsAwesome (Get-AzureSubscription -Current).SubscriptionID
Next run this command to submit a HiveQL statement to the cluster. The statement uses a sample Hive table that is setup on the cluster by default when it is created.
Invoke-Hive "select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"
The query is a fairly simple select-groupby and when complete will display the results on the PowerShell command line.
In this blog we showed you just how easy it is to get up and running with an HDInsight cluster and begin analyzing your data. There is a lot more you can do and learn with HDInsight like uploading your own data sets, running sophisticated jobs and analyzing your results. For more details on using HDInsight visit the HDInsight documentation page or use the following links to access help articles directly.
For details on pricing visit the HDInsight pricing details page.