Managing Hive Job Submissions With PowerShell

Managing Hive Job Submissions With PowerShell

Rate This
  • Comments 3

In my previous post, I talked about “Managing Your HDInsight Cluster with PowerShell”. In this post I made no mention of using Hive. I hope to re-address this balance by specifically talking about how you can submit Hive jobs from the same local management console.

As before all the scripts mentioned in this and the previous post can be downloaded from here.

To run the samples mentioned in this post I am using the following Hive table definition that points to the sample data provided with the service:

CREATE EXTERNAL TABLE IF NOT EXISTS HiveSampleTable (
    ClientId string,
    QueryTime string,
    Market string,
    DevicePlatform string,
    DeviceMake string,
    DeviceModel string,
    State string,
    Country string,
    QueryDwellTime double,
    SessionId bigint,
    SessionPageViewOrder bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION
'/hive/warehouse/hivesampletable';

The process for Hive submission is basically the same as for MapReduce jobs. However there is an additional requirement to preserve the Hive meta-data. Basically if you define Hive tables/views/etc., you want these to persist between cluster creations.

To achieve this you will need to create a customized cluster. Once you do this you can use Hive for your Elastic job submissions.

Provision Customized Cluster

You can also provision cluster and configure it to connect to more than one Azure Blob storage or custom Hive and Oozie meta-stores.

This advanced feature allows you to separate the lifetime of your data and metadata from the lifetime of the cluster. This not only includes job submission details but also allows one to retain Hive metadata, and hence table definitions, as long as Hive External tables are used.

Using the same principal of defining the italicized variables first, the first step is to get the subscription details and the storage key of the storage accounts you want to connect your cluster to, in this case 2 storage accounts:

$subid = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
$cert = Get-AzureSubscription -Current | %{ $_.Certificate }

$key1 = Get-AzureStorageKey $myStoreAcct1 | %{ $_.Primary }
$key2 = Get-AzureStorageKey $myStoreAcct2 | %{ $_.Primary }

The next step is to create your custom SQL Azure Database and Server following this blog post or get credentials to the existing one.

To create a SQL Database one can easily use the management portal:

image

Using this connectivity information you can create cluster config object, pipe in additional storage and meta-stores configuration, and then finally pipe this config object into the New-AzureHDInsightCluster cmdlet to create a cluster based on this custom configuration.

Write-Host "Creating '$numbernodes' Node Cluster named: $clustername" -f yellow

New-AzureHDInsightConfig -ClusterSizeInNodes $numbernodes | Set-AzureHDInsightDefaultStorage -StorageAccountName $blobstorage1 -StorageAccountKey $key1 -StorageContainerName $containerdefault |Add-AzureHDInsightStorage -StorageAccountName $blobstorage2 -StorageAccountKey $key2 | Add-AzureHDInsightMetastore -SqlAzureServerName $sqlserver -DatabaseName $databaseoozie -UserName $sqlusername -Password $sqlpassword -MetastoreType OozieMetastore | Add-AzureHDInsightMetastore -SqlAzureServerName $sqlserver -DatabaseName $databasehive -UserName $sqlusername -Password $sqlpassword -MetastoreType HiveMetastore | New-AzureHDInsightCluster -SubscriptionId $subid -Certificate $cert -UserName $username -Password $password -Name $clustername -Location $location

As before a sample set of variable setting could be:

$mySubName = "Windows Azure Subscription Name"
$myStoreAcct1 = "storageaccount"
$myStoreAcct2 = "storageaccountbck"

$blobstorage1 = "storageaccount.blob.core.windows.net"
$blobstorage2 = "storageaccountbck.blob.core.windows.net"

$containerdefault = "hadooproot"

$clustername = $Cluster
$location = "East US"

$numbernodes = $Hosts

$username = "Admin"
$password = "Password"

$sqlserver = "sqlservername.database.windows.net"
$sqlusername = "SqlAdmin"
$sqlpassword = "Password"

$databasehive = "HiveDatabaseName"
$databaseoozie = "OozieDatabaseName"

As before one can use the PowerShell Param option to create a script that enables passing in parameters; such as number hosts, cluster name, etc. In doing this one could create a cluster using the following command:

. "C:\Users\Me\Scripts\ClusterCreateComplex.ps1" -Hosts 4 -Cluster "clustername"

In this case I have used a secondary storage account with a “bck” prefix. Other than this, along with the SQL parameters, the Azure storage configuration is the same as in the previous case.

Hive Job Submission

The process for running Hive jobs is a little different to that of MapReduce. In this instance one has to use the Web client to Create and Submit the Hive job. As an aside the process for running a Pig script is very similar.

For Hive, you have two options available. Firstly you can provide the actual Hive command or you can provide a file location accessible from the cluster.

Thus, assuming you have created the cluster as specified in section “Provision Customized Cluster”, the script to submit a Hive command or file would be:

Param($Cluster = "myclustername", [string] $OutputPath = $(throw "Output Path Required."), [string] $Hql = $(throw "Hive Query File Required."), $HqlType = "Command")

# Import the management module and DLLS
$basePath = "C:\Users\Me"
$dllPath = $basePath + "\Dlls"
Add-Type -Path ($dllPath + "\microsoft.hadoop.client.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.webclient.dll")
Add-Type -Path ($dllPath + "\System.Net.Http.dll")
Add-Type -Path ($dllPath + "\System.Net.Http.Formatting.dll")
Add-Type -Path ($dllPath + "\System.Net.Http.WebRequest.dll")

# Define the JObject type
$newtonDll = $dllPath + "\Newtonsoft.Json.dll"
$submitterAssembly = [System.Reflection.Assembly]::LoadFile($newtonDll)
[Type] $jobjectType = $submitterAssembly.GetType("Newtonsoft.Json.Linq.JObject", 1)

# Define the connection properties
$clusterName = "
https://$Cluster.azurehdinsight.net:563"
[Uri] $clusterUri = New-Object System.Uri $clusterName

$mySubName = "Windows Azure Subscription Name"
$clusterUsername = "Admin"
$clusterPassword = "Password"

# Get the Hive Query from the specified File or Command
[string] $hiveCommand = ""
[string[]]$hiveFiles = @()

if ($HqlType -eq "File") {
    $hiveFiles = $hiveFiles + $Hql
    $hiveCommand = $Null
} elseif ($HqlType -eq "Command") {
    $hiveFiles = $Null
    $hiveCommand = $Hql
} else {
    throw "Hql Type must be 'File' or 'Command' and '$HqlType' was entered."
}

# Connect and Run the Hive Query
$httpClient = New-Object Microsoft.Hadoop.WebHCat.Protocol.WebHCatHttpClient ($clusterUri, $clusterUsername, $clusterPassword, $clusterUsername)

Write-Host "Submitting Hive $HqlType : $Hql" -f yellow
Write-Host "Working Directory '$OutputPath'" -f yellow
Write-Host "On Cluster '$clusterName'" -f yellow
   
$hiveJob = $httpClient.CreateHiveJob($hiveCommand, $hiveFiles, $Null, $OutputPath, $Null)
if (-Not $hiveJob.IsCompleted) { $hiveJob.Wait() }

if ($hiveJob.Exception) {
    Write-Host "Job Submission Error" -f yellow -b red
    Write-Host $hiveJob.Exception -f red
}
else
{
    $response = $hiveJob.Result   
    $output = [System.Net.Http.HttpContentExtensions]::ReadAsAsync($response.Content, $jobjectType)
    $output.Wait()

    $response.EnsureSuccessStatusCode()

    $jobId = $output.Result.GetValue("id").ToString()
    $httpClient.WaitForJobToCompleteAsync($jobId).Wait()

    Write-Host "Job Run Information" -f yellow
    Write-Host "Job Id: " $jobId -f yellow
    Write-Host $response
}

 

One has to remember that this script assumes the location of the file to be executed is local to the cluster. Also, when providing a string command there is a limit of about 2K of text.

Examples of running this script would be:

. "C:\Users\Me\Scripts\ClusterHiveScriptSubmission.ps1" -Cluster "myclustername" -OutputPath "/hive/output" -Hql "/hive/sample/queries/HiveSampleExtractFile.hql" -HqlType "File"

. "C:\Users\Me\Scripts\ClusterHiveScriptSubmission.ps1" -Cluster "myclustername" -OutputPath "/hive/output" -Hql "INSERT OVERWRITE DIRECTORY '/hive/sample/output' SELECT DevicePlatform, DeviceMake, SUM(QueryDwellTime) FROM HiveSampleTable GROUP BY DevicePlatform, DeviceMake HAVING DevicePlatform <> 'Unknown'"

In this instance the paths specified are actually ASV relative paths based on the default storage container.

The output path will contain the “exit”, “stderr”, and “stdout” data from the job execution.

In running this script you will need to point it to the necessary DLLs required for Web Client assemblies. If you create .Net MapReduce job the necessary DLLs will be fetched from Nuget.

Elastic Hive Submissions

As long as one has created a Customized Cluster, as defined above, one can recreate a cluster and still have access to the Hive definitions previously created. This option allows you to:

  • Create a cluster
  • Define Hive metadata (tables, views, etc.)
  • Delete the cluster

When required you can then

  • Create the cluster
  • Run Hive scripts against previously defined objects
  • Delete the cluster

Such a scenario allows for the usage of an Elastic service even when utilizing Hive.

Note. The following script assumes the default storage account is used as the basis for the Hadoop root. This is defined when performing the workstation configuration, as follows:

$mySubID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
$certThumbprint = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
$myCert = Get-Item cert:\CurrentUser\My\$certThumbprint
$mySubName = "Windows Azure Subscription Name"
$myStoreAcct = "defaultstorageaccount"

Set-AzureSubscription -SubscriptionName $mySubName -Certificate $myCert -SubscriptionID $mySubID

# Set the current default subscription to be the one being managed along with the default storage account
Set-AzureSubscription -DefaultSubscription $mySubName
Set-AzureSubscription –SubscriptionName $mySubName -CurrentStorageAccount $myStoreAcct

In defining a script to perform these operations one not only needs to define the job and cluster properties, but also the SQL Server properties such that the cluster can be pointed back to the original metadata. As such database credentials are needed along with the Hive and Oozie database names.

The following script will thus allow you to specify the HDInsight, Database, and Hive job properties:

Param($Hosts = 4, $Cluster = "myclustername", [string] $SqlServerName = $(throw "Sql Server Name Required."), [string] $HiveDatabase = $(throw "Hive Database Name Required."), [string] $OozieDatabase = $(throw "Oozie Database Name Required."), [string] $SqlUsername = $(throw "Sql Username Required."), [string] $SqlPassword = $(throw "Sql Password Required."), [string] $OutputPath = $(throw "Output Path Required."), [string] $Hql = $(throw "Hive Query File Required."), $HqlType = "Command")

$basePath = "C:\Users\Me"

# Import the management module and DLLS
$dllPath = $basePath + "\Dlls"
$cmdletsPath = $basePath + "\Cmdlets"

Import-Module ($cmdletsPath + "\Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.client.dll")
Add-Type -Path ($dllPath + "\microsoft.hadoop.webclient.dll")
Add-Type -Path ($dllPath + "\System.Net.Http.dll")
Add-Type -Path ($dllPath + "\System.Net.Http.Formatting.dll")
Add-Type -Path ($dllPath + "\System.Net.Http.WebRequest.dll")

# Define the JObject type
$newtonDll = $dllPath + "\Newtonsoft.Json.dll"
$submitterAssembly = [System.Reflection.Assembly]::LoadFile($newtonDll)
[Type] $jobjectType = $submitterAssembly.GetType("Newtonsoft.Json.Linq.JObject", 1)

# Get the subscription information and set variables
$subscriptionInfo = Get-AzureSubscription -Current

$subName = $subscriptionInfo | %{ $_.SubscriptionName }
$subId = $subscriptionInfo | %{ $_.SubscriptionId }
$cert = $subscriptionInfo | %{ $_.Certificate }

$storeAccount = $subscriptionInfo | %{ $_.CurrentStorageAccount }
$storeAccountBck = $storeAccount + "bck"

$key = Get-AzureStorageKey $storeAccount | %{ $_.Primary }
$keyBck = Get-AzureStorageKey $storeAccountBck | %{ $_.Primary }

$storageAccountInfo = Get-AzureStorageAccount $storeAccount
$location = $storageAccountInfo | %{ $_.Location }

$clusterUsername = "Admin"                           
$clusterPassword = [System.Web.Security.Membership]::GeneratePassword(20, 5)
$clusterName = $Cluster
$numberNodes = $Hosts
$containerDefault = "hadooproot"

$clusterHttp = "https://$clusterName.azurehdinsight.net:563"
$blobStorage = "$storeAccount.blob.core.windows.net"
$blobStorageBck = "$storeAccountBck.blob.core.windows.net"

$sqlServer = "$SqlServerName.database.windows.net"

# Create the cluster
Write-Host "Creating '$numberNodes' Node Cluster named: $clusterName" -f yellow
Write-Host "Storage Account '$storeAccount' and Container '$containerDefault'" -f yellow
Write-Host "User '$clusterUsername' Password '$clusterPassword'" -f green
New-AzureHDInsightConfig -ClusterSizeInNodes $numberNodes | Set-AzureHDInsightDefaultStorage -StorageAccountName $blobStorage -StorageAccountKey $key -StorageContainerName $containerdefault |Add-AzureHDInsightStorage -StorageAccountName $blobstorageBck -StorageAccountKey $keyBck | Add-AzureHDInsightMetastore -SqlAzureServerName $sqlServer -DatabaseName $OozieDatabase -UserName $SqlUsername -Password $SqlPassword -MetastoreType OozieMetastore | Add-AzureHDInsightMetastore -SqlAzureServerName $sqlServer -DatabaseName $HiveDatabase -UserName $SqlUsername -Password $SqlPassword -MetastoreType HiveMetastore | New-AzureHDInsightCluster -SubscriptionId $subId -Certificate $cert -UserName $clusterUsername -Password $clusterPassword -Name $clusterName -Location $location
Write-Host "Created '$numbernodes' Node Cluster: $clusterName" -f yellow

# Get the Hive Query from the specified File or Command
[string] $hiveCommand = ""
#$hiveFiles = New-Object System.Collections.Generic.List[System.String]
[string[]]$hiveFiles = @()

if ($HqlType -eq "File") {
    #$hiveFiles.Add($Hql)
    $hiveFiles = $hiveFiles + $Hql
    $hiveCommand = $Null
} elseif ($HqlType -eq "Command") {
    $hiveFiles = $Null
    $hiveCommand = $Hql
} else {
    throw "Hql Type must be 'File' or 'Command' and '$HqlType' was entered."
}

# Connect and Run the Hive Query
[Uri] $clusterUri = New-Object System.Uri $clusterHttp
$httpClient = New-Object Microsoft.Hadoop.WebHCat.Protocol.WebHCatHttpClient ($clusterUri, $clusterUsername, $clusterPassword, $clusterUsername)

Write-Host "Submitting Hive $HqlType : $Hql" -f yellow
Write-Host "Working Directory '$OutputPath'" -f yellow
Write-Host "On Cluster '$clusterName'" -f yellow
   
$hiveJob = $httpClient.CreateHiveJob($hiveCommand, $hiveFiles, $Null, $OutputPath, $Null)
if (-Not $hiveJob.IsCompleted) { $hiveJob.Wait() }

if ($hiveJob.Exception) {
    Write-Host "Job Submission Error" -f yellow -b red
    Write-Host $hiveJob.Exception -f red
}
else {
    $response = $hiveJob.Result   
    $output = [System.Net.Http.HttpContentExtensions]::ReadAsAsync($response.Content, $jobjectType)
    $output.Wait()

    $response.EnsureSuccessStatusCode()

    $jobId = $output.Result.GetValue("id").ToString()
    $httpClient.WaitForJobToCompleteAsync($jobId).Wait()

    Write-Host "Job Run Information" -f yellow
    Write-Host "Job Id: " $jobId -f yellow
    Write-Host $response
}

# Finally delete the cluster
Write-Host "Deleting Cluster named: $clusterName" -f yellow
Remove-AzureHDInsightCluster $clusterName -SubscriptionId $subId -Certificate $cert
Write-Host "Cluster $clusterName Deleted" -f yellow

 

This script however does make a few assumptions:

  1. The Default container is set to "hadooproot"
  2. The subscription default storage container is the base one used for the cluster
  3. Backup storage container is the default with a "bck" suffix

Of course one could optionally add these as additional parameters

To run this script to execute a Hive file saved on the cluster one can execute the following command:

. "C:\Users\Me\Scripts\ClusterHiveElasticScriptSubmissionAuto.ps1" -Hosts 4 -Cluster "myclustername" -SqlServerName "myservername" -SqlUsername "sqlusername" -SqlPassword "sqlpassword" -HiveDatabase "HiveDatabaseName" -OozieDatabase "OozieDatabaseName" -OutputPath "/hive/output" -Hql "/hive/sample/queries/HiveSampleExtractFile.hql" -HqlType "File"

Or when running a Hive command directly:

. "C:\Users\Me\Scripts\ClusterHiveElasticScriptSubmissionAuto.ps1" -Hosts 4 -Cluster "myclustername" -SqlServerName "sqlservername" -SqlUsername "sqlusername" -SqlPassword "sqlpassword" -HiveDatabase "HiveDatabaseName" -OozieDatabase "OozieDatabaseName" -OutputPath "/hive/output" -Hql "INSERT OVERWRITE DIRECTORY '/hive/sample/output' SELECT DevicePlatform, DeviceMake, SUM(QueryDwellTime) FROM HiveSampleTable GROUP BY DevicePlatform, DeviceMake HAVING DevicePlatform <> 'Unknown'"

Of course one can also specify a Hive Command as in the previous Hive job submission example.

As before, the script also generates a password, which is displayed in case the cluster is not successfully deleted after the job execution.

Of course other parameters could be included in this model to make the PowerShell script more applicable to more job types, and possibly remove or change any assumptions that I have made in the script creation.

Leave a Comment
  • Please add 5 and 7 and type the answer here:
  • Post
  • Where can the following .dlls be found?

    # Add submission file references

    Add-Type -Path ($dllPath + "\microsoft.hadoop.client.dll")

    Add-Type -Path ($dllPath + "\microsoft.hadoop.mapreduce.dll")

    Add-Type -Path ($dllPath + "\microsoft.hadoop.webclient.dll")

  • They are part of the Microsoft .NET SDK For Hadoop: http://hadoopsdk.codeplex.com/

  • Great post!

    QQ:

    What's the advantages of using WebHCatHttpClient instead of New-AzureHDInsightHiveJobDefinition? as described here: www.windowsazure.com/.../using-hive-with-hdinsight

    Since this is one of the first to show up in the search results, people might think that they must use (the more complicated) WebHCatHttpClient  for submitting Hive jobs via PowerShell. I suggest to call out the simpler alternative to avoid such confusion.

    Thanks

Page 1 of 1 (3 items)