Use Additional Storage Accounts with HDInsight Hive

When you create an HDInsight Hadoop cluster you pass in one or more storage accounts and their associated keys. This allows you to access the files on all associated storage accounts from the cluster. If you want to use public storage that isn’t passed in at create time that’s easy – simply supply the storage account name each time you run a job. But how do you access data on private storage accounts that need an access key?

The steps are laid out in this wiki by Eric Hanson: Using an HDInsight Cluster with Alternate Storage Accounts and Metastores

https://social.technet.microsoft.com/wiki/contents/articles/23256.using-an-hdinsight-cluster-with-alternate-storage-accounts-and-metastores.aspx 

I am providing a variable based variation of the PowerShell sample for Hive. To set up PowerShell for use with Azure see Getting Started with Azure PowerShell Cmdlets–Subscription Management.

First you will set some values for your environment. If you use your default subscription you don’t need to pass in the subscription name and select it. However, you will always need to specify the HDInsight cluster name. In this example $undefinedStorageAccount is the name of an account that you want to access from a cluster but you didn’t define it when you created the cluster. You always need to specify which container to use for any given reference so you also need to define $undefinedContainer. If the storage account belongs to the current subscription you can simply ask Azure to return the key (#commented out in the example below) or you can paste in the key that someone has given you.

 $subscriptionName = "LocalAzureSubscriptionName"
 $clusterName = "HDInsightClusterName"
 $undefinedStorageAccount = "AdditionalStorageAccount"
 $undefinedContainer = "ContainerOnAdditionalStorageAccount"
 #$undefinedStorageKey = Get-AzureStorageKey $undefinedStorageAccount | %{ $_.Primary }
 $undefinedStorageKey = "YourActualAccessKeyFromAzurePortal"

Now choose which of your locally defined subscriptions to use:

 Select-AzureSubscription -SubscriptionName $subscriptionName

Set the context of the cluster you want to use:

 Use-AzureHDInsightCluster $clusterName

Now let’s check your HDInsight cluster properties.

 $defaultStorageAccount  = (Get-AzureHDInsightCluster -Name $clusterName).DefaultStorageAccount.StorageAccountName #default/only storage account
 $defaultContainerName   = (Get-AzureHDInsightCluster -Subscription $SubID -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName
 $definedStorageAccounts = (Get-AzureHDInsightCluster -Name $clusterName).StorageAccounts #no 2nd account is associated, no value is returned

Let’s check the values and verify that the storage account you want to use is not listed as either the DefaultStorageAccount (every cluster has one) or as one of the additional known storage accounts configured during provisioning (you may have zero, one, or many).

 write-host "===Default storage account"
 $defaultStorageAccount
 write-host "===Default container name"
 $defaultContainerName
 write-host "===Other defined storage accounts for this cluster"
 $definedStorageAccounts

Next we’ll get a non-recursive listing of the files in the default location:

 invoke-hive "dfs -ls wasb://$defaultContainerName@$defaultStorageAccount/;" #default storage

And then try to get a listing for the private storage account that we have not associated with the cluster:

 invoke-hive "dfs -ls wasb://$undefinedContainer@$undefinedStorageAccount/;" #not associated, errors

Because the storage account access key is not yet known you will see an error similar to this one:

 Logging initialized using configuration in file:/C:/apps/dist/hive-0.12.0.2.0.7.0-1559/conf/hive-log4j.properties
 ls: org.apache.hadoop.fs.azure.AzureException: Unable to access container xyz in account abc using anonymous credentials, 
 and no credentials found for them  in the configuration.
 Command failed with exit code = 1

But we can fix this! From PowerShell we can pass in “defines” statements to change configuration values, add libraries, etc.

 $defines = @{}
 $defines.Add("fs.azure.account.key.$undefinedStorageAccount.blob.core.windows.net", $undefinedStorageKey)
 Invoke-Hive -Defines $defines -Query "dfs -ls wasb://$undefinedContainer@$undefinedStorageAccount.blob.core.windows.net/;"

The access key is only available to this Hive query, but now that I have the variables set I can pass it in to other queries as well. Happy Hiving!

I hope you enjoyed this small bite of Big Data!