HDInsight was optimized from the start to be able to quickly analyze data on Azure's blob storage service using Hadoop by using the WASB file system to expose the data there as a native Hadoop file system. But the spirit of Hadoop has always been to be able to analyze data wherever it is, so in this post I'll show you how to analyze data residing the more structured Azure table storage service. This would enable you to analyze tables that have millions/billions of rows in them using Hadoop/Hive to distribute that analysis to many compute nodes in Azure.
To follow along in this post, you'll need to:
You can skip this step if you already have a table you want to analyze (and hey if you're reading this chances are good that this is the case). But just to get a quick-and-dirty example going, here's is a simple PowerShell script to create an example table for us to analyze:
function InsertRow($table, [String]$partitionKey, [String]$rowKey, [int]$intValue)
$entity = New-Object "Microsoft.WindowsAzure.Storage.Table.DynamicTableEntity" $partitionKey, $rowKey
$insertResult = $table.CloudTable.Execute(`
$exampleTable = New-AzureStorageTable tobehadooped
for ($p = 1; $p -le 10; $p++)
for ($r = 1; $r -le 100; $r++)
InsertRow $exampleTable "P$p" "R$r" $r
Now that we have data we want to analyze, let's setup an HDInsight cluster to analyze it. In this tutorial I'll be using the extensions in my github project azure-tables-hadoop to be able to access Azure tables from Hive, so we'll need to get it and package it:
Update (6/27/2014): Since writing this post HDInsight and Hive have advanced forward in a not very backwards-compatible way. I've updated my git project so now it works with HDInsight 3.1 (version 0.0.3 of my jar). If you want the HDInsight 2.1-compatible version please git checkout hdp1compat before the mvn package step below. And if you want the HDInsight 3.0-compatible version please git checkout Hive12 instead.
git checkout hdp1compat
git checkout Hive12
git clone https://github.com/mooso/azure-tables-hadoop.git
After that, we need to create a cluster with this JAR accessible to Hive. To do that, we'll use the AdditionalLibraries option when creating the cluster that's documented here. So we'll need to create a container to put this JAR in, put it there, then create the cluster that points to it:
# Please customize the lines below to choose your own user name, password and cluster name
$creds = New-Object System.Management.Automation.PSCredential 'myUser',`
('Please ch00se a different password :)' | ConvertTo-SecureString -force -asplaintext)
$clusterName = "mycluster"
$hiveLibsContainer = New-AzureStorageContainer hiveontableslib
$azureJar = ".\target\microsoft-hadoop-azure-0.0.1.jar"
$azureJarBlob = Set-AzureStorageBlobContent -Container $hiveLibsContainer.Name `
-Blob $(Split-Path $azureJar -Leaf) -File $azureJar -Force
$hiveConf = New-Object $('Microsoft.WindowsAzure.Management.' + `
$hiveConf.AdditionalLibraries = New-Object $('Microsoft.WindowsAzure' + `
$storageAccount = $(Get-AzureSubscription -Current).CurrentStorageAccountName
$hiveConf.AdditionalLibraries.StorageAccountKey = (Get-AzureStorageKey $storageAccount).Primary
$hiveConf.AdditionalLibraries.StorageContainerName = $hiveLibsContainer.Name
$clusterContainer = New-AzureStorageContainer $clusterName
$clusterConf = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 1 |
Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccount `
-StorageAccountKey $(Get-AzureStorageKey $storageAccount).Primary -StorageContainerName $clusterContainer.Name
$clusterConf = $clusterConf | Add-AzureHDInsightConfigValues -Hive $hiveConf
$cluster = $clusterConf | New-AzureHDInsightCluster -Credential $creds `
-Name $clusterName -Location $(Get-AzureStorageAccount $storageAccount).Location
If all goes well, this should take around 10 minutes then you'll have a newly minted single-node HDInsight cluster to play with.
The easiest way to explore this data is to use Hive with its SQL-like query language (you can also use Map-Reduce directly or use Pig or other Hadoop ecosystem components, but I'll ignore those for this post). The main trick we use to expose Azure tables to Hive is to have a custom Storage Handler for it, so we need to use the STORED BY clause to specify that the table data resides in an Azure table. Then we use the TBLPROPERTIES clause to specify the information about the table.
$hiveQuery = "CREATE EXTERNAL TABLE ExampleTable(IntValue int)
STORED BY 'com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler'
Out-File -FilePath .\HiveCreateTable.q -InputObject $hiveQuery -Encoding ascii
$hiveQueryBlob = Set-AzureStorageBlobContent -File .\HiveCreateTable.q -Blob "queries/HiveCreateTable.q" `
-Container $clusterContainer.Name -Force
$createTableJobDefinition = New-AzureHDInsightHiveJobDefinition -QueryFile /queries/HiveCreateTable.q
$job = Start-AzureHDInsightJob -JobDefinition $createTableJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
This should create the Hive table for us, so now we can finally explore it! For example, to get the average of the IntValue column:
$avgJobDefinition = New-AzureHDInsightHiveJobDefinition -Query "SELECT AVG(IntValue) FROM ExampleTable"
$job = Start-AzureHDInsightJob -JobDefinition $avgJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
Get-AzureHDInsightJobOutput -JobId $job.JobId -Cluster $cluster.Name
Let me air out the dirty laundry and explain how this all works. The main way I expose Azure tables data to Hadoop is by having an InputFormat defined for it. An InputFormat for Hadoop has two main responsibilities:
The second responsibility for our purposes is just grunt work: we just use the excellent Azure Storage SDK to read the data and pass it on as WritableEntity objects (in Hive, the StorageHandler understands those and maps them to columns). The first one though is a bit trickier: there is no universally good way to partition data to get the most out of our cluster. Ideally, if you have a 100 nodes (each with 4 mapper slots) processing your data, you want at least 400 splits among your mappers. Also, you want each mapper's data to fit in memory, so you don't want too many rows per mapper (also because if a mapper runs for an hour then fails it'll have to redo that all over again). But you don't want each mapper to have too little to do either: then your job run time will be dominated by overhead from starting/stopping all these mappers. So in short: it's an art to tune this partitioning step, and I certainly don't have it right by default for most real-world situations. By default: I query for all the partition keys in the table (one-by-one, so it's a slow process if there are many keys), then split the table by assigning each mapper to one of those values. If you want to customize the partitioning for your data, you can do that by using the configuration knob 'azure.table.partitioner.class' to specify a custom partitioner that you can provide.
Hope that helps! Please leave comments here and/or create issues/pull requests on the project in github if you have any suggestions/feedback.
Interesting stuff. Haven't had a go at this yet (I will do soon) but in the meantime, do you know any way of INSERTing data into Azure table Storage using Hive?
Hi Jamie - thanks for your interest. Hive extensibility does allow INSERT into externally stored data like this, but unfortunately it's still a TODO in the Azure tables code I wrote up so it won't work with that (and I don't personally know of others who implemented INSERT functionality). Hopefully I'll get around to implementing it soon, but if you feel up to it feel free to dig in, code it up and send a pull request.
Cool, thank you Mostafa.
This is really an interesting article. Just wondering how HDInsight team adds the support of Azure Storage to HIVE. Is it also through Storage Handler?
@Justin: Thank you! Azure Blob Storage is exposed as a file system within Hadoop, so Hive can just use it like it uses any other Hadoop file system. See this article for details on how it works: azure.microsoft.com/.../hdinsight-use-blob-storage
I have been having some difficulty setting this up to query my Azure Table. I keep getting: HiveException: Error in loading storage handler.com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler
I set up a new HDInsight Cluster (2.2), got the GIT code and Packaged it. It created the: microsoft-hadoop-azure-0.0.2.jar
I then copied this JAR file everywhere in my shared blob storage for Hive.
I then remote to the Azure HDInsight Cluster server. Copied the JAR file to the local file server in Azure and modified the hive-site.xml file to add the file path property reference. (But I don't know how to restart hive to pick up these settings?)
Any help in figuring out why Hive can't find the JAR?
I haven't even touched this yet, but great article to get my head around HDInsight and its place with Azure tables (I actually understand this one).
works well with hdinsight 3.0 but not working with hdinsight 3.1
JVM throw method not found on entitypropertity inspector
@Tom: this is probably because I've since updated the project to HDI 3.0 (and now to HDI 3.1, see my response to Huan below)...
@Huan: thanks for pointing it out! It turns out there were some minor API breaking changes between Hive 0.12 (in HDI 3.0) and Hive 0.13 (in HDI 3.1). I've updated my code to comply with the latest changes, and created a tag in git, Hive12, for anyone who wants the HDI 3.0 version. I'll update the post to reflect all that...
Thank you so much, this worked great. I was using 3.0 HDINsight, and I didn't notice a tag or a branch for it like you mentioned, but I was able to get the code at a specific version and compile it and go
how hard would it be to add in support for having the paritionkey and rowkey in the external table as well? currently only properties are able to be in the external table.
@rob: glad it worked for you. Adding the PartitionKey and RowKey was pretty easy so I added it (commit d57064e). Please sync to the newer version and try it out - or if you want it on the older HDInsight versions just diff that commit and apply it to the older version.
Nice article :)
Just curious to know how much difference we see in terms of performance while reading data from blob storage(assuming its written in delimited compressed format) vs azure table storage? As in case of blob storage, we are reading a file system but with table storage handler, we are still reading 1000 records in one call and will be receiving complete data using continuation token. It may affect the data read performance?