Hi, my name is Dharshana and I work on the Big Data Support Team at Microsoft. As covered in the earlier post by Dan from our team, HDInsight provides a very easy to use interface to provision a Hadoop cluster with a few clicks and interact with the cluster programmatically. In this blog post, we will explore Hive and how it can be used with HDInsight.
Hive is very popular among data analysts/scientists as it provides a friendly SQL-like interface to interact with the Hadoop cluster. However, behind the scenes, Hive uses the MapReduce framework to submit jobs and get results back. When an HDInsight cluster is provisioned, HiveServer and HiveServer2 services are installed on the head node of the cluster. While HDInsight provides the option to interact with both the services, HiveServer2 is the service of choice since it addresses the concurrency limitation imposed by HiveServer.
You can find the version of HDInsight deployed with Hive here. The key point to note is that Hive is deployed automatically when provisioning the HDInsight cluster and several interfaces are available to interact with Hive. The backend data for Hive tables need to be uploaded into the Windows Azure BLOB Storage container.
Fig 1. Hive Architecture on HDInsight
Let us break the workflow illustrated in Fig 1. Above into greater detail to understand this.
You may have noticed that with GA of Azure HDInsight, there are two versions available for provisioning your HDInsight Cluster – 1.6 and 2.1. We recommend that you install the latest version if you are planning on building your cluster for the first time! However, if you have an existing cluster on a version earlier than 2.1, say 1.6, then you have the option to point to the Hive Metastore associated with the 1.6 cluster when provisioning your Version 2.1 cluster.
The Azure HDInsight implementation detaches storage from the compute cluster, so when you tear down your HDInsight cluster, Storage Containers and Metastore databases are still left intact such that they can be reused.
A picture is worth a thousand words! Let us walk through the steps for an initial provisioning of an HDInsight Cluster. I am consciously choosing the older version of HDInsight – Version 1.6 – for this illustration, so we can upgrade this cluster to Version 2.1 on a later section.
Fig 2. Hive Metastore Database
Fig 3. Hive Metastore Database
Did you notice the pattern so far? See how the SQL Azure databases and the storage accounts we created so far are all on the same datacenter and an additional requirement is that the storage account can NOT belong to any affinity group.
Using one of the cloud storage explorers, such as Azure Storage Explorer, you can pre-provision the storage containers that you need for the HDInsight Cluster. This gives you some control over how you want to name your containers. The screen clip below shows how I have the containers configured –
Fig 4. Storage Containers
Fig 5. New HDInsight Cluster
I am choosing Custom Create Option here and specifying the Cluster Name, Version of 1.6, Number of Data Nodes and the Subscription Name. Only the storage accounts and SQL Azure databases available on the same region as the HDInsight Cluster, are presented at the time of provisioning the cluster. So, on this specific GUI screen, please make sure you specify the same region as on Steps 1 and 2 above.
Fig 6. New HDInsight Cluster – Custom Create
The next screen lets you configure the cluster user and the SQL Azure Metastore Information –
Fig 7. Configure Cluster User – Custom Create
At this step, only the SQL Azure Databases available on the same datacenter as the cluster are presented. This step also validates your access into the SQL Azure database that is specified. Note that, when you use programmatic interfaces to provision the cluster, it is a good idea to validate connectivity through SQL Azure Management Portal from your client machine as a pre-validation step.
The next screen lets you specify the primary storage account and shows the containers that already exist. All the samples and default JAR files are stored on the primary storage container that is specified on this step. If no containers exist on the selected storage account, you have the option to create a default container, which will have the same name as the storage account. You can also specify the number of additional storage accounts that you want to configure for this cluster here and it will display that many configuration pages. As an example, in the following screen clip, since I specified two additional storage accounts, two additional pages – pages 4 and 5 – are created.
Fig 8. Configure Storage Account – Custom Create
The process to configure the other storage accounts is very similar! You would need to choose "Use Storage From Another Subscription" option for Storage Account, to point to a storage account and container from another Subscription. You need to key in the storage account name and key to point to any storage account from another subscription.
Fig 9. Configure Storage Account from another Subscription– Custom Create
Once you have configured all the storage accounts, the cluster creation process starts and typically within about 15 minutes, you have a cluster ready to use!
Fig 10. HDInsight Cluster Creation Complete – Custom Create
Now that we have a new HDInsight Cluster, let us upload some sample data into the primary and secondary storage container, create Hive tables and execute a Hive Query! You can use any storage explorer tool, like Azure Storage Explorer, to upload data into your configured containers. I am using the sample data from here:
http://data.worldbank.org/indicator/GB.XPD.RSDV.GD.ZS/countries. I am working with the "gb.xpd.rsdv.gd.zs_Indicator_en_csv_v2.zip" dataset and in specific, the file - "gb.xpd.rsdv.gd.zs_Indicator_MetaData_Country_en_csv_v2.csv". The data snippet from this file is below:
Fig 11. HDInsight Cluster Creation Complete – Custom Create
Let us remove the header row from the CSV file and upload this file into the Storage Container: "factdata" on the Secondary Storage Account: "mysecondarystorage". If header row is not removed, extra filtering logic needs to be added to Hive queries to ignore the header row. There is currently not an easy way to let Hive ignore the header row when loading data into an external table. There is no concept of directories on Azure BLOB storage, but you can achieve a logical notion of organizing files into directory by naming them with <Directory>/<Filename>. For our file, let us go ahead and name it as "incomegroupbycountry/gb.xpd.rsdv.gd.zs_indicator_MetaData_Country_en_csv_v2.csv". All related files can be named with a prefix of "IncomeGroupByCountry" to have a logical binding.
Fig 12. Data file on Storage Container: "factdata"
Discussion on Hive External and Internal tables is out of scope for this discussion, however if you are interested in learning more about it, you can read more about it on this post. I am creating an external table for the purpose of our exercise. I put in the code on an .hql file and uploaded it to the storage container "hivecustomlibs" on the storage account "myhivelibs".
Fig 13. HQL File on storage container: "hivecustomlibs"
The HQL can be executed from PowerShell. Detailed instructions on setting up the PowerShell Environment for HDInsight is available here
$subscriptionName = "Your Subscription Name" $clusterName = "HDI16" $queryString = "select * from IncomeGroupByCountry limit 10;" # this query selects 10 rows from the newly created table
Select-AzureSubscription -SubscriptionName $subscriptionName Use-AzureHDInsightCluster $clusterName -Subscription (Get-AzureSubscription -Current).SubscriptionId
Invoke-Hive –File "wasb://firstname.lastname@example.org/CreateTableIncomeGroup.hql" Invoke-Hive $queryString
The HQL file contains the following commands:
drop table if exists IncomeGroupByCountry; create external table IncomeGroupByCountry(CountryName STRING, CountryCode STRING, Region STRING, IncomeGroup STRING, SpecialNotes STRING) row format delimited fields terminated by ',' stored as textfile location 'wasb://email@example.com/incomegroupbycountry';
The above HQL drops the table IncomeGroupByCountry if it already exists, and creates the table and points it to the secondary storage account location. Then an Invoke-Hive cmdlet to select 10 rows from table IncomeGroupByCountry is executed. If everything is working, you should see 10 rows being printed on the console now!
One nice thing about HDInsight is the ease of upgrade. HDInsight implementation model makes it very easy to tear down the compute cluster and rebuild a new version of the cluster and point it to the existing BLOB containers and SQL Azure metadata. If the new cluster has a higher version of Hive than the older cluster, then the existing SQL Azure Metastore from the older cluster is automatically upgraded at the time of provisioning!
Let us upgrade our HDInsight 1.6 cluster to HDInsight 2.1. Note that on this illustration, we will test the Metastore upgrade by selecting from the IncomeGroupByCountry table that we created on the previous step. When creating the new cluster, any customization that were added like Hive JARs or configuration options, need to be added back. If you interested to see a blog post on this, please let us know! :)
Fig 14. New Version 2.1 Cluster – Custom Create
Fig 15. Configure Metastore Database – Custom Create
Fig 16. Configure Storage Account – Custom Create
Hope you enjoyed the post as much as I did writing this! Thanks for reading through this and please let us know if you have any questions on this content or feedback.
I had stored one query to a string and copied into *.hql file and stored into storage container. When i execute , get the following error.
Logging initialized using configuration in file:/C:/apps/dist/hive-0.11.0.1.3.2.0-05/conf/hive-log4j.properties
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
FAILED: ParseException line 1:0 cannot recognize input near '$' 'querystring' '='
Any settings i need to do at hive side ?
Hi Sara, this looks to be a parse error on the HQL file. Can you please submit the question on this free forum and I will respond over there…
I have an HDInsight cluster 2.1 working fine with a sql azure hive meta store. I'm trying to upgrade the cluster to 3.0 but the any hive queries throw an error:
"org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset" when I point to the hive meta store.
If I create the 3.0 cluster with no hive meta store, I can create my tables from scratch and everything works fine.
any idea how I can upgrade the old hive meta store?