Hi Folks,

I'm Jason from the Microsoft Big Data Support team. Thanks for reading our blog, and for trying out HDInsight in your own business.

I want to share some new articles Microsoft just published that will be helpful for getting started with HDInsight in your business. To help folks who are not so familiar with the Apache projects, I'll quickly compare to an existing Microsoft SQL Server feature you might have heard of before to help ease the transition.

For even more articles, check out the main HDInsight documentation front page at http://www.windowsazure.com/en-us/documentation/services/hdinsight/

Happy Hadooping! Jason Howell


1. Monitor HDInsight clusters using Ambari API



  • Analogous to System Center: If you are familiar with Microsoft System Center to deploy applications, and SCOM Management packs to measure those system, you will feel at home with the concepts in Apache Ambari.


  • What is Ambari? The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. http://ambari.apache.org/

2. Use Sqoop with HDInsight



  • Analogous to BCP: If you are familiar with SQL Server BCP.exe then Sqoop will be an easy tool for you to learn for Hadoop and HDInsight.


  • What is Sqoop? Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. http://sqoop.apache.org/

    See our blog posts on Sqoop as well

3. Analyze Twitter data with HDInsight



  • In this tutorial, you will connect to Twitter web service to get some Tweets using the Twitter streaming API, and then you will use Hive to get a list of Twitter users that sent most Tweets that contained a certain word.



By the way, shameless plug! Follow us on twitter as @MSBigDataSupp https://twitter.com/MSBigDataSupp

4. Analyze flight delay data with HDInsight



  • This tutorial shows you how to use Hive to calculate average delays among airports, and how to use Sqoop to export the results to SQL Database.

5. Use Oozie with HDInsight



  • Analogous to SQL Agent: If you are familiar with SQL Server Agent for job scheduling, and SQL Server Integration Services (SSIS) Control flow tasks, Oozie might be a good tool for you to try on Hadoop and HDInsight.


  • What is Oozie? Apache Oozie (TM) is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system. http://oozie.apache.org/