Windows Azure SQL Database Marketplace
Windows Azure HDInsight provides the capability to dynamically provision clusters running Apache Hadoop to process Big Data. You can find more information here in the initial blog post for this series, and you can click here to get started using it in the Windows Azure portal. This post enumerates the different ways for a developer to interact with HDInsight, first by discussing the different scenarios and then diving into the variety of capabilities in HDInsight. As we are built on top of Apache Hadoop, there is a broad and rich ecosystem of tools and capabilities that one can leverage.
In terms of scenarios, as we've worked with customers, there are really two distinct scenarios, authoring jobs where one is using the tool to process big data, and integrating HDInsight with your application where the input and output of jobs are incorporated as part of a larger application architecture. One key design aspect of HDInsight is the integration with Windows Azure Blob Storage as the default file system. What this means is that in order to interact with data, you can use existing tools and API's for accessing data in blob storage. This blog post goes into more detail on our utilization of Blob Storage.
As HDInsight leverages Apache Hadoop via the Hortonworks Data Platform, there is a high degree of fidelity with the Hadoop ecosystem. As such, many capabilities will work “as-is.” This means that investments and knowledge in any of the following tools will work in HDInsight. Clusters are created with the following Apache projects for distributed processing:
You can find an updated list of Hadoop components here. The table below represents the versions for the current preview:
Additionally, other projects in the Hadoop space, such as Mahout (see this sample) or Cascading can easily be used on top of HDInsight. We will be publishing additional blog post on these topics in the future.
We're working to build out a portfolio of tools that allow developers to leverage their skills and investments in .NET to use Hadoop. These projects are hosted on CodePlex, with packages available from NuGet to author jobs to run on HDInsight. For instructions on these, please see the getting started pages on the CodePlex site.
In order to run any of these jobs, there are a few options:
In order to provide a simple surface for client apps to integrate, we've worked to ensure that all capabilities on a cluster are surfaced via a set of secured REST API's.
We are currently providing .NET clients to these API's, available here, and one is able to easily build clients using the HTTP stacks in other languages as well.
By leveraging the ODBC client (instructions here), one can easily integrate existing applications (Excel) with data that is being stored in Hive tables in HDInsight.
In order to provide an experience where one can work disconnected from a cluster running in Azure, we have provided the HDInsight Developer Preview, a one-box setup, easily installed from the Web Platform Installer. You can use this to experiment, debug, and test all of the technologies above on a smaller set of data. You can then deploy the artifacts to Azure and run against your big data in Blob Storage. In order to install this, simply search for HDInsight inside the Web Platform Installer, or click here to install directly from the web.
The final post in our 5-part series on HDInsight will explore how to analyze data from HDInsight with Excel. Stay tuned!