HDInsight Services for Windows Azure is a service that deploys and provisions Apache™ Hadoop™ clusters in the cloud, providing a software framework designed to manage, analyze and report on big data.

Data is described as "big data" to indicate that it is being collected in ever escalating volumes, at increasingly high velocities, and for a widening variety of unstructured formats and variable semantic contexts.

Big data collection does not provide value to an enterprise on its own. For big data to provide value in the form of actionable intelligence or insight, it must be accessible, cleaned, analyzed, and then presented in a useful way, often in combination with data from various other sources.

Apache Hadoop is a software framework that facilitates big data management and analysis. Apache Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze in parallel the data stored in this distributed system. HDFS uses data replication to address hardware failure issues that arise when deploying such highly distributed systems.

To simplify the complexities of analyzing unstructured data from various sources, the MapReduce programming model provides a core abstraction that provides closure for map and reduce operations. The MapReduce programming model views all of its jobs as computations over key-value pair datasets. So both input and output files must contain such key-value pair datasets.

Other Hadoop-related projects such as Pig and Hive are built on top of HDFS and the MapReduce framework, providing higher abstraction levels such as data flow control and querying, as well as additional functionality such as warehousing and mining, required to integrate big data analysis and end-to-end management.

HDInsight Services for Windows Azure makes Apache Hadoop avaliable as a service in the cloud. It makes the HDFS/MapReduce software framework and related projects available in a simpler, more scalable, and cost efficient environment.

The main scenarios for using Hadoop on Windows Azure

Big data: volume, velocity, variety, and variability

The Hadoop big data solution is a response to two divergent trends. On the one hand, because the capacity of hard drives has continued to increase dramatically over the last 20 years, vast amounts of new data generated by web sites and by new device and instrumentation generations connected to the Internet, can be stored. In addition, there is automated tracking of everyone's online behavior. On the other hand, data access speeds on these larger capacity drives have not kept pace, so reading from and writing to very large disks is too slow.

The solution for this network bandwidth bottleneck has two principal features. First, HDFS provides a type of distributed architecture that stores data on multiple disks with enabled parallel disk reading. Second, move any data processing computational requirements to the data-storing node, enabling access to the data as local as possible. The enhanced MapReduce performance depends on this design principle known as data locality. The idea saves bandwidth by moving programs to the data, rather than data to programs, resulting in the MapReduce programming model scaling linearly with the data set size. For an increase in the cluster size proportionately with the data processed volume, the job executes in more or less the same amount of time.

The rate at which data is becoming available to organizations has followed a trend very similar to the previously described escalating volume of data, and is being driven by increased ecommerce clickstream consumer behavior logging and by data associated social networking such as Facebook and Twitter. Smartphones and tablets device proliferation has dramatically increased the online data generation rate. Online gaming ,scientific health instrumentation are also generating streams of data at velocities with which traditional RDBMS are not able to cope. Insuring a competitive advantage in commercial and gaming activities requires quick responses as well as quick data analysis results. These high velocity data streams with tight feedback loops require a NoSQL approach like Hadoop's optimized for fast storage and retrieval.

Most generated data is messy. Diverse data sources do not provide a static structure enabling traditional RDBMS timely management. Social networking data, for example, is typically text-based taking a wide variety of forms that may not remain fixed over time. Data from images and sensors feeds present similar challenges. This sort of unstructured data requires a flexible NoSQL system like Hadoop that enables providing sufficient structure to incoming data, storing it without requiring an exact schema. Cleaning up unstructured data is a significant processing part required to prepare unstructured data for use in an application. To make clean high-quality data more readily available, data marketplaces are competing and specializing in providing this service.

Larger issues in the interpretation of big data can also arise. The term variability when applied to big data tends to refer specifically to the wide possible variance in meaning that can be encountered. Finding the most appropriate semantic context within which to interpret unstructured data can introduce significant complexities into the analysis.

So what is HDInsight?

HDInsight is a service from Microsoft that brings an Apache Hadoop-based solution to the cloud. HDInsight is a cloud-based data platform that manages data whether structured or unstructured, and of any size, HDInsight makes it possible for you to gain the full value of big data.

HDInsight offers the following benefits:

• Insights with familiar tools: Through deep integration with Microsoft BI tools such as PowerPivot and Power View, HDInsight enables you to easily analyze data for insights. Seamlessly combine data from several sources, including HDInsight, with Microsoft Power Query for Excel. Easily map your data with new Power Map, a three-dimensional mapping Excel 2013 add-in.

• Enterprise-ready capabilities: HDInsight offers enterprise-class security, scalability, and manageability. Thanks to a dedicated secure node, HDInsight helps you secure your cluster. You can also take full advantage of the elastic scalability of Windows Azure. In addition, we simplify manageability of your cluster through extensive support for Windows PowerShell scripting.

• Rich developer experience: HDInsight offers powerful programming capabilities with a choice of technologies including Microsoft .NET, Java and other languages. .NET developers can exploit the full power of Language-Integrated Query with LINQ to Hive. And database developers can use existing skills to query and transform data through Hive.

To get started with HDInsight:

  1.   Visit the Windows Azure Management Portal.

  2.   In the bottom left, click New.

  3.   Under Data Services, click HDInsight.

  4.   Click Quick Create, and then follow the quick-start steps.


HDInsight pricing consists of two components: a head node and one or more compute nodes.

Head node – The head node is available in the Extra Large instance size and will continue to be billed at the preview price of £0.2037 per hour per cluster through November 30, 2013. On December 1, 2013, the new price will be £0.4073 per hour per cluster.

Compute node – Compute nodes are available in the Large instance size and will continue to be billed at the preview price of £0.1019 per hour per instance deployed through November 30, 2013. On December 1, 2013, the new price will be £0.2037 per hour per instance deployed.

For additional pricing information, please see the HDInsight Pricing Details webpage. For more information about the service, visit the HDInsight Service webpage.