Familiar. Collaborative. Managed.
With Strata less than a week away, we are kicking off a Big Data blogging series that will highlight Microsoft’s Big Data technologies. Today’s guest post is by Torsten Grabs (torsteng(at)microsoft(dot)com), Principal Program Manager Lead for StreamInsight™ at Microsoft.
At the PASS Summit last October Microsoft announced its roadmap and approach to Big Data. Our goal is to provide insights to all users from all their structured and unstructured data of any size. We will achieve this by delivering a comprehensive Big Data solution that includes an Enterprise-ready Hadoop-based distribution for Windows Server and Windows Azure, insights for everyone through the use of familiar tools, connection to the world’s data for better discovery and collaboration, and an open and flexible platform with full compatibility with Apache Hadoop. Since PASS, I am frequently asked about the interplay between Hadoop, MapReduce, and StreamInsight™. The first question that I typically ask in return is “Why are you asking?”. Well, as it turns out there are many reasons why you might wish to look into a combination of these technologies. Just looking into any of the definitions used to explain Big Data these days is insightful to answer the question. For our discussion here, let’s pick the one proposed by Gartner, but other definitions will lead to a similar result. Gartner characterizes Big Data as challenges in three distinct areas, characterized by the 3Vs:
- Volume: The volume dimension describes the challenges an organization faces because of the large and increasing amounts of data that need to be stored or analyzed.
- Velocity: The velocity dimension captures the speed at which the data needs to be processed and analyzed so that results are available in a timely manner for an organization to act on the data.
- Variety: The variety dimension finally looks at the different kinds of data that need to be processed and analyzed, ranging from tabular data in relational databases to multimedia content like text, audio or video.
For our discussion here, the first two dimensions, i.e., volume and velocity, are the most interesting dimensions. To cover these dimensions in a Big Data solution, an intuitive approach will lead you to a picture like the one illustrated in Figure 1. It shows the dimensions volume and velocity and overlays it with technologies such as MapReduce or Event Processing.
Figure 1: Covering Gartner's Volume and Velocity dimension
MapReduce technologies such as Hadoop are well-suited to plow through large volumes of data quickly and to parallelize the analysis using MapReduce. The map phase splits the input into different partitions and each partition is processed concurrently before the results are collected by the reducer(s). With vanilla MapReduce, the reducers run independently of each other and can be distributed across different machines. Depending on the size of the data and the speed at which you need the processing done, you can adjust the number of partitions and correspondingly the number of machines that you throw at the problem. This is a great way to process huge data volumes and scale out the processing while at the same time reducing overall end-to-end processing time for a batch of input data. That makes MapReduce a great fit to address the volume dimension of Big Data. While Hadoop is great at batch processing, it is not suitable for analyzing streaming data. Let’s now take a look at the velocity dimension.
Event-processing technologies like Microsoft StreamInsight™ in turn are a good fit to address challenges from the velocity dimension in Big Data. The obvious benefit with StreamInsight™ is that the processing is performed in an event-driven fashion. That means that processing of the next result is triggered as soon as a new event arrives. If you render your input data as a stream of events you can process these events in a non-batched way, i.e., event by event. If data is continuously arriving in your system, this gives you the ability to react quickly to each incoming event. This is the best way to drive the processing if you need to provide continuous analytics processing over a continuous data feed.
Let’s now take a look at the area in Figure 1 where the two technologies intersect, indicated with a question mark in the figure.
A particularly powerful combination of technologies is when MapReduce is used to run scaled out reducers performing complex event processing over the data partitions. Figure 2 illustrates this scenario using Hadoop for MapReduce and StreamInsight™ for complex event processing. Note how the pace of the overall processing is governed by the batches being fed into Hadoop. Each input batch results into a set of output batches where each output batch corresponds to the result produced by one of the reducers. The reducers are now performing complex event processing in parallel over the data from the input batch. While this does not allow for the velocity we get with purely event-driven processing scheme, it does allow us to greatly simplify the coding needed for the reducers to perform complex event processing.
Figure 2: Combining Complex Event Processing with Map Reduce
As it turns out, a lot of scenarios in Big Data can benefit from complex event processing in the reduction phase of MapReduce. For instance, consider all the pre-processing that needs to happen before you can send data into a machine learning algorithm. You may wish to augment time-stamped raw input data with aggregate information like the average value observed in an input field over a week-long moving average window. Another example is to work with trends or slopes and their changes over time as input data to the mining or learning phase. Again, these trends and slopes need to be calculated from the raw data where the built-in temporal processing capabilities of complex event processing would help.
Example: Consider a log of sensor data that we wish to use to predict equipment failures. To prepare the data, we need to augment the sensor values for each week with the average sensor value for that week. To calculate the averages, weeks are from Monday through Sunday and averages are calculated per piece of equipment. Data preparation steps like these are oftentimes required when preparing your data for use in regression. The following table shows some sample input data:
The output with the weekly averages then looks as follows:
The weekly averages in the previous example illustrate how the temporal processing capabilities in complex event processing systems such as StreamInsight™ become crucial steps in the processing pipeline for Big Data applications.
Another prominent example is the detection of complex events like failure conditions or patterns in the input data across input streams or over time. Again, most complex event processing systems today provide powerful capabilities to declaratively express these conditions and patterns, to check for them in the input data, and to compose complex events in the output when a pattern is matched or a condition has been observed. Similar processing has applications in supervised learning where one needs to prepare training data with the classification results, the failure occurrences or derived values that are later used for prediction when executing the model in production. Again, complex event processing systems help here: one can express the processing in a declarative way instead of writing lots of procedural code in reducers.
Example: Consider the following input data.
Given this input data, we want to find complex events that satisfy all of the following conditions:
- An event of kind ‘A’ is followed by an event of kind ‘B;
- Events ‘A’ and ‘B’ occur within a week (i.e. 7 days) from each other;
- Events ‘A’ and ‘B’ occur on the same piece of equipment;
- The payload of ‘A’ is at least 10% larger than the one of ‘B’.
The original events are not re-produced in the output. The output only has a complex event for each occurrence of the pattern in the input. The complex event carries the timestamp of the input event ‘B’ completing the pattern and it carries both payloads. With the example input data from above, note how the second ‘A’ – ‘B’ pair in the input does not qualify since it does not satisfy the predicate over the payloads. Given the above input, we get the following output:
To perform complex event processing over large data volumes for the scenarios and examples outlined in the previous paragraphs, the use of MapReduce is compelling as it allows parallelizing the processing and reducing the amount of time spent waiting for the result. The main requirement is that the input data and calculation can be partitioned so that there are no dependencies between different reducers. In the previous examples, a simple way to do the partitioning is to split by equipment ID. In these examples, all processing is done on a per equipment basis. Similar properties can be found in many Big Data processing scenarios. If the partitioning condition holds one can instantiate the picture from Figure 2 and rely on familiar Hadoop tools to scale the processing and to manage your cluster while expressing the logic of the reducer step in a declarative language. On the Microsoft stack, the combination of Microsoft StreamInsight™ with Microsoft’s Hadoop-based Services for Windows makes it easy to follow this approach.
Microsoft’s software stack provides compelling benefits for Big Data processing along both the volume and velocity dimension of Big Data:
- For batch-oriented processing, Microsoft’s Hadoop-based services provide the MapReduce technology for scale-out across many machines in order to quickly process large volumes of data.
- For event-driven processing, Microsoft StreamInsight™ provides the capabilities to perform rich and expressive analysis over large amounts of continuously arriving data in a highly efficient incremental way.
- For complex event processing at scale, Microsoft StreamInsight™ can run in the reducer step of Hadoop MapReduce jobs on Windows Server clusters to detect patterns or calculate trends over time.
We’re thrilled to be a top-tier Elite sponsor of the upcoming Strata Conference on February 28th to March 1st 2012 held in Santa Clara California. Learn more about our presence and sessions at Strata as well as our Big Data Solution here.
Following Strata, on March 7th, 2012 we will be hosting an online event that will allow you to immerse yourself in the exciting New World of Data with SQL Server 2012. Learn more here.
Finally, for more information on Microsoft Big Data go to http://www.microsoft.com/bigdata.