LinkedIn | FaceBook | Twitter
(As with all of these types of posts, check the date of the latest update I’ve made here. Anything older than 6 months is probably out of date, given the speed with which we release new features into Windows and SQL Azure)
I don’t normally like to discuss things in terms of tools. I find that whenever you start with a given tool (or even a tool stack) it’s too easy to fit the problem to the tool(s), rather than the other way around as it should be.
That being said, it’s often useful to have an example to work through to better understand a concept. But like many ideas in Computer Science, “Big Data” is too broad a term in use to show a single example that brings out the multiple processes, use-cases and patterns you can use it for.
So we turn to a description of the tools you can use to analyze large data sets. “Big Data” is a term used lately to describe data sets that have the “Four V’s” as a characteristic, but I have a simpler definition I like to use:
Big Data involves a data set too large to process in a reasonable period of time
I realize that’s a bit broad, but in my mind it answers the question and is fairly future-proof. The general idea is that you want to analyze some data, and using whatever current methods, storage, compute and so on that you have at hand it doesn’t allow you to finish processing it in a time period that you are comfortable with. I’ll explain some new tools you can use for this processing.
Yes, this post is Microsoft-centric. There are probably posts from other vendors and open-source that cover this process in the way they best see fit. And of course you can always “mix and match”, meaning using Microsoft for one or more parts of the process and other vendors or open-source for another. I never advise that you use any one vendor blindly - educate yourself, examine the facts, perform some tests and choose whatever mix of technologies best solves your problem.
At the risk of being vendor-specific, and probably incomplete, I use the following short list of tools Microsoft has for working with “Big Data”. There is no single package that performs all phases of analysis. These tools are what I use; they should not be taken as a Microsoft authoritative testament to the toolset we’ll finalize for a given problem-space. In fact, that’s the key: find the problem and then fit the tools to that.
I break up the analysis of the data into two process types. The first is examining and processing the data in-line, meaning as the data passes through some process. The second is a store-analyze-present process.
Processing data in-line means that the data doesn’t have a destination - it remains in the source system. But as it moves from an input or is routed to storage within the source system, various methods are available to examine the data as it passes, and either trigger some action or create some analysis.
You might not think of this as “Big Data”, but in fact it can be. Organizations have huge amounts of data stored in multiple systems. Many times the data from these systems do not end up in a database for evaluation. There are options, however, to evaluate that data real-time and either act on the data or perhaps copy or stream it to another process for evaluation.
The advantage of an in-stream data analysis is that you don’t necessarily have to store the data again to work with it. That’s also a disadvantage - depending on how you architect the solution, you might not retain a historical record. One method of dealing with this requirement is to trigger a rollup collection or a more detailed collection based on the event.
StreamInsight - StreamInsight is Microsoft’s “Complex Event Processing” or CEP engine. This product, hooked into SQL Server 2008R2, has multiple ways of interacting with a data flow. You can create adapters to talk with systems, and then examine the data mid-stream and create triggers to do something with it. You can read more about StreamInsight here: http://msdn.microsoft.com/en-us/library/ee391416(v=sql.110).aspx
BizTalk - When there is more latency available between the initiation of the data and its processing, you can use Microsoft BizTalk. This is a message-passing and Service Bus oriented tool, and it can also be used to join system’s data together than normally does not have a direct link, for instance a Mainframe system to SQL Server. You can learn more about BizTalk here: http://www.microsoft.com/biztalk/en/us/overview.aspx
.NET and the Windows Azure Service Bus - Along the same lines as BizTalk but with a more programming-oriented design are the Windows and Windows Azure Service Bus tools. The Service Bus allows you to pass messages as well, and opens up web interactions and even inter-company routing. BizTalk can do this as well, but the Service Bus tools use an API approach for designing the flow and interfaces you want. The Service Bus offerings are also intended as near real-time, not as a streaming interface. You can learn more about the Windows Azure Service Bus here: http://www.windowsazure.com/en-us/home/tour/service-bus/ and more about the Event Processing side here: http://msdn.microsoft.com/en-us/magazine/dd569756.aspx
A more traditional approach with an organization’s data is to store the data and analyze it out-of-band. This began with simply running code over a data store, but as locking and blocking became an issue on a file system, Relational Database Management Systems (RDBMs) were created. Over time a distinction was made between data used in an online processing system, meant to be highly available for writing data (OLTP) and systems designed for analytical and reporting purposes (OLAP).
Later the data grew larger than these systems were designed for, primarily due to consistency requirements. In analysis, however, consistency isn’t always a requirement, and so file-based systems for that analysis were re-introduced from the Mainframe concepts, with new technology layered in for speed and size.
I normally break up the process of analyzing large data sets into four phases:
You’ll notice in this area, along with those that follow, Microsoft is adopting not only its own technologies but those within open-source. This is a positive sign, and means that you will have a best-of-breed, supported set of tools to move the data from one location to another. Traditional file-copy, File Transfer Protocol and more are certainly options, but do not normally deal with moving datasets.
I’ve already mentioned the ability of a streaming tool to push data into a store-analyze-present model, so I’ll follow up that discussion with the tools that can extract data from one source and place it in another.
SQL Server Integration Services (SSIS)/SQL Server Bulk Copy Program (BCP) - SSIS is a SQL Server tool used to move data from one location to another, and optionally perform transform or other processes as it does so. You are not limited to working with SQL Server data - in fact, almost any modern source of data from text to various database platforms is available to move to various systems. It is also extremely fast and has a rich development environment. You can learn more about SSIS here: http://msdn.microsoft.com/en-us/library/ms141026.aspx BCP is a tool that has been used with SQL Server data since the first releases; it has multiple sources and destinations as well. It is a command-line utility,and has some limited transform capabilities. You can learn more about BCP here: http://msdn.microsoft.com/en-us/library/ms162802.aspx
Sqoop - Tied to Microsoft’s latest announcements with Hadoop on Windows and Windows Azure, Sqoop is a tool that is used to move data between SQL Server 2008R2 (and higher) and Hadoop, quickly and efficiently. You can read more about that in the Readme file here: http://www.microsoft.com/download/en/details.aspx?id=27584
Application Programming Interfaces - API’s exist in most every major language that can connect to one data source, access data, optionally transforming it and storing it in another system. Most every dialect of the .NET-based languages contain methods to perform this task.
Data at rest is normally used for historical analysis. In some cases this analysis is performed near real-time, and in others historical data is analyzed periodically. Systems that handle data at rest range from simple storage to active management engines.
SQL Server - Microsoft’s flagship RDBMS can indeed store massive amounts of complex data. I am familiar with a two systems in excess of 300 Terabytes of federated data, and the Pan-Starrs project is designed to handle 1+ Petabyte of data. The theoretical limit of SQL Server DataCenter edition is 540 Petabytes. SQL Server is an engine, so the data access and storage is handled in an abstract layer that also handles concurrency for ACID properties. You can learn more about SQL Server here: http://www.microsoft.com/sqlserver/en/us/product-info/compare.aspx
SQL Azure Federations - SQL Azure is a database service from Microsoft associated with the Windows Azure platform. Database Servers are multi-tenant, but are shared across a “fabric” that moves active databases for redundancy and performance. Copies of all databases are kept triple-redundant with a consistent commitment model. Databases are (at this writing - check http://WindowsAzure.com for the latest) capped at a 150 GB size limit per database. However, Microsoft released a “Federation” technology, allowing you to query a head node and have the data federated out to multiple databases. This improves both size and performance. You can read more about SQL Azure Federations here: http://social.technet.microsoft.com/wiki/contents/articles/2281.federations-building-scalable-elastic-and-multi-tenant-database-solutions-with-sql-azure.aspx
Analysis Services - The Business Intelligence engine within SQL Server, called Analysis Services, can also handle extremely large data systems. In addition to traditional BI data store layouts (ROLAP, MOLAP and HOLAP), the latest version of SQL Server introduces the Vertipaq column-storage technology allowing more direct access to data and a different level of compression. You can read more about Analysis Services here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/analysis-services.aspx and more about Vertipaq here: http://msdn.microsoft.com/en-us/library/hh212945(v=SQL.110).aspx
Parallel Data Warehouse - The Parallel Data Warehouse (PDW) offering from Microsoft is largely described by the title. Accessed in multiple ways including using Transact-SQL (the Microsoft dialect of the Structured Query Language), This is an MPP appliance scaling in parallel to extremely large datasets. It is a hardware and software offering - you can learn more about it here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/pdw.aspx
HPC Server - Microsoft’s High-Performance Computing version of Windows Server deals not only with large data sets, but with extremely complicated computing requirements. A scale-out architecture and inter-operation with Linux systems, as well as dozens of applications pre-written to work with this server make this a capable “Big Data” system. It is a mature offering, with a long track record of success in scientific, financial and other areas of data processing. It is available both on premises and in Windows Azure, and also in a hybrid of both models, allowing you to “rent” a super-computer when needed. You can read more about it here: http://www.microsoft.com/hpc/en/us/product/cluster-computing.aspx
Hadoop - Pairing up with Hortonworks, Microsoft has released the Hadoop Open-Source system - including HDFS and a Map/Reduce standardized software, Hive and Pig - on Windows and the Windows Azure platform. This is not a customized version; off-the-shelf concepts and queries work well here. You can read more about Hadoop here: http://hadoop.apache.org/common/docs/current/ and you can read more about Microsoft’s offerings here: http://hortonworks.com/partners/microsoft/ and here: http://social.technet.microsoft.com/wiki/contents/articles/6204.hadoop-based-services-for-windows.aspx
Windows and Azure Storage - Although not an engine - other than a triple-redundant, immediately consistent commit - Windows Azure can hold terabytes of information and make it available to everything from the R programming language to the Hadoop offering. Binary storage (Blobs) and Table storage (Key-Value Pair) data can be queried across a distributed environment. You can learn more about Windows Azure storage here: http://msdn.microsoft.com/en-us/library/windowsazure/gg433040.aspx
In a “Big Data” environment, it’s not unusual to have a specialized set of tasks for analyzing and even interpreting the data. This is a new field called “data Science”, with a requirement not only for computing, but also a heavy emphasis on math.
Transact-SQL - T-SQL is the dialect of the Structured Query Language used by Microsoft. It includes not only robust selection, updating and manipulating of data, but also analytical and domain-level interrogation as well. It can be used on SQL Server, PDW and ODBC data sources. You can read more about T-SQL here: http://msdn.microsoft.com/en-us/library/bb510741.aspx
Multidimensional Expressions and Data Analysis Expressions - The MDX and DAX languages allow you to query multidimensional data models that do not fit well with typical two-plane query languages. Pivots, aggregations and more are available within these constructs to query and work with data in Analysis Services. You can read more about MDX here: http://msdn.microsoft.com/en-us/library/ms145506(v=sql.110).aspx and more about DAX here: http://www.microsoft.com/download/en/details.aspx?id=28572
HPC Jobs and Tasks - Work submitted to the Windows HPC Server has a particular job - essentially a reservation request for resources. Within a job you can submit tasks, such as parametric sweeps and more. You can learn more about Jobs and Tasks here: http://technet.microsoft.com/en-us/library/cc719020(v=ws.10).aspx
HiveQL - HiveQL is the language used to query a Hive object running on Hadoop. You can see a tutorial on that process here: http://social.technet.microsoft.com/wiki/contents/articles/6628.aspx
Piglatin - Piglatin is the submission language for the Pig implementation on Hadoop. An example of that process is here: http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/10/running-apache-pig-pig-latin-at-apache-hadoop-on-windows-azure.aspx
Application Programming Interfaces - Almost all of the analysis offerings have associated API’s - of special note is Microsoft Research’s Infer.NET, a new language construct for framework for running Bayesian inference in graphical models, as well as probabilistic programming. You can read more about Infer.NET here: http://research.microsoft.com/en-us/um/cambridge/projects/infernet/
Lots of tools work in presenting the data once you have done the primary analysis. In fact, there’s a great video of a comparison of various tools here: http://msbiacademy.com/Lesson.aspx?id=73 Primarily focused on Business Intelligence. That term itself is now not as completely defined, but the tools I’ll show below can be used in multiple ways - not just traditional Business Intelligence scenarios. Application Programming Interfaces (API’s) can also be used for presentation; but I’ll focus here on “out of the box” tools.
Excel - Microsoft’s Excel can be used not only for single-desk analysis of data sets, but with larger datasets as well. It has interfaces into SQL Server, Analysis Services, can be connected to the PDW, and is a first-class job submission system for the Windows HPC Server. You can watch a video about Excel and big data here: http://www.microsoft.com/en-us/showcase/details.aspx?uuid=e20b7482-11c9-4965-b8f0-7fb6ac7a769f and you can also connect Excel to Hadoop: http://social.technet.microsoft.com/wiki/contents/articles/how-to-connect-excel-to-hadoop-on-azure-via-hiveodbc.aspx
Reporting Services - Reporting Services is a SQL Server tool that can query and show data from multiple sources, all at once. It can also be used with Analysis Services. You can read more about Reporting Services here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/reporting-services.aspx
Power View - Power View is a “Self-Service” Business Intelligence reporting tool, which can work with on-premises data in addition to SQL Azure and other data. You can read more about it and see videos of Power View in action here: http://www.microsoft.com/sqlserver/en/us/future-editions/business-intelligence/SQL-Server-2012-reporting-services.aspx
SharePoint Services - Microsoft has rolled several capable tools in SharePoint as “Services”. This has the advantage of being able to integrate into the working environment of many companies. You can read more about lots of these reporting and analytic presentation tools here: http://technet.microsoft.com/en-us/sharepoint/ee692578
This is by no means an exhaustive list - more capabilities are added all the time to Microsoft’s products, and things will surely shift and merge as time goes on. Expect today’s “Big Data” to be tomorrow’s “Laptop Environment”.
A couple of other tools to consider on the open source analysis side: R for statistical - http://www.r-project.org/ and Mahout for data mining - http://mahout.apache.org/.
Found another tool for analyzing graphs called Pegasus - www.cs.cmu.edu/~pegasus. This tool works on top of Hadoop.