In February 2014, tens of thousands of attendees met at the Strata Conference in Santa Clara, California, to learn about advances in big data. The conference was hosted by Microsoft and dozens of other companies in the big data arena. What I learned, as a participant in the Microsoft booth at Strata, is that Hadoop is ubiquitous in Silicon Valley. Nearly every person I talked to across retail, manufacturing, startup, government, and hi-tech verticals described how they were leveraging (or trying to leverage) Hadoop today to solve business problems.
Before we get into what big data is, however, let’s cover what it is not. Big data is not a panacea; it will not magically make all of your business problems go away. Big data is not a replacement for relational database management systems—at least, not today … maybe in a few years when query performance radically improves (or new methods arise, e.g. Stinger). Big data solutions are not simple; spinning up an HDInsight cluster is a breeze, but then what? By its very definition, you are dealing with vast amounts of data, most of which is unstructured. There is a lot to wade through to find the nuggets of relevant data; there is no easy way to perform this operation.
So what is big data? While there are several answers to that (rhetorical) question, the most common is similar to Wikipedia’s entry: big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
At Microsoft, we classify big data by the “four Vs”:
Figure 1 – The Internet of Things
Enter HadoopIn 2007, the Apache Software Foundation released an open-source platform library known as “Hadoop.” From Apache:
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”
In 2008, Yahoo, Inc. deployed the first enterprise-scale Hadoop implementation to manage the volume, velocity, variety, and variability of data associated with its web content. By building large-scale clusters running the Hadoop Distributed File System (HDFS) and MapReduce, Yahoo was able to distribute storage and processing horizontally on cost-effective hardware, rather than making massive capital investments in scale-up solutions (supercomputers and SAN). Distribution of the load is key.
Hadoop, at its core, is a distributed system with a set of tools and functions for storing, retrieving, managing, and analyzing information. I’ll cover each of the tools and functions in future blog posts. For now, let’s cover the fundamentals:
Microsoft has partnered with Hortonworks, a leader in the big data space, to implement a 100-percent Apache-compliant Hadoop. As of today, there are four methods to build Hadoop solutions in the Microsoft ecosystem:
Figure 2 – Azure Virtual Machine Images
Microsoft differentiates itself in the Hadoop space by providing support for any HDFS implementation architecture—from in-cloud as a platform-as-a-service (PaaS), in-cloud as infrastructure-as-a-service (IaaS), on-premises as a physical cluster, on-premises as a virtual cluster, or any combination thereof. We have designed our tools to natively support this combination of architecture via WebHDFS (REST), native provider (HDI), or Open Database Connectivity (ODBC)/Java Database Connectivity (JDBC). For movement into/out of the Microsoft SQL Server platform, we have developed a SQL Server Sqoop adaptor, which is now built into Apache Sqoop (as of v1.4).
Stay tuned for future posts on this topic!
Pat Sheehan works at the Microsoft Technology Center (MTC) in Silicon Valley, California, as Data Platform Architect. Prior to the MTC, Pat was with Microsoft Services as an engineer, architect, and enterprise strategist, specializing in SQL Server, Business Intelligence, and Enterprise/Solutions Architecture. He has a background in application and database development, with a strong emphasis on enterprise-scale solutions.
Great post Pat!
This expanded a bit on my knowledge of Big Data and MS Hadoop solutions.
I would love to learn more about the specific tools. I keep hearing about MapReduce, Pig, Hive, HDFS, etc. I'm guessing a lot of folks are like me wondering how to get started "playing with" these technologies to feel them out. I am currently at a 100% Microsoft shop with the exception of the Virtual Environment which is driven off of VMWare. (We are 100% virtualized)
Also, use-cases are of big interest. I currently work for an organization with small volume and velocity of data. But, it is my understanding that there is a lot of free, public big data sources available that can be of value to businesses i.e. how many mentions did my company get on Twitter last month. I would like to know more about what valuable public data is available.
As someone with complete understanding of traditional methods of data modeling and ETL, I am interested in enhancing my knowledge and skills around Big Data and figuring out which tools and concepts I should go after. I am not typically used to Business Intelligence developers actually developing Java code. Perhaps working with Pig or Hive will be something that my existing skill set will migrate toward more easily. With all of these new technologies emerging, a lot of us with the traditional data warehousing skills are trying to figure out where to begin!