MTC Silicon Valley: An Introduction to Microsoft Big Data

MTC Silicon Valley: An Introduction to Microsoft Big Data

  • Comments 1

In February 2014, tens of thousands of attendees met at the Strata Conference in Santa Clara, California, to learn about advances in big data. The conference was hosted by Microsoft and dozens of other companies in the big data arena. What I learned, as a participant in the Microsoft booth at Strata, is that Hadoop is ubiquitous in Silicon Valley. Nearly every person I talked to across retail, manufacturing, startup, government, and hi-tech verticals described how they were leveraging (or trying to leverage) Hadoop today to solve business problems.

Before we get into what big data is, however, let’s cover what it is not. Big data is not a panacea; it will not magically make all of your business problems go away. Big data is not a replacement for relational database management systems—at least, not today … maybe in a few years when query performance radically improves (or new methods arise, e.g. Stinger). Big data solutions are not simple; spinning up an HDInsight cluster is a breeze, but then what? By its very definition, you are dealing with vast amounts of data, most of which is unstructured. There is a lot to wade through to find the nuggets of relevant data; there is no easy way to perform this operation.

So what is big data? While there are several answers to that (rhetorical) question, the most common is similar to Wikipedia’s entry: big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

At Microsoft, we classify big data by the “four Vs”:

  • Volume: defines the amount of data. In a “traditional” database, we measure in the gigabyte-terabyte scale; in a big data system, however, we work with petabytes (PB) – exabytes (EB).
  • Velocity: defines how quickly the data arrives from its origin. Streaming data is a huge part of the “Internet of Things” (Figure 1). We need methods to not only accept and store this high-velocity information but also to analyze and act upon it as it is “in flight.” Streaming data (e.g. telemetry, logging, sentiment) has a very different set of characteristics than transactional data, which is traditionally collected and batch-loaded.
  • Variety: defines diversity of data, e.g. structured records (HTML, XML, JSON, etc.), formatted records (CSV, XLS, etc.), and unstructured records (DOC, PDF, etc.), or any other binaries (JPG, MP4, MOV, WMV, etc.)
  • Variability: defines variances within information. For example, a Word document may contain information in different sections of the file with completely different formats that, when combined, provide the information the consumer wants/needs.

Figure 1 – The Internet of Things

Enter Hadoop
In 2007, the Apache Software Foundation released an open-source platform library known as “Hadoop.” From Apache:

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

In 2008, Yahoo, Inc. deployed the first enterprise-scale Hadoop implementation to manage the volume, velocity, variety, and variability of data associated with its web content. By building large-scale clusters running the Hadoop Distributed File System (HDFS) and MapReduce, Yahoo was able to distribute storage and processing horizontally on cost-effective hardware, rather than making massive capital investments in scale-up solutions (supercomputers and SAN). Distribution of the load is key.

Hadoop, at its core, is a distributed system with a set of tools and functions for storing, retrieving, managing, and analyzing information. I’ll cover each of the tools and functions in future blog posts. For now, let’s cover the fundamentals:

  • HDFS: deployed on “nodes” in a cluster. Nodes are commodity hardware, e.g. low-end servers. A cluster is a grouping of networked servers for the purpose of high availability and, in some instances, scalability (distributed workloads provide scalability).
  • MapReduce: procedures to retrieve, sort, and summarize data stored in HDFS. You use Java to create JAR files, which define the actions you wish to take (see WordCount example here).
  • Pig: a high-level scripting language (Pig Latin) which may produce a series of MapReduce jobs to retrieve data from HDFS.
  • Hive: a query language (HiveQL) similar to SQL for querying HDFS

Microsoft has partnered with Hortonworks, a leader in the big data space, to implement a 100-percent Apache-compliant Hadoop. As of today, there are four methods to build Hadoop solutions in the Microsoft ecosystem:

  • Hortonworks Data Platform (HDP) on Windows
  • HDInsight, the Microsoft cloud Hadoop service (built upon HDP; more blog posts to come on this topic)
  • HDInsight Region within the Microsoft Parallel Data Warehouse (PDW) appliance
  • Hadoop on Linux in a Microsoft Azure or Hyper-V virtual machine; Azure is the Microsoft public cloud platform (Figure 2)

Figure 2 – Azure Virtual Machine Images

Microsoft differentiates itself in the Hadoop space by providing support for any HDFS implementation architecture—from in-cloud as a platform-as-a-service (PaaS), in-cloud as infrastructure-as-a-service (IaaS), on-premises as a physical cluster, on-premises as a virtual cluster, or any combination thereof. We have designed our tools to natively support this combination of architecture via WebHDFS (REST), native provider (HDI), or Open Database Connectivity (ODBC)/Java Database Connectivity (JDBC). For movement into/out of the Microsoft SQL Server platform, we have developed a SQL Server Sqoop adaptor, which is now built into Apache Sqoop (as of v1.4).

Stay tuned for future posts on this topic!


Pat Sheehan works at the Microsoft Technology Center (MTC) in Silicon Valley, California, as Data Platform Architect. Prior to the MTC, Pat was with Microsoft Services as an engineer, architect, and enterprise strategist, specializing in SQL Server, Business Intelligence, and Enterprise/Solutions Architecture. He has a background in application and database development, with a strong emphasis on enterprise-scale solutions.

Leave a Comment
  • Please add 8 and 5 and type the answer here:
  • Post
  • Great post Pat!

    This expanded a bit on my knowledge of Big Data and MS Hadoop solutions.

    I would love to learn more about the specific tools.  I keep hearing about MapReduce, Pig, Hive, HDFS, etc.  I'm guessing a lot of folks are like me wondering how to get started "playing with" these technologies to feel them out.  I am currently at a 100% Microsoft shop with the exception of the Virtual Environment which is driven off of VMWare.  (We are 100% virtualized)  

    Also, use-cases are of big interest.  I currently work for an organization with small volume and velocity of data.  But, it is my understanding that there is a lot of free, public big data sources available that can be of value to businesses i.e. how many mentions did my company get on Twitter last month.  I would like to know more about what valuable public data is available.

    As someone with complete understanding of traditional methods of data modeling and ETL, I am interested in enhancing my knowledge and skills around Big Data and figuring out which tools and concepts I should go after. I am not typically used to Business Intelligence developers actually developing Java code.  Perhaps working with Pig or Hive will be something that my existing skill set will migrate toward more easily.  With all of these new technologies emerging, a lot of us with the traditional data warehousing skills are trying to figure out where to begin!

Page 1 of 1 (1 items)