LinkedIn | FaceBook | Twitter
Last week Microsoft announced several new offerings for “Big Data” - and since I’m a stickler for definitions, I wanted to make sure I understood what that really means. What is “Big Data”? What size hard drive is that? After all, my laptop has 1TB of storage - is my laptop “Big Data”?
There are actually a few definitions for this term, most notably those involving the “Four V’s” Volume, Velocity, Variety and Variability. Others disagree with this definition. I tend to try and get things into their simplest form, so I’m using this definition for myself:
Big data is defined as a large set of computationally expensive data that is worked on simultaneously.
Let me flesh that out a little. To be sure, “Big Data” has a larger size than say a few megabytes. The reason this is important is that it takes special hardware to be able to move large sets of data around, store it, process it and so on. (large set)
If you store a LOT of data, but only use a small portion of it at a time, that really isn’t super-hard to do. It’s mainly a storage issue at that point. But, if you do need to work with a large portion of the data at one time, then the memory, CPU and transfer components of the system have to adapt to be responsive - new ways to work with that data (game theory, knot-algorithms, map-reduce, etc.) need to be brought into play. (computationally expensive)
Once that data is loaded into the processing area (memory or whatever other mechanism is used) it must be worked on in parallel to come back in a reasonable time. You have two options here - you can scale the system up with more internal hardware (CPU’s, memory and so on) or you can scale it out to have multiple systems work on it at the same time using paradigms such as map/reduce and so on. Actually, when you lay this out in an architecture diagram, scale up or out doesn’t actually change the logical structure of the process - in scale out the network becomes the bus, and the nodes become more RAM and computing power. Of course, there are changes in code for how you stitch the workload back together. (worked on simultaneously)
So back to the original question. Is Big Data, as I have defined it here, a workload for Windows and SQL Azure? Absolutely! In fact, it’s probably one of the main workloads, and I believe it represents the latest, and perhaps also the earliest frontier of computing. Jim Gray, a former researcher here at Microsoft and a hero of mine, was working on this very topic. I believe as he did - all computing is simply an interface over data.
Microsoft has multiple offerings on the topic of Big Data. In posts that follow from myself and my co-workers, we’ll explore when and where you use each one. Whether you are a data professional or a developer, this is the new frontier - don’t wait to educate yourself on how to leverage Big Data for your organization.
Hadoop on Windows Azure and SQL Server - Microsoft’s partnership to include Hadoop workloads on Windows Azure and SQL Server/Parallel Data Warehouse (PDW)
LINQ to HPC - Microsoft’s High-Performance Computing SKU of HPC is now in Azure
Windows Azure Table Storage - A key/value pair type storage with full partitioning that is immediately consistent, able to handle huge loads of data and works with any REST-compatible language
Other offerings - Including the new Data Explorer, Project Daytona (with a Big Data Toolkit for Scientists and researchers), Power View and more.
The era of Big Data is here. And you can use Windows and SQL Azure to bring it to your organization.
I'm trying to understand how LINQ to HPC differs from the Hadoop-on-Windows effort that's going on now. Why would I choose one over the other, and why is MSFT funding both simultaneously? From everything that I've read/heard they tackle the same big data scenarios.
Chris - great question. You'll often see Microsoft use multiple technologies to answer the same question. We have a huge installed base of HPC, and of course Hadoop has different advantages. Look for us to constantly explore multiple ways of handling large data processing sets.
As far as the choice goes, like most any other, lay out your requirements first. Then evaluate each technology offering to see what meets your needs best. While that may be a bit vague, it works for me every time. :)
Actually, seems that MSFT announced the end of LINQ to HPC several days prior to our posts.
Chris - you're correct. When I posted I wasn't sure if that was public yet or not. :) We're focusing now on Hadoop - embracing open source and using it to help!