LinkedIn | FaceBook | Twitter
More than perhaps any other computing discipline, Data Science lends itself best to Cloud Computing in general, and Windows Azure in specific. That's a big claim, but before I offer some evidence, I need to explain what I mean by "Data Science". I've written before on Data Science (http://blogs.msdn.com/b/buckwoody/archive/2012/10/16/is-data-science-science.aspx, and https://www.simple-talk.com/cloud/data-science/data-science-laboratory-system---keyvalue-pair-systems/ ), but since it's an evolving field, here's what I've observed as the areas that a Data Scientist focuses on:
There are of course other aspects of data science, but I believe this list covers the majority of skills I've seen in individuals with the Data Scientist title. And it is normally an individual, or at least a very limited group of people. as you examine the list above, you can see this person requires a fairly extensive technical background, and in the domain knowledge area in specific, there's a pretty large time element. That isn't to say a very bright person couldn't ramp up on these areas, just that having all of that in your portfolio takes time.
Given that these are the skillsets, why is cloud computing well suited to assisting in the data science function?
Cloud computing allows the data scientist to access data stored in Windows Azure (Blobs, Tables, Queues, RDBMS's as a service such as SQL Server and MySQL) as well as IaaS systems that can run full RDBMS systems such as SQL Server, Oracle, PostGreSQL and others. In addition, the Windows Azure Marketplace contains "Data as a Service" which has free and fee-based data to include in a single application.
The Windows Azure Service Bus allows architecting a CEP system, and using SQL Server allows the StreamInsight feature, and can communicate from on-premises, Windows Azure IaaS and PaaS, and other data sources.
For data storage and computing, Windows Azure allows everything from traditional RDBMS's as described to any NoSQL in IaaS, on both Windows and Linux operating systems. Statistical packages such as "R" are also supported. The elasticity allows the data scientist to spin up huge clusters, such as Hadoop or other NoSQL offerings, perform some analysis, and then stop the process when complete, saving cost, and bypassing the internal IT systems (which may have its own dangers, to be sure). Windows Azure also offer the High Performance Computing (HPC) computing version of Windows Server on Windows Azure, for large-scale massively parallel data processing, in constant and "burst" modes.
In addition, Windows Azure has many services, such as the HDInsight Service (Hadoop on demand) and other analysis offerings that don't even require the data scientist to stand up and manage a Virtual Machine in IaaS. For visualization, Microsoft has included the ability to use Excel with the HDInight Service, and of course that works with all Microsoft Business Intelligence functions, and there are several other data visualization tools such as Power View . You can enter the tools you have in the Microsoft stack in this tool (http://www.microsoft.com/en-us/bi/Products/bi-solution-builder.aspx) for more on the visualization options you have. The data scientist can also build visualizations in web pages, on iPhone, Android or Windows mobile devices, or in full client-code installations.
Because the need for elasticity, multiple operating systems, and changing landscapes for data and processing, data science is well served by cloud computing - and in Windows Azure in particular because of the services and features offered, not only on Microsoft Windows but Open Source.
Great article! I would add one more area that a Data Scientist should focus on: Semantics -- which includes ontologies, taxonomies, folksonomies, context metadata, RDF triplestores, and SPARQL -- for knowledge representation, preservation, sharing, and reuse. Being able to extract context (and knowledge) from Big Data, then representing and storing that information for later reuse, and finally applying context in recommendation engines (and in other "Learning From Data" applications) is really hot stuff for cool Data Scientists. This is especially true in social data as well as scientific data applications.