by Terry Room

clip_image002

There is a lot of buzz around Big Data at the moment. This is both exciting and confusing in equal measures. With this post I’ve tried to distil some of the key thoughts and learning on the subject gleaned from a number of interesting discussions on the subject.

So how big does your data have to be to be BIG?

Not too long ago a Terabyte database required highly skilled practitioners to deliver and operate. Assisted by developments in storage technology, CPU architectures and systems management tools, multi-TB is now far more commonplace. Cast your mind back a little further and enterprises gainfully employed people to punch holes in cards which would form the base instruction sets for the first mainframe computers. My Windows Phone now carries more CPU capacity, and generates more data.

Traditional data management ways of handling increased business transaction volumes have scaled to this point. So what's different now?

Latencies are changing

The opportunities brought by the global economy has brought with it fierce competition. This environment is more like a weather system - difficult to predict beyond short term time horizons, and where seismic events can massively disrupt. This brings the need for more agility - tactically, operationally, and even strategically. And this in turn places a higher demand on the execution of business processes and for low latency BI ‘feedback loops’. Air Fleets need to be re-planned now in days not weeks, trading strategies need to change in terms of minutes and seconds, web sites need to be seconds behind the browsing behaviour of their would be customers, not minutes, hours or days.

The New Data and the ‘need to correlate’

But there is something much more - the 'New Data. Think of the data points which your mobile phone is constantly producing. Think of the big social networks now such as Facebook and Twitter and the phenomenal volume of data they generate. Aggregation of 'likes', 'dislikes', correlated with demographics, or location information, or email - there is potential value in all of this data which needs to be unlocked. Or what about large simulations (Monte Carlo, Financial Risk, Particle Physics or other 'Big Science' experiments) and the emergence of new technologies such as Hadoop which are driving down the cost/value ratio for processing such data. And to complicate things further these sources are in different formats (streams, tables, files, documents). For enterprises this raises two key questions: where is the value, and how can I quantify it?

Risk of the Data Science Department - ‘Questions, Answers, Questions’

Before answering this question let’s look at another key factor.

An Enterprise BI project typically starts with a 'formulation of the questions we want to ask of the data'. This leads to the development of data models, ETL processes and then the project is delivered, usually taking many months to complete.

Upon delivery the questions can be asked of the data and it is only then when the extent of the value the data exposes to such questions is known. The desire for an 'answers to questions leading to more questions' approach has led to the emergence of many iterative BI methodologies. This partly helps but still there is a need for 'project iterations'.

Couple this with some of the factors already mentioned (volume, latency, variety of format) and this calls out a need for a complementary approach, based upon statistical modelling, data mining and algorithms against 'all of the raw data'. That is, there is a pre-stage of experimentation, whereby statisticians look to find 'interesting patterns' in the new data -ones which may (or may not) be worthy of further investigation.

In many cases this may naturally lead to ‘Data as a Service’ architectures in Enterprises, whereby Enterprise IT are responsible for data hosting, and business group analysts and data scientists are responsible for ‘making best use of it’.

But where is the value?

Of course this is all interesting. But Enterprises need to start with the ‘value question’ which will underpin their 'Big Data Strategy'. What is the value in this new data? How will it enable more satisfied customers, or better products or services, reduce costs or enable new markets? How can value be expressed in tangible (non-subjective) terms, and how can this value be used as the foundation of a business case for investment in the new technologies and skills required to deliver upon the strategy.

Challenges on the ‘Path to Value’

There are challenges which will need to be overcome on the ‘path to value’. Privacy, ownership, liability (i.e. who is responsible for decisions made on bad data), access control and sharing (IP) rights, as well as mixed technology adoption and growth of new skills (not everyone has a data scientist at the moment!) are a few key examples. The extent to which these can be overcome will be key.

Like any big disruptive force, the dust will settle on the hype of Big Data in time. The technology, and having people to deliver it, is incredibly important – be it CEP, Hadoop, MPP Data Warehouses, OLAP or OLTP. But to start this journey, don’t worry about if your data is big enough, worry about where you can get the most value from it. And just don’t take too long establishing this or you may get left behind!

Terry Room
Solutions Architect, Microsoft

Terry is a Solutions Architect in Microsoft Services. Having held engineering, design and operational management roles on a number of large IT improvement projects through his career to date, he understands what it takes to deliver Mission Critical Solutions with Microsoft Technology. Terry has broad technology and industry experience spanning a number of industry sectors including Legal, Retail and Manufacturing, Financial Services and Capital Markets, and a number of solution areas including large scale transaction processing, business intelligence, integration, collaboration, web, business productivity and large scale infrastructure projects.