Information, tips, news and announcements about SQL Server Analysis Services and PowerPivot directly from the product team.
Over the past few weeks, I had an opportunity to work with colleague Denny Lee and David P Mariani (VP of Engineering, Klout) on a very interesting Big Data story about how Klout.com takes advantage of Analysis Services. Klout analyzes massive amounts of social network user data to measure the impact of opinions, links, and recommendations and identify influential individuals on the social web. Their data warehouse holds more than 800 terabytes of data. The signal collectors score hundreds of millions of profiles and process over 12 billion data points every day—and with Analysis Services Klout achieves query response times of less than 10 seconds on 1 trillion rows of data on a single server with 24 cores and 128 GB of memory.
It’s very clever how Klout connected a multidimensional Analysis Services database to Hadoop/Hive. The clear idea is to use a linked server and distributed queries via the relational SQL Server engine. It helps Klout to reduce data latencies, eliminate maintenance overhead and costs, move aggregation processing to Hadoop, and shorten development cycles. If you are searching for an efficient way to get similar benefits in your Big Data Analytics solution, look no further, read the white pape SQL Server Analysis Services to Hive - A Klout Case Study at http://msdn.microsoft.com/en-us/library/jj710329.aspx. I hope you’ll enjoy it.
For distinct count measure group, if my understanding is correct, SSAS will issue a query with a order by, that will make sql server fetch all the data from hadoop, and then sort the data inside sql server, that might be time consuming, is it possible to fetch the data in hadoop in order, then sql server doesn't need to sort it again, is this doable?
I would not try to do this via Hive because Analysis Services needs to have the records in a specific order to ensure accurate counting. Collation mismatches between data source and Analysis Services, for example, can lead to wrong distinct count values in the cube.
Hope this helps.
We are trying to configure SQLServer against IBM BigInsight. The case study didn't mention which linked server properties governs the use of openquery. Can anyone advise on this as it is very important for performance.
Very cool idea but what about surrogate keys? Type 2 SCD's? How were these addressed in the Klout solution? May be type 2 SCD's weren't needed but surrogate keys are a fundamental part of a dimensionally modeled db.