Yesterday, I attended Seattle Technique Forum: Case of Big Data, I like to post my take away here for your reference.
I ask her one question about Data scientist as a career: since Bigdata is so hot, and many people are interesting in this area (this is the first time that the meeting room of Bellevue Hall is full and more than 10 people were standing), what is your advises for people?
She answered my question with three key points:
She also mentioned another topic which is how to let people knowing the value of data and make decision based on data. She said that it is very hard, she has 10:1 ratio convincing people to do data driven decision. In average it took 10 to 18 months for a manager to fully get the idea of using data driven to making decision (you can image that lot of factors impact on How people doing decision today, and it might not easy to get there). At that time, when you realized this, it is too late:
My take away for this is that speed is one of the import factors, you should get your data out as soon as possible. If we focus too much on building complete solution for data analytic, it might be useless when it finally arrived. Starting from small, asking questions you want to answer, generating some report from your data, leading to some actions from your manager, and continue to improve this.
Jim talks about Google's Dremel query engine and BigQuery. Unlike Map-Reduce which aims to do batch processing on large set of data, Dremel aims to answer complex adhoc queries on very short time. The presenter shows a live demo on query all Wikipedia documents and find out the top pages which have description contains G*g??l?, the query returns the query in 30 seconds with scanning 600Gb of data. Dremel used lot of traditional RDBMS techniques, and Columnstore as well which I think some of DBMS guys in Microsoft might be interesting.