LinkedIn | FaceBook | Twitter
I'm still re-reading the "Fourth Paradigm" book by Microsoft Research, and one section continues to intrigues me. There's a part where the book explains database design, and puts forth that the most important thing when you're designing large data sets is to find out the "Top Twenty Questions" the database has to answer. The quote is this:
"Most selections involving human choices follow a 'long tail,' or so-called 1/f distribution, it is clear that the relative information in the queries ranked by importance is logarithmic, so the gain realized by going from approximately 20 (24.5) to 100 (26.5) is quite modest."
I find this facinating - it just doesn't seem to make "common" sense. Surely you have to ask a lot more questions than that to "get" the shape of the data? I researched the mathematical concept he's describing (http://www.scholarpedia.org/article/1/f_noise), and I'll try some experiments here. I'll let you know what I uncover!
Here's the link for the book if you want to read it:
I'm thinking you have to ask more than 20 questions in order to figure out what the "top" twenty questions would be.
I have to agree with ShellyNoll about having to ask a lot more than 20 questions in order to figure out which 20 of the questions you ask are the most important. Unless, of course, the experts at Microsoft Research have a universally "most important" top 20 for us to ask.