In the book “The Fourth Paradigm”, Jim Gray describes the modeling process he came up with modeling large databases. He was working with scientists, researchers and others, and the data sets they were generating were huge – terabytes of data at a time. To even come up with an entity model was a daunting task, so he came up with something he called “The 20 Queries”. He asked each researcher the 20 most important queries they would ask of the data, and then he worked from that.

What caught my eye was the quote about the number 20:

“Most selections involving human choices follow a “long tail,” or so-called 1/f distribution, it is clear that the relative information in the queries ranked by importance is logarithmic, so the gain realized by going from approximately 20 (24.5) to 100 (26.5) is quite modest.”

What? They do? So I began my own research – it’s one of the multiple notes I took from the book. The 1/f distribution, or 1/f noise, is all over the place. I won’t go into the formula here since it’s a bit complicated, but it affects everything from signal generation in stereo and other speakers to the probability distributions in the stock market. It’s another one of those “magic formulas” like the golden mean that you see everywhere in nature – and yes, it even works for databases.

Here’s the way I use this:

  1. I have the users define their entities – the nouns in their organization or what they want to track.
  2. I then have them tell me about the verbs that define what those nouns do, and the relationships between them.
  3. Then I ask, “what would be your 20 top queries – in English – for each of those nouns and verbs?”
  4. From there, design is a breeze.

Isn’t math cool?