Quite a while back I promised to tell everyone about the subtle new feature in DMX in SQL Server 2005 SP2. Well, that day has finally arrived, true believers – your patience has paid off! J
One of the nice improvements we made to SQL Server 2005 DMX was the ability to predict the probability of a particular state. That is, in SQL Server 2000, I could get the probability of the predicted state using the following query:
SQL Server 2000:SELECT PredictProbability([Gender]) FROM Customers PREDICTION JOIN …
This query gives me the probability of whatever state is predicted for Gender for each row of the input. However if I wanted to get the probability that a customer was ‘Female’, for example, I would have to write a query like this:
SQL Server 2000:SELECT FLATTENED (SELECT $Probability FROM PredictHistogram([Gender]) WHERE [Gender]='Female') FROM Customers PREDICTION JOIN …
Which, to put in polite terms, is a tad bit cumbersome. Not only that, but you get a column called “Expression.$PROBABILITY” as your output – not nice.
Luckily in SQL Server 2005 Data Mining, we added new syntax that allowed you to simply indicate the state for which you wanted the probability directly in the function call, like this:
SQL Server 2005:SELECT PredictProbability([Gender], 'Female') FROM Customers PREDICTION JOIN …
And life was good. We even made it nicer by allowing the specification of a parameter for the state, so you could write queries like the following:
SQL Server 2005 SP2:SELECT PredictProbability([Gender], @Gender) FROM Customers PREDICTION JOIN …
And specify the desired state at run-time. Wonderful! However, when we were implementing other features we ran into the one ommission. It turns out we even allowed arbitrary expressions like ‘Fe’+’male’ there, but we left out the ability to determine what the probability of a known state from the input. That means that if I wanted to determine what the probability was of a state and the state happened to be stored in the database, I would have to first issue a query to get the state and then issue another query to get the probability of that state. Not fun.
In SQL Server 2005 SP2 Data Mining we generalized the language so that essentially, every where you can place a parameter, you can also place a column reference. That means that I can solve the above problem with a query like this:
SELECT PredictProbability([Gender], t.Gender) FROM Customers NATURAL PREDICTION JOINOPENQUERY([Movie Click], 'SELECT * FROM CUSTOMERS') AS t
Now, where is this useful? I’m sure all of the creative DMXers out there will find innumerable uses for this new flexibility, but the most obvious application (for me at least) is for accuracy and lift calculations. Now, in one query, I can generate lift figures for my entire testing set without caching the data, allowing, for example, the generation of lift reports in Reporting Services!
Enjoy this new found friend in DMX, and I’d enjoy hearing how you put it to use in your applications!