Welcome to MSDN Blogs Sign in | Join | Help

Happy New Year with a big bang!

It's been over 10 years that I've been working on data mining for SQL Server and the new year brings big changes for me.  It's been a very exciting time for me to learn this technology and industry, and with the new year, I'm leaving Microsoft to become the founding CTO of Predixion Software.  It's too early for me to say exactly what we will be creating over there at Predixion, but it's safe to say that SQL Server Data Mining will feature prominently in our business.  I don't have an "official" landing page at the moment, but when there's news to be delivered, I'll post either here - if possible -  or at http://sqldataminingbook.com.   My contact info on this blog has been updated as well, so if you write me from here, I'll get it at my new location.

 Cheers and a happy new year to all!

 

-Jamie

Posted by JamieMac | 0 Comments
Filed under:

The amazing flexibility of DMX Table Valued Parameters

When most people think of “Table Valued Parameters” they think of “possibly the most anticipated T-SQL feature of SQL Server 2008.”  However, little may you know, that the Data Mining team added table valued parameters and table valued functions to DMX in SQL Server 2005.  Not only has this feature been in shipping code for more than four years as of this writing, the DMX implementation is considerably easier to use as well.

This blog entry is one of those that has been requested so many times, I look back on my past posts and am simply amazed that I haven’t written it yet.  The thing is, we made the feature so easy to use, that, as the development team, we didn’t think it needed more explanation.  Of course, as the development team we think all our features our so easy to use, so it may not say much.  Since last week the question came up yet again and since I’m working from home this morning prior to a parent-teacher conference, I decided to buck down and knock this topic out.

SQL Server Data Mining – use data from anywhere

When we set out to make data mining accessible from anywhere on any data we really meant it.  Independent of where the data resides you should always be able to use that data with SQL Server Data Mining to train models or perform predictions on already existing models.  And it should be hard.  One part of this functionality is XML for Analysis (XMLA) which allows you to connect to and communicate with a SQL Server Analysis Services server using SOAP.  I’m not going to get into that now, but it means that everything that can be done using any API can be done from any platform using raw XML, if need be.

But it needn’t be, because we thought it should be easier, so we added Table Valued Parameters to DMX in SQL Server 2005 – we didn’t call it that, but it is what it is.  It means that you can write a query like this:

INSERT INTO (<columns>) @InputRowset

And pass the training data as the rowset parameter.  So, this begs the question, how do you send a rowset as a parameter?  Well, unlike other implementations we tried to make this as easy as possible using the API’s you are currently using.  Of one thing we made sure – there is no “preparatory” work that is necessary for you to make this happen.  For example, you do not need to make any user-defined types or do any work on the server to get this to work.  Another thing we made happen, is that if you think that the object you have in your application is logically a rowset, you should be able to send it to us the way you have it.  This means that if you use our OLE DB provider and you have a pointer to an IRowset, you can pass the IRowset as the query parameter.  If you use our ADOMD.Net provider you can pass a DataTable or an IDataReader as the parameter.  Our providers do the work of converting the contents of the objects into the formats required by the protocol so you don’t have to.  Anyway that probably sounds more simple to me than it does, so let’s just go ahead with an example.

The Example

First, lets just create an arbitrary DataTable in memory like this:

DataTable table = new DataTable();
DataColumn column;
column = table.Columns.Add("Column1");
column.DataType = System.Type.GetType("System.Int32");
column = table.Columns.Add("Column2");
column.DataType = System.Type.GetType("System.String");
column = table.Columns.Add("Column3");
column.DataType = System.Type.GetType("System.String");

table.Rows.Add(1, "A", "B");
table.Rows.Add(2, "A", "B");
table.Rows.Add(3, "B", "C");
table.Rows.Add(4, "B", "C");

See – nothing special.  Of course, in your case you would populate the table with meaningful values, but that’s besides the point of the example.  Next I’ll use ADOMD.Net to create and train a model using DMX.  I’m using CREATE MINING MODEL just to shorten the number of steps.  Of course you can use CREATE MINING STRUCTURE followed by an ALTER MINING STRUCTURE statement, but, again, the point of the example is the table valued parameter.  Anyway, this is the code for creating the model.

 AdomdConnection conn = new AdomdConnection("DataSource=localhost");
AdomdCommand cmd = new AdomdCommand();
conn.Open();
cmd.Connection = conn;
// Create mining model
String createMiningModel =
                "CREATE MINING MODEL foo " +
                "( " +
                "   [rowid] LONG KEY, " +
                "   [A] TEXT DISCRETE, " +
                "   [B] TEXT DISCRETE " +
                ") USING Microsoft_Clustering " +
                "WITH DRILLTHROUGH";
cmd.CommandText = createMiningModel;
cmd.ExecuteNonQuery();
This creates a simple cluster model with three columns that match those in my table.  I added drillthrough so I can retrieve the cases for later, but other than that – totally straightforward.

Next, let’s look at the code that lets me take the table and use that data to train the model.

String insertInto =
                "INSERT INTO foo([rowid],[A],[B]) " +
                "@TrainingRowset";
cmd.CommandText = insertInto;
cmd.Parameters.Add("TrainingRowset", table);
cmd.ExecuteNonQuery();
See!  It really is that simple.  Just create your query, load your data into a table, add the table as a parameter and voila!  Table valued parameters sent to the server to process a model.

So I wanted to show an example of how to do this with a prediction query, but it’s just so easy it didn’t make sense.  However, you may ask “What if my data is external?  Do I have to load it into a data table before I can send it to the server?”  The answer is a profound NO.  If you can get any sort of IDataReader on the data, the ADOMD.Net provider will consume the data reader and stream the data to the server.

Let’s take another example using a data reader.  In this example I’m going to pull the cases out of the model using a <model>.CASES, since it’s convenient (and I added DRILLTHROUGH to the model definition), but any implementation of IDataReader will do.  In fact, this is kind of silly as DMX allows me to directly join against the model cases, but hey, it’s just for example’s sake.

Here’s the code to get the reader – I put it in a function just to exemplify that the calling code is simply getting an IDataReader and has nothing to do with ADOMD.Net:

private IDataReader GetReader()
{
    AdomdConnection conn = new AdomdConnection("DataSource=localhost");
    AdomdCommand cmd = new AdomdCommand();
    conn.Open();
    cmd.Connection = conn;
    cmd.CommandText = "SELECT [A],[B] FROM foo.CASES";
    return cmd.ExecuteReader();
}

And here’s the how you would take that reader and perform an arbitrary prediction join against the model.

cmd.Parameters.Clear();  // clear parameters from previous usage
String queryModel =
                "SELECT Cluster(), t.[A], t.[B] " +
                "FROM foo " +
                "NATURAL PREDICTION JOIN " +
                "@InputRowset AS t";
cmd.CommandText = queryModel;
cmd.Parameters.Add("InputRowset",GetReader());
AdomdDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
    Console.WriteLine(String.Format("Cluster {0}, Input Values {1}, {2}",
            reader[0], reader[1], reader[2]));
}

Again, virtually no work on your part to accomplish pushing data from an arbitrary data provider on the client to the server to perform, in this case, predictions on data.

Summary

Table Valued Parameters in SQL Server Data Mining give you pretty much unlimited flexibility in the data sources you can use to perform data mining.  Even if appropriate data providers are not available on the server to use traditional data access methods such as OPENQUERY or model development in BI Dev Studio – or even if appropriate data providers don’t exist at all – you can still perform data mining operations on any data that you yourself can access or generate.

I hope this article opens some doors to some great new data mining applications!

Posted by JamieMac | 0 Comments
Filed under: ,

More Data Mining in the Cloud?

Since we released the Tech Preview of the Cloud Data Mining Addins last year, I've heard about and talked with Michael Zeller of Zementis.  Zementis has been very active in shaping the PMML (Predictive Modeling Markup Language) standard for representing learned knowledge in XML format.  They have a product called ADAPA that allows you to score, or predict from, arbitrary data sets using arbitrary PMML models.

Recently they released a version of ADAPA for the Amazon Cloud service and it has been very popular for them.  I just recieved a note from Michael showing how you can integrate ADAPA with SQL Server Integration Services.  Basically ADAPA exposes each model as a web mothod that you can call from the SSIS Web Services task.  It's up to you to populate SSIS variables from data and subsequently take the results from a variable and archive them, but it does give you the ability to take models generated in other data mining tools, such as SAS and SPSS and integrate them into your SQL Server workflows.

Hopefully Zementis will release an SSIS transform version of their service to make the usage more natural with SQL Server Integration Services, or better yet, offer a SQL Server Data Mining PlugIn version of their PMML reader so that you can use DMX against your SAS or SPSS models for integration into applications/reporting services/Excel Addins/etc.

 Let me know what you think and I'll pass the info on to Michael, or if you happen to be going to Paris this summer, you can catch him at the SIG KDD Data Mining Conference at the end of June where he'll be running a panel on Emerging Trends in Open Standards and Cloud Computing for Data Mining

 

New Cloud Data Mining Addin Usability Feature (well kind of)

Bogdan did an awesome job putting together the modified Data Mining Addin for the cloud.  We saw quite a few downloads of his new addin, but surprisingly and unfortunately, didn't see a lot of usage!  Strange?  Maybe.  Bogdan thought about the issue for a while and from a few anecdotal comments determined that once installed, the addin can be pretty hard to find.

So, the new usability feature is a full page writeup on how to find the addin at http://www.sqlserverdatamining.com/cloud/ once you click on the Download link.  (Basically, you have to make a table to get the table tools!)

 If you've downloaded the addin and "lost" it - go try again!  If you haven't yet, here's your chance!

Posted by JamieMac | 1 Comments
Filed under: ,

In Germany this April?

If so – check out the European PASS Conference, if not – maybe you should be!

This conference will have a great BI focus including a pre-conference session on using SQL Server BI tools to monitor SQL Server BI tools and at least two data mining sessions, including one from Steve Simon of State Street Corporation talking about how they personally leverage SQL Server Data Mining to manage their risk exposure.  Not another talking head from the product team (aka “me”) spouting the general awesomeness of the product, but user solving real-world user problems with the (imHo) best data mining product in the world.

If you can, check it out it will be a great chance to network with other users and see real-world implementations rather than MSFT demos!

Posted by JamieMac | 1 Comments

Querying the Dependency Net

OK, ok, so just yesterday I posted that it was easy to determine what queries were being used by the browsers to get the data underlying the view.  Of course it’s easy to get them, but without a teensy weensy bit of documentation, it’s not necessarily easy to understand what the parameters mean or possibly what the results mean.

For instance, take the dependency net – everybody loves the dependency net – it’s cool and shows that “high-level information” that everybody craves.  For example, this picture showing how “Die Hard” is the nexus linking Beverly Hills Cop with the Lethal Weapon family of movies.

image

Whatever.

Anyway, to get this information we call the deceptively simple ARGetNodeGraph function, that is, for Association Rules models, like this:

CALL System.ARGetNodeGraph('Associate Movies', 60)

With the first parameter being the name of the model and the second being the number of nodes to return.  The function chooses nodes using a heuristic that considers the popularity of the node and balances between inputs and outputs in order to produce a nice result.

That’s easy enough – let’s take a look at the output (truncated for space)

image

Whoa!  That’s a bit more complex.

Basically the output is divided into two sections indicated by the "NODE_TYPE” column.  The “NODE_TYPE” column actually has nothing to do  with a node type and (if I had to guess, which I don’t) I would say that NODE_TYPE was used to reuse names from the MINING_CONTENT schema rowset rather than be the most accurate moniker for the column itself.  NODE_TYPE is actually the ROW type, and has the values 1 or 2.  If the NODE_TYPE is 1, then the row represents a NODE in the graph.  If the NODE_TYPE is 2, the row represents an EDGE in the graph.  All of the other column interpretations depend on the type of row.

For NODE rows (NODE_TYPE=1, for those readers with serious short-term memory issues), the columns describe a node like this:

  • NODE_UNIQUE_NAME_1 – This is an ID used to identify the node in the edges section.
  • NODE_UNIQUE_NAME_2 – The default label for the node
  • MSOLAP_NODE_SCORE – Unused
  • ATTRIBUTE_NAME – The attribute name part of the default label, used to compose and decompose different options for node displays
  • ATTRIBUTE_LONG_NAME – The full attribute name – i.e. if the attribute came from a nested table, it has the name of the nested table, plus the specific attribute name in parenthesis.  Again used for label composition.
  • ATTRIBUTE_VALUE – The value name part of the default label, again used for label composition.

For EDGE rows (NODE_TYPE=2) the columns describe a directed edge like this:

  • NODE_UNIQUE_NAME_1 – The ID (from the NODE rows) of the source of the edge
  • NODE_UNIQUE_NAME_2 – The ID (from the NODE rows) of the destination of the edge
  • MSOLAP_NODE_SCORE – The strength of the edge – I don’t believe there is any specific range
  • Other columns - unused

So, that’s how the dependency net gets created – initially.  There are actually many additional functions used by the dependency net to, as you may say, fill out the graph.

For example, if you click the “Find Node” button in the dependency net browser, the browser issues this call:

CALL System.ARGetNodes('Associate Movies')

This call returns a result set like the NODE section of the ARGetNodeGraph, except without the NODE_TYPE column, with a row for every possible node – not just the top 60.  The only parameter is the name of the model.

If you select a node that is not already in the graph, this is where it gets a bit interesting.  The browser issues a call like this:

CALL System.ARAddNodes('Associate Movies', '600', 
     '604;726;648;733;630;700;719;718;130;387;386;712;621;727;670;154;337')
 

ARAddNodes has the following parameters:

  • strModel – the model name, but you already new this
  • strNodesToAdd – a semi-colon delimited list of node ids to add to the graph
  • strNodesInGraph – a semi-colon delimited list of node ids already in the graph

The result set looks like the EDGE section of the result of ARGetNodeGraph without the unused columns and contains only the edges between the nodes identified in strNodesToAdd and those identified in strNodesInGraph.  Note that the node id’s that are used are only those returned from ARGetNodeGraph or ARGetNodes and are not node id’s from the model content schema rowset.

NB When you see the function calls in SQLProfiler, you will get the fully qualified function name, e.g. System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.ARAddNodes.  You can eliminate all the intermediate namespaces and just call System.<function name>.

NB2 There are a set of equivalent stored procedures for Decision Trees, that you can probe by browsing a tree model’s dependency network

NB3 Nope, you won’t find a Naive Bayes version by browsing a NB model’s dep net – that browser was never “updated” to use stored procedures to get dependency network information.  You can use the Visio Data Mining Template and see what functions are called….but their different…..

Posted by JamieMac | 1 Comments
Filed under: ,

Data Mining in SQL Server 2008 Book Review

I just read a great review of our book from Richard Lees in Australia.  Richard is one of the early adopters of Analysis Service and Data Mining, so he has a lot of experience in this area (we actually reference some of his samples in the book!).  You should check out his blog anyway, just to learn from a master and expand your BI expertise!

Posted by JamieMac | 1 Comments

Querying like the Data Mining viewers do

It happens all the time.  You see some cool user interface trick and think “how can I do that?”  Or, maybe more likely, you think “gee, that’s useful, where’s that documented?”  In any case, if you have ever wondered about how the DM viewers get the data to display on the screen, then this is the post for you.

In many cases, what is displayed in the data mining viewers is the result of built-in stored procedures which allow the processing required for the view to be done on the server without requiring all of the model content to be brought to the server.  In this post, I’ll give one example of those stored procedures, and then I’ll add some information on how you con figure out the rest.  Not everything in the viewers is calculated on the server, but many of them are, and getting access directly to the data is useful (I presume, since everyone always asks…. :) )

Let’s start with an easy one – the Naive Bayes attribute discrimination view.  This view shows how the differences in the input attributes across the states of an output attribute.  It generally looks like this:

image

The viewer doesn’t download all of the correlations in the Naive Bayes content, rather it calls the stored procedure GetAttributeDiscrimination like this:

 

CALL System.GetAttributeDiscrimination
  ('Classify CollegePlans NB', 
   '100000005', 
   'Plans to attend', 
   1, 
   'All other states', 
   2, 
   0.0005, 
   true)

The not-so-obvious parameters are, in order strModel, strPredictableNode, strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize.  Let’s go through these parameters…

strModel – the name of the model, of course!

strPredictableNode – this one is a bit difficult, as it takes the Node Unique Name of the target attribute instead of just the string you see in the viewer.  The Node Unique Name identifies the attribute in the content rowset generated by the model.  You can get the list of predictable attributes and their Node Unique Names by calling another stored procedure – like this CALL System.GetPredictableAttributes('ModelName').  This stored procedure returns two columns – one for the attribute name and one for the Node Unique Name.

strValue1 – The name of the value you want to compare on the left hand side.  The usage of this parameter depends on the value of the next parameter, which is….

iValType1 – This parameter indicates how to treat strValue1.  It can have values 0,1, or 2.  If this parameter is a 1, the value in strValue1 is the actual state of the attribute.  However, if this parameter is a 0 or 2, the value in strValue1 is ignored.  If the value is 0, the left-hand value is considered to be the “missing state”.  If the value is 2, the left hand value is considered to be “all other states.”  In the example above, “All other states” is specified only because it looks nice (and it’s easier to just drop the combo box value into the function call even if it will just be ignored….)

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such that small correlations don’t come back in the results.    Usually you set it to a really small number like 0.0005 in the example above.

bNormalize  - Whether or not the result is normalized.  If this value is true, the results are normalized to a maximum absolute value of 100, giving a possible range of –100 to 100.  All this does is take the largest absolute value in the result and divide that into 100, and then multiple all the other numbers by that amount.  If set to false, the numbers are whatever they are and you can figure it out yourself – it’s up to you, but we always set this to true.

The results

Calling this routine returns a row for every differentiating attribute/value pair with a score higher than the specified threshold.  The row contains the differentiating pair along with the score and some other columns and looks somewhat like this:

image

The score column is the “important” one and is best explained as if you did something like a c language compare routine e.g int Compare(int v1,int v2) { return v1-v2; } .  That is, if the value is positive it favors value1 and if the value is negative it favors value2.  I’m not going to go in depth on the other columns other than to say that they are the actual counts of the correlations of the discriminator against the inputs.  The best way to understand them is to look at the Mining Legend as you browse the model and click on rows.  For example if you clicked on the first row of the result above (in either picture), the Mining Legend would look like this:

image

Of course, once you have the result set you can use it wherever you want – in Reporting Services, Integration Services, or in you custom program.

How to get the function calls

So, how do you find this laundry list of undocumented stored procedures.  Well, some are documented in my book, but you can get them all for free just by looking in the right place.  You can run the SQL Server Profiler to see what functions are being called by the viewers.  Here’s how you do it.

First, run SQL Server Profiler:

image

Then, start a New Trace from the File Menu and connect to Analysis Services

image

You can leave all the defaults on for the Trace Properties dialog that appears

image

Then go to any data mining viewer and browse a model!  That’s it!  You will get trace output that looks like this:

image

You can get the query text in the bottom pane of the trace window by finding and selecting rows with event types of ‘Query Begin’ or ‘Query End’.

Oh, and make sure you stop the trace before you shut down SQL Profiler - ‘cuz it will keep going.

Let me know if you find any particularly tricky stored procedures that you need help with – best place to ask is in the MSDN forums!

Posted by JamieMac | 1 Comments

Those kids won't eat anything!

I did my BI Power Hour demo at PASS 2008 yesterday and it featured my twin boys Bowen and Logan. IMG_0450sm

Logan (right) has an ASD (Autism Spectrum Disorder) that limits his diet (no milk products, gluten, or soy) and Bowen has some sensory issues, but that doesn't explain their extreme pickiness with food.  For example, I made a rice-yogurt-blueberry smoothie for Logan and he just looked at it and said "yucky!".  Ugh.

So, I decided to make a worksheet listing foods they these kids will eat and won't eat.  I used attributes of Color, Type, and Processed, along with a column indicating whether or not they will actually eat the food.  Of course, I had to answer to myself disturbing questions such as "what color are hot dogs?", but I got through it.

image

Then I used the Prediction Calculator from the new Table Analysis Tools Excel addin for SQL Server 2008.  The Prediction Calculator creates a little widget in Excel that allows you to enter in input values and based on your costs.  Running the Prediction Calculator is as simple as selecting your table, clicking the Prediction Calculator button on the Table Analyze ribbon, and then choosing the column and value you want to predict.

imageimage

There's actually a little more work to do after you run the tool, and that is specifying your costs and profits.  Your costs are the cost you incur for getting the answer wrong, and a profit is the profit you make when you get the answer right.  The grid below is in Prediction Calculator Report that is created after running the tool.  In this case, I figured that if I guessed that my kids would eat some food and was wrong, it would cost me the 5 bucks for the food which would be wasted, therefore I set the False Positive Cost to 5.  Furthermore, I figured that if I guessed correctly that they wouldn't eat a food, I saved the money and the 5 bucks would still be in my pocket, so I set the True Negative Profit to 5 as well.

image

Doing so, gave me a profit chart that looks like below - which is a problem.  Basically, what a "always rising" chart says is that you should always say "no" to achieve the highest profit - which makes sense since I can only lose money by saying "yes" and only gain money by saying "no".  Essentially my laptop analysis tells me that my kids are simply too picky and I should just make them starve!  Hah!  My laptop apparently has never had kids!

image

Anyway, I decided that there's some nominal value for my kids eating, so I changed the parameters a bit.  I figured that my kids complaining that I didn't get a food that they wanted causes me the psychological cost of 1 dollar (or maybe the real cost of going back to the store of a dollar, however you want to see it), and I set the False Negative Cost to 1.  Also I decided the value of my kids not getting a sugar imbalance and (literally) bouncing off the walls is a "peace of mind" profit of a dollar, so I set the True Positive Profit to 1 as well.  This gives me a better behaved profit chart with a peak like below.

image

image

Once I've set my costs, I can go to the Prediction Calculator sheet that was created and select my inputs and see if my kids will actually eat the food.  In this case, "Yellow, unprocessed, grains and nuts" doesn't exceed the threshold of 642, so the answer is no.  Yay!  I saved 5 bucks because my kids won't eat corn :(.

image

Of course, this isn't very useful for me in this format - it's not like I'm going to lug my laptop around the grocery store plugging in values for every product I see on the shelf.  So to get around this I use the new, experimental Cloud Data Mining ServiceThe web interface contains many of the Table Analysis Tools, including the Prediction Calculator, and you can access data from CSV files, SQL Data Services, or another way which is not entirely obvious and not documented by simply pasting your data from Excel to the web.

image

Once you've pasted your data, I run the Prediction Calculator just like I did in Excel.  There's one small difference, however, in the result is that I have the HTML fragment for the calculator itself.  Therefore I can make my own web site with the calculator embedded inside.

image

Once I have my website (which in this case is at 

http://www.sqlserverdatamining.com/PASS2008BIPowerHourDemo.htm) I can access the Prediction Calculator from any web-enabled device - like my phone, which I can carry around the grocery store and determine that my kids will eat .... brown.....processed.....meat.... oh yay....

Untitled 3

Posted by JamieMac | 1 Comments
Filed under: , ,

Time's running out for your free data mining book!

The survey is now closed.  Thank you. 

Tomorrow is the last day you can fill out this survey for a chance to win one of ten copies of Data Mining with SQL Server 2008.  I used the the Data Exploration tool in the SQL Server 2008 Data Mining Client for Excel, and saw that it takes most people less than 15 minutes to fill it out.

image

(time to take survey in seconds)

Regarding the book, just today I received my sample copies and I was surprised at how much bigger it is than the 2005 book!  It rounds out at 636 pages - I remember last version we were running up against publisher defined page limits and we cut back material to make it fit.  This time, we just wrote what needed to be written and the publisher agreed to let us, the authors, make the decisions on how long the book should be.  I'm really happy about how the text turned out this time - we still don't have a review on Amazon, so hopefully the public will agree!

Posted by JamieMac | 1 Comments
Filed under: , ,

See me at PASS in Seattle next week

I'll be at the PASS 2008 conference in Seattle most of next week.  Currently (subject to change) I'll be presenting at the BI Power Hour on Wednesday at 1:45 and will be in the Ask the Experts area on Wednesday after the Power Hour session and Thursday from 11 to 2.

Stop by if you have any questions or just want to chit-chat.  I'll be happy to sign any book you bring by!  (Even if I'm not the author!)

Posted by JamieMac | 1 Comments

Get a FREE AUTOGRAPHED copy of Data Mining with SQL Server 2008

 The survey is now closed.  Thank you

 I checked on Amazon today and Data Mining with SQL Server 2008 is finally in stock! 

However, if you are currently using SQL Server Data Mining, before you rush out and buy a copy, please take this opportunity to win a FREE COPY AUTOGRAPHED BY BOGDAN AND ME by filling out a middling short survey on how you use the product.  I will be using the survey responses to determine the variety of creative ways customers are implementing SQL Server Data Mining solutions, and, of course, as the information churns in our brains, it means that you could have an impact on what happens in the product down the road.  So not only do you have a chance to win a book, your input could change history!

In any case, I'm not giving away just one, but TEN copies of the book to random responders by November 21.  So, if you are a SQL Server Data Mining user, please fill out the survey and get your name in the drawing.

Oh, and feel free to go ahead and buy the book, if you win, you'll have an extra loaner copy!! :)

Posted by JamieMac | 1 Comments
Filed under: ,

Time Series Reporting Stored Procedure - Part 4 of 3

OK, OK, I know it's just not right to do a part 4 of a three part series, but I've gotten enough demand to expand these stored procedures just a little bit.  I casually left out a class of time series models that contain nested tables since it made things a bit more complicated to explain, but a bunch of people reminded me ever so kindly that the only way to create Time Series models on OLAP cubes (using BIDS at least) is to use nested tables!

So here I am back again into my three part series on time series reporting.  To simplify the code I just made a new function which you would call like this

CALL TSSprocs.TimeSeriesReportNested('Store Forecasting','Store Cost','Store 1',10,5)

With the parameters being, in order, the model name, the column to forecast, the name of the series to forecast, the number of historical points (0 -> all), and the number of forecasted points.  In this version it is assumed that the dimension containing the names of the series are on the case level and Time and the measures are in the nested table.  The series name is required for this version (since in the OLAP scenario it always should be there).

I'm not going to go through the function line-by-line, but I will show you the queries I put together in order to fetch the historical and predicted data.  To fetch a limited amount of historical data from a nested table required a bit of a different query.  In this case, I had to use the TopCount function to retrieve the last rows and then use the same ORDER BY trick I used in the non-nested scenario to reverse the row order.  Note also the syntax required to reference the nested column in the outer select - I had to put my nested alias along with the nested column name inside the brackets for it to be properly referenced.

SELECT * FROM 
   (SELECT FLATTENED 
       (SELECT [Year Month],[Store Cost]
            FROM  TopCount([Time],[Year Month],5)) 
        AS ex 
    FROM [Store Forecasting].CASES 
    WHERE [Store]='Store 1') AS t
ORDER BY [ex.Year Month]

The case where you aren't selecting a subset of the data is fairly trivial, as is the actual prediction query - you just need to be aware of the subselect from the nested tables.

// Fetch all historical data for a series
// with nested tables
SELECT FLATTENED 
  (SELECT [Year Month],[Store Cost] FROM  [Time]) 
FROM [Store Forecasting].CASES WHERE Store='Store 1'

// Predict a series with nested tables
SELECT FLATTENED 
  (SELECT PredictTimeSeries([Store Cost],5) FROM Time) 
FROM [Store Forecasting] WHERE [Store] = 'Store 1'

In any case, the code for all of the stored procedures is attached, so you can replace the existing TSSprocs.cs with this one and start creating OLAP Mining Model Time Series reports!

Posted by JamieMac | 0 Comments

Attachment(s): TSSprocs.cs

Support Vector Machines for SQL Server Data Mining

Many have requested that we implement Support Vector Machines (SVM's) for SQL Server 2008, and for a wide variety of reasons, we just couldn't get to it.  Luckily the community has come to the rescue for those needing an SVM implementation today!  Joris Valkonet of Avanade Netherlands along with colleague Thanh Luc have implemented an SVM plug-in algorithm and viewer.  Not only that, but Joris has released the plug-in along with all of the source code at CodePlex so you can customize the algorithm for your own purposes or at least get another example of how algorithms and viewers are implemented.

Below is a screenshot from the viewer showing cancer classification split across two selectable axes with green and blue indicating correctly classified benign and malignant tumors respectively and red indicating misclassifications.

The plug-in and code can be found at http://www.codeplex.com/svmplugin

Viewer_WDBC.jpg

Posted by JamieMac | 1 Comments
Filed under: , ,

Getting Data Mining results into SQL Tables

I've been seeing lots of questions about how to get data mining results into SQL tables.  Most times the answers are "use the prediction query builder save button" or "build an SSIS package."  Both of these have issues among them being that the former is really only suited to single-use, small jobs, and the latter has a lot of overhead (not to mention that if you want to use the Data Mining Query Transform you have to have Enterprise Edition).

Luckily, there happens to be a much easier way - and it's one of those "Doh!" moments when you learn about it, because it's that easy.

The way to do it is to simply use linked servers.  Anyone who uses DMX knows to connect to SQL data with OPENQUERY.  There's no reason you can't simply connect to Data Mining data using the same mechanism.

For example, use a query like this to establish a link to an AS server:

EXEC sp_addlinkedserver 
@server='LINKED_AS', -- local SQL name given to the linked server
@srvproduct='', -- not used 
@provider='MSOLAP', -- OLE DB provider 
@datasrc='localhost', -- analysis server name (machine name) 
@catalog='MovieClick' -- default catalog/database 

Then you can select data using OPENQUERY like this:

SELECT * FROM 
OPENQUERY(LINKED_AS, 
  'SELECT Cluster() AS [Cluster], ClusterProbability() AS [Prob] 
   FROM [Customers - Clustering]
   NATURAL PREDICTION JOIN
   OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')

Then, of course, you can do all kinds of manipulations on it, like finding the average cluster probability of each cluster, right?  Well, almost, the data type returned by the Cluster function is actual text or ntext or something that GROUP BY chokes on, so you have to do some casting first.  Therefore if you want to do that trick, use a query like this:

SELECT Cluster, AVG(Prob) FROM
(SELECT CAST(Cluster AS Char(30)) AS Cluster, Prob FROM OPENQUERY(LINKED_AS, 
  'SELECT Cluster() AS [Cluster], 
      ClusterProbability() AS [Prob] FROM [Customers - Clustering]
   NATURAL PREDICTION JOIN
   OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')
   ) AS t
GROUP BY Cluster

That will give you a nice result showing you, in a way, the affinity of each cluster based on the input set.  That is, if you ran such a query against the training data, you could say that the clusters with a higher probability are "tighter" than the ones with low probabilities.  Anyway, that's besides the point of this post.

In any case, remember to double your single quotes and flatten any nested results and this technique should work just great for getting DMX into SQL.

-J

Posted by JamieMac | 1 Comments
Filed under: ,
More Posts Next page »
 
Page view tracker