Welcome to MSDN Blogs Sign in | Join | Help

More Data Mining in the Cloud?

Since we released the Tech Preview of the Cloud Data Mining Addins last year, I've heard about and talked with Michael Zeller of Zementis.  Zementis has been very active in shaping the PMML (Predictive Modeling Markup Language) standard for representing learned knowledge in XML format.  They have a product called ADAPA that allows you to score, or predict from, arbitrary data sets using arbitrary PMML models.

Recently they released a version of ADAPA for the Amazon Cloud service and it has been very popular for them.  I just recieved a note from Michael showing how you can integrate ADAPA with SQL Server Integration Services.  Basically ADAPA exposes each model as a web mothod that you can call from the SSIS Web Services task.  It's up to you to populate SSIS variables from data and subsequently take the results from a variable and archive them, but it does give you the ability to take models generated in other data mining tools, such as SAS and SPSS and integrate them into your SQL Server workflows.

Hopefully Zementis will release an SSIS transform version of their service to make the usage more natural with SQL Server Integration Services, or better yet, offer a SQL Server Data Mining PlugIn version of their PMML reader so that you can use DMX against your SAS or SPSS models for integration into applications/reporting services/Excel Addins/etc.

 Let me know what you think and I'll pass the info on to Michael, or if you happen to be going to Paris this summer, you can catch him at the SIG KDD Data Mining Conference at the end of June where he'll be running a panel on Emerging Trends in Open Standards and Cloud Computing for Data Mining

 

New Cloud Data Mining Addin Usability Feature (well kind of)

Bogdan did an awesome job putting together the modified Data Mining Addin for the cloud.  We saw quite a few downloads of his new addin, but surprisingly and unfortunately, didn't see a lot of usage!  Strange?  Maybe.  Bogdan thought about the issue for a while and from a few anecdotal comments determined that once installed, the addin can be pretty hard to find.

So, the new usability feature is a full page writeup on how to find the addin at http://www.sqlserverdatamining.com/cloud/ once you click on the Download link.  (Basically, you have to make a table to get the table tools!)

 If you've downloaded the addin and "lost" it - go try again!  If you haven't yet, here's your chance!

Posted by JamieMac | 1 Comments
Filed under: ,

In Germany this April?

If so – check out the European PASS Conference, if not – maybe you should be!

This conference will have a great BI focus including a pre-conference session on using SQL Server BI tools to monitor SQL Server BI tools and at least two data mining sessions, including one from Steve Simon of State Street Corporation talking about how they personally leverage SQL Server Data Mining to manage their risk exposure.  Not another talking head from the product team (aka “me”) spouting the general awesomeness of the product, but user solving real-world user problems with the (imHo) best data mining product in the world.

If you can, check it out it will be a great chance to network with other users and see real-world implementations rather than MSFT demos!

Posted by JamieMac | 1 Comments

Querying the Dependency Net

OK, ok, so just yesterday I posted that it was easy to determine what queries were being used by the browsers to get the data underlying the view.  Of course it’s easy to get them, but without a teensy weensy bit of documentation, it’s not necessarily easy to understand what the parameters mean or possibly what the results mean.

For instance, take the dependency net – everybody loves the dependency net – it’s cool and shows that “high-level information” that everybody craves.  For example, this picture showing how “Die Hard” is the nexus linking Beverly Hills Cop with the Lethal Weapon family of movies.

image

Whatever.

Anyway, to get this information we call the deceptively simple ARGetNodeGraph function, that is, for Association Rules models, like this:

CALL System.ARGetNodeGraph('Associate Movies', 60)

With the first parameter being the name of the model and the second being the number of nodes to return.  The function chooses nodes using a heuristic that considers the popularity of the node and balances between inputs and outputs in order to produce a nice result.

That’s easy enough – let’s take a look at the output (truncated for space)

image

Whoa!  That’s a bit more complex.

Basically the output is divided into two sections indicated by the "NODE_TYPE” column.  The “NODE_TYPE” column actually has nothing to do  with a node type and (if I had to guess, which I don’t) I would say that NODE_TYPE was used to reuse names from the MINING_CONTENT schema rowset rather than be the most accurate moniker for the column itself.  NODE_TYPE is actually the ROW type, and has the values 1 or 2.  If the NODE_TYPE is 1, then the row represents a NODE in the graph.  If the NODE_TYPE is 2, the row represents an EDGE in the graph.  All of the other column interpretations depend on the type of row.

For NODE rows (NODE_TYPE=1, for those readers with serious short-term memory issues), the columns describe a node like this:

  • NODE_UNIQUE_NAME_1 – This is an ID used to identify the node in the edges section.
  • NODE_UNIQUE_NAME_2 – The default label for the node
  • MSOLAP_NODE_SCORE – Unused
  • ATTRIBUTE_NAME – The attribute name part of the default label, used to compose and decompose different options for node displays
  • ATTRIBUTE_LONG_NAME – The full attribute name – i.e. if the attribute came from a nested table, it has the name of the nested table, plus the specific attribute name in parenthesis.  Again used for label composition.
  • ATTRIBUTE_VALUE – The value name part of the default label, again used for label composition.

For EDGE rows (NODE_TYPE=2) the columns describe a directed edge like this:

  • NODE_UNIQUE_NAME_1 – The ID (from the NODE rows) of the source of the edge
  • NODE_UNIQUE_NAME_2 – The ID (from the NODE rows) of the destination of the edge
  • MSOLAP_NODE_SCORE – The strength of the edge – I don’t believe there is any specific range
  • Other columns - unused

So, that’s how the dependency net gets created – initially.  There are actually many additional functions used by the dependency net to, as you may say, fill out the graph.

For example, if you click the “Find Node” button in the dependency net browser, the browser issues this call:

CALL System.ARGetNodes('Associate Movies')

This call returns a result set like the NODE section of the ARGetNodeGraph, except without the NODE_TYPE column, with a row for every possible node – not just the top 60.  The only parameter is the name of the model.

If you select a node that is not already in the graph, this is where it gets a bit interesting.  The browser issues a call like this:

CALL System.ARAddNodes('Associate Movies', '600', 
     '604;726;648;733;630;700;719;718;130;387;386;712;621;727;670;154;337')
 

ARAddNodes has the following parameters:

  • strModel – the model name, but you already new this
  • strNodesToAdd – a semi-colon delimited list of node ids to add to the graph
  • strNodesInGraph – a semi-colon delimited list of node ids already in the graph

The result set looks like the EDGE section of the result of ARGetNodeGraph without the unused columns and contains only the edges between the nodes identified in strNodesToAdd and those identified in strNodesInGraph.  Note that the node id’s that are used are only those returned from ARGetNodeGraph or ARGetNodes and are not node id’s from the model content schema rowset.

NB When you see the function calls in SQLProfiler, you will get the fully qualified function name, e.g. System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.ARAddNodes.  You can eliminate all the intermediate namespaces and just call System.<function name>.

NB2 There are a set of equivalent stored procedures for Decision Trees, that you can probe by browsing a tree model’s dependency network

NB3 Nope, you won’t find a Naive Bayes version by browsing a NB model’s dep net – that browser was never “updated” to use stored procedures to get dependency network information.  You can use the Visio Data Mining Template and see what functions are called….but their different…..

Posted by JamieMac | 1 Comments

Data Mining in SQL Server 2008 Book Review

I just read a great review of our book from Richard Lees in Australia.  Richard is one of the early adopters of Analysis Service and Data Mining, so he has a lot of experience in this area (we actually reference some of his samples in the book!).  You should check out his blog anyway, just to learn from a master and expand your BI expertise!

Posted by JamieMac | 1 Comments

Querying like the Data Mining viewers do

It happens all the time.  You see some cool user interface trick and think “how can I do that?”  Or, maybe more likely, you think “gee, that’s useful, where’s that documented?”  In any case, if you have ever wondered about how the DM viewers get the data to display on the screen, then this is the post for you.

In many cases, what is displayed in the data mining viewers is the result of built-in stored procedures which allow the processing required for the view to be done on the server without requiring all of the model content to be brought to the server.  In this post, I’ll give one example of those stored procedures, and then I’ll add some information on how you con figure out the rest.  Not everything in the viewers is calculated on the server, but many of them are, and getting access directly to the data is useful (I presume, since everyone always asks…. :) )

Let’s start with an easy one – the Naive Bayes attribute discrimination view.  This view shows how the differences in the input attributes across the states of an output attribute.  It generally looks like this:

image

The viewer doesn’t download all of the correlations in the Naive Bayes content, rather it calls the stored procedure GetAttributeDiscrimination like this:

 

CALL System.GetAttributeDiscrimination
  ('Classify CollegePlans NB', 
   '100000005', 
   'Plans to attend', 
   1, 
   'All other states', 
   2, 
   0.0005, 
   true)

The not-so-obvious parameters are, in order strModel, strPredictableNode, strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize.  Let’s go through these parameters…

strModel – the name of the model, of course!

strPredictableNode – this one is a bit difficult, as it takes the Node Unique Name of the target attribute instead of just the string you see in the viewer.  The Node Unique Name identifies the attribute in the content rowset generated by the model.  You can get the list of predictable attributes and their Node Unique Names by calling another stored procedure – like this CALL System.GetPredictableAttributes('ModelName').  This stored procedure returns two columns – one for the attribute name and one for the Node Unique Name.

strValue1 – The name of the value you want to compare on the left hand side.  The usage of this parameter depends on the value of the next parameter, which is….

iValType1 – This parameter indicates how to treat strValue1.  It can have values 0,1, or 2.  If this parameter is a 1, the value in strValue1 is the actual state of the attribute.  However, if this parameter is a 0 or 2, the value in strValue1 is ignored.  If the value is 0, the left-hand value is considered to be the “missing state”.  If the value is 2, the left hand value is considered to be “all other states.”  In the example above, “All other states” is specified only because it looks nice (and it’s easier to just drop the combo box value into the function call even if it will just be ignored….)

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such that small correlations don’t come back in the results.    Usually you set it to a really small number like 0.0005 in the example above.

bNormalize  - Whether or not the result is normalized.  If this value is true, the results are normalized to a maximum absolute value of 100, giving a possible range of –100 to 100.  All this does is take the largest absolute value in the result and divide that into 100, and then multiple all the other numbers by that amount.  If set to false, the numbers are whatever they are and you can figure it out yourself – it’s up to you, but we always set this to true.

The results

Calling this routine returns a row for every differentiating attribute/value pair with a score higher than the specified threshold.  The row contains the differentiating pair along with the score and some other columns and looks somewhat like this:

image

The score column is the “important” one and is best explained as if you did something like a c language compare routine e.g int Compare(int v1,int v2) { return v1-v2; } .  That is, if the value is positive it favors value1 and if the value is negative it favors value2.  I’m not going to go in depth on the other columns other than to say that they are the actual counts of the correlations of the discriminator against the inputs.  The best way to understand them is to look at the Mining Legend as you browse the model and click on rows.  For example if you clicked on the first row of the result above (in either picture), the Mining Legend would look like this:

image

Of course, once you have the result set you can use it wherever you want – in Reporting Services, Integration Services, or in you custom program.

How to get the function calls

So, how do you find this laundry list of undocumented stored procedures.  Well, some are documented in my book, but you can get them all for free just by looking in the right place.  You can run the SQL Server Profiler to see what functions are being called by the viewers.  Here’s how you do it.

First, run SQL Server Profiler:

image

Then, start a New Trace from the File Menu and connect to Analysis Services

image

You can leave all the defaults on for the Trace Properties dialog that appears

image

Then go to any data mining viewer and browse a model!  That’s it!  You will get trace output that looks like this:

image

You can get the query text in the bottom pane of the trace window by finding and selecting rows with event types of ‘Query Begin’ or ‘Query End’.

Oh, and make sure you stop the trace before you shut down SQL Profiler - ‘cuz it will keep going.

Let me know if you find any particularly tricky stored procedures that you need help with – best place to ask is in the MSDN forums!

Posted by JamieMac | 1 Comments

Those kids won't eat anything!

I did my BI Power Hour demo at PASS 2008 yesterday and it featured my twin boys Bowen and Logan. IMG_0450sm

Logan (right) has an ASD (Autism Spectrum Disorder) that limits his diet (no milk products, gluten, or soy) and Bowen has some sensory issues, but that doesn't explain their extreme pickiness with food.  For example, I made a rice-yogurt-blueberry smoothie for Logan and he just looked at it and said "yucky!".  Ugh.

So, I decided to make a worksheet listing foods they these kids will eat and won't eat.  I used attributes of Color, Type, and Processed, along with a column indicating whether or not they will actually eat the food.  Of course, I had to answer to myself disturbing questions such as "what color are hot dogs?", but I got through it.

image

Then I used the Prediction Calculator from the new Table Analysis Tools Excel addin for SQL Server 2008.  The Prediction Calculator creates a little widget in Excel that allows you to enter in input values and based on your costs.  Running the Prediction Calculator is as simple as selecting your table, clicking the Prediction Calculator button on the Table Analyze ribbon, and then choosing the column and value you want to predict.

imageimage

There's actually a little more work to do after you run the tool, and that is specifying your costs and profits.  Your costs are the cost you incur for getting the answer wrong, and a profit is the profit you make when you get the answer right.  The grid below is in Prediction Calculator Report that is created after running the tool.  In this case, I figured that if I guessed that my kids would eat some food and was wrong, it would cost me the 5 bucks for the food which would be wasted, therefore I set the False Positive Cost to 5.  Furthermore, I figured that if I guessed correctly that they wouldn't eat a food, I saved the money and the 5 bucks would still be in my pocket, so I set the True Negative Profit to 5 as well.

image

Doing so, gave me a profit chart that looks like below - which is a problem.  Basically, what a "always rising" chart says is that you should always say "no" to achieve the highest profit - which makes sense since I can only lose money by saying "yes" and only gain money by saying "no".  Essentially my laptop analysis tells me that my kids are simply too picky and I should just make them starve!  Hah!  My laptop apparently has never had kids!

image

Anyway, I decided that there's some nominal value for my kids eating, so I changed the parameters a bit.  I figured that my kids complaining that I didn't get a food that they wanted causes me the psychological cost of 1 dollar (or maybe the real cost of going back to the store of a dollar, however you want to see it), and I set the False Negative Cost to 1.  Also I decided the value of my kids not getting a sugar imbalance and (literally) bouncing off the walls is a "peace of mind" profit of a dollar, so I set the True Positive Profit to 1 as well.  This gives me a better behaved profit chart with a peak like below.

image

image

Once I've set my costs, I can go to the Prediction Calculator sheet that was created and select my inputs and see if my kids will actually eat the food.  In this case, "Yellow, unprocessed, grains and nuts" doesn't exceed the threshold of 642, so the answer is no.  Yay!  I saved 5 bucks because my kids won't eat corn :(.

image

Of course, this isn't very useful for me in this format - it's not like I'm going to lug my laptop around the grocery store plugging in values for every product I see on the shelf.  So to get around this I use the new, experimental Cloud Data Mining ServiceThe web interface contains many of the Table Analysis Tools, including the Prediction Calculator, and you can access data from CSV files, SQL Data Services, or another way which is not entirely obvious and not documented by simply pasting your data from Excel to the web.

image

Once you've pasted your data, I run the Prediction Calculator just like I did in Excel.  There's one small difference, however, in the result is that I have the HTML fragment for the calculator itself.  Therefore I can make my own web site with the calculator embedded inside.

image

Once I have my website (which in this case is at 

http://www.sqlserverdatamining.com/PASS2008BIPowerHourDemo.htm) I can access the Prediction Calculator from any web-enabled device - like my phone, which I can carry around the grocery store and determine that my kids will eat .... brown.....processed.....meat.... oh yay....

Untitled 3

Posted by JamieMac | 1 Comments
Filed under: , ,

Time's running out for your free data mining book!

The survey is now closed.  Thank you. 

Tomorrow is the last day you can fill out this survey for a chance to win one of ten copies of Data Mining with SQL Server 2008.  I used the the Data Exploration tool in the SQL Server 2008 Data Mining Client for Excel, and saw that it takes most people less than 15 minutes to fill it out.

image

(time to take survey in seconds)

Regarding the book, just today I received my sample copies and I was surprised at how much bigger it is than the 2005 book!  It rounds out at 636 pages - I remember last version we were running up against publisher defined page limits and we cut back material to make it fit.  This time, we just wrote what needed to be written and the publisher agreed to let us, the authors, make the decisions on how long the book should be.  I'm really happy about how the text turned out this time - we still don't have a review on Amazon, so hopefully the public will agree!

Posted by JamieMac | 1 Comments
Filed under: , ,

See me at PASS in Seattle next week

I'll be at the PASS 2008 conference in Seattle most of next week.  Currently (subject to change) I'll be presenting at the BI Power Hour on Wednesday at 1:45 and will be in the Ask the Experts area on Wednesday after the Power Hour session and Thursday from 11 to 2.

Stop by if you have any questions or just want to chit-chat.  I'll be happy to sign any book you bring by!  (Even if I'm not the author!)

Posted by JamieMac | 1 Comments

Get a FREE AUTOGRAPHED copy of Data Mining with SQL Server 2008

 The survey is now closed.  Thank you

 I checked on Amazon today and Data Mining with SQL Server 2008 is finally in stock! 

However, if you are currently using SQL Server Data Mining, before you rush out and buy a copy, please take this opportunity to win a FREE COPY AUTOGRAPHED BY BOGDAN AND ME by filling out a middling short survey on how you use the product.  I will be using the survey responses to determine the variety of creative ways customers are implementing SQL Server Data Mining solutions, and, of course, as the information churns in our brains, it means that you could have an impact on what happens in the product down the road.  So not only do you have a chance to win a book, your input could change history!

In any case, I'm not giving away just one, but TEN copies of the book to random responders by November 21.  So, if you are a SQL Server Data Mining user, please fill out the survey and get your name in the drawing.

Oh, and feel free to go ahead and buy the book, if you win, you'll have an extra loaner copy!! :)

Posted by JamieMac | 1 Comments
Filed under: ,

Time Series Reporting Stored Procedure - Part 4 of 3

OK, OK, I know it's just not right to do a part 4 of a three part series, but I've gotten enough demand to expand these stored procedures just a little bit.  I casually left out a class of time series models that contain nested tables since it made things a bit more complicated to explain, but a bunch of people reminded me ever so kindly that the only way to create Time Series models on OLAP cubes (using BIDS at least) is to use nested tables!

So here I am back again into my three part series on time series reporting.  To simplify the code I just made a new function which you would call like this

CALL TSSprocs.TimeSeriesReportNested('Store Forecasting','Store Cost','Store 1',10,5)

With the parameters being, in order, the model name, the column to forecast, the name of the series to forecast, the number of historical points (0 -> all), and the number of forecasted points.  In this version it is assumed that the dimension containing the names of the series are on the case level and Time and the measures are in the nested table.  The series name is required for this version (since in the OLAP scenario it always should be there).

I'm not going to go through the function line-by-line, but I will show you the queries I put together in order to fetch the historical and predicted data.  To fetch a limited amount of historical data from a nested table required a bit of a different query.  In this case, I had to use the TopCount function to retrieve the last rows and then use the same ORDER BY trick I used in the non-nested scenario to reverse the row order.  Note also the syntax required to reference the nested column in the outer select - I had to put my nested alias along with the nested column name inside the brackets for it to be properly referenced.

SELECT * FROM 
   (SELECT FLATTENED 
       (SELECT [Year Month],[Store Cost]
            FROM  TopCount([Time],[Year Month],5)) 
        AS ex 
    FROM [Store Forecasting].CASES 
    WHERE [Store]='Store 1') AS t
ORDER BY [ex.Year Month]

The case where you aren't selecting a subset of the data is fairly trivial, as is the actual prediction query - you just need to be aware of the subselect from the nested tables.

// Fetch all historical data for a series
// with nested tables
SELECT FLATTENED 
  (SELECT [Year Month],[Store Cost] FROM  [Time]) 
FROM [Store Forecasting].CASES WHERE Store='Store 1'

// Predict a series with nested tables
SELECT FLATTENED 
  (SELECT PredictTimeSeries([Store Cost],5) FROM Time) 
FROM [Store Forecasting] WHERE [Store] = 'Store 1'

In any case, the code for all of the stored procedures is attached, so you can replace the existing TSSprocs.cs with this one and start creating OLAP Mining Model Time Series reports!

Posted by JamieMac | 0 Comments

Attachment(s): TSSprocs.cs

Support Vector Machines for SQL Server Data Mining

Many have requested that we implement Support Vector Machines (SVM's) for SQL Server 2008, and for a wide variety of reasons, we just couldn't get to it.  Luckily the community has come to the rescue for those needing an SVM implementation today!  Joris Valkonet of Avanade Netherlands along with colleague Thanh Luc have implemented an SVM plug-in algorithm and viewer.  Not only that, but Joris has released the plug-in along with all of the source code at CodePlex so you can customize the algorithm for your own purposes or at least get another example of how algorithms and viewers are implemented.

Below is a screenshot from the viewer showing cancer classification split across two selectable axes with green and blue indicating correctly classified benign and malignant tumors respectively and red indicating misclassifications.

The plug-in and code can be found at http://www.codeplex.com/svmplugin

Viewer_WDBC.jpg

Posted by JamieMac | 1 Comments
Filed under: , ,

Getting Data Mining results into SQL Tables

I've been seeing lots of questions about how to get data mining results into SQL tables.  Most times the answers are "use the prediction query builder save button" or "build an SSIS package."  Both of these have issues among them being that the former is really only suited to single-use, small jobs, and the latter has a lot of overhead (not to mention that if you want to use the Data Mining Query Transform you have to have Enterprise Edition).

Luckily, there happens to be a much easier way - and it's one of those "Doh!" moments when you learn about it, because it's that easy.

The way to do it is to simply use linked servers.  Anyone who uses DMX knows to connect to SQL data with OPENQUERY.  There's no reason you can't simply connect to Data Mining data using the same mechanism.

For example, use a query like this to establish a link to an AS server:

EXEC sp_addlinkedserver 
@server='LINKED_AS', -- local SQL name given to the linked server
@srvproduct='', -- not used 
@provider='MSOLAP', -- OLE DB provider 
@datasrc='localhost', -- analysis server name (machine name) 
@catalog='MovieClick' -- default catalog/database 

Then you can select data using OPENQUERY like this:

SELECT * FROM 
OPENQUERY(LINKED_AS, 
  'SELECT Cluster() AS [Cluster], ClusterProbability() AS [Prob] 
   FROM [Customers - Clustering]
   NATURAL PREDICTION JOIN
   OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')

Then, of course, you can do all kinds of manipulations on it, like finding the average cluster probability of each cluster, right?  Well, almost, the data type returned by the Cluster function is actual text or ntext or something that GROUP BY chokes on, so you have to do some casting first.  Therefore if you want to do that trick, use a query like this:

SELECT Cluster, AVG(Prob) FROM
(SELECT CAST(Cluster AS Char(30)) AS Cluster, Prob FROM OPENQUERY(LINKED_AS, 
  'SELECT Cluster() AS [Cluster], 
      ClusterProbability() AS [Prob] FROM [Customers - Clustering]
   NATURAL PREDICTION JOIN
   OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')
   ) AS t
GROUP BY Cluster

That will give you a nice result showing you, in a way, the affinity of each cluster based on the input set.  That is, if you ran such a query against the training data, you could say that the clusters with a higher probability are "tighter" than the ones with low probabilities.  Anyway, that's besides the point of this post.

In any case, remember to double your single quotes and flatten any nested results and this technique should work just great for getting DMX into SQL.

-J

Posted by JamieMac | 1 Comments
Filed under: ,

Data Mining with SQL Server 2008

Last week Bogdan and I went through the final steps in completing the SQL (sic) to the wildly popular "Data Mining with SQL Server 2005", cleverly titled "Data Mining with SQL Server 2008".  We went through the final PDF's and signed off that the companion web content is good to go.  I am definitely pleased with how this book turned out.  I listened to the feedback on the first book and pressed the authors and the editors to ensure that this book has much higher "initial code quality" as you might call it. 

Writing a book isn't really all that much different than writing software, particularly when the book is about software, I suppose.  The authors are the developers and the editors are the testers, and I have to say that the editors took us to task ensuring that every phrase turned correctly and no word was used wastefully.  It was great help that, unlike the first edition, we were able to have the same editors throughout, and even though at times it seemed they were being a bit picky :), they were able to catch similar phrases used differently across multiple chapters and made a huge impact on the final text.  Also, Shuvro, from the DM team here, manually verified each and every query and line of code that is included on the companion site.  (I suppose that it helps that the book was written after the product was completed, so that API's and language constructs didn't change in flight)

So, readers of the first book will probably want to know what's new and different about this one.  Overall, the book has generally the same structure as the first - so it's evolutionary rather than revolutionary.  That being said, I would guess that at over half the content is new or heavily modified.  For example, I rewrote most of the introduction, there are two new chapters covering the Data Mining Addins for Office, and the "OLE DB for Data Mining" chapter was completely reimagined and rewritten as "Data Mining Concepts and DMX."  Each of the algorithm chapters was updated with respect to new features and comments from forums and newsgroups.  Even where there were no significant changes to the algorithm between 2005 and 2008, the chapters were reorganized to focus on the practical application first and leave the technical implementation details to the latter portion of the chapter.  Also, Bogdan, borrowing from his 50-page epic whitepaper SQL Server Data Mining Programmability, took over and greatly enhanced the Architecture and API chapters.

According to Amazon, the book will be released on November 10th and you can pre-order now.  Since the "read inside" feature isn't available yet (likely since the book is being printed as I write this), I've taken the liberty and pasted the table of contents to make this probably the longest blog post I've ever had the pleasure of pasting.  Anyway, I hope this gives you a good idea of what's coming in the book so you can preorder with confidence! :)

Foreword xxix

Introduction xxxi

Chapter 1 Introduction to Data Mining in SQL Server 2008 1

Business Problems for Data Mining 4

Data Mining Tasks 6

Classification 6

Clustering 6

Association 7

Regression 8

Forecasting 8

Sequence Analysis 9

Deviation Analysis 9

Data Mining Project Cycle 9

Business Problem Formation 10

Data Collection 10

Data Cleaning and Transformation 10

Model Building 12

Model Assessment 12

Reporting and Prediction 12

Application Integration 13

Model Management 13

Summary 13

Chapter 2 Applied Data Mining Using Microsoft Excel 2007 15

Setting Up the Table Analysis Tools 16

Configuring Analysis Services with Administrative Privileges 17

Configuring Analysis Services without Administrative

Privileges 18

What the Add-Ins Expect 19

What to Do If You Need Help 22

The Analyze Key Influencers Tool 22

The Main Influencers Report 24

The Discrimination Report 26

Summary of the Analyze Key Influencers Task 28

The Detect Categories Tool 28

Launching the Tool 29

The Categories Report 30

Categories and the Number of Rows in Each 30

Characteristics of Each Category 31

The Category Profiles Chart 32

Summary of the Detect Categories Tool 34

The Fill From Example Tool 35

Running the Tool and Interpreting the Results 36

Refining the Results 38

Summary of the Fill From Example Tool 39

The Forecasting Tool 39

Launching the Tool and Specifying Options 40

Interpreting the Results 42

Summary of the Forecast Tool 44

The Highlight Exceptions Tool 44

Using the Tool 45

More Complex Interactions 48

Limitations and Troubleshooting 50

Summary of the Highlight Exceptions Tool 51

The Scenario Analysis Tool 51

The Goal Seek Tool 53

Using Goal Seek for a Numeric Goal 56

Using Goal Seek for the Whole Table 57

TheWhat-If Tool 58

UsingWhat-If for the Whole Table 61

Summary of the Scenario Analysis Tool 62

The Prediction Calculator Tool 62

Running the Tool 64

The Prediction Calculator Spreadsheet 65

The Printable Calculator Spreadsheet 67

Refining the Results 68

Using the Results 73

Summary of the Prediction Calculator Tool 73

The Shopping Basket Analysis Tool 74

Using the Tool 75

The Bundled Item Report 76

The Recommendations Report 77

Tweaking the Tool 79

Summary of the Shopping Basket Analysis Tool 81

Technical Overview of the Table Analysis Tools 81

Summary 82

Chapter 3 Data Mining Concepts and DMX 83

History of DMX 83

Why DMX? 84

The Data Mining Process 85

Key Concepts 86

Attribute 86

State 87

Case 88

Keys 89

Inputs and Outputs 91

DMX Objects 93

Mining Structure 93

Mining Model 94

DMX Query Syntax 95

Creating Mining Structures 96

Discretized Columns 97

Nested Tables 98

Partitioning into Testing and Training Sets 99

Creating Mining Models 100

Nested Tables 101

Complex Nesting Scenarios 104

Filters 107

Populating Mining Structures 108

Populating Nested Tables 110

Querying Structure Data 112

Querying Model Data 112

Prediction 115

Prediction Join 116

Prediction Query Syntax 116

Nested Source Data 117

Real-Time Prediction 118

Degenerate Predictions 119

Prediction Functions 120

PredictNodeID 122

External and User-Defined Functions 123

Predictions on Nested Tables 123

Predicting Nested Value Columns 124

Summary 125

Chapter 4 Using SQL Server Data Mining 127

Introducing the Business Intelligence Development Studio 128

Understanding the User Interface 128

Offline Mode and Immediate Mode 130

Immediate Mode 131

Getting Started in Immediate Mode 131

Offline Mode 132

Getting Started in Offline Mode 133

Switching Project Modes 135

Creating Data Mining Objects 135

Setting Up Your Data Sources 135

Understanding Data Sources 136

Creating the MovieClick Data Source 137

Using the Data Source View 137

Creating the MovieClick Data Source View 138

Working with Named Calculations 140

Creating a Named Calculation on the Customers Table 142

Working with Named Queries 142

Creating a Named Query Based on the Customers Table 143

Organizing the DSV 144

Exploring Data 145

Creating and Editing Models 148

Structures and Models 148

Using the Data Mining Wizard 148

Creating the MovieClick Mining Structure and Model 155

Using Data Mining Designer 157

Working with the Mining Structure Editor 157

Adding the Genre Column to the Movies Nested Table 159

Working with the Mining Models Editor 160

Creating and Modifying Additional Models 163

Processing 164

Processing the MovieClick Mining Structure 165

Using Your Models 166

Understanding the Model Viewers 166

Using the Mining Accuracy Chart 167

Selecting Test Data 168

Understanding the Accuracy Charts 169

Using the Profit Chart 172

Multiple Target Accuracy Charts 172

Using the Classification Matrix 173

Scatter Accuracy Charts 173

Creating a Lift Chart on MovieClick 174

Using CrossValidation 174

Using the Mining Model Prediction Builder 178

Executing a Query on the MovieClick Model 179

Creating Data Mining Reports 180

Using SQL Server Management Studio 181

Understanding the Management Studio User Interface 182

Using Server Explorer 182

Using Object Explorer 183

Using the Query Editor 184

Summary 185

Chapter 5 Implementing a Data Mining Process Using Office 2007 187

Importing Data using the Data Mining Client 189

Data Exploration and Preparation 190

Discretizing Data with the Explore Data Tool 191

Chopping Off the Long Tail 191

Consolidating Meaning 192

Eliminating Spurious Values 194

Rebalancing Data 195

Modeling 196

Task-Based Modeling 196

Introduction 198

Select Data 198

Select Columns and Options 198

Split Data 200

Finishing the Task 200

Advanced Modeling in the Data Mining Client 200

Accuracy and Validation 203

Model Usage 204

Browsing Models 204

Viewing Models with Visio 205

Querying Models 208

QueryWizard 208

Data Mining Cell Functions 211

DMPREDICT 211

DMPREDICTTABLEROW 212

DMCONTENTQUERY 212

Model Management 213

Trace 213

Summary 213

Chapter 6 Microsoft Naıve Bayes 215

Introducing the Naıve Bayes Algorithm 216

Using the Naıve Bayes Algorithm 216

Creating a Predictive Model 217

Data Exploration 219

Analysis of Key Influencers 219

Document Classification 220

DMX 222

Drill-Through 222

Understanding Naıve Bayes Content 223

Exploring a Naıve Bayes Model 225

Dependency Network 225

Attribute Profiles 226

Attribute Characteristics 227

Attribute Discrimination 228

Understanding Naıve Bayes Principles 229

Limitations of the Naıve Bayes Algorithm 231

Naıve Bayes Parameters 233

MAXIMUM INPUT ATTRIBUTES 233

MAXIMUM OUTPUT ATTRIBUTES 233

MAXIMUM STATES 233

MINIMUM DEPENDENCY PROBABILITY 234

Summary 234

Chapter 7 Microsoft Decision Trees Algorithm 235

Introducing Decision Trees 236

Using Decision Trees 237

Creating a Decision Tree Model 237

DMX Queries 237

Classification Model 237

Regression Model 239

Association 241

Model Content 244

Interpreting the Model 244

Decision Tree Principles 248

Basic Concepts of Tree Growth 248

Working with Many States in an Attribute 251

Avoiding Overtraining 252

Incorporating Prior Knowledge 252

Feature Selection 253

Using Continuous Inputs 253

Regression 254

Association Analysis with Microsoft Decision Trees 255

Parameters 256

COMPLEXITY PENALTY 257

MINIMUM SUPPORT 257

SCORE METHOD 257

SPLIT METHOD 258

MAXIMUM INPUT ATTRIBUTES 258

MAXIMUM OUTPUT ATTRIBUTES 258

FORCE REGRESSOR 258

Stored Procedures 259

Summary 260

Chapter 8 Microsoft Time Series Algorithm 263

Overview 264

Usage 265

Time Series Scenarios 267

Performing a Simple Forecast 267

Predicting Interdependent Series 268

Understanding Your Time Series 268

What-If Scenarios 269

Predicting New Series 269

DMX 270

Model Creation 270

Model Processing 272

Forecasting 274

Returning Supplemental Statistics 275

Changing the Future —Executing a What-If Forecast 276

Forecasting with Little Data —Applying Models to New

Data 277

Drill-Through 280

Principles of Time Series 280

Autoregression 281

Periodicity 281

Autoregression Trees 282

Prediction 284

Parameters 285

MISSING VALUE SUBSTITUTION 285

PERIODICITY HINT 286

AUTO DETECT PERIODICITY 286

MINIMUM and MAXIMUM SERIES VALUE 286

FORECAST METHOD 286

PREDICTION SMOOTHING 287

INSTABILITY SENSITIVITY 287

HISTORIC MODEL COUNT and HISTORIC MODEL GAP 287

COMPLEXITY PENALTY and MINIMUM SUPPORT 288

Model Content 289

Summary 289

Chapter 9 Microsoft Clustering 291

Overview 292

Usage of Clustering 294

Performing a Clustering 295

Clustering as an Analytical Step 297

Anomaly Detection Using Clustering 297

DMX 299

Model Creation 300

Drill-Through 301

Cluster 301

ClusterProbability 301

PredictHistogram 302

PredictCaseLikelihood 302

Model Content 303

Understanding Your Cluster Models 304

Get a High-Level Overview 305

Pick a Cluster and Determine How It Is Different from the

General Population 307

Determine How the Cluster Is Different from Nearby

Clusters 308

Verify that Your Assertions Are True 309

Label the Cluster 309

Principles of Clustering 309

Hard Clustering versus Soft Clustering 311

Discrete Clustering 312

Scalable Clustering 313

Clustering Prediction 314

Parameters 314

CLUSTERING METHOD 314

CLUSTER COUNT 315

MINIMUM CLUSTER CASES 315

MODELLING CARDINALITY 316

STOPPING TOLERANCE 316

SAMPLE SIZE 316

CLUSTER SEED 317

MAXIMUM INPUT ATTRIBUTES 317

MAXIMUM STATES 318

Summary 318

Chapter 10 Microsoft Sequence Clustering 319

Introducing the Microsoft Sequence Clustering Algorithm 320

Using the Microsoft Sequence Clustering Algorithm 320

Creating a Sequence Clustering Model 321

DMX Queries 322

Executing Cluster Predictions 323

Executing Sequence Predictions 323

Extracting the Probability for the Sequence Predictions 325

Using the Histogram of the Sequence Predictions 326

Detecting Unusual Sequence Patterns 329

Interpreting the Model 329

Cluster Diagram 330

Cluster Profiles 331

Cluster Characteristics 331

Cluster Discrimination 333

State Transitions 333

Microsoft Sequence Clustering Algorithm Principles 334

Understanding a Markov Chain 334

Order of a Markov Chain 335

State Transition Matrix 336

Clustering with a Markov Chain 337

Cluster Decomposition 339

Model Content 339

Algorithm Parameters 340

CLUSTER COUNT 340

MINIMUM SUPPORT 340

MAXIMUM STATES 341

MAXIMUM SEQUENCE STATES 341

Summary 341

Chapter 11 Microsoft Association Rules 343

Introducing Microsoft Association Rules 344

Using the Association Rules Algorithm 344

Data Exploration Models 345

A Simple Recommendation Engine 346

Advanced Cross-Sales Analysis 349

DMX 351

Model Content 355

Interpreting the Model 357

Association Algorithm Principles 359

Understanding Basic Association Algorithm Terms and

Concepts 359

Itemset 360

Support 360

Probability (Confidence) 361

Importance 361

Finding Frequent Itemsets 363

Generating Association Rules 366

Prediction 367

Algorithm Parameters 368

MINIMUM SUPPORT 368

MAXIMUM SUPPORT 368

MINIMUM PROBABILITY 368

MINIMUM IMPORTANCE 368

MAXIMUM ITEMSET SIZE 369

MINIMUM ITEMSET SIZE 369

MAXIMUM ITEMSET COUNT 369

OPTIMIZED PREDICTION COUNT 369

AUTODETECT MINIMUM SUPPORT 369

Summary 370

Chapter 12 Microsoft Neural Network and Logistic Regression 371

Same Principle, Two Algorithms 372

Using the Microsoft Neural Network 373

Text Classification Models 373

Utility Models 378

DMX Queries 378

Model Content 381

Interpreting the Model 382

Principles of the Microsoft Neural Network Algorithm 384

What Is a Neural Network? 385

Combination and Activation 387

Backpropagation, Error Function, and Conjugate Gradient 389

A Simple Example of Processing a Neural Network 390

Normalization and Mapping 392

Topology of the Network 393

Training the Ending Condition 394

Nonlinearly Separable Classes 395

Algorithm Parameters 396

MAXIMUM INPUT ATTRIBUTES 396

MAXIMUM OUTPUT ATTRIBUTES 396

MAXIMUM STATES 396

HOLDOUT PERCENTAGE 397

HOLDOUT SEED 397

HIDDEN NODE RATIO 397

SAMPLE SIZE 397

Summary 397

Chapter 13 Mining OLAP Cubes 399

Introducing OLAP 400

Understanding Star and Snowflake Schemas 401

Understanding Dimension and Hierarchy 402

Understanding Measures and Measure Groups 404

Understanding Cube Processing and Storage 404

Using Proactive Caching 405

Querying a Cube 406

Performing Calculations 407

Browsing a Cube 408

Understanding Unified Dimension Modeling 408

Understanding the Relationship between OLAP and Data

Mining 413

Mining Aggregated Data 414

OLAP Pattern Discovery Needs 415

OLAP Mining versus Relational Mining 415

Building OLAP Mining Models Using Wizards and Editors 417

Using the Data Mining Wizard 417

Building the Customer Segmentation Model 417

Creating a Market Basket Model 420

Creating a Sales Forecast Model 424

Using the Data Mining Designer 428

Understanding Data Mining Dimensions 429

Using MDX within DMX Queries 432

Using Analysis Management Objects for the OLAP Mining

Model 434

Summary 438

Chapter 14 Data Mining with SQL Server Integration Services 439

An Overview of SSIS 440

Understanding SSIS Packages 442

Task Flow 442

Standard Tasks in SSIS 442

Containers 443

Debugging 444

Exploring a Control Flow Example 444

Data Flow 444

Transformations 445

Viewers 446

Exploring a Data Flow Example 447

Working with SSIS in Data Mining 447

Data Mining Tasks 448

Data Mining Query Task 449

Analysis Services Processing Task 452

Analysis Services Execute DDL Task 453

Data Mining Transformations 455

Data Mining Model Training Destination 455

Data Mining Query Transformation 458

Example Data Flows 462

Using Non-Predictive Data Mining Queries in an

Integration Services Pipeline 463

Text Mining Transformations 464

Term Extraction Transformation 465

Term Lookup Transformation 467

More Details on the Text Mining Process 470

Summary 472

Chapter 15 SQL Server Data Mining Architecture 475

Introducing Analysis Services Architecture 476

XML for Analysis 476

XMLA APIs 477

Discover 478

Execute 479

XMLA and Analysis Services 480

Processing Architecture 482

Predictions 486

Data Mining Administration 487

Server Configuration 488

Data Mining Security 489

Security Requirements for Creating and Training Mining

Objects 491

Security for Various Deployment Scenarios 491

Local Database and Analysis Services 492

Local Analysis Services and a Remote Database 493

Intranet Analysis Services and Databases on the Same

Server 493

Analysis Services and Databases behind an HTTP

Endpoint in an Internet Deployment 494

Configuring Analysis Services for Use with Data Mining

Excel Add-Ins over HTTP 495

Summary 496

Chapter 16 Programming Sql Server Data Mining 497

Data Mining APIs 498

ADO 498

ADO.NET 500

ADOMD.NET 501

Server ADOMD.NET 501

AMO 501

Using Analysis Services APIs 502

Using Microsoft.AnalysisServices to Create and Manage

Mining Models 502

AMO Basics 503

AMO Applications and Security 505

Object Creation 506

Creating Data Access Objects 507

Creating the Mining Structure 510

Creating the Mining Models 512

Processing Mining Models 513

Deploying Mining Models 515

Setting Mining Permissions 516

Browsing and Querying Mining Models 517

Predicting with ADOMD.NET 517

More on Table-Valued Parameters in ADOMD.NET 522

Browsing Models 525

Stored Procedures 527

Writing Stored Procedures 529

Stored Procedures and Prepare Invocations 530

A Stored Procedure Example 530

Executing Queries inside Stored Procedures 533

Returning Data Sets from Stored Procedures 534

Deploying and Debugging Stored Procedure Assemblies 537

Summary 538

Chapter 17 Extending SQL Server Data Mining 541

Plug-in Algorithms 542

Plug-in Algorithm Framework 543

Lifetime of a Plug-in Algorithm Instance 543

Conceptual Overview 545

Model Creation and Processing 547

Prediction 553

Content Navigation 554

Custom Functions 555

PMML 557

Managed vs. Native Plug-ins 557

Installing Plug-in Algorithms 558

Where to Find Out More about Plug-in Algorithms 558

Data Mining Viewers 558

Interfaces to Be Implemented 559

Rendering the Information 559

Retrieving Information from Analysis Services 560

Registering the Viewer 561

Where to Find Out Mode about Plug-in Viewers 561

Summary 562

Chapter 18 Implementing a Web Cross-Selling Application 563

Source Data Description 564

Building Your Model 564

Identifying the Data Mining Task 564

Using Decision Trees for Association 565

Using the Association Rules Algorithm 567

Comparing the Two Models 568

Making Predictions 570

Making Batch Prediction Queries 570

Using Singleton Prediction Queries 572

Integrating Predictions with Web Applications 573

UnderstandingWeb Application Architecture 573

Setting the Permissions 574

Examining Sample Code for the Web Recommendation

Application 575

Summary 578

Chapter 19 Conclusion and Additional Resources 581

Recapping the Highlights of SQL Server 2008 Data Mining 581

State-of-the-Art Algorithms 582

Easy-to-Use Tools 583

Simple-Yet-Powerful API 584

Integration with Sibling BI Technologies 584

Exploring New Data Mining Frontiers and Opportunities 585

Further Reference 586

Microsoft Data Mining 586

General Data Mining 586

Appendix A Data Sets 589

Appendix B Supported Functions 595

Index 607

Posted by JamieMac | 2 Comments
Filed under:

Time Series Reporting Stored Procedure - part 3 of 3

This post continues part 2 of the series.  In this part of the series I will demonstrate how to create a report using the stored procedure and also how to call the stored procedure from Excel using the Data Mining Addins for Office 2007.

In creating a report, the first thing I did was to use BI Dev Studio to create a new Report Server Wizard Project.  I then, of course, gave the wizard a connection string to my Analysis Services database that contains my Time Series model.

After all that housekeeping, the wizard puts you on the rather ominous "Design the Query" page - which has a big giant text box for a query which is disabled so you can't type there, and a button labeled Query Builder...  Given the lack of options, I guess I'll just click that big button instead of futilely trying to type into the disabled text box!

Clicking on the Query Builder button launches the (what else?) Query Designer for Analysis Services.  First potential gotcha happens here.  If you have cubes and mining models in your Analysis Services database, the Query Designer will default to the MDX query builder which returns MDX Cellsets.  We don't want that - we want DMX Rowsets.  So if that's the case, you need to click the little "pickaxe" button to switch the designer to DMX mode.

image

Which makes the designer look like this:

image

And you say "Great!  Now all I have to do is type in my stored procedure call and I'm done!"  So you go and you click on the "Edit as Text" button right in the upper right-hand corner.  And you click on it, and click and click and click and click - clickity click click click.....clik....clk.....  But alas, nothing happens, because it's disabled.

However, as you mumble frikkafrakkafrikkafrakka under your breath and move that mouse pointer over to the cancel button to give up on the whole endeavor - WAIT!  All is not lost!  There's another way!  This little button image  switches from "Design Mode" to some unnamed mode I guess you might want to call "Text Edit Mode"  yay!  Hooray for small victories!  Now you can type in your stored procedure call something like this

CALL TSSprocs.TimeSeriesReport("Reds Interleaved", "Sales", "Red", 25, 10)

And even preview the results if you want before clicking OK and sending that string to the uneditable text box on the main wizard.

On the next couple of pages, it doesn't matter what you really do - just make some selections to get the wizard to finish.  For example, choose a tabular report, add the sales to the details, pick a style - that kind of stuff.

Once you get to edit the actual report, the first order of business is to simply delete the table that was created!  OK, you could probably do this without the wizard, but hey, that's how people usually get started, no?

Next you go to the toolbox and drag out a chart to the report surface.  Make it a line chart and, since it's going to be the only thing on the report, MAKE IT BIG.

When you click on the chart you will have drop areas for series, data, and category.  The "Ordinal" column from the dataset will be the category (so go ahead and drop it there), and the two series columns will be the data (yes, the data) so drop them there.

There is still a problem that needs changing, as the data sets are automatically aggregated by Count and we need them to just be the values, or at least something that makes sense.  "Sum" makes sense for this data, so right click on each of the data fields, select "Series Properties" and change the Value field from Count(Sales) to Sum(Sales).  When you're done, your chart setup will look something like this:

image

At this point, you can just run your report and you're done!  The result (depending on your data, of course) will look like this

image

and you can make it as pretty as you want.

What's cool is that you can easily parameterize the stored procedure to select the models, columns, and series that you want to predict.  For example, in this case, my series are in the "Series" column in my model, so I can populate a drop down using the query

SELECT DISTINCT [Series] FROM [Reds Interleaved]

and parameterize my query with the result to get an interactive report like this:

image

All done!

Now lets do the same in Excel (using those Data Mining Addins)

First, you have to have the Addins installed - it's not going to work otherwise - when you install the addins, make sure you check the "Data Mining Client for Excel" option.

Next you will need to use this part of the data mining ribbon

image

Particularly the "Data Mining" tab (to get here) and the connection and query buttons.  Before executing the query, of course, you have to click the connection button to connect to your Analysis Services database.

Next you click on the Query button to bring up the query wizard.  Actually nothing in the "wizard" part is going to help you out here, so as soon as you see it, click on this button

image

The advanced query interface has an incredible interactive query designer that would be just awesome to use anywhere, but again, we're doing something "special" so it's not going to help us out either, so just click on

image

And then select and delete all the text (Now that's what I'm talkin' bout - an editable text box!)

You can then type/copy/otherwise enter your stored procedure call - you remember, this one:

CALL TSSprocs.TimeSeriesReport("Reds Interleaved", "Sales", "Red", 25, 10)

Click finish, specify where you want the data to be dropped and viola!  You have your time series prediction results all set up for you in Excel and it's even formatted nicely for you as a table!

image

Then you just Insert a line chart and.....almost!  Inserting a line chart actually adds the "Ordinal" column as a series so you have to fix that up by manually removing the series and setting the category labels to the Ordinal column, but then - a chart!

image

And now I'm finished with this ridiculously long post.  Of course there are many other things you can do - you can change the stored procedure to return deviations and plot those, you can change it to return all the series in a model, etc, etc, but this series of posts should lay the foundation for doing more.

Let me know if you like this series, if the level of detail is way too much or needs to fill in some more gaps, if the snark level is too high or too low, or even if the content is useful - always appreciate the feedback.

Posted by JamieMac | 1 Comments
More Posts Next page »
 
Page view tracker