Welcome to MSDN Blogs Sign in | Join | Help

Data Mining Addins on Channel 9

Wee Hyong, a SQL Server MVP in Singapore discusses with Channel 9 how he improved the event experience for the Heroes Community launch using the Data Mining Addins.  The first half of the video is the interview and then he presents a demo of the event data using the SQL Server 2008 Data Mining Addins for Office


SQL Data Mining

Posted by JamieMac | 0 Comments
Filed under: ,

Data Mining Case Studies

As I'm in the mood for finally responding to frequent requests, here is a list of case studies referring to SQL Server Data Mining on Microsoft.com.  Note that these are "marketing" case studies and are notoriously sparse on details, and that's aggravated by marketing focusing more on the relational engine than the data mining engine (in fact many of them refer to data mining without any details, but so is life).  In any case, you can use these whenever you get a response such as "but how do we know if SQL Server Data Mining is enterprise ready?" or "But does anyone like us use SQL Server Data Mining?" or similar questions.

Location Industry Case Study
Philippines Media & Entertainment Wireless Services Firm Doubles Response Rates with SQL Server 2005 Data Mining 10/23/05
Canada Professional Services Microsoft Partner Helps Internet-Monitoring Company Crawl 750 Million Pages Monthly 01/17/07
France Professional Services E-Commerce Provider Boosts Competitiveness with Microsoft Technology and Partnership 02/04/07
France Retail French Retailer Deploys Powerful Solution to Handle 1 Million Orders a Day 09/04/06
Israel Healthcare Microsoft SQL Server 2005 Data Mining Helps Clalit Preserve Health and Save Lives 11/03/05
France Manufacturing New Business Intelligence Solution Reduces Customer Churn by 25 to 30 Percent 03/16/06
United States Government The Illinois Department of Transportation Saves Lives with Microsoft Business Intelligence Solution 21-Apr-08
Germany Transportation & Logistics Global Air Freight Company Analyzes Mission Critical Logistics Data with SQL Server 2005 02/01/06
United Kingdom Media & Entertainment Scalable Data Warehouse Helps U.K. Lifestyle Group Consolidate Multiple Data Sources in Just Three Months 11/07/06
Italy Financial Services New Information System Increases Efficiency and Helps Double Profits at Italian Bank 02/16/07
United Kingdom Education Analytical Tools Help Research Group Provide New Insight Into How Darwin Developed His Theory of Evolution 11/07/06
United Kingdom Education Research Group Unlocks the Secrets of Darwinian Research with Mapping and Database Software 18-Mar-08
United States Manufacturing Xerox Develops Scalable, Hosted Solution to Optimize Global Print Fleet Management 11/11/05
United States Professional Services Database Products Help Real Estate Service Evaluate Millions of Homes Daily, Cut Costs 08/24/06
United States Financial Services Zurich in North American Anticipates U.S.$25 Million Gain with SQL Server 2005 Solution 01/22/07
Posted by JamieMac | 0 Comments
Filed under:

FREE Data Mining training by Rafal Lukawiecki

Many people ask me fore pointers to data mining and this fell into my lap over the past few days.  I've blogged about Rafal before when he first took up speaking on data mining - now he's created a lot of new material and presented to audiences around the world.  Below is excerpted from a mail I received about Rafal's presentations grouped up for training purposes.   It's not the deeply specific training that some of you have asked for, but listening to Rafal is always inspirational, so if there are other audiences you need to capture, have them watch these videos!

Some additional info about Rafal:

· Rafal Lukawiecki is consistently one of the top rated speakers at Microsoft Events

· These sessions were presented to 20 Countries across the EMEA region reaching more than 3,000 IT Pros who rated the content at  8.2 out of 9!

In Data Mining and Business Intelligence for Enterprises, Rafal Lukawiecki aims to show IT Professionals how data mining can be used in IT infrastructure to support real business scenarios demystifying the perception that Data Mining is complex, untested or only for specialists. This has become possible since Microsoft has taken the technology to new levels making it accessible to all. In four easy-to-understand yet packed with practical information sessions you will learn about what Data Mining and Business Intelligence can do for you, how to deploy and manage it, how to use it, and how to make it available to other parts of your IT environment.

Visit TechNet Spotlight: www.microsoft.com/technetspotlight

Video on Demand, Video Downloads, PowerPoint Presentations, Audio and more

Data Mining and Business Intelligence for Enterprises

Available Videos

clip_image004clip_image005clip_image006clip_image007

Data Mining and Business Intelligence for Enterprises:

http://www.microsoft.com/emea/spotlight/event.aspx?id=99

Links to Session Videos:

Ø Introduction to Data Mining- Rafal Lukawiecki, Strategic Consultant, Project Botticelli Ltd

Ø Working with Data Mining- Rafal Lukawiecki, Strategic Consultant, Project Botticelli Ltd

Ø Using Data Mining in your IT Systems (Part 1)- Rafal Lukawiecki, Strategic Consultant, Project Botticelli Ltd

Ø Using Data Mining in your IT Systems (Part 2), Rafal Lukawiecki, Strategic Consultant, Project Botticelli Ltd

Posted by JamieMac | 0 Comments
Filed under:

Visual Numerics integration into SQL Server Data Mining

Our friends at Visual Numerics, Inc have created an excellent white paper covering the creation of SQL Server Data Mining plug-in algorithms in C#.  In addition to documenting all of the architecture and concepts required, they have created a step-by-step tutorial demonstrating how to integrate their own K-Means clustering algorithm into SQL Server Data Mining using the C# interfaces.  The entire data mining team here in SQL Server reviewed the paper - we couldn't have done a better job ourselves!  Kudos to VNI for such an accomplishment!

The paper includes FULL SOURCE CODE for their plug-in algorithm and is available at http://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdf.  Of course you will need to get a copy of their IMSL C# Numerical Library, for which an eval copy is available upon request to VNI.

As a teaser, here is the TOC for their paper:

Audience ..................................................................................................... 4
Rationale ..................................................................................................... 4
Background ................................................................................................. 5
Plug‐in Architecture .................................................................................... 8
Managed Plug‐in Development .................................................................. 9
IMSL C# Library: ClusterKMeans Integration........................................... 9
Starting up.............................................................................................. 10
Metadata Changes (Metadata.cs) ......................................................... 10
Algorithm Changes (Algorithm.cs) ......................................................... 11
Training and Persistence of patterns..................................................... 11
Persistence of Patterns .......................................................................... 13
Prediction............................................................................................... 13
Algorithm Navigator Changes (AlgorithmNavigator.cs) ........................ 13
Registering the Algorithm with Analysis Services.................................. 14
Debugging .............................................................................................. 15
Other Default Features for Third‐Party Mining Algorithm Developers.... 16
The User Experience ................................................................................. 16
Excel 2007 .............................................................................................. 19
Conclusion................................................................................................. 21
About the Author...................................................................................... 21
References ................................................................................................ 22
Appendix A: Code Files ............................................................................. 23

Posted by JamieMac | 0 Comments
Filed under: ,

What's a book without a cover - redux

Three years ago next month I posted on our first book and now we can peek at what's to come!

I guess with the price of gas, Wiley decided that black gold was more interesting than the yellow stuff.  Readers of the last edition will notice a new addition to the author's list - Bogdan has contributed greatly to this version of the book and has helped make it far superior to the previous version.  Many chapters have been completely rewritten or greatly extended based on reader feedback, new features, and also how we see the technology has evolved since we drafted the first edition.  Another big change, depending on how the publisher pulls it off, most chapters have DMX and project files on the "related web site."  Here's a draft excerpt from the "front matter" explaining the chapter list:

 

Chapter 1 - Introduction - The chapter you’re reading not only introduces the book, but also the technology. It continues with a detailed definition of what exactly is meant by the term “data mining” and discusses what kinds of problems are addressed by this technology.

Chapter 2 – Applied Data Mining Using Office 2007 – This chapter provides an overview of the Table Analysis Tools for Office 2007 – a rich set of tools for Excel that are usable by any information worker. This chapter explains how and why you use these tools and provides guidance on how to get the best results.

Chapter 3 – Data Mining Concepts and DMX – This chapter is critical to your understanding of the SQL Server Data Mining platform. It explains the underlying concepts of how you think about a data mining problem plus it details the Data Mining Extensions for SQL (DMX) in a learn-by-example framework.

Chapter 4 – Using the SQL Server Data Mining Toolset – This chapter introduces you to building data mining solutions using the Business Intelligence Development Studio (BIDS). Beyond a basic overview, it provides a wide range of tips and tricks that can make the difference between a successful project and a failed one. This chapter also covers using SQL Server Management Studio to access and secure data mining objects also includes a discussion on how to expose your data mining models through SQL Server Reporting Services.

Chapter 5 – Implementing a Data Mining Process Using Office 2007 – This chapter goes beyond the Table Analysis Tools described in chapter 2 and provides an alternative framework for implementing data mining solutions than BIDS/SQL Server Management Studio. This chapter explores the remaining addins of the Data Mining Addins for Office 2007 showing how they provide additional functionality than BIDS and SQL Server Management Studio while having limitations preventing them from exposing the full functionality of SQL Server Data Mining. In any case, this chapter will allow you to best take advantage of the Microsoft Office tools for data mining.

Chapters 6 – 12 The Algorithm Chapters – Each of these chapters is devoted to one or more of the algorithms included with SQL Server Data Mining. In each of the chapters you will find a basic description of the algorithm, followed by usage scenarios that will help you understand how, when, and where you apply each algorithm. Each chapter describes how you create, train, interpret, and apply models using the specified algorithms. In the end, for those who hunger for the details, the chapters wrap up with a deeper technical dive into how the algorithms work.

Chapter 13 – Mining OLAP Cubes – This chapter provides a brief introduction to OLAP and the OLAP functionality of SQL Server Analysis Services followed by an explanation of how and when you perform data mining on OLAP cubes. This chapter includes details on how to implement popular OLAP mining scenarios.

Chapter 14 – Data Mining with SQL Server Integration Services – This chapter introduces SQL Server Integration Services (SSIS) and describes its various components. It then details the tasks and transforms that you use to implement data mining solutions in your data integration packages. This chapter also introduces the text mining components that are used to prepare unstructured data for use in data mining scenarios.

Chapter 15 – Data Mining Architecture – This is the first chapter that moves away from tools and concepts and starts to delve in to the programming and administration aspects of SQL Server Data Mining. This chapter discusses the architecture of a server-based data mining system including introducing the XML for Analysis (XMLA) protocol that underlies all client-server communication. This chapter also discusses administering a data mining server including server properties that are important for SQL Server Data Mining and data mining security roles.

Chapter 16 – Programming SQL Server Data Mining – This chapter details the programming interfaces for SQL Server Data Mining and includes many examples on the programmatic creation, training, and application of data mining objects.

Chapter 17 – Extending SQL Server Data Mining – This chapter shows how you can extend SQL Server Data Mining with your own functionality. It describes how you can create stored procedures to add additional operations to DMX, how you can implement your own data mining algorithms to plug into SQL Server Data Mining and exploit all of the SQL Server Data Mining features and integration, and how you can write your own data mining visualizations to display patterns in either the supplied algorithms or your own algorithm implementations and embed them in BIDS and SQL Server Management Studio.

Chapter 18 Implementing a Web Cross-Selling Application – This chapter walks you through a common data mining scenario – implementing a recommendation engine and integrating it into a retail web site. It includes sample queries and code to get you started.

The book is available for preorder at Amazon - order 10 copies!

Posted by JamieMac | 1 Comments

Screencasts, more screencasts, plus some scripts.

Oddly enough, this week was a banner week for data mining screencast references.  I took some internal strategy training, and was asked by several attendees for links to the screencasts that I made (on the download page) and received two (maybe three) independent requests for the scripts for those screencasts.

Additionally I was in an e-mail conversation with Professor Emeritus Bill Burrows from University of Washington and discovered coincidentally that he has created his own excellent set of screencasts for the addins!  Wacky!  I must say that his voice and style is much more pleasant than my own, plus it's not delivered from the perspective of the development team, so you should check them out or send colleagues that way to get introduced to the topic.

And, if you're interested, the scripts from my screencasts are attached - plus a bonus script that I never actually got around to recording!

Posted by JamieMac | 1 Comments
Filed under: ,

Attachment(s): Scripts.zip

"Supercrunchers" and Microsoft Data Mining

Ian Ayres, Yale Law school professor and author of SuperCrunchers, talks about data driven decisions and the Data Mining Addins in this Gartner podcast.  He also mentions some other company, but we won't talk about them :)  Give it a listen!

Posted by JamieMac | 0 Comments
Filed under: , ,

Unwinding MDX Flattening Semantics with DMX

When issuing an MDX statement in code, you have the option of requesting either a multidimensional set or a relational rowset as the result set.  If you are familiar with MDX and multidimensional concepts, the multidimensional result is the obvious choice.  However, if your application is more suited to dealing with tabular data, such as Reporting Services, it may be more convenient to request the rowset result.  One issue is that it's not always obvious how the MDX flattening semantics will translate the multidimensional set into a flattened rowset.

You can see exactly how MDX flattens results right inside SQL Server Management Studio by exploiting a little trick using DMX.

For example, take the simple Adventure Works query below:

SELECT
       {Measures.[Measures].[Reseller Sales Amount] * 
          [Product].[Category].MEMBERS} ON Columns,
       {[Date].[Month]}  ON ROWS
FROM [Adventure Works]
WHERE [Date].[FY 2002]

Which produces a result like this:

image

How many columns do you think the flattened result will have?  It would seem that you could represent the results in the six columns that of the multidimensional result.  However, the result set actually contains the entire hierarchy for the month.  You can exploit DMX's ability to accept raw MDX statements as input to extract the rowset result.  Since DMX always returns a rowset, if you select the input columns, you will retrieve the flattened result of the MDX statement. 

The next question, of course, is how to you create a model such that you can access the MDX?  The answer, you don't really have to.  You can use any mining model that happens to be there, or just create any arbitrary mining model for the purpose.  This trick takes advantage of the NATURAL PREDICTION JOIN construct with matches input columns to model columns by name and simply ignores any columns that don't match up - in this case, likely all of them.

Therefore, the DMX statement

SELECT t.* FROM [Customer Tree]
NATURAL PREDICTION JOIN
(SELECT {Measures.[Measures].[Reseller Sales Amount] *
            [Product].[Category].MEMBERS} ON COLUMNS,
       {[Date].[Month]}  ON ROWS
FROM [Adventure Works]) AS t

Essentially ignores the model and simple returns the flattened result of the MDX query, which in this case returns nine columns, like this:

image

This trick can be used against any dimensionality MDX query to extract the flattened results.  From a data mining perspective, this is particularly useful to determine the column layout of an MDX query in order to train mining models with INSERT INTO.

Posted by JamieMac | 3 Comments
Filed under: ,

A data mining survey

Karl Rexer of Rexer analytics has taken up a hobby of implementing a somewhat unscientific survey of data mining users to determine trends in the industry - see what kinds of software people are using - all that kind of stuff.  In Karl's own words:

"We would greatly appreciate it if you would:

1)       Participate in this survey, and

2)       Tell other data miners about the survey (forward this email to them)."

 

Since I don't have everyone's email address, the next best thing is posting on my blog.  You can participate in the survey here   and use access code GSU74.  The survet is anonymous, but there's a place at the end to leave your e-mail if you want to get a copy of the results.

 

If you have any questions about the survey you can ask Karl himself at krexer@RexerAnalytics.com. 

 

Posted by JamieMac | 1 Comments
Filed under:

Silverlight Viewers, Data Mining Road Show, and TechEd

A enthusiast in China took a suggestion all the way and has implemented at least a decision tree viewer in Silverlight - I don't read Chinese, but you can see it here

Rafal Lukawiecki - consistantly one of the top TechEd Europe speakers - has been traveling around doing an extensive data mining roadshow.  Here is a review of his show in Israel, but you don't have to hear just the critics - you can also hear Rafal's point of view in this interview or even see last year's TechEd presentation.  I was there and was amazed - I had provided a lot of the material to Rafal and presented it for a "good" response, but hearing Rafal deliver it was inspirational! 

You don't have to stop there, in April, Rafal's show is going to London, and rumor tells me you'll have the chance to see him at TechEd in Orlando this year in both the developer week and the IT week speaking both on Data Mining and Security.

 

Posted by JamieMac | 1 Comments
Filed under:

DMX Queries - the DataSource hole

So many people have told me how much they like DMX - the fact that they can use a query language to create, train, and query mining models is simply unprecedented in the data mining world.  And they love it.

 Until.....

Until, that is, they decide to go whole hog on DMX and implement their entire solution using ONLY DMX.  Wow!  What a great idea!  I'll create a new database just for mining, and then I'll use DMX to create a mining structure, then use ALTER MINING STRUCTURE to add all my models, and then train using INSERT INTO.

 Oh, frizznet!  I can't INSERT INTO - I don't have a datasource.

How do you create a datasource with DMX?

YOU DON'T!

DMX lacks any syntax for creating or managing datasources, even though the language depends on datasources for all external data access (unless you want to open that giant security hole that is OPENROWSET, but that's a story for another blog post...).

However, not all is lost.  It is fairly trivial to create a stored procedure that will take a datasource definition and create it for you from a command interface.  In fact, it's so easy, because all you have to do is copy and paste the source code below.

 [SafeToPrepare(true)]
public void CreateDataSource(string DataSourceName, string ConnectionString,
                         string sImpersonationMode, string Account, string Password)
{
      if (Context.ExecuteForPrepare)
              return;

      // Connect to the calling session
      Microsoft.AnalysisServices.Server svr = new Microsoft.AnalysisServices.Server();
      svr.Connect(
"*");

      // Get the current database
      Database db = svr.Databases.GetByName(Context.CurrentDatabaseName);

      // Create a new datasource and set connection string
      DataSource ds = db.DataSources.AddNew(DataSourceName, DataSourceName);
      ds.ConnectionString = ConnectionString;
      
      // Set the impersonation info
      bool FoundMode = false;
      foreach (ImpersonationMode mode in Enum.GetValues(typeof(ImpersonationMode)))
      {
          if (mode.ToString() == sImpersonationMode)
          {
              FoundMode =
true;
              ds.ImpersonationInfo =
new ImpersonationInfo(mode, Account, Password);
              break;
          }
       }
      if (!FoundMode)
          throw (new SystemException("Invalid Impersonation Mode"));

      // Update the server with the new datasource
      ds.Update();
}

When you build a project with this code, you need to include AMO as well as Server ADOMD.Net references.  When deploying the assembly to the server, you need to make the assembly access "Unrestricted" or it will throw exceptions when trying to execute the AMO code.

After you have successfully built and deployed the stored procedure you can call it like this

CALL MySproc.CreateDataSource('MyDataSourceName','
                   Provider=SQLNCLI;Location=localhost;Initial Catalog=MyDatabase','ImpersonateCurrentUser','','')

And there you have it - closed the datasouce hole in DMX in 100 lines of code or less.  Of course, if you're adventurous, you can add other management functions to add permissions to datasources, delete them, whatever you want!

Enjoy!

Posted by JamieMac | 2 Comments
Filed under: , ,

To (a) or not to (a), that is the question?

While looking for content for the next edition of my book (newsflash! I'm currently working on the next edition of my book!) I went hunting around for that trick using decision trees to predict the states of a single column independently rather than all together.  Turns out - I never wrote it!  So, in case you don' t want to wait for it (or it doesn't make it into the book!), here it is now.

The classifier-type algorithms in SQL Server Data Mining (notably decision trees, naive bayes, neural networks) can all predict multinomial outputs - that is, output attributes with multiple states - 3,4, 10, 20, whatever.  However, in reality, classifiers in general prefer the ying and yang of things, the black and white, the yes-ness and the no-ness.  In short, they really are better at seperating between the states of a binomial attribute rather than those of a multinomial attribute - and so are you, actually.  If I gave you ten things to look at and said what is the factor that most cleanly divides all ten of these things, you'd have a hard time, but if I gave you two things instead, you might not have a problem.  You would be more accurate, and your model may be more accurate as well, if you only had binomials.

Additionally, with multinomials, your model - particularly with decision trees - is harder to interpret.  Say you have the marital status states of "Married", "Single", "Seperated", "Divorced" and "Widowed".  When I look at the dependency net and it shows that "Number of Children" is predictive of "Marital Status" - what does it mean?   Which aspect of marital status is it talking about - all of them?  One of them?  Impossible to tell.

 So what do you do when you have a multinomial output?  The first choice is obvious - see if you can reduce it to two states - e.g. can they be changed into "Married" and "Not Married".  If that's not an option, i.e. if the states are important, another option is to turn them into a series of binomials - e.g. "Marriend" and "Not Married", "Single" and "Not Single", etc.

Now we run into an additional problem - not only is transposing our data like that a royal pain in the butt, it increases the number of columns and we may exceed the maximum row length of SQL Server.

Luckily with SQL Server Data Mining, we have a way out.  By using a trick with nested tables, we can create a model that treats each state as a binomial attribute without changing any of our data.  Assume our original model looked like this:

CREATE MINING MODEL MyMultinomialModel
(
    CustomerID     LONG   KEY,
    Age            LONG   CONTINUOUS,
    Gender         TEXT   DISCRETE,
    NumChildren    LONG   CONTINUOUS,
    MaritalStatus  TEXT   DISCRETE     PREDICT
) USING Microsoft_Decision_Trees

we can transform the MaritalStatus field into a nested table like this:

CREATE MINING MODEL MyBinomialModel
(
    CustomerID     LONG   KEY,
    Age            LONG   CONTINUOUS,
    Gender         TEXT   DISCRETE,
    NumChildren    LONG   CONTINUOUS,
    MaritalStatus  TABLE  PREDICT ONLY
    (
         Status    TEXT   KEY
    )
) USING Microsoft_Decision_Trees

OK, so now that we've created the model, how do we train it - we only have a single source table right?  Well, you're right, but that doesn't really matter.  Using the tools in BI Dev Studio, you can select a table as both a Case and Nested table, and you can do the same thing using DMX, like this

INSERT INTO MyBinomialModel(CustomerID,
                            Age, Gender, NumChildren,
                            MaritalStatus(SKIP, Status))
SHAPE{OPENQUERY(MyDataSource,
                'SELECT CustID, Age, Gender, NumChildren
                FROM Customers ORDER BY CustID') }
      APPEND
      {OPENQUERY(MyDataSource,'SELECT CustID, MaritalStatus
                              FROM Customers ORDER BY CustID') 
      RELATE CustID to CustID} AS MaritalStatus

So why does this work?  Since we're using the same table as the case and nested tables, we're guaranteed that each case will have one and only one "Status" in the "MaritalStatus" table.  The MaritalStatus table is PREDICT ONLY, so there's no cross-confusion between the states.  A decision tree model in this case will build five trees - one for each state of MaritalStatus.  Each attribute is a binomial variable with two possibilities "This state exists" or "This state doesn't exist".

This also helps in prediction if you need to predict a particular state.  You will get a much better prediction of whether or not a customer is or is not a particular state than if you did this the traditional way.  You can find out how likely by using subselects with your prediction statements, e.g.

SELECT FLATTENED
   (SELECT * FROM
      PREDICT(MaritalStatus, INCLUDE_STATISTICS)
    WHERE Status='Single')
FROM MyBinomialModel PREDICTION JOIN ...

The above query will give you the probability and support of being single for each customer in the input set.

Additionally, in the dependency network, each state of MaritalStatus will appear as its own node, so you will finally be able to see whether Number Of Children is more predictive of "Married" or of "Divorced" J

And for those wanting more information on the book, I’m working once again with Zhaohui Tang on the book and this time we’ve added Bogdan Crivat as a new author.  It will probably have some spiffy name like “Data Mining with SQL Server 2008” and will be released sometime around when the product ships – no comment on that!

 (note: the author is happily married with four children, regardless of what your dependency net says)

 

 

Posted by JamieMac | 1 Comments
Filed under: ,

Tools Can Be So Much Fun PLUS Cluster Accuracy

For a project I was working on I had to create a lot of models and extract one piece of information from each of them for charting.  Particularly, I wanted to see the impact of changing the cluster count of a clustering model vis-a-vis the model "score" which is stored in the content column "MSOLAP_NODE_SCORE".

To do this "the traditional way" - i.e. using the UI - i.e. the way I started - is arduous.  Basically the process is like this

  1. Create a project, data source, data source view
  2. Create mining structure and initial model
  3. Add model to structure
  4. Name model
  5. Select clustering algorithm
  6. Close dialog
  7. Open algorithm parameters
  8. Set the CLUSTER_COUNT parameter
  9. Close dialog
  10. Rinse
  11. Repeat

What a pain!  So after creating two of these - I stopped, took a breather and reapproached the problem.

I started over and created the project items in BI Dev Studio up to and including a Mining Structure with a cluster model which I called "Cluster0" and set the parameter CLUSTER_COUNT to 0.  I then deployed and processed the model to my server from the project.

Next I used a trick I previously posted using the query editor in SQL Server Management studio to send multiple DMX commands.  Using this technique I was able to quickly create multiple ALTER MINING STRUCTURE commands in the text editor and fix up the names and parameter numbers.

In the end I had a document that looked like this

ALTER MINING STRUCTURE [WBCD] ADD MINING MODEL Cluster2
USING Microsoft_Clustering(CLUSTER_COUNT=2)

GO

ALTER MINING STRUCTURE [WBCD] ADD MINING MODEL Cluster3
USING Microsoft_Clustering(CLUSTER_COUNT=3)

GO

ALTER MINING STRUCTURE [WBCD] ADD MINING MODEL Cluster4
USING Microsoft_Clustering(CLUSTER_COUNT=4)

GO

...

I actually did this up to Cluster26 - why?  Because that's how many times I pasted - that's why.

Next, I had to train the models, so I issued the command:

INSERT INTO MINING STRUCTURE [WBCD]

This statement executes a "Process Default" on the structure, which, since the structure was already processed, processes all of the unprocessed models within the structure.  If you have Enterprise Edition, it also parallelizes the processing to speed up overall processing time.  Actually, in practice, I put the INSERT INTO statement after the last ALTER to create and process the models in a single execution.

Finally I needed to get the bit of information from each model.  The overall score for a cluster model is stored in the first row of the content so I just used the query

SELECT TOP 1 MSOLAP_NODE_SCORE FROM Cluster2.CONTENT

Simple enough - so then I used the "GO" trick as above to execute the statement for each model.  However this resulted in 25 windows being created in SSMS with no way to select results from all the queries.  Yuck!  It looked liked this (after I manually resized all of the panes)

image

Needless to say, not very useful.  So again time for reapproaching.  So I though of those undocumented features of the Data Mining Addins I posted a while back (to be documented and supported in SQL Server 2008!) and made a plan.

I ran Excel with the Addins loaded and created a connection to my database.  Then I typed "Cluster2" in a cell, and "Cluster3" in the cell below.  I selected the cells and dragged down until I got to "Cluster26" (why?  because that's how many times I pasted - remember?).

Next I needed to use the DMCONTENTQUERY function.  The DMCONTENTQUERY function has the signature

DMCONTENTQUERY(<connection>,<model>,<content column>, [<where-clause>])

It has the semantic that it returns the first value the query returns - essentially creating a TOP 1 query like I issued from SSMS.  So I entered in the neighboring column

=DMCONTENTQUERY("",A2,"MSOLAP_NODE_SCORE")

Leaving the connection name empty implies that it should use the current connection - since I was already connected, I was lazy and left it blank (many times I mistyped the connection since the form is "<database> (<server>)" and I always forgot the server - ugh!).  "A2" is the cell that said "Cluster2".

I dragged that down the column and voila!  I had a list of scores that I could easily plot!  Mission accomplished!

And for those of you patient to read this far - here's the resulting chart showing good models ranging from 4-10 clusters and then starting to drop off with an odd little spike at 17 clusters.

image

Posted by JamieMac | 0 Comments

New Server at SQLServerDataMining.com

After three years the data mining team has upgraded the server for SQLServerDataMining.com.  The previous server with a 1 GB server with (IIRC) a 40 GB hard drive that ran SQL Server, IIS, Analysis Services and Reporting Services and a beta version of the toolkit we were using.  Now we have a faster server with much larger capacity leveraging a new framework.  Thanks to all the people who were patient when the server ran out of disk space and couldn't resolve their request (and thanks more to those of you who told us about it!) 

Added bonus - we're running the latest CTP of SQL Server 2008 Analysis Services so whenever you run any of the LiveSamples, you're hitting the brand new code base.

Soon we will be adding some new samples leveraging  the new 2008 feature set, so come back often to check it out!

Posted by JamieMac | 1 Comments

How can we mine? Let me count the ways...

Recently I received some customer feedback that SQL Server Data Mining "doesn't have enough algorithms."  More specifically, the comment was that we have the same capabilities are other Data Mining providers, we just "hide" many facilities as algorithm parameters rather than separating out each as a named algorithm.  So let's count the Microsoft algorithms a few different ways to work this out.

First - let's go by the box.  This is the list of algorithms as specified in Books Online

  1. Microsoft Decision Trees
  2. Microsoft Clustering
  3. Microsoft Naive Bayes
  4. Microsoft Association Rules
  5. Microsoft Neural Networks
  6. Microsoft Time Series
  7. Microsoft Sequence Clustering
  8. Microsoft Linear Regression
  9. Microsoft Logistic Regression

So that's nine - count 'em nine algorithms.   But that's just one way.  If you look at my book, Data Mining with SQL Server 2005 written with Zhaohui Tang, there are only seven algorithms!  What?  You say!  How can it be?

Let me explain.  During the development of SQL Server 2005, we realized a couple of tricks; 1) linear regression was the same as our tree algorithm,  just forced to not split; and 2) logistic regression was the same as our Neural Nets, just with zero hidden layers.  However, we got similar feedback - people want more algorithms, and specifically these ones, so we set up two "new algorithms" by forcibly setting parameters on the Decision Tree and Neural Network algorithms and voila! we shipped with nine named algorithms.  It would have been hard to fill up two entire chapters explaining that last sentence, so Zhaohui and I decided just to stick to the seven core algorithms.

Anyway, this posting isn't really about how to count less algorithms, I really wanted to show you how to count more.  When we set about designing SQL Server Data Mining, we really and truly tried to make data mining operations simpler.  We thought at the time, rightly or wrongly, that the more options end users have, the more complicated and difficult the product would be to use.  Therefore, we tried to determine the best behavior in a class, and make more advanced options available through parameters.

For example, take our clustering algorithm.  We assumed that if people wanted clustering, most likely didn't care about the details of the algorithm, they just wanted to get the job done, and that those people who wanted more would look for it (the design principal - make the simple things simple, and the complex things possible).  So we bundled up different flavors of clustering into a single package that many vendors would have broken apart.  So let's start counting with clustering.

1

Our default clustering behavior is EM (Expectation Maximization) clustering using the Bradley-Fayyad scalable framework

2

Setting a parameter changes that to a K-Means clustering implementation using the same framework

3+4

Setting the same parameter another way, provides non-scalable versions of the two clustering varieties.  (I know it's hard to swallow that the non-scalable versions count as separate algorithms, but if you started with the vanilla versions and added scalability, then of course you would consider those versions as new algorithms - I'm just working backwards here.

5

Let's move to our Decision Tree algorithm and we will consider our classification tree as one algorithm.

6

But our Decision Tree also predicts continuous and counts as a regression tree, so we will count that as another algorithm.

7

Oops!  Our Decision Tree also creates full linear regressions at each of the leaf nodes.  To get the typical regression tree behavior you need to make sure that none of the continuous inputs have the REGRESSOR flag and you get yet another algorithm.

8

Oh yeah, our trees allow for multiple targets in each model, allowing the discovery and display of interrelated patterns through our dependency net.  I've seen other vendors advertise such functionality as an "algorithm" so there's our #8.

9

How about collaborative filtering with Trees - just slap a PREDICT flag on a nested table, and you have a complete recommendation system.  Let's call it Associative Trees

10

Naive Bayes.

11+12

If we're going to count Associative Trees, we also have "Associative Bayes".  I guess the multiple target interrelated pattern thing counts here as well.

13

Association Rules.  A-priori style

14

It seems odd to count association rules twice since we can do predictions with it, but nobody else does it (or didn't before - correct me if I'm wrong), so Predictive Association Rules makes the cut.

15+16+17+18

Well if we're going to go and call predictive association an algorithm, we had better do the same for our clustering algorithm.  Granted, clustering doesn't make a great classifier or estimator, but the great Highlight Exceptions functionality of the Data Mining addins comes from this ability.  Yes, we can do nested table prediction as well with clustering, but I wouldn't recommend it to my mom, so I won't take another four for that.

19+20+21+22+23

Neural Networks, Sequence Clustering, Time Series, Linear Regression and Logistic Regression.  Yeah, yeah, I could get into varieties here, but I think you get the point.

 

So by that count, and not being too creative (trust me, I can do more) we're looking at 23 algorithms in SQL Server 2005 Data Mining.  There are a few more options coming up in SQL Server 2008 that are worth discussing as well.

24

The time series of SQL Server 2007 uses the ARTXP algorithm - "Auto Regression Trees with Cross Predict".  In 2008, we're adding ARIMA as well, for algorithm #24.

25

And yet again with Time Series, the default mode of operation is to blend ARTXP and ARIMA results in an intelligent way to maximize accuracy and stability for #25.

 

Arbitrarily there are 23 algorithms in SQL 2005 and 25 in SQL 2008, with the option of teasing out even more varieties depending on how you apply parameters and flags to the base nine (or seven - depending on how you count!).   Next time someone quips that SQL Server only has "nine" algorithms, tell them that's just the packaging - each of those nine provides a wealth of value in each box.

More Posts Next page »
 
Page view tracker