<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Jamie's Junk : Decision Trees</title><link>http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx</link><description>Tags: Decision Trees</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>To (a) or not to (a), that is the question?</title><link>http://blogs.msdn.com/jamiemac/archive/2008/01/13/to-a-or-not-to-a-that-is-the-question.aspx</link><pubDate>Mon, 14 Jan 2008 08:36:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:7105158</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/7105158.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=7105158</wfw:commentRss><description>&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;While looking for content for the next edition of my book (newsflash! I'm currently working on the next edition of my book!) I went hunting around for that trick using decision trees to predict the states of a single column independently rather than all together.&amp;nbsp; Turns out - I never wrote it!&amp;nbsp; So, in case you don' t want to wait for it (or it doesn't make it into the book!), here it is now.&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;The classifier-type algorithms in SQL Server Data Mining (notably decision trees, naive bayes, neural networks) can all predict multinomial outputs - that is, output attributes with multiple states - 3,4, 10, 20, whatever.&amp;nbsp; However, in reality, classifiers in general prefer the ying and yang of things, the black and white, the yes-ness and the no-ness.&amp;nbsp; In short, they really are better at seperating between the states of a binomial attribute rather than those of a multinomial attribute - and so are you, actually.&amp;nbsp; If I gave you ten things to look at and said what is the factor that most cleanly divides all ten of these things, you'd have a hard time, but if I gave you two things instead, you might not have a problem.&amp;nbsp; You would be more accurate, and your model may be more accurate as well, if you only had binomials.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;Additionally, with multinomials, your model - particularly with decision trees - is harder to interpret.&amp;nbsp; Say you have the marital status states of "Married", "Single", "Seperated", "Divorced" and "Widowed".&amp;nbsp; When I look at the dependency net and it shows that "Number of Children" is predictive of "Marital Status" - what does it mean?&amp;nbsp;&amp;nbsp; Which aspect of marital status is it talking about - all of them?&amp;nbsp; One of them?&amp;nbsp; Impossible to tell.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;&amp;nbsp;So what do&amp;nbsp;you do when&amp;nbsp;you have a multinomial output?&amp;nbsp; The first choice is obvious - see if you can reduce it to two states - e.g. can they be changed into "Married" and "Not Married".&amp;nbsp; If that's not an option, i.e. if the states are important,&amp;nbsp;another option is&amp;nbsp;to turn them into a series of binomials&amp;nbsp;- e.g. "Marriend" and "Not Married", "Single" and "Not Single", etc.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;Now we run into an additional problem - not only is transposing our data like that a royal pain in the butt, it increases the number of columns and we may exceed the maximum row length of SQL Server.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;Luckily with SQL Server Data Mining, we have a way out.&amp;nbsp; By using a trick with nested tables, we can create a model that treats each state as a binomial attribute without changing any of our data.&amp;nbsp; Assume our original model looked like this:&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Courier New'"&gt;CREATE MINING MODEL MyMultinomialModel&lt;BR&gt;(&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; CustomerID&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; LONG&amp;nbsp;&amp;nbsp; KEY,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Age&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; LONG&amp;nbsp;&amp;nbsp; CONTINUOUS,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Gender&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; TEXT&amp;nbsp;&amp;nbsp;&amp;nbsp;DISCRETE,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; NumChildren&amp;nbsp;&amp;nbsp; &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;LONG&amp;nbsp;&amp;nbsp;&amp;nbsp;CONTINUOUS,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; MaritalStatus&amp;nbsp; TEXT&amp;nbsp;&amp;nbsp;&amp;nbsp;DISCRETE&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; PREDICT&lt;BR&gt;) USING Microsoft_Decision_Trees&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;we can transform the MaritalStatus field into a nested table like this:&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Courier New'"&gt;CREATE MINING MODEL MyBinomialModel&lt;BR&gt;(&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; CustomerID&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; LONG&amp;nbsp;&amp;nbsp; KEY,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Age&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;LONG&amp;nbsp;&amp;nbsp; CONTINUOUS,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Gender&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;TEXT&amp;nbsp;&amp;nbsp;&amp;nbsp;DISCRETE,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; NumChildren&amp;nbsp;&amp;nbsp; &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;LONG&amp;nbsp;&amp;nbsp;&amp;nbsp;CONTINUOUS,&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; MaritalStatus&amp;nbsp; TABLE&amp;nbsp; PREDICT ONLY&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; (&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Status&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;TEXT&amp;nbsp;&amp;nbsp;&amp;nbsp;KEY&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; )&lt;BR&gt;) USING Microsoft_Decision_Trees&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;OK, so now that we've created the model, how do we train it - we only have a single source table right?&amp;nbsp; Well, you're right, but that doesn't really matter.&amp;nbsp; Using the tools in BI Dev Studio, you can select a table as both a Case and Nested table, and you can do the same thing using DMX, like this&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Courier New'"&gt;INSERT INTO MyBinomialModel(CustomerID, &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;Age, Gender, NumChildren, &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;MaritalStatus(SKIP, Status))&lt;BR&gt;SHAPE{OPENQUERY(MyDataSource, &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;'SELECT CustID, Age, Gender, NumChildren &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;FROM Customers ORDER BY CustID') }&lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;APPEND&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;{OPENQUERY(MyDataSource,'SELECT CustID, MaritalStatus &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;FROM Customers ORDER BY CustID')&amp;nbsp;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;RELATE CustID to CustID} AS MaritalStatus&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;So why does this work?&amp;nbsp; Since we're using the same table as the case and nested tables, we're guaranteed that each case will have one and only one "Status" in the "MaritalStatus" table.&amp;nbsp; The MaritalStatus table is PREDICT ONLY, so there's no cross-confusion between the states.&amp;nbsp; A decision tree model in this case will build five trees - one for each state of MaritalStatus.&amp;nbsp; Each attribute is a binomial variable with two possibilities "This state exists" or "This state doesn't exist".&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;This also helps in prediction if you need to predict a particular state.&amp;nbsp; You will get a much better prediction of whether or not a customer is or is not a particular state than if you did this the traditional way.&amp;nbsp; You can find out how likely by using subselects with your prediction statements, e.g.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Courier New'"&gt;SELECT FLATTENED&lt;BR&gt;&amp;nbsp;&amp;nbsp; (SELECT * FROM &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;PREDICT(MaritalStatus, INCLUDE_STATISTICS) &lt;BR&gt;&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;WHERE Status='Single') &lt;BR&gt;FROM MyBinomialModel PREDICTION JOIN ...&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;The above query will give you the probability and support of being single for each customer in the input set.&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;Additionally, in the dependency network, each state of MaritalStatus will appear as its own node, so you will finally be able to see whether Number Of Children is more predictive of "&lt;I style="mso-bidi-font-style: normal"&gt;Married&lt;/I&gt;" or of "&lt;I style="mso-bidi-font-style: normal"&gt;Divorced&lt;/I&gt;" &lt;/SPAN&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Wingdings; mso-ascii-font-family: Arial; mso-hansi-font-family: Arial; mso-char-type: symbol; mso-symbol-font-family: Wingdings; mso-bidi-font-family: Arial"&gt;&lt;SPAN style="mso-char-type: symbol; mso-symbol-font-family: Wingdings"&gt;J&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;And for those wanting more information on the book, I’m working once again with Zhaohui Tang on the book and this time we’ve added Bogdan Crivat as a new author.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;It will probably have some spiffy name like “Data Mining with SQL Server 2008” and will be released sometime around when the product ships – no comment on that!&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;&amp;nbsp;&lt;EM&gt;&lt;SPAN style="FONT-FAMILY: 'Arial','sans-serif'"&gt;(note: the author is happily married with four children, regardless of what your dependency net says)&lt;o:p&gt;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'"&gt;&lt;o:p&gt;&amp;nbsp;&lt;/o:p&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 10pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri size=3&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=7105158" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/DMX/default.aspx">DMX</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category></item><item><title>How can we mine?  Let me count the ways...</title><link>http://blogs.msdn.com/jamiemac/archive/2007/11/19/how-can-we-mine-let-me-count-the-ways.aspx</link><pubDate>Tue, 20 Nov 2007 00:18:10 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:6410880</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/6410880.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=6410880</wfw:commentRss><description>&lt;p&gt;Recently I received some customer feedback that SQL Server Data Mining &amp;quot;doesn't have enough algorithms.&amp;quot;&amp;#160; More specifically, the comment was that we have the same capabilities are other Data Mining providers, we just &amp;quot;hide&amp;quot; many facilities as algorithm parameters rather than separating out each as a named algorithm.&amp;#160; So let's count the Microsoft algorithms a few different ways to work this out.&lt;/p&gt;  &lt;p&gt;First - let's go by the box.&amp;#160; This is the list of algorithms as specified in Books Online&lt;/p&gt;  &lt;ol&gt;   &lt;li&gt;Microsoft Decision Trees&lt;/li&gt;    &lt;li&gt;Microsoft Clustering&lt;/li&gt;    &lt;li&gt;Microsoft Naive Bayes&lt;/li&gt;    &lt;li&gt;Microsoft Association Rules&lt;/li&gt;    &lt;li&gt;Microsoft Neural Networks&lt;/li&gt;    &lt;li&gt;Microsoft Time Series&lt;/li&gt;    &lt;li&gt;Microsoft Sequence Clustering&lt;/li&gt;    &lt;li&gt;Microsoft Linear Regression&lt;/li&gt;    &lt;li&gt;Microsoft Logistic Regression&lt;/li&gt; &lt;/ol&gt;  &lt;p&gt;So that's nine - count 'em &lt;em&gt;nine&lt;/em&gt; algorithms.&amp;#160;&amp;#160; But that's just one way.&amp;#160; If you look at my book, Data Mining with SQL Server 2005 written with Zhaohui Tang, there are only &lt;em&gt;seven &lt;/em&gt;algorithms!&amp;#160; What?&amp;#160; You say!&amp;#160; How can it be?&lt;/p&gt;  &lt;p&gt;Let me explain.&amp;#160; During the development of SQL Server 2005, we realized a couple of tricks; 1) linear regression was the same as our tree algorithm,&amp;#160; just forced to not split; and 2) logistic regression was the same as our Neural Nets, just with zero hidden layers.&amp;#160; However, we got similar feedback - people want &lt;em&gt;more algorithms&lt;/em&gt;, and specifically these ones, so we set up two &amp;quot;new algorithms&amp;quot; by forcibly setting parameters on the Decision Tree and Neural Network algorithms and voila! we shipped with nine named algorithms.&amp;#160; It would have been hard to fill up two entire chapters explaining that last sentence, so Zhaohui and I decided just to stick to the seven core algorithms.&lt;/p&gt;  &lt;p&gt;Anyway, this posting isn't really about how to count &lt;em&gt;less&lt;/em&gt; algorithms, I really wanted to show you how to count &lt;em&gt;more.&amp;#160; &lt;/em&gt;When we set about designing SQL Server Data Mining, we really and truly tried to make data mining operations simpler.&amp;#160; We thought at the time, rightly or wrongly, that the more options end users have, the more complicated and difficult the product would be to use.&amp;#160; Therefore, we tried to determine the best behavior in a class, and make more advanced options available through parameters.&lt;/p&gt;  &lt;p&gt;For example, take our clustering algorithm.&amp;#160; We assumed that if people wanted clustering, most likely didn't care about the details of the algorithm, they just wanted to get the job done, and that those people who wanted more would look for it (the design principal - make the simple things simple, and the complex things possible).&amp;#160; So we bundled up different flavors of clustering into a single package that many vendors would have broken apart.&amp;#160; So let's start counting with clustering.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;1&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Our default clustering behavior is &lt;strong&gt;EM (Expectation Maximization) clustering&lt;/strong&gt; using the Bradley-Fayyad scalable framework&lt;/p&gt; &lt;strong&gt;&lt;font size="5"&gt;&lt;/font&gt;&lt;/strong&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;2&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Setting a parameter changes that to a &lt;strong&gt;K-Means clustering &lt;/strong&gt;implementation using the same framework&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;3+4&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Setting the same parameter another way, provides &lt;strong&gt;non-scalable&lt;/strong&gt; versions of the two clustering varieties.&amp;#160; (I know it's hard to swallow that the non-scalable versions count as separate algorithms, but if you &lt;em&gt;started&lt;/em&gt; with the vanilla versions and &lt;em&gt;added&lt;/em&gt; scalability, then &lt;em&gt;of course&lt;/em&gt; you would consider those versions as new algorithms - I'm just working backwards here.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;5&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Let's move to our Decision Tree algorithm and we will consider our classification tree as one algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;6&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;But our Decision Tree also predicts continuous and counts as a &lt;em&gt;regression&lt;/em&gt; tree, so we will count that as another algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;7&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Oops!&amp;#160; Our Decision Tree &lt;em&gt;also &lt;/em&gt;creates full linear regressions at each of the leaf nodes.&amp;#160; To get the typical regression tree behavior you need to make sure that none of the continuous inputs have the REGRESSOR flag and you get yet another algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;8&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Oh yeah, our trees allow for multiple targets in each model, allowing the discovery and display of interrelated patterns through our dependency net.&amp;#160; I've seen other vendors advertise such functionality as an &amp;quot;algorithm&amp;quot; so there's our #8.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;9&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;How about collaborative filtering with Trees - just slap a PREDICT flag on a nested table, and you have a complete recommendation system.&amp;#160; Let's call it Associative Trees&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;10&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Naive Bayes.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;11+12&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;If we're going to count Associative Trees, we also have &amp;quot;Associative Bayes&amp;quot;.&amp;#160; I guess the multiple target interrelated pattern thing counts here as well.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;13&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Association Rules.&amp;#160; A-priori style&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;14&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;It seems odd to count association rules twice since we can do predictions with it, but nobody else does it (or didn't before - correct me if I'm wrong), so Predictive Association Rules makes the cut.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;15+16+17+18&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Well if we're going to go and call predictive association an algorithm, we had better do the same for our clustering algorithm.&amp;#160; Granted, clustering doesn't make a great classifier or estimator, but the great Highlight Exceptions functionality of the Data Mining addins comes from this ability.&amp;#160; Yes, we can do nested table prediction as well with clustering, but I wouldn't recommend it to my mom, so I won't take another four for that.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;19+20+21+22+23&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Neural Networks, Sequence Clustering, Time Series, Linear Regression and Logistic Regression.&amp;#160; Yeah, yeah, I could get into varieties here, but I think you get the point.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;So by that count, and not being &lt;em&gt;too &lt;/em&gt;creative (trust me, I can do more) we're looking at &lt;font size="5"&gt;&lt;strong&gt;23 &lt;/strong&gt;&lt;/font&gt;algorithms in SQL Server 2005 Data Mining.&amp;#160; There are a few more options coming up in SQL Server 2008 that are worth discussing as well.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;24&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;The time series of SQL Server 2007 uses the ARTXP algorithm - &amp;quot;Auto Regression Trees with Cross Predict&amp;quot;.&amp;#160; In 2008, we're adding ARIMA as well, for algorithm #24.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;25&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;And yet again with Time Series, the default mode of operation is to blend ARTXP and ARIMA results in an intelligent way to maximize accuracy and stability for #25.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Arbitrarily there are 23 algorithms in SQL 2005 and 25 in SQL 2008, with the option of teasing out even more varieties depending on how you apply parameters and flags to the base nine (or seven - depending on how you count!).&amp;#160;&amp;#160; Next time someone quips that SQL Server only has &amp;quot;nine&amp;quot; algorithms, tell them that's just the packaging - each of those nine provides a wealth of value in each box.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=6410880" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Association+Rules/default.aspx">Association Rules</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Time+Series/default.aspx">Time Series</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Algorithms/default.aspx">Algorithms</category></item><item><title>Tree Utilities in Analysis Services Stored Procedures</title><link>http://blogs.msdn.com/jamiemac/archive/2007/03/02/tree-utilities-in-analysis-services-stored-procedures.aspx</link><pubDate>Fri, 02 Mar 2007 11:32:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1785956</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/1785956.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=1785956</wfw:commentRss><description>&lt;P&gt;This past week I was helping out a customer that wanted to reduce the length of a questionairre by using data mining to determine which of the 300+ questions were actually necessary for them to get the understanding they required.&amp;nbsp; By using a tree model, and playing with the COMPLEXITY_PENALTY parameter, I was able to build a model that was reasonably accurate and only required 10-17 questions.&amp;nbsp; (I was able to do the whole project using the Data Mining Add-ins as well!)&lt;/P&gt;
&lt;P&gt;In the process, I created some stored procedures that helped me easily extract the information I needed from the tree - information such as which paths are the shortest, the longest, that contain the minimum or maximum values, etc&lt;/P&gt;
&lt;P&gt;Attached is the source code for those utilities.&amp;nbsp; You can copy and paste them into Visual Studio - add a reference to ADOMDServer.NET, build and add as an assembly to your Analysis Services server.&lt;/P&gt;
&lt;P&gt;The utilities give you the following functions:&lt;/P&gt;
&lt;P&gt;For all trees:&lt;/P&gt;&lt;FONT size=2&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT size=2&gt;List out the shortest paths from root to leaf&lt;BR&gt;CALL TreeUtils.ShortestPaths("Model Name", "Tree Name")&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;List out the longest paths from root to leaf&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;BR&gt;CALL TreeUtils.LongestPaths("Model Name", "Tree Name")&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Provide information about the shortest paths (e.g. value, probability, etc)&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;BR&gt;CALL TreeUtils.ShortestPathStatistics("Model Name", "Tree Name")&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;Provide information about the longest paths (e.g. value, probability, etc)&lt;FONT size=2&gt;&lt;BR&gt;&lt;/FONT&gt;CALL TreeUtils.LongestPathStatistics("Model Name", "Tree Name")&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;For Regression Trees:&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Return the path to the leaf node containing the minimum value&lt;BR&gt;CALL TreeUtils.MinimumPath("Model Name", "Tree Name")&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;P&gt;Return information about the path containing the minimum value (e.g. depth, value, etc)&lt;/FONT&gt;&lt;FONT size=2 000&gt;&lt;BR&gt;CALL TreeUtils.MinimumPathStatistics("Model Name", "Tree Name")&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;Return the path to the leaf node containing the maxmum value&lt;BR&gt;CALL TreeUtils.MaximumPath("Model Name", "Tree Name")&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Return information about the path containing the minimum value (e.g. depth, value, etc)&lt;FONT size=2&gt;&lt;BR&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;CALL TreeUtils.MaximumPathStatistics("Model Name", "Tree Name")&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;For Classification Trees&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT size=2&gt;Return the path leading to the leaf with the least likelihood of the specified state&lt;BR&gt;CALL TreeUtils.MinimumPath("Model Name", "Tree Name", "State")&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;Return the path leading to the leaf with the most likelihood of the specified state&lt;BR&gt;CALL TreeUtils.MaximumPath("Model Name", "Tree Name", "State")&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;Return information about the path containing the least likelihood of the state (e.g. depth, probability, etc)&lt;FONT size=2&gt;&lt;BR&gt;&lt;/FONT&gt;CALL TreeUtils.MinimumPathStatistics("Model Name", "Tree Name", "State")&lt;/P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;
&lt;P&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT color=#000000&gt;Return information about the path containing the most likelihood of the state (e.g. depth, probability, etc)&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;BR&gt;&lt;/FONT&gt;CALL TreeUtils.MaximumPathStatistics("Model Name", "Tree Name", "State")&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;/FONT&gt;
&lt;P&gt;&lt;FONT size=2&gt;&lt;/P&gt;
&lt;P&gt;&lt;/FONT&gt;.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1785956" width="1" height="1"&gt;</description><enclosure url="http://blogs.msdn.com/jamiemac/attachment/1785956.ashx" length="28804" type="text/plain" /><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Code/default.aspx">Code</category></item><item><title>Wisconsin Breast Cancer Dataset available</title><link>http://blogs.msdn.com/jamiemac/archive/2007/02/01/wisconsin-breast-cancer-dataset-available.aspx</link><pubDate>Fri, 02 Feb 2007 01:50:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1580365</guid><dc:creator>JamieMac</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/1580365.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=1580365</wfw:commentRss><description>&lt;P&gt;Frequently I use the Wisconsin Breast Cancer Dataset for demonstrating the Data Mining&amp;nbsp;Addins for Office&amp;nbsp;- enough people asked, so I made it available as an &lt;A class="" href="http://www.sqlserverdatamining.com/DMCommunity/_Downloads/4390.aspx" mce_href="http://www.sqlserverdatamining.com/DMCommunity/_Downloads/4390.aspx"&gt;Excel 2007 file&lt;/A&gt;&amp;nbsp;(free login required).&amp;nbsp; For purists, the original data is available at the &lt;A class="" href="http://www.ics.uci.edu/~mlearn/MLRepository.html" mce_href="http://www.ics.uci.edu/~mlearn/MLRepository.html"&gt;Machine Learning repository&lt;/A&gt;, which is a great location for many sample datasets.&lt;/P&gt;
&lt;P&gt;Here are some screenshots of the data mining add-ins applied to this dataset&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 1:&amp;nbsp; Key Factor Analysis showing differences between benign and malignant tumors&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Key factors discriminating malignant and benign tumors" alt="Key factors discriminating malignant and benign tumors" src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_kfa.jpg" mce_src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_kfa.jpg"&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 2: Detect categories showing malignancy across detected groups.&amp;nbsp; Note two purely malignant categories suggesting differing classes of malignant tumors.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Malignancy across categories detected by Table Analysis Tools" style="WIDTH: 1054px; HEIGHT: 785px" height=785 alt="Malignancy across categories detected by Table Analysis Tools" src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_dc.jpg" width=1054 mce_src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_dc.jpg"&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 3: Decision tree to predict diagnosis, with nodes shaded based on likelihood of malignancy.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Diagnosis Decision Tree" style="WIDTH: 1073px; HEIGHT: 556px" height=556 alt="Diagnosis Decision Tree" src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_tree.jpg" width=1073 mce_src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_tree.jpg"&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1580365" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Excel/default.aspx">Excel</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category></item></channel></rss>