<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Jamie's Junk : Algorithms</title><link>http://blogs.msdn.com/jamiemac/archive/tags/Algorithms/default.aspx</link><description>Tags: Algorithms</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Support Vector Machines for SQL Server Data Mining</title><link>http://blogs.msdn.com/jamiemac/archive/2008/10/14/support-vector-machines-for-sql-server-data-mining.aspx</link><pubDate>Tue, 14 Oct 2008 20:43:17 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8999848</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/8999848.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=8999848</wfw:commentRss><description>&lt;p&gt;Many have requested that we implement Support Vector Machines (SVM's) for SQL Server 2008, and for a wide variety of reasons, we just couldn't get to it.&amp;#160; Luckily the community has come to the rescue for those needing an SVM implementation today!&amp;#160; Joris Valkonet of &lt;a href="http://www.avanade.nl" target="_blank"&gt;Avanade Netherlands&lt;/a&gt; along with colleague Thanh Luc have implemented an SVM plug-in algorithm and viewer.&amp;#160; Not only that, but Joris has released the plug-in along with all of the source code at &lt;a href="http://www.codeplex.com/svmplugin" target="_blank"&gt;CodePlex&lt;/a&gt; so you can customize the algorithm for your own purposes or at least get another example of how algorithms and viewers are implemented.&lt;/p&gt;  &lt;p&gt;Below is a screenshot from the viewer showing cancer classification split across two selectable axes with green and blue indicating correctly classified benign and malignant tumors respectively and red indicating misclassifications.&lt;/p&gt;  &lt;p&gt;The plug-in and code can be found at &lt;a title="http://www.codeplex.com/svmplugin" href="http://www.codeplex.com/svmplugin"&gt;http://www.codeplex.com/svmplugin&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;img height="421" alt="Viewer_WDBC.jpg" src="http://i3.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=svmplugin&amp;amp;DownloadId=45974" width="377" /&gt;&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8999848" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Code/default.aspx">Code</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Algorithms/default.aspx">Algorithms</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Plug-ins/default.aspx">Plug-ins</category></item><item><title>How can we mine?  Let me count the ways...</title><link>http://blogs.msdn.com/jamiemac/archive/2007/11/19/how-can-we-mine-let-me-count-the-ways.aspx</link><pubDate>Tue, 20 Nov 2007 00:18:10 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:6410880</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/6410880.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=6410880</wfw:commentRss><description>&lt;p&gt;Recently I received some customer feedback that SQL Server Data Mining &amp;quot;doesn't have enough algorithms.&amp;quot;&amp;#160; More specifically, the comment was that we have the same capabilities are other Data Mining providers, we just &amp;quot;hide&amp;quot; many facilities as algorithm parameters rather than separating out each as a named algorithm.&amp;#160; So let's count the Microsoft algorithms a few different ways to work this out.&lt;/p&gt;  &lt;p&gt;First - let's go by the box.&amp;#160; This is the list of algorithms as specified in Books Online&lt;/p&gt;  &lt;ol&gt;   &lt;li&gt;Microsoft Decision Trees&lt;/li&gt;    &lt;li&gt;Microsoft Clustering&lt;/li&gt;    &lt;li&gt;Microsoft Naive Bayes&lt;/li&gt;    &lt;li&gt;Microsoft Association Rules&lt;/li&gt;    &lt;li&gt;Microsoft Neural Networks&lt;/li&gt;    &lt;li&gt;Microsoft Time Series&lt;/li&gt;    &lt;li&gt;Microsoft Sequence Clustering&lt;/li&gt;    &lt;li&gt;Microsoft Linear Regression&lt;/li&gt;    &lt;li&gt;Microsoft Logistic Regression&lt;/li&gt; &lt;/ol&gt;  &lt;p&gt;So that's nine - count 'em &lt;em&gt;nine&lt;/em&gt; algorithms.&amp;#160;&amp;#160; But that's just one way.&amp;#160; If you look at my book, Data Mining with SQL Server 2005 written with Zhaohui Tang, there are only &lt;em&gt;seven &lt;/em&gt;algorithms!&amp;#160; What?&amp;#160; You say!&amp;#160; How can it be?&lt;/p&gt;  &lt;p&gt;Let me explain.&amp;#160; During the development of SQL Server 2005, we realized a couple of tricks; 1) linear regression was the same as our tree algorithm,&amp;#160; just forced to not split; and 2) logistic regression was the same as our Neural Nets, just with zero hidden layers.&amp;#160; However, we got similar feedback - people want &lt;em&gt;more algorithms&lt;/em&gt;, and specifically these ones, so we set up two &amp;quot;new algorithms&amp;quot; by forcibly setting parameters on the Decision Tree and Neural Network algorithms and voila! we shipped with nine named algorithms.&amp;#160; It would have been hard to fill up two entire chapters explaining that last sentence, so Zhaohui and I decided just to stick to the seven core algorithms.&lt;/p&gt;  &lt;p&gt;Anyway, this posting isn't really about how to count &lt;em&gt;less&lt;/em&gt; algorithms, I really wanted to show you how to count &lt;em&gt;more.&amp;#160; &lt;/em&gt;When we set about designing SQL Server Data Mining, we really and truly tried to make data mining operations simpler.&amp;#160; We thought at the time, rightly or wrongly, that the more options end users have, the more complicated and difficult the product would be to use.&amp;#160; Therefore, we tried to determine the best behavior in a class, and make more advanced options available through parameters.&lt;/p&gt;  &lt;p&gt;For example, take our clustering algorithm.&amp;#160; We assumed that if people wanted clustering, most likely didn't care about the details of the algorithm, they just wanted to get the job done, and that those people who wanted more would look for it (the design principal - make the simple things simple, and the complex things possible).&amp;#160; So we bundled up different flavors of clustering into a single package that many vendors would have broken apart.&amp;#160; So let's start counting with clustering.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;1&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Our default clustering behavior is &lt;strong&gt;EM (Expectation Maximization) clustering&lt;/strong&gt; using the Bradley-Fayyad scalable framework&lt;/p&gt; &lt;strong&gt;&lt;font size="5"&gt;&lt;/font&gt;&lt;/strong&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;2&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Setting a parameter changes that to a &lt;strong&gt;K-Means clustering &lt;/strong&gt;implementation using the same framework&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;3+4&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Setting the same parameter another way, provides &lt;strong&gt;non-scalable&lt;/strong&gt; versions of the two clustering varieties.&amp;#160; (I know it's hard to swallow that the non-scalable versions count as separate algorithms, but if you &lt;em&gt;started&lt;/em&gt; with the vanilla versions and &lt;em&gt;added&lt;/em&gt; scalability, then &lt;em&gt;of course&lt;/em&gt; you would consider those versions as new algorithms - I'm just working backwards here.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;5&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Let's move to our Decision Tree algorithm and we will consider our classification tree as one algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;6&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;But our Decision Tree also predicts continuous and counts as a &lt;em&gt;regression&lt;/em&gt; tree, so we will count that as another algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;7&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Oops!&amp;#160; Our Decision Tree &lt;em&gt;also &lt;/em&gt;creates full linear regressions at each of the leaf nodes.&amp;#160; To get the typical regression tree behavior you need to make sure that none of the continuous inputs have the REGRESSOR flag and you get yet another algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;8&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Oh yeah, our trees allow for multiple targets in each model, allowing the discovery and display of interrelated patterns through our dependency net.&amp;#160; I've seen other vendors advertise such functionality as an &amp;quot;algorithm&amp;quot; so there's our #8.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;9&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;How about collaborative filtering with Trees - just slap a PREDICT flag on a nested table, and you have a complete recommendation system.&amp;#160; Let's call it Associative Trees&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;10&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Naive Bayes.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;11+12&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;If we're going to count Associative Trees, we also have &amp;quot;Associative Bayes&amp;quot;.&amp;#160; I guess the multiple target interrelated pattern thing counts here as well.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;13&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Association Rules.&amp;#160; A-priori style&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;14&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;It seems odd to count association rules twice since we can do predictions with it, but nobody else does it (or didn't before - correct me if I'm wrong), so Predictive Association Rules makes the cut.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;15+16+17+18&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Well if we're going to go and call predictive association an algorithm, we had better do the same for our clustering algorithm.&amp;#160; Granted, clustering doesn't make a great classifier or estimator, but the great Highlight Exceptions functionality of the Data Mining addins comes from this ability.&amp;#160; Yes, we can do nested table prediction as well with clustering, but I wouldn't recommend it to my mom, so I won't take another four for that.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;19+20+21+22+23&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Neural Networks, Sequence Clustering, Time Series, Linear Regression and Logistic Regression.&amp;#160; Yeah, yeah, I could get into varieties here, but I think you get the point.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;So by that count, and not being &lt;em&gt;too &lt;/em&gt;creative (trust me, I can do more) we're looking at &lt;font size="5"&gt;&lt;strong&gt;23 &lt;/strong&gt;&lt;/font&gt;algorithms in SQL Server 2005 Data Mining.&amp;#160; There are a few more options coming up in SQL Server 2008 that are worth discussing as well.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;24&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;The time series of SQL Server 2007 uses the ARTXP algorithm - &amp;quot;Auto Regression Trees with Cross Predict&amp;quot;.&amp;#160; In 2008, we're adding ARIMA as well, for algorithm #24.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;25&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;And yet again with Time Series, the default mode of operation is to blend ARTXP and ARIMA results in an intelligent way to maximize accuracy and stability for #25.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Arbitrarily there are 23 algorithms in SQL 2005 and 25 in SQL 2008, with the option of teasing out even more varieties depending on how you apply parameters and flags to the base nine (or seven - depending on how you count!).&amp;#160;&amp;#160; Next time someone quips that SQL Server only has &amp;quot;nine&amp;quot; algorithms, tell them that's just the packaging - each of those nine provides a wealth of value in each box.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=6410880" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Association+Rules/default.aspx">Association Rules</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Time+Series/default.aspx">Time Series</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Algorithms/default.aspx">Algorithms</category></item></channel></rss>