<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Jamie's Junk : Clustering</title><link>http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx</link><description>Tags: Clustering</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Getting Data Mining results into SQL Tables</title><link>http://blogs.msdn.com/jamiemac/archive/2008/10/07/getting-data-mining-results-into-sql-tables.aspx</link><pubDate>Wed, 08 Oct 2008 09:35:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:8990858</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/8990858.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=8990858</wfw:commentRss><description>&lt;P&gt;I've been seeing lots of questions about how to get data mining results into SQL tables.&amp;nbsp; Most times the answers are "use the prediction query builder save button" or "build an SSIS package."&amp;nbsp; Both of these have issues among them being that the former is really only suited to single-use, small jobs, and the latter has a lot of overhead (not to mention that if you want to use the Data Mining Query Transform you have to have Enterprise Edition).&lt;/P&gt;
&lt;P&gt;Luckily, there happens to be a much easier way - and it's one of those "Doh!" moments when you learn about it, because it's that easy.&lt;/P&gt;
&lt;P&gt;The way to do it is to simply use linked servers.&amp;nbsp; Anyone who uses DMX knows to connect to SQL data with OPENQUERY.&amp;nbsp; There's no reason you can't simply connect to Data Mining data using the same mechanism.&lt;/P&gt;
&lt;P&gt;For example, use a query like this to establish a link to an AS server:&lt;/P&gt;&lt;PRE class=csharpcode&gt;&lt;SPAN class=kwrd&gt;EXEC&lt;/SPAN&gt; sp_addlinkedserver 
@server=&lt;SPAN class=str&gt;'LINKED_AS'&lt;/SPAN&gt;, &lt;SPAN class=rem&gt;-- local SQL name given to the linked server&lt;/SPAN&gt;
@srvproduct=&lt;SPAN class=str&gt;''&lt;/SPAN&gt;, &lt;SPAN class=rem&gt;-- not used &lt;/SPAN&gt;
@provider=&lt;SPAN class=str&gt;'MSOLAP'&lt;/SPAN&gt;, &lt;SPAN class=rem&gt;-- OLE DB provider &lt;/SPAN&gt;
@datasrc=&lt;SPAN class=str&gt;'localhost'&lt;/SPAN&gt;, &lt;SPAN class=rem&gt;-- analysis server name (machine name) &lt;/SPAN&gt;
@&lt;SPAN class=kwrd&gt;catalog&lt;/SPAN&gt;=&lt;SPAN class=str&gt;'MovieClick'&lt;/SPAN&gt; -- &lt;SPAN class=kwrd&gt;default&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;catalog&lt;/SPAN&gt;/&lt;SPAN class=kwrd&gt;database&lt;/SPAN&gt; &lt;/PRE&gt;
&lt;P&gt;
&lt;STYLE type=text/css&gt;
.csharpcode, .csharpcode pre
{
	font-size: small;
	color: black;
	font-family: consolas, "Courier New", courier, monospace;
	background-color: #ffffff;
	/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt 
{
	background-color: #f4f4f4;
	width: 100%;
	margin: 0em;
}
.csharpcode .lnum { color: #606060; }&lt;/STYLE&gt;
Then you can select data using OPENQUERY like this:&lt;/P&gt;&lt;PRE class=csharpcode&gt;&lt;SPAN class=kwrd&gt;SELECT&lt;/SPAN&gt; * &lt;SPAN class=kwrd&gt;FROM&lt;/SPAN&gt; 
&lt;SPAN class=kwrd&gt;OPENQUERY&lt;/SPAN&gt;(LINKED_AS, 
  &lt;SPAN class=str&gt;'SELECT Cluster() AS [Cluster], ClusterProbability() AS [Prob] 
   FROM [Customers - Clustering]
   NATURAL PREDICTION JOIN
   OPENQUERY([Movie Click],'&lt;/SPAN&gt;&lt;SPAN class=str&gt;'SELECT * FROM Customers'&lt;/SPAN&gt;&lt;SPAN class=str&gt;') AS t'&lt;/SPAN&gt;)&lt;/PRE&gt;
&lt;STYLE type=text/css&gt;
.csharpcode, .csharpcode pre
{
	font-size: small;
	color: black;
	font-family: consolas, "Courier New", courier, monospace;
	background-color: #ffffff;
	/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt 
{
	background-color: #f4f4f4;
	width: 100%;
	margin: 0em;
}
.csharpcode .lnum { color: #606060; }&lt;/STYLE&gt;

&lt;P&gt;Then, of course, you can do all kinds of manipulations on it, like finding the average cluster probability of each cluster, right?&amp;nbsp; Well, almost, the data type returned by the Cluster function is actual text or ntext or something that GROUP BY chokes on, so you have to do some casting first.&amp;nbsp; Therefore if you want to do that trick, use a query like this:&lt;/P&gt;&lt;PRE class=csharpcode&gt;&lt;SPAN class=kwrd&gt;SELECT&lt;/SPAN&gt; Cluster, &lt;SPAN class=kwrd&gt;AVG&lt;/SPAN&gt;(Prob) &lt;SPAN class=kwrd&gt;FROM&lt;/SPAN&gt;
(&lt;SPAN class=kwrd&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;CAST&lt;/SPAN&gt;(Cluster &lt;SPAN class=kwrd&gt;AS&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;Char&lt;/SPAN&gt;(30)) &lt;SPAN class=kwrd&gt;AS&lt;/SPAN&gt; Cluster, Prob &lt;SPAN class=kwrd&gt;FROM&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;OPENQUERY&lt;/SPAN&gt;(LINKED_AS, 
  &lt;SPAN class=str&gt;'SELECT Cluster() AS [Cluster], 
      ClusterProbability() AS [Prob] FROM [Customers - Clustering]
   NATURAL PREDICTION JOIN
   OPENQUERY([Movie Click],'&lt;/SPAN&gt;&lt;SPAN class=str&gt;'SELECT * FROM Customers'&lt;/SPAN&gt;&lt;SPAN class=str&gt;') AS t'&lt;/SPAN&gt;)
   ) &lt;SPAN class=kwrd&gt;AS&lt;/SPAN&gt; t
&lt;SPAN class=kwrd&gt;GROUP&lt;/SPAN&gt; &lt;SPAN class=kwrd&gt;BY&lt;/SPAN&gt; Cluster&lt;/PRE&gt;
&lt;STYLE type=text/css&gt;
.csharpcode, .csharpcode pre
{
	font-size: small;
	color: black;
	font-family: consolas, "Courier New", courier, monospace;
	background-color: #ffffff;
	/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt 
{
	background-color: #f4f4f4;
	width: 100%;
	margin: 0em;
}
.csharpcode .lnum { color: #606060; }&lt;/STYLE&gt;

&lt;P&gt;That will give you a nice result showing you, in a way, the affinity of each cluster based on the input set.&amp;nbsp; That is, if you ran such a query against the training data, you could say that the clusters with a higher probability are "tighter" than the ones with low probabilities.&amp;nbsp; Anyway, that's besides the point of this post.&lt;/P&gt;
&lt;P&gt;In any case, remember to double your single quotes and flatten any nested results and this technique should work just great for getting DMX into SQL.&lt;/P&gt;
&lt;P&gt;-J&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=8990858" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/DMX/default.aspx">DMX</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category></item><item><title>How can we mine?  Let me count the ways...</title><link>http://blogs.msdn.com/jamiemac/archive/2007/11/19/how-can-we-mine-let-me-count-the-ways.aspx</link><pubDate>Tue, 20 Nov 2007 00:18:10 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:6410880</guid><dc:creator>JamieMac</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/6410880.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=6410880</wfw:commentRss><description>&lt;p&gt;Recently I received some customer feedback that SQL Server Data Mining &amp;quot;doesn't have enough algorithms.&amp;quot;&amp;#160; More specifically, the comment was that we have the same capabilities are other Data Mining providers, we just &amp;quot;hide&amp;quot; many facilities as algorithm parameters rather than separating out each as a named algorithm.&amp;#160; So let's count the Microsoft algorithms a few different ways to work this out.&lt;/p&gt;  &lt;p&gt;First - let's go by the box.&amp;#160; This is the list of algorithms as specified in Books Online&lt;/p&gt;  &lt;ol&gt;   &lt;li&gt;Microsoft Decision Trees&lt;/li&gt;    &lt;li&gt;Microsoft Clustering&lt;/li&gt;    &lt;li&gt;Microsoft Naive Bayes&lt;/li&gt;    &lt;li&gt;Microsoft Association Rules&lt;/li&gt;    &lt;li&gt;Microsoft Neural Networks&lt;/li&gt;    &lt;li&gt;Microsoft Time Series&lt;/li&gt;    &lt;li&gt;Microsoft Sequence Clustering&lt;/li&gt;    &lt;li&gt;Microsoft Linear Regression&lt;/li&gt;    &lt;li&gt;Microsoft Logistic Regression&lt;/li&gt; &lt;/ol&gt;  &lt;p&gt;So that's nine - count 'em &lt;em&gt;nine&lt;/em&gt; algorithms.&amp;#160;&amp;#160; But that's just one way.&amp;#160; If you look at my book, Data Mining with SQL Server 2005 written with Zhaohui Tang, there are only &lt;em&gt;seven &lt;/em&gt;algorithms!&amp;#160; What?&amp;#160; You say!&amp;#160; How can it be?&lt;/p&gt;  &lt;p&gt;Let me explain.&amp;#160; During the development of SQL Server 2005, we realized a couple of tricks; 1) linear regression was the same as our tree algorithm,&amp;#160; just forced to not split; and 2) logistic regression was the same as our Neural Nets, just with zero hidden layers.&amp;#160; However, we got similar feedback - people want &lt;em&gt;more algorithms&lt;/em&gt;, and specifically these ones, so we set up two &amp;quot;new algorithms&amp;quot; by forcibly setting parameters on the Decision Tree and Neural Network algorithms and voila! we shipped with nine named algorithms.&amp;#160; It would have been hard to fill up two entire chapters explaining that last sentence, so Zhaohui and I decided just to stick to the seven core algorithms.&lt;/p&gt;  &lt;p&gt;Anyway, this posting isn't really about how to count &lt;em&gt;less&lt;/em&gt; algorithms, I really wanted to show you how to count &lt;em&gt;more.&amp;#160; &lt;/em&gt;When we set about designing SQL Server Data Mining, we really and truly tried to make data mining operations simpler.&amp;#160; We thought at the time, rightly or wrongly, that the more options end users have, the more complicated and difficult the product would be to use.&amp;#160; Therefore, we tried to determine the best behavior in a class, and make more advanced options available through parameters.&lt;/p&gt;  &lt;p&gt;For example, take our clustering algorithm.&amp;#160; We assumed that if people wanted clustering, most likely didn't care about the details of the algorithm, they just wanted to get the job done, and that those people who wanted more would look for it (the design principal - make the simple things simple, and the complex things possible).&amp;#160; So we bundled up different flavors of clustering into a single package that many vendors would have broken apart.&amp;#160; So let's start counting with clustering.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;1&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Our default clustering behavior is &lt;strong&gt;EM (Expectation Maximization) clustering&lt;/strong&gt; using the Bradley-Fayyad scalable framework&lt;/p&gt; &lt;strong&gt;&lt;font size="5"&gt;&lt;/font&gt;&lt;/strong&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;2&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Setting a parameter changes that to a &lt;strong&gt;K-Means clustering &lt;/strong&gt;implementation using the same framework&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;3+4&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Setting the same parameter another way, provides &lt;strong&gt;non-scalable&lt;/strong&gt; versions of the two clustering varieties.&amp;#160; (I know it's hard to swallow that the non-scalable versions count as separate algorithms, but if you &lt;em&gt;started&lt;/em&gt; with the vanilla versions and &lt;em&gt;added&lt;/em&gt; scalability, then &lt;em&gt;of course&lt;/em&gt; you would consider those versions as new algorithms - I'm just working backwards here.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;5&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Let's move to our Decision Tree algorithm and we will consider our classification tree as one algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;6&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;But our Decision Tree also predicts continuous and counts as a &lt;em&gt;regression&lt;/em&gt; tree, so we will count that as another algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;7&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Oops!&amp;#160; Our Decision Tree &lt;em&gt;also &lt;/em&gt;creates full linear regressions at each of the leaf nodes.&amp;#160; To get the typical regression tree behavior you need to make sure that none of the continuous inputs have the REGRESSOR flag and you get yet another algorithm.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;8&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Oh yeah, our trees allow for multiple targets in each model, allowing the discovery and display of interrelated patterns through our dependency net.&amp;#160; I've seen other vendors advertise such functionality as an &amp;quot;algorithm&amp;quot; so there's our #8.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;9&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;How about collaborative filtering with Trees - just slap a PREDICT flag on a nested table, and you have a complete recommendation system.&amp;#160; Let's call it Associative Trees&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;10&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Naive Bayes.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;11+12&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;If we're going to count Associative Trees, we also have &amp;quot;Associative Bayes&amp;quot;.&amp;#160; I guess the multiple target interrelated pattern thing counts here as well.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;13&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Association Rules.&amp;#160; A-priori style&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;14&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;It seems odd to count association rules twice since we can do predictions with it, but nobody else does it (or didn't before - correct me if I'm wrong), so Predictive Association Rules makes the cut.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;15+16+17+18&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Well if we're going to go and call predictive association an algorithm, we had better do the same for our clustering algorithm.&amp;#160; Granted, clustering doesn't make a great classifier or estimator, but the great Highlight Exceptions functionality of the Data Mining addins comes from this ability.&amp;#160; Yes, we can do nested table prediction as well with clustering, but I wouldn't recommend it to my mom, so I won't take another four for that.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;19+20+21+22+23&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;Neural Networks, Sequence Clustering, Time Series, Linear Regression and Logistic Regression.&amp;#160; Yeah, yeah, I could get into varieties here, but I think you get the point.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;So by that count, and not being &lt;em&gt;too &lt;/em&gt;creative (trust me, I can do more) we're looking at &lt;font size="5"&gt;&lt;strong&gt;23 &lt;/strong&gt;&lt;/font&gt;algorithms in SQL Server 2005 Data Mining.&amp;#160; There are a few more options coming up in SQL Server 2008 that are worth discussing as well.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;24&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;The time series of SQL Server 2007 uses the ARTXP algorithm - &amp;quot;Auto Regression Trees with Cross Predict&amp;quot;.&amp;#160; In 2008, we're adding ARIMA as well, for algorithm #24.&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;&lt;font size="5"&gt;25&lt;/font&gt;&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;And yet again with Time Series, the default mode of operation is to blend ARTXP and ARIMA results in an intelligent way to maximize accuracy and stability for #25.&lt;/p&gt;  &lt;p&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;Arbitrarily there are 23 algorithms in SQL 2005 and 25 in SQL 2008, with the option of teasing out even more varieties depending on how you apply parameters and flags to the base nine (or seven - depending on how you count!).&amp;#160;&amp;#160; Next time someone quips that SQL Server only has &amp;quot;nine&amp;quot; algorithms, tell them that's just the packaging - each of those nine provides a wealth of value in each box.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=6410880" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Association+Rules/default.aspx">Association Rules</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Time+Series/default.aspx">Time Series</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Algorithms/default.aspx">Algorithms</category></item><item><title>Automatically Label Clusters using Analysis Services Stored Procedures</title><link>http://blogs.msdn.com/jamiemac/archive/2007/02/21/automatically-label-clusters-using-analysis-services-stored-procedures.aspx</link><pubDate>Thu, 22 Feb 2007 01:06:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1737534</guid><dc:creator>JamieMac</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/1737534.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=1737534</wfw:commentRss><description>I just found this awesome &lt;A class="" href="http://www.codeplex.com/ASStoredProcedures/Wiki/View.aspx?title=ClusterNaming&amp;amp;referringTitle=Home" mce_href="http://www.codeplex.com/ASStoredProcedures/Wiki/View.aspx?title=ClusterNaming&amp;amp;referringTitle=Home"&gt;posting on CodePlex&lt;/A&gt; by furmangg to automatically label the clusters generated by the clustering algorithm.&amp;nbsp; It examines the cluster content to generate human-readable descriptions and then changes the labels appropriately.&amp;nbsp; Check it out - it might just save you a bunch of time figuring out what all the clusters mean.&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1737534" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Code/default.aspx">Code</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category></item><item><title>Wisconsin Breast Cancer Dataset available</title><link>http://blogs.msdn.com/jamiemac/archive/2007/02/01/wisconsin-breast-cancer-dataset-available.aspx</link><pubDate>Fri, 02 Feb 2007 01:50:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1580365</guid><dc:creator>JamieMac</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/jamiemac/comments/1580365.aspx</comments><wfw:commentRss>http://blogs.msdn.com/jamiemac/commentrss.aspx?PostID=1580365</wfw:commentRss><description>&lt;P&gt;Frequently I use the Wisconsin Breast Cancer Dataset for demonstrating the Data Mining&amp;nbsp;Addins for Office&amp;nbsp;- enough people asked, so I made it available as an &lt;A class="" href="http://www.sqlserverdatamining.com/DMCommunity/_Downloads/4390.aspx" mce_href="http://www.sqlserverdatamining.com/DMCommunity/_Downloads/4390.aspx"&gt;Excel 2007 file&lt;/A&gt;&amp;nbsp;(free login required).&amp;nbsp; For purists, the original data is available at the &lt;A class="" href="http://www.ics.uci.edu/~mlearn/MLRepository.html" mce_href="http://www.ics.uci.edu/~mlearn/MLRepository.html"&gt;Machine Learning repository&lt;/A&gt;, which is a great location for many sample datasets.&lt;/P&gt;
&lt;P&gt;Here are some screenshots of the data mining add-ins applied to this dataset&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 1:&amp;nbsp; Key Factor Analysis showing differences between benign and malignant tumors&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Key factors discriminating malignant and benign tumors" alt="Key factors discriminating malignant and benign tumors" src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_kfa.jpg" mce_src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_kfa.jpg"&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 2: Detect categories showing malignancy across detected groups.&amp;nbsp; Note two purely malignant categories suggesting differing classes of malignant tumors.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Malignancy across categories detected by Table Analysis Tools" style="WIDTH: 1054px; HEIGHT: 785px" height=785 alt="Malignancy across categories detected by Table Analysis Tools" src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_dc.jpg" width=1054 mce_src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_dc.jpg"&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Figure 3: Decision tree to predict diagnosis, with nodes shaded based on likelihood of malignancy.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;IMG title="Diagnosis Decision Tree" style="WIDTH: 1073px; HEIGHT: 556px" height=556 alt="Diagnosis Decision Tree" src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_tree.jpg" width=1073 mce_src="http://www.sqlserverdatamining.com/images/wbcd/wbcd_tree.jpg"&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1580365" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Excel/default.aspx">Excel</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Decision+Trees/default.aspx">Decision Trees</category><category domain="http://blogs.msdn.com/jamiemac/archive/tags/Clustering/default.aspx">Clustering</category></item></channel></rss>