<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/atom.xsl" media="screen"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US"><title type="html">Andreas De Ruiter&amp;#39;s BI blog</title><subtitle type="html">Best practices on designing business intelligence solutions using SSAS, SSIS and other Microsoft BI tools</subtitle><id>http://blogs.msdn.com/b/andreasderuiter/atom.aspx</id><link rel="alternate" type="text/html" href="http://blogs.msdn.com/b/andreasderuiter/" /><link rel="self" type="application/atom+xml" href="http://blogs.msdn.com/b/andreasderuiter/atom.aspx" /><generator uri="http://telligent.com" version="5.6.50428.7875">Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><updated>2012-12-05T16:27:02Z</updated><entry><title>Tracking changes to tables in your data warehouse using snapshot-based versioning</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/b/andreasderuiter/archive/2013/03/19/tracking-changes-to-tables-in-your-data-warehouse-using-snapshot-based-versioning.aspx" /><id>http://blogs.msdn.com/b/andreasderuiter/archive/2013/03/19/tracking-changes-to-tables-in-your-data-warehouse-using-snapshot-based-versioning.aspx</id><published>2013-03-19T14:20:38Z</published><updated>2013-03-19T14:20:38Z</updated><content type="html">&lt;p&gt;Today I will explain a way to track historic changes in a data warehouse and share a little program I wrote that will help you do this in minutes. It’s based on a method I’ve use many times and which is generic enough to cover many different scenarios. I’ll use an example to clarify how this works. In this example, I have a data warehouse database called MyDW, and this database contains a table called ImportTable which I’ll assume to be a table which is imported from another database by the data warehouse’s ETL process. For the purpose of this example I’ve kept thus table very simple:&lt;/p&gt;  &lt;div id="codeSnippetWrapper"&gt;   &lt;pre id="codeSnippet" style="margin: 0em; padding: 0px; width: 100%; text-align: left; color: black; line-height: 12pt; overflow: visible; font-family: &amp;quot;Courier New&amp;quot;, courier, monospace; font-size: 8pt; direction: ltr; background-color: rgb(244, 244, 244);"&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;CREATE&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;TABLE&lt;/span&gt; [dbo].[ImportTable]( &lt;br /&gt;    [ID] [&lt;span style="color: rgb(0, 0, 255);"&gt;int&lt;/span&gt;] &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [Name] [nvarchar](100) &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [Country] [nvarchar](100) &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt; &lt;br /&gt; ) &lt;span style="color: rgb(0, 0, 255);"&gt;ON&lt;/span&gt; [&lt;span style="color: rgb(0, 0, 255);"&gt;PRIMARY&lt;/span&gt;]&lt;/pre&gt;

  &lt;br /&gt;It’s important to understand that the ETL process periodically refreshes this table by fetching fresh data from an external database. Records might be added, removed or changed. The ETL process might simply update it by truncating the existing table and reloading it entirely. The ID field is the primary key which is imported from the external database (for this example we don’t add our own ID in the data warehouse).&lt;/div&gt;

&lt;p&gt;I called this method of tracking history “snapshot-based versioning”. It does &lt;em&gt;not &lt;/em&gt;rely on database triggers or on CDC. While those features might be useful for other scenarios, I’ve found them too limiting for data warehouse scenarios. For example, these features don’t work well if your ETL process refreshes the data by truncating the existing table and then reloading it.&lt;/p&gt;

&lt;p&gt;Snapshot-based versioning instead relies on TSQL’s &lt;a href="http://technet.microsoft.com/en-us/library/bb510625.aspx"&gt;MERGE&lt;/a&gt; statement. The MERGE statement is supported in SQL Server 2008 or higher. A limitation is that you cannot track changes to fields of type image, ntext and text. This is usually not a blocker because in real-life scenarios most tables have a “modified” field that is updated whenever a change to the record is changed.&lt;/p&gt;

&lt;p&gt;The idea is to have a second table called H_ImportTable which has the same fields as the original ImportTable. This second table, which I’ll refer to as the “history table”, has a versioned copy of the data in H_ImportTable. In other words, it works like a log which appends a row for any change to ImportTable. Obviously, if the table in the external database has changed multiple times between two ETL runs, this will appear only as one change in H_ImportTable, hence the name “snapshot-based versioning”.&lt;/p&gt;

&lt;p&gt;You can create H_ImportTable as follows:&lt;/p&gt;

&lt;div id="codeSnippetWrapper"&gt;
  &lt;pre id="codeSnippet" style="margin: 0em; padding: 0px; width: 100%; text-align: left; color: black; line-height: 12pt; overflow: visible; font-family: &amp;quot;Courier New&amp;quot;, courier, monospace; font-size: 8pt; direction: ltr; background-color: rgb(244, 244, 244);"&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;CREATE&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;TABLE&lt;/span&gt; [dbo].[H_ImportTable] ( &lt;br /&gt;    [_ACTION] [&lt;span style="color: rgb(0, 0, 255);"&gt;char&lt;/span&gt;](1) &lt;span style="color: rgb(0, 0, 255);"&gt;NOT&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [_KEY] [&lt;span style="color: rgb(0, 0, 255);"&gt;int&lt;/span&gt;] &lt;span style="color: rgb(0, 0, 255);"&gt;IDENTITY&lt;/span&gt;(1,1) &lt;span style="color: rgb(0, 0, 255);"&gt;NOT&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [_EFF_FROM] [datetime] &lt;span style="color: rgb(0, 0, 255);"&gt;NOT&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [_EFF_TO] [datetime] &lt;span style="color: rgb(0, 0, 255);"&gt;NOT&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [ID] [&lt;span style="color: rgb(0, 0, 255);"&gt;int&lt;/span&gt;] &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [Name] [nvarchar](100) &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt;, &lt;br /&gt;    [Country] [nvarchar](100) &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt; &lt;br /&gt; )&lt;/pre&gt;

  &lt;br /&gt;As you see there are four additional fields in this table:&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;_ACTION indicates whether a row has been added (‘A’), changed (‘C’), deleted (‘D’) or undeleted (‘U’).&lt;/li&gt;

  &lt;li&gt;_KEY is a unique row identifier. This field is only useful in specific scenarios&lt;/li&gt;

  &lt;li&gt;_EFF_FROM indicates the date/time from which this row is effective&lt;/li&gt;

  &lt;li&gt;_EFF_TO indicates the date/time until which this row is active. In case the row is effective now, this field is set to 9999-12-31.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After refreshing ImportTable, the ETL process should update H_ImportTable using the following SQL statement:&lt;/p&gt;

&lt;div id="codeSnippetWrapper" style="margin: 20px 0px 10px; padding: 4px; border: 1px solid silver; width: 97.45%; height: 310px; text-align: left; line-height: 12pt; overflow: auto; font-family: &amp;quot;Courier New&amp;quot;, courier, monospace; font-size: 8pt; cursor: text; direction: ltr; max-height: 200px; background-color: rgb(244, 244, 244);"&gt;
  &lt;pre id="codeSnippet" style="margin: 0em; padding: 0px; width: 100%; height: 331px; text-align: left; color: black; line-height: 12pt; overflow: visible; font-family: &amp;quot;Courier New&amp;quot;, courier, monospace; font-size: 8pt; direction: ltr; background-color: rgb(244, 244, 244);"&gt;INSERT &lt;span style="color: rgb(0, 0, 255);"&gt;INTO&lt;/span&gt; &lt;u&gt;[dbo].[H_ImportTable]&lt;/u&gt; &lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;SELECT&lt;/span&gt; _ACTION, SYSDATETIME() &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; _EFF_FROM, &lt;span style="color: rgb(0, 96, 128);"&gt;'9999-12-31'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; _EFF_TO,&lt;u&gt;[ID],[Name],[Country]&lt;/u&gt; &lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;FROM&lt;/span&gt; &lt;br /&gt; ( &lt;br /&gt;  MERGE &lt;u&gt;[dbo].[H_ImportTable]&lt;/u&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; T &lt;br /&gt;  &lt;span style="color: rgb(0, 0, 255);"&gt;USING&lt;/span&gt; &lt;u&gt;[dbo].[ImportTable]&lt;/u&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; S &lt;br /&gt;  &lt;span style="color: rgb(0, 0, 255);"&gt;ON&lt;/span&gt; (S.&lt;u&gt;[ID]&lt;/u&gt;=T.&lt;u&gt;[ID]&lt;/u&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AND&lt;/span&gt; T._EFF_TO=&lt;span style="color: rgb(0, 96, 128);"&gt;'9999-12-31'&lt;/span&gt;) &lt;br /&gt;  &lt;span style="color: rgb(0, 0, 255);"&gt;WHEN&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NOT&lt;/span&gt; MATCHED &lt;span style="color: rgb(0, 0, 255);"&gt;BY&lt;/span&gt; TARGET &lt;span style="color: rgb(0, 0, 255);"&gt;THEN&lt;/span&gt; &lt;br /&gt;    INSERT(_ACTION,_EFF_FROM,_EFF_TO,&lt;u&gt;[ID],[Name],[Country]&lt;/u&gt;) &lt;br /&gt;    &lt;span style="color: rgb(0, 0, 255);"&gt;VALUES&lt;/span&gt;(&lt;span style="color: rgb(0, 96, 128);"&gt;'A'&lt;/span&gt;,SYSDATETIME(),&lt;span style="color: rgb(0, 96, 128);"&gt;'9999-12-31'&lt;/span&gt;,&lt;u&gt;[S].[ID],[S].[Name],[S].[Country]&lt;/u&gt;) &lt;br /&gt;  &lt;span style="color: rgb(0, 0, 255);"&gt;WHEN&lt;/span&gt; MATCHED &lt;span style="color: rgb(0, 0, 255);"&gt;AND&lt;/span&gt; T._EFF_TO=&lt;span style="color: rgb(0, 96, 128);"&gt;'9999-12-31'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AND&lt;/span&gt; (T._ACTION=&lt;span style="color: rgb(0, 96, 128);"&gt;'D'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;OR&lt;/span&gt; (&lt;u&gt;[S].[Name]&amp;lt;&amp;gt;[T].[Name] &lt;span style="color: rgb(0, 0, 255);"&gt;OR&lt;/span&gt; [S].[Country]&amp;lt;&amp;gt;[T].[Country])&lt;/u&gt;) &lt;span style="color: rgb(0, 0, 255);"&gt;THEN&lt;/span&gt; &lt;br /&gt;    &lt;span style="color: rgb(0, 0, 255);"&gt;UPDATE&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;SET&lt;/span&gt; T._EFF_TO=SYSDATETIME() &lt;br /&gt;  &lt;span style="color: rgb(0, 0, 255);"&gt;WHEN&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NOT&lt;/span&gt; MATCHED &lt;span style="color: rgb(0, 0, 255);"&gt;BY&lt;/span&gt; SOURCE &lt;span style="color: rgb(0, 0, 255);"&gt;AND&lt;/span&gt; T._EFF_TO=&lt;span style="color: rgb(0, 96, 128);"&gt;'9999-12-31'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AND&lt;/span&gt; T._ACTION&amp;lt;&amp;gt;&lt;span style="color: rgb(0, 96, 128);"&gt;'D'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;THEN&lt;/span&gt; &lt;br /&gt;    &lt;span style="color: rgb(0, 0, 255);"&gt;UPDATE&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;SET&lt;/span&gt; T._EFF_TO=SYSDATETIME() &lt;br /&gt;  &lt;span style="color: rgb(0, 0, 255);"&gt;OUTPUT&lt;/span&gt; $&lt;span style="color: rgb(0, 0, 255);"&gt;Action&lt;/span&gt; Action_Out &lt;br /&gt;         ,&lt;span style="color: rgb(0, 0, 255);"&gt;CASE&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;WHEN&lt;/span&gt; S.&lt;u&gt;[ID]&lt;/u&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;IS&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;NULL&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;THEN&lt;/span&gt; &lt;span style="color: rgb(0, 96, 128);"&gt;'D'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;WHEN&lt;/span&gt; Inserted._ACTION=&lt;span style="color: rgb(0, 96, 128);"&gt;'D'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;THEN&lt;/span&gt; &lt;span style="color: rgb(0, 96, 128);"&gt;'U'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;ELSE&lt;/span&gt; &lt;span style="color: rgb(0, 96, 128);"&gt;'C'&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;END&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; _ACTION &lt;br /&gt;         ,ISNULL(S.&lt;u&gt;[ID]&lt;/u&gt;,Deleted.&lt;u&gt;[ID]&lt;/u&gt;) &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; &lt;u&gt;ID,[S].[Name],[S].[Country]&lt;/u&gt; &lt;br /&gt; ) &lt;span style="color: rgb(0, 0, 255);"&gt;AS&lt;/span&gt; MERGE_OUT &lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;WHERE&lt;/span&gt; MERGE_OUT.Action_Out = &lt;span style="color: rgb(0, 96, 128);"&gt;'UPDATE'&lt;/span&gt; &lt;br /&gt;&lt;/pre&gt;

  &lt;br /&gt;&lt;/div&gt;

&lt;div&gt;&amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;The statement above does all of the heavy lifting to update the history table H_ImportTable. Because MERGE is just a single SQL statement, it is very efficient compared to using multiple SELECT, INSERT and UPDATE statements to accomplish the same. The statement makes the assumption that the ID field is unique in the source table and you’re advised to create a unique index on that field in ImportTable. Obviously, in the history table H_ImportTable this field will no longer be unique. &lt;/div&gt;

&lt;div&gt;&amp;#160;&lt;/div&gt;

&lt;div&gt;For performance you’ll want to create the following indexes on the history table H_ImprotTable: ID, _ACTION, _EFF_FROM and _EFF_TO.&lt;/div&gt;

&lt;p&gt;You can adjust this SQL code to work with the tables in your data warehouse. To do so, adjust all the parts that are underlined. If you need to do this for many tables, making these changes manually is laborious and error-prone, This is why I wrote the program “Table Merge Scripts” which you can &lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11/2055.Table-Merge-Scripts.zip"&gt;download here&lt;/a&gt;. This program generates the SQL code based on an existing table in the database. You can see the UI below:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/3022.image_5F00_75C89363.png"&gt;&lt;img title="image" style="border: 0px currentcolor; display: inline;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/5670.image_5F00_thumb_5F00_6D64BE0C.png" width="466" height="345" /&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;First, edit the connection string by entering the correct server name and database name. Then click Connect.&lt;/p&gt;

&lt;p&gt;Next select the table and the field that is the primary key.&lt;/p&gt;

&lt;p&gt;The first button in the bottom half of the screen copies the SQL code to create the history table to the clipboard. The new table will have the same name as the selected source table, prefixed with “H_”. The code also creates indexes for _ACTION, _EFF_FROM, _EFF_TO, _KEY and the field which you designated as the primary key in the source table.&lt;/p&gt;

&lt;p&gt;The next button copies the SQL code to update the history table to the clipboard. This consists of a MERGE command similar to the example shown above.&lt;/p&gt;

&lt;p&gt;The final button copies the same SQL code as the second table, but wrapped in a stored procedure. &lt;/p&gt;

&lt;p&gt;Good luck!&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10403489" width="1" height="1"&gt;</content><author><name>Andreas de Ruiter (Microsoft)</name><uri>http://blogs.msdn.com/mobileiq_4000_outlook.com/ProfileUrlRedirect.ashx</uri></author><category term="ETL" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/ETL/" /><category term="data warehouse" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/data+warehouse/" /><category term="SQL" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/SQL/" /></entry><entry><title>SSMS 2012 freezes when being launched</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/18/ssms-2012-freezes-when-being-launched.aspx" /><id>http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/18/ssms-2012-freezes-when-being-launched.aspx</id><published>2012-12-18T14:24:53Z</published><updated>2012-12-18T14:24:53Z</updated><content type="html">&lt;p&gt;I just spent 30 minutes on SSMS which hanged on me every time I tried to launch it. I searched the Internet but couldn’t find a solution. The SMSS process was heavily using the CPU and slowing down the machine, but after the splash screen the SSMS application window wouldn’t appear.&lt;/p&gt;  &lt;p&gt;Then I downloaded &lt;a href="http://technet.microsoft.com/en-us/sysinternals/bb896645"&gt;Process Monitor&lt;/a&gt;, saw that the process was continuously using a file called C:\Users\&lt;em&gt;username&lt;/em&gt;\AppData\Local\Microsoft\SQL Server Management Studio\11.0\1033\Ssms.CTM.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/0550.image_5F00_7D60006A.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/1121.image_5F00_thumb_5F00_400F25D6.png" width="644" height="263" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;I killed the SSMS process, renamed the Ssms.CTM file, and launched SSMS again. Problem solved!&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10379040" width="1" height="1"&gt;</content><author><name>Andreas de Ruiter (Microsoft)</name><uri>http://blogs.msdn.com/mobileiq_4000_outlook.com/ProfileUrlRedirect.ashx</uri></author><category term="SSMS" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/SSMS/" /></entry><entry><title>Solving “You cannot activate the relationship because a set of active relationships already exists between tables…” in PowerPivot and tabular cubes</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/12/solving-you-cannot-activate-the-relationship-because-a-set-of-active-relationships-already-exists-between-tables-in-powerpivot-and-tabular-cubes.aspx" /><id>http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/12/solving-you-cannot-activate-the-relationship-because-a-set-of-active-relationships-already-exists-between-tables-in-powerpivot-and-tabular-cubes.aspx</id><published>2012-12-12T15:05:00Z</published><updated>2012-12-12T15:05:00Z</updated><content type="html">&lt;p&gt;When you create relationships between tables in PowerPivot (or in a SSAS tabular cube), you may run into the following message:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/3173.image_5F00_3642643F.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/1411.image_5F00_thumb_5F00_4E65CE9A.png" width="644" height="133" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Below I’ll explain a way to solve this.&lt;/p&gt;  &lt;p&gt;By setting relationships in the designer you tell PowerPivot how the tables are related. This is important because that’s how PowerPivot knows how to slice and dice data from one table by attributes from other tables. PowerPivot needs to know unambiguously which records in one table are related to records in another table. Therefore, if you follow the relationships in the direction of the arrows, there may only be one direct or indirect path from any table to any other table. &lt;/p&gt;  &lt;p&gt;When PowerPivot encounters two conflicting paths, it will automatically deactivate one of them. In the PowerPivot designer this path then appears as a dashed arrow. If you wish to reactivate a relationship, you first must deactivate another conflicting relationship. Unfortunately, in real live scenarios you’ll often have a business need for both conflicting relationships, so by changing which relationship is deactivated you’re only moving the problem. Fortunately, there is a way to solve this!&lt;/p&gt;  &lt;p&gt;Let’s take a simple example:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/5621.image_5F00_4601F943.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/8838.image_5F00_thumb_5F00_72AAB31C.png" width="504" height="348" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;In this example we have the three tables:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;Location – this is a table with addresses (for simplicity we only included ZipCode above)&lt;/li&gt;    &lt;li&gt;Customer – this is a table of customers. The customer table has a reference to the Location table for the customer’s home address.&lt;/li&gt;    &lt;li&gt;Sales – this is a table of sales orders. The table has a reference to the customer table as well as a reference to the Location table for the address of the store where the sales was made.&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;Also, note that the Customer and the Sales table have calculated fields. These are calculated aggregates (not calculated columns), which is important as you’ll see shortly. &lt;/p&gt;  &lt;p&gt;Assume you have the following business requirements:&lt;/p&gt;  &lt;ol&gt;   &lt;li&gt;You should be able to show sales by store location&lt;/li&gt;    &lt;li&gt;You should be able to show sales by household size&lt;/li&gt;    &lt;li&gt;You should be able to show average household size by ZipCode&lt;/li&gt; &lt;/ol&gt;  &lt;p&gt;As you see in the diagram above, one of the relationships was deactivated because there were multiple paths from Sales to Location. Because the relationship from Sales to Customer has been deactivated, requirement 2 cannot be satisfied. You can try to solve this by first deactivating another relationship, however, that will lead to one of the other business requirement not being satisfied.&lt;/p&gt;  &lt;p&gt;You can solve this by importing certain tables two times and re-routing relationships. The $64.000 question is: which tables should you load twice and where to put the relationships? I’ll first explain the steps and then the rational behind them.&lt;/p&gt;  &lt;h1&gt;Steps to solve the problem&lt;/h1&gt;  &lt;ol&gt;   &lt;li&gt;Rename all tables which have one or more calculated aggregates so that their friendly names in PowerPivot are prefixed with “Fact”. In our example the Sales table is renamed to FactSales and Customer is renamed to “FactCustomer”.&lt;/li&gt;    &lt;li&gt;Import all tables which you just renamed a second time into the model. In our example we’ll import Sales and Customer a second time (keeping their friendly names unchanged). Do not create calculated aggregates on these tables.&lt;/li&gt;    &lt;li&gt;Remove all relationships and recreate them such that all relationships originate from a fact table and their destination is a non-fact table. Don’t forget to also create relationships from the fact tables to their non-fact sibling tables.&lt;/li&gt; &lt;/ol&gt;  &lt;p&gt;The diagram for the example project now looks like this:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/4113.image_5F00_430C9490.png"&gt;&lt;img title="image" style="display: inline; background-image: none;" border="0" alt="image" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/2553.image_5F00_thumb_5F00_0F642832.png" width="644" height="356" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;All relationships are active and all of the business requirements can now be satisfied!&lt;/p&gt;  &lt;p&gt;Finally there’s some cleaning up we can do. The fact and non-fact tables have a lot of duplicate fields. My advice is to hide all the columns on the fact tables in PowerPivot so only the calculated aggregates from these tables are exposed to Excel or another OLAP client.&lt;/p&gt;  &lt;p&gt;Once everything is working you can also remove unneeded columns from the data model in order to minimize the amount of data loaded into PowerPivot or the tabular cube.&lt;/p&gt;  &lt;h1&gt;Rational behind this method&lt;/h1&gt;  &lt;p&gt;The ordinary way many BI experts model data in OLAP solutions is through“dimensional modeling”. Dimensional modeling implies that there are basically two types of tables: dimension tables and fact tables. Dimension tables typically hold text fields that are used for slicing and dicing (e.g. customer name, product name, country) whereas fact tables have numeric fields which can be aggregated (e.g. sales amount). Fact tables also have foreign key columns which link them to dimension tables. This leads to “star-schema” diagrams where each fact table is linked via a 1:many relationship to multiple dimensions tables. (There’s a variant known as “snowflake-schema” where dimension tables can also link to other dimension tables.)&lt;/p&gt;  &lt;p&gt;PowerPivot’s tabular model is simpler and more intuitive to business users since it doesn’t force them to learn dimensional modeling. The downside is that you easily run into problems like the one we covered here. The method I described in this post uses the principles of dimensional modeling to solve the problem. &lt;/p&gt;  &lt;p&gt;I tried to explain dimension modeling in a couple of sentences, but it would take a few hundred pages to discuss it thoroughly. Even though PowerPivot does not enforce dimensional modeling, understanding the concepts and principles of it helps you design better solutions. If you want to learn more about dimensional modeling I recommend reading Ralph Kimball’s book “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling”. &lt;/p&gt;  &lt;p&gt;If you’ve found another (better?) way of solving conflicting relationships in PowerPivot model, please let me know!&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10376595" width="1" height="1"&gt;</content><author><name>Andreas de Ruiter (Microsoft)</name><uri>http://blogs.msdn.com/mobileiq_4000_outlook.com/ProfileUrlRedirect.ashx</uri></author><category term="tabular cube" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/tabular+cube/" /><category term="PowerPivot" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/PowerPivot/" /><category term="dimensional modeling" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/dimensional+modeling/" /></entry><entry><title>Should you use a data warehouse with a tabular cube?</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/10/should-you-use-a-data-warehouse-with-a-tabular-cube.aspx" /><id>http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/10/should-you-use-a-data-warehouse-with-a-tabular-cube.aspx</id><published>2012-12-10T09:15:00Z</published><updated>2012-12-10T09:15:00Z</updated><content type="html">&lt;p&gt;With SQL Server 2012, Microsoft introduced SSAS “tabular mode”. Tabular mode essentially is a server-hosted version of PowerPivot. In fact, it’s very easy to take an Excel workbook with a PowerPivot model and import that to SSAS as a tabular cube. &lt;/p&gt;  &lt;p&gt;When you convert a PowerPivot model to a tabular cube, the new cube fetches data from external data sources just like the original PowerPivot model in Excel did. In other words, you don’t need to have a data warehouse database “in front” of your tabular cube. The question I’ll address below is whether it’s a good idea to have a data warehouse in combination with a tabular cube anyway.&lt;/p&gt;  &lt;p&gt;You can build your BI solution with varying degrees of sophistication. More sophisticated is not necessarily better, and you should determine which best fits your needs. I’ll describe the three main scenarios as I see them below.&lt;/p&gt;  &lt;h4&gt;Scenario 1: No data warehouse&lt;/h4&gt;  &lt;p&gt;As I mentioned above, you can have a tabular cube without having a data warehouse database. A benefit of this approach is that the only skills you need to learn is how to deploy a PowerPivot model to a server-based tabular cube, and how to create a job that will process this cube according to a certain time schedule. Another advantage is that you can continue developing your solution in Excel, without needing to learn and install the Visual Studio-based tools. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/4113.clip_5F00_image002_5F00_67D6B750.jpg"&gt;&lt;img title="clip_image002" style="border: 0px currentcolor; display: inline; background-image: none;" border="0" alt="clip_image002" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/0871.clip_5F00_image002_5F00_thumb_5F00_51A09BFE.jpg" width="625" height="281" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The nice thing of a server-based tabular cube over using PowerPivot in Excel is that you no longer get gigantic Excel files, sometimes larger than 100MB. Your Excel files can have all of the PowerPivot functionality, but since the raw data is stored in the cube instead of the Excel files, your workbooks become very small, often smaller than 100KB. &lt;/p&gt;  &lt;p&gt;You can also more easily share the Excel files with peers. As long as you grant them access, they can refresh the Excel file by clicking Refresh All in the data ribbon. Furthermore, the tabular cube enables web-based reports.&lt;/p&gt;  &lt;p&gt;The bottom line is that there are lots of benefits of converting your PowerPivot solution to tabular cubes when you need to share or publish the data.&lt;/p&gt;  &lt;h4&gt;Scenario 2: Data warehouse as staging area only&lt;/h4&gt;  &lt;p&gt;The next step up the ladder of sophistication is to maintain a data warehouse database for staging data. Using a simple ETL process, presumably built using an Integration Services project in Visual Studio, the external data is retrieved from the external data sources and stored unmodified in staging tables in the data warehouse. The cube is reconfigured to pull data from the staging tables instead of the external data sources.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/6170.clip_5F00_image004_5F00_654D8592.jpg"&gt;&lt;img title="clip_image004" style="border: 0px currentcolor; display: inline; background-image: none;" border="0" alt="clip_image004" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/2860.clip_5F00_image004_5F00_thumb_5F00_75E58080.jpg" width="624" height="287" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;If you’re a business user without BI development experience building this can challenging. You’ll need to learn how to create a SQL database for the data warehouse with the required staging tables, and how to develop an ETL process that copies data from the external sources into the staging tables. This will require you to learn how to use SQL Server Management Studio and some of the BI design tools in Visual Studio. &lt;/p&gt;  &lt;p&gt;One of the main benefits over having no data warehouse is that you’re now no longer limited to the types of external data sources supported by PowerPivot. Also,&amp;#160; you can make your BI system more resilient whereby the cube can be processed even though some of the external data sources cannot be refreshed, for example because they are offline.&lt;/p&gt;  &lt;h4&gt;Scenario 3: Data warehouse for staging and transforming data&lt;/h4&gt;  &lt;p&gt;The most supplicated approach I’ll discuss here, the one I commonly use in my BI projects, is to use the data warehouse for both staging raw data as well as transforming that data to an OLAP-friendly data model. This is what I called the “using staging tables” approach to data warehousing in my blog post &lt;a href="http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data.aspx"&gt;Designing an ETL process with SSIS: two approaches to extracting and transforming data&lt;/a&gt;. The cube gets data from the pre-transformed tables/views in the data warehouse database instead of pulling it from the staging tables as in the previous scenario. Now that you use views and stored procedures to model the data, you no longer need to do those transformations in the tabular cube by renaming and filtering columns etc. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/7750.clip_5F00_image006_5F00_1B6EFDE2.jpg"&gt;&lt;img title="clip_image006" style="border: 0px currentcolor; display: inline; background-image: none;" border="0" alt="clip_image006" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/6663.clip_5F00_image006_5F00_thumb_5F00_40F87B43.jpg" width="624" height="281" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The OLAP-friendly data model provides an extra layer of abstraction which is beneficial for many reasons. When changes are made to the data sources (and therefore also the staging tables), breaking changes can be shielded from the cube and from the reports that depend of them. Also, business users can access the same data they know from the cube using SQL queries, which can be very helpful in certain scenarios. Finally, the transformations in data warehouse can be much more powerful than what it possible in the cube. For example, you can combine multiple tables into one using the SQL UNION statement, which would be impossible to do inside the tabular cube.&lt;/p&gt;  &lt;p&gt;Hence, this scenario enables very powerful solutions.&lt;/p&gt;  &lt;h3&gt;Which scenario to choose?&lt;/h3&gt;  &lt;p&gt;If you’re a business analyst and not a software developer, the “No Data warehouse” method will be achievable and you’ll be able to better share and publish reports. &lt;/p&gt;  &lt;p&gt;If you are a software/BI developer, building a data warehouse will make your solutions more robust and powerful. The benefits of including transformations in the data warehouse (3&lt;sup&gt;rd&lt;/sup&gt; scenario) outweighs the additional amount of work compared to the 2&lt;sup&gt;nd&lt;/sup&gt; scenario, which is why I always choose include transformations in the data warehouse. &lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10375620" width="1" height="1"&gt;</content><author><name>Andreas de Ruiter (Microsoft)</name><uri>http://blogs.msdn.com/mobileiq_4000_outlook.com/ProfileUrlRedirect.ashx</uri></author><category term="data warehouse" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/data+warehouse/" /><category term="tabular cube" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/tabular+cube/" /></entry><entry><title>Designing an ETL process with SSIS: two approaches to extracting and transforming data</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data.aspx" /><id>http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data.aspx</id><published>2012-12-05T16:27:02Z</published><updated>2012-12-05T16:27:02Z</updated><content type="html">&lt;p&gt;On the Internet you find a lot of articles explaining the concepts like the data warehouse and ETL. There is also lots of information about products and tools such as SSIS. But, when I first starting building BI solutions, I was hardly able to find any architectural guidance for best practices for building BI solutions based Microsoft’s tools. Microsoft has a wonderful set of tools to create powerful BI solutions, nevertheless the learning curve is pretty steep. After learning a lot of lessons the hard way I decided to share my best practices in this blog.&lt;/p&gt;  &lt;p&gt;In this post I’ll compare the two approaches on how to extract and transform (the ‘E’ and ‘T’ in ‘ETL’) data from external databases using SSIS and SQL Server. Understanding the options will help you avoid common pitfalls. &lt;/p&gt;  &lt;p&gt;The ‘T’ is ETL stands for transformation. The goal of transformation is to convert raw input data to an OLAP-friendly data model. This is also known as dimensional modeling. If you’re not familiar with this I highly recommend reading Ralph Kimball’s book “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling”. Understanding dimensional modeling is essential for building BI systems. &lt;/p&gt;  &lt;p&gt;The first approach is what I call the “obvious approach”. Many tutorials about SSIS are written from the standpoint that you obviously extract and transform data using data flows in SSIS. I’ve run into many roadblocks using the obvious approach, which led me to an alternative approach which relies on using staging tables. This staging table approach not only solves many of the issues I had before, it also provides a simple and robust way to design and build ETL processes in SSIS. Once your familiar with this approach, you’ll be able to create a basic BI solutions hours instead of days. &lt;/p&gt;  &lt;p&gt;Using staging tables will also set you up for something I call snapshot-based history tracking, which is an exciting design pattern I’ll cover in a future blog.&lt;/p&gt;  &lt;h3&gt;What’s ETL again?&lt;/h3&gt;  &lt;p&gt;When designing the ETL process it’s good to think about the three fundamental things it needs to do:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;b&gt;Extract&lt;/b&gt; data from the external data sources such as line-of-business systems, CRM systems, relational databases, web services, and SharePoint lists.&lt;/li&gt;    &lt;li&gt;&lt;strong&gt;Transform&lt;/strong&gt; the data. This includes cleansing the data and converting it to a OLAP-friendly data model. The OLAP-friendly data model traditionally consists of dimension and fact tables in a star or snowflake schema and closely maps SSAS’s dimensional model (SSAS stands for SQL Server Analyses Services ). &lt;/li&gt;    &lt;li&gt;&lt;b&gt;Load&lt;/b&gt; the data so that it can be quickly accessed by querying tools such as reports. In practice this implies processing SSAS cubes.&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;An ETL process is a program that periodically runs on a server and orchestrates the refresh of the data in the BI system. SQL Server Integration Services (SSIS) is a development tool and runtime that is optimized for building ETL processes. Learning SSIS involves a steep learning curve and if you have a software development background like I do, you might first be inclined to build your ETL program from scratch using a general purpose programming language such as C#. However, once you master SSIS you’ll be able to write very efficient ETL processes much more quickly. This is because SSIS lets you design ETL processes in a graphical way (but if needed you can write parts using VB or C#). The SSIS components are highly optimized for ETL type tasks and the SSIS run-time executes independent tasks in parallel where possible. If you’re a programmer you’ll find it amazingly difficult to write your own ETL process using a general purpose language and make it run more efficient than one developed in SSIS.&lt;/p&gt;  &lt;h3&gt;The obvious approach: using SSIS data flows to transform the data&lt;/h3&gt;  &lt;p&gt;In SSIS you can design your ETL process using control flows and data flows. Data flows in SSIS are a type of control flow that allow you to extract data from an external data sources, flow that data through a number of transformations such as sorting, filtering, merging it with other data and converting data types, and finally store the result at a destination, usually a table in the data warehouse. This is very powerful and data flows seem to lend themselves very well for integrating the extract and transformation tasks within them. This is why I call this the “obvious” approach and many tutorials about SSIS follow this approach. The obvious approach seems especially attractive because it is very efficient and there’s no need to store intermediate results. The figure below illustrates this process:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/5621.clip_5F00_image001_5F00_6D0203FB.jpg"&gt;&lt;img title="clip_image001" style="border: 0px currentcolor; display: inline; background-image: none;" border="0" alt="clip_image001" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/0841.clip_5F00_image001_5F00_thumb_5F00_32A68E1A.jpg" width="500" height="242" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The top level control flow in the Integration Services project may look like this:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/2061.clip_5F00_image003_5F00_1C7072C8.jpg"&gt;&lt;img title="clip_image003" style="display: inline; background-image: none;" border="0" alt="clip_image003" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/3566.clip_5F00_image003_5F00_thumb_5F00_094F461C.jpg" width="504" height="499" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The “Extract and Transform” box is a sequence container that holds a data flow for each of the tables that will be refreshed in the data warehouse. In this example there is one fact table and there are three dimension tables. SSIS will execute the data flows in parallel, and when all of them have successfully completed the cube will be processed.&lt;/p&gt;  &lt;p&gt;The transformation of data takes place in the data flows. The transformations needed in each of the data flows would typically look something like this:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/7774.clip_5F00_image005_5F00_73192AC9.gif"&gt;&lt;img title="clip_image005" style="display: inline; background-image: none;" border="0" alt="clip_image005" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/6278.clip_5F00_image005_5F00_thumb_5F00_6D5E1123.gif" width="504" height="494" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;h3&gt;What’s wrong with the obvious approach?&lt;/h3&gt;  &lt;p&gt;There’s nothing necessarily wrong with obvious approach, it works fine and when the transformations are designed well it’s very efficient. But there are several reasons why I dislike using this approach:&lt;/p&gt;  &lt;p&gt;· I have found that developing and debugging transformations within data flows is very time-consuming. You’re working on a very low abstract level, as if you’re building a complex machine from scratch using basic components such as nuts and bolts.&lt;/p&gt;  &lt;p&gt;· Every time you test run a data flow it will need to fetch data from the external data source. This slows down testing and debugging even more because the external data source might have millions of rows of data. Moreover, you are putting a high burden on the external data source, which is a production database.    &lt;br /&gt;Of course you can work around this by having a test external database on your local development machine, but then you need to spend time creating it and filling it with a subset of the data. &lt;/p&gt;  &lt;p&gt;· When you update an existing data flow that already is in production, you must be very careful, especially in case the data flow maintains state such as with a slowly moving dimension with historic attributes. Your changed data flow may no longer be compatible with the stored state, causing lots of headaches.&lt;/p&gt;  &lt;h3&gt;Using staging tables&lt;/h3&gt;  &lt;p&gt;A much better approach is to keep extraction and transformation as two strictly separated steps. First you extract data from the external data source and store a “raw” copy of the data in staging tables in the data warehouse. With “raw” I mean that you keep the column names the same as in the source database and you don’t convert data, calculate new data fields, etc. You may however filter unneeded rows and columns as you extract data so that you don’t waste resources on unneeded data. That being said, if size and performance are not an issue it’s more convenient to just load the entire source tables.&lt;/p&gt;  &lt;p&gt;As with the obvious approach, you use data flow components to pull the data from the data sources. However, since they are now only used to extract data, the flows will be much simpler. Instead of using data flows in SSIS to transform the data to dimension and fact tables, you can now use database views to convert the raw data in the staging tables to dimension and fact views. At some point before deploying to production you will likely replace views by tables and stored procedures. I’ll come back to that later.&lt;/p&gt;  &lt;p&gt;Tip: There’s a quick and easy way to create staging tables from within SSIS. When you add a destination data source to a data flow, connect it to incoming data and then edit the destination’s properties, you will be asked to select a destination table. At that point you also have the option of creating a new table. Doing so will create a database table with all the right columns.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/5621.clip_5F00_image006_5F00_2BE35ECA.jpg"&gt;&lt;img title="clip_image006" style="border: 0px currentcolor; display: inline; background-image: none;" border="0" alt="clip_image006" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/6278.clip_5F00_image006_5F00_thumb_5F00_0A83B92E.jpg" width="500" height="233" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;The top level control flow now looks like this:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/7840.clip_5F00_image008_5F00_0D496E21.gif"&gt;&lt;img title="clip_image008" style="display: inline; background-image: none;" border="0" alt="clip_image008" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/2072.clip_5F00_image008_5F00_thumb_5F00_60C03E3A.gif" width="504" height="606" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;Instead of combining Extract and Transform, they have now become separate steps in the top level control flow. &lt;/p&gt;  &lt;p&gt;When you look at the Extract sequence container, you’ll there are now more data flows than in the obvious approach because we now have a data flow per staging table instead of one per dimension/fact table. If the source data is highly normalized as in this example, the number of staging tables will typically be bigger than the number of dimension and fact tables.&lt;/p&gt;  &lt;p&gt;For each data source we now have two operations: truncate the table and then retrieve data through a data flow task. This is because this example uses a very simple approach to refresh the data in our data warehouse: first empty the table (which contains data from the previous ETL run) using an SQL TRUNCATE TABLE statement, and then reload the entire table from the data source using a data flow. &lt;/p&gt;  &lt;p&gt;Let’s take a look at one of the data flows, SalesOrderHeader:&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/4213.clip_5F00_image010_5F00_71583928.gif"&gt;&lt;img title="clip_image010" style="display: inline; background-image: none;" border="0" alt="clip_image010" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-57-11-metablogapi/2061.clip_5F00_image010_5F00_thumb_5F00_7732DCC1.gif" width="228" height="186" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;As you see the data flow for retrieving the data is very straightforward compared to the data flow in the obvious approach: it simply gets the data from the external database and then stores it in the staging table in the data warehouse. The other data flows are also this simple.&lt;/p&gt;  &lt;p&gt;Once the data is extracted, the ETL process will continue with the transformation step. Since we implement transformations as database views instead of using data flow transformations in SSIS, we don’t need to do anything special during the transformation phase in the SSIS package (for now at least). &lt;/p&gt;  &lt;p&gt;For example, the transformation for the sales transaction fact table could look something like:&lt;/p&gt;  &lt;pre class="csharpcode"&gt;&lt;span class="kwrd"&gt;CREATE&lt;/span&gt; &lt;span class="kwrd"&gt;VIEW&lt;/span&gt; [Fact sale &lt;span class="kwrd"&gt;transaction&lt;/span&gt;] &lt;span class="kwrd"&gt;AS&lt;/span&gt;
&lt;span class="kwrd"&gt;SELECT&lt;/span&gt; d.[SalesOrderID] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [Sales &lt;span class="kwrd"&gt;order&lt;/span&gt; &lt;span class="kwrd"&gt;key&lt;/span&gt;]
,d.[ProductID] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [Product &lt;span class="kwrd"&gt;key&lt;/span&gt;]
,[TerritoryID] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [Territory &lt;span class="kwrd"&gt;key&lt;/span&gt;]
,dbo.DateToDateKey(h.[OrderDate]) &lt;span class="kwrd"&gt;AS&lt;/span&gt; [&lt;span class="kwrd"&gt;Date&lt;/span&gt; &lt;span class="kwrd"&gt;key&lt;/span&gt;]
,d.[OrderQty] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [&lt;span class="kwrd"&gt;Order&lt;/span&gt; quantity]
,d.[UnitPrice] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [Unit price]
,d.[UnitPriceDiscount] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [Unit price discount]
,d.[LineTotal] &lt;span class="kwrd"&gt;AS&lt;/span&gt; [Line total]
&lt;span class="kwrd"&gt;FROM&lt;/span&gt; [Staging].[SalesOrderDetail] d
&lt;span class="kwrd"&gt;JOIN&lt;/span&gt; [Staging].[SalesOrderHeader] h &lt;span class="kwrd"&gt;ON&lt;/span&gt; h.[SalesOrderID] = d.[SalesOrderID]
&lt;/pre&gt;


&lt;p&gt;A couple of things to point out in this view:&lt;/p&gt;

&lt;p&gt;· Each of the columns are made “business user friendly” by giving meaningful column names and including spaces. The idea is that this table is targeted towards business users, not software developers.&lt;/p&gt;

&lt;p&gt;· ID fields are renamed to “key” fields and these will later be used to join fact tables to dimension tables. Many BI specialis recommend introducing your own key fields instead of relying on ID fields in the source database. I often ignore that advice and can’t remember every running into trouble because of it. If we would need to introduce our own row identifiers, it’s not difficult to do so later.&lt;/p&gt;

&lt;p&gt;· You’ll often need to join multiple staging tables in order to produce the right results for the dimension and fact tables. In this example most fields are from the order detail table, but the order date needs to come from the order header table. This is an example of how OLAP data models are less normalized than the highly normalized data models which are typical for transactional databases.&lt;/p&gt;

&lt;h3&gt;How is using staging tables better than the obvious approach?&lt;/h3&gt;

&lt;p&gt;Remember the disadvantages of the obvious approach:&lt;/p&gt;

&lt;p&gt;In the obvious approach transformations are very time-consuming to develop and debug. With staging tables, transformations are implemented as database views, which are much simpler and less time consuming to develop because you can do it interactively in SSMS. If your SQL skills are a bit rusty, it’s definitely worth investing some time on improving those skills that instead of learning how to use all the different data transformations in SSIS.&lt;/p&gt;

&lt;p&gt;In the obvious approach, in order to test the transformations you need to run the data flow which each time pulls the data from the external data sources. With staging tables, you only need to run the Extract data flows once in order to fill the staging tables with test data. From that point onwards you can simply write and test SQL queries within your data warehouse database. &lt;/p&gt;

&lt;p&gt;Perhaps you may still want to use fake databases to mimic the external databases. However, I find I can often get away using the production databases during development because I hardly ever need to re-run the extract part of the ETL process. (I do not recommend using the production data warehouse database during development! You should have your own copy of the data warehouse database on your development machine.)&lt;/p&gt;

&lt;p&gt;In the obvious approach you needed to be very careful when updating transformations. With staging tables the transformation is just a database view and you won’t corrupt any data by changing the view.&lt;/p&gt;

&lt;h3&gt;Further improving the approach with staging tables&lt;/h3&gt;

&lt;p&gt;There are several ways we can further improve to the approach with staging tables.&lt;/p&gt;

&lt;p&gt;First of all, although the use of database views for implementing transformations simplifies development, it’s not very efficient from a runtime performance perspective. This may not be a problem if the amount of data is small and when the cube is the only “client” of the database views, but otherwise the views can quickly become a bottleneck. &lt;/p&gt;

&lt;p&gt;The way to solve this is to replace the database views by tables which are filled by stored procedures. For example, instead of having the Fact sale transaction view, you would create a Fact sale transaction table and a stored procedure which runs a similar query as the original view and stores the results in the table. &lt;/p&gt;

&lt;p&gt;The nice thing about this approach is that you can start developing the system by creating views and then, before deploying to production, replace the views by tables and stored procedures. The stored procedures are called within the Transform step in the ETL process.&lt;/p&gt;

&lt;p&gt;A downside compared to the obvious approach is that the data warehouse database will require more disk space because data is stored twice (first at the staging level and then at the dim/fact level). Depending how efficient the transformations are made, the obvious approach may also perform better. Therefore, if you work with big datasets (hundreds of millions of rows or more) you may still want to consider using the obvious approach instead of staging tables.&lt;/p&gt;

&lt;p&gt;A second area for improvement is the way staging tables are being refreshed using the flush-and-reload method. When the data flow fails, for example because the source database is offline, we end up with a couple of empty staging, dimension and fact tables in the data warehouse and the result is that we cannot process the cube. It’s better to be more resilient by processing the cube even though some of the data sources have failed. For the failed data sources, the cube should use the data from the previous ETL run. There are several solutions to solving this problem, but that’s outside the scope of this post.&lt;/p&gt;

&lt;h3&gt;One more way to transform data&lt;/h3&gt;

&lt;p&gt;Microsoft’s BI stack provides multiple ways to transform data, and that’s why it’s not evident for inexperienced BI developers which best way is best. I explained two ways of doing transformations, using either data flows and or SQL views. &lt;/p&gt;

&lt;p&gt;To complicate things further, you can also transform data inside of SSAS which allows you to rename and filter columns, introduce new calculated columns, etc. Therefore there is a third approach which we did not yet discuss, which is to create staging tables in the data warehouse and then do all the transformations in SSAS.&lt;/p&gt;

&lt;p&gt;Relying on SSAS for transformations is bad for a couple of reasons. One issue is that when you’re new to SSAS you’ll spend countless days trying to figure out the cryptic error messages that are thrown at you. Like SSIS, SSAS also has a steep learning curve. &lt;/p&gt;

&lt;p&gt;The bottom line is that transformations are probably the most complex part of the ETL process, and the SQL language provides the simplest and fastest way to deal with them.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10374909" width="1" height="1"&gt;</content><author><name>Andreas de Ruiter (Microsoft)</name><uri>http://blogs.msdn.com/mobileiq_4000_outlook.com/ProfileUrlRedirect.ashx</uri></author><category term="ETL" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/ETL/" /><category term="SSIS" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/SSIS/" /><category term="SSAS" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/SSAS/" /><category term="data warehouse" scheme="http://blogs.msdn.com/b/andreasderuiter/archive/tags/data+warehouse/" /></entry></feed>