<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Microsoft SQL Server Development Customer Advisory Team : Integration Services</title><link>http://blogs.msdn.com/sqlcat/archive/tags/Integration+Services/default.aspx</link><description>Tags: Integration Services</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Assigning surrogate keys to early arriving facts using Integration Services</title><link>http://blogs.msdn.com/sqlcat/archive/2009/05/13/assigning-surrogate-keys-to-early-arriving-facts-using-integration-services.aspx</link><pubDate>Wed, 13 May 2009 17:26:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9609343</guid><dc:creator>tkejser</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/sqlcat/comments/9609343.aspx</comments><wfw:commentRss>http://blogs.msdn.com/sqlcat/commentrss.aspx?PostID=9609343</wfw:commentRss><description>&lt;P&gt;In data warehouses, it is quite common that fact records arrive with a source system key that has not yet been loaded in the dimension tables. This phenomena is known as “late arriving dimensions” or “early arriving facts” in Kimball terminology.&lt;/P&gt;
&lt;P&gt;When you see a fact record that cannot be resolved to a dimension surrogate key, the typical solution is this:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Create a dummy member in the dimension table using the source system key &lt;/LI&gt;
&lt;LI&gt;Assign a surrogate key to this dummy member &lt;/LI&gt;
&lt;LI&gt;Use the newly create surrogate key and assign it to the fact record &lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;If you use T-SQL to load the data warehouse, it means you have to pass over the input fact rows twice. First, you have to discover which keys are not present in the dimension (and create surrogates for them). Second, you will have to look at the input data again and use the newly generated surrogate keys to load the the fact table.&lt;/P&gt;
&lt;P&gt;Using Integration Services, early arriving facts can be populated with just one pass over the source rows, which means less read I/O operations. Nice!&lt;/P&gt;
&lt;P&gt;In project &lt;A href="http://www.microsoft.com/downloads/details.aspx?FamilyID=B61A37B6-5852-4018-BBA9-795A34123ED0&amp;amp;displaylang=en" mce_href="http://www.microsoft.com/downloads/details.aspx?FamilyID=B61A37B6-5852-4018-BBA9-795A34123ED0&amp;amp;displaylang=en"&gt;Project REAL&lt;/A&gt;, a script component is used to achieve this effect. If many of your dimension have early arriving facts, this creates a lot of copy/paste code. There is a cleaner solution that does not use script components.&lt;/P&gt;
&lt;P&gt;There is a way handle early arriving facts without relying on script components. It is best illustrated with an example. Let us create these three tables:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;FONT color=#008000&gt;/* The input table */&lt;/FONT&gt; &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;CREATE TABLE Stage_Fact &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;( &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;&amp;nbsp; NK_A CHAR(10) NOT NULL /* The late arriving source system key */ &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;&lt;FONT color=#008000&gt;/* The late arriving dimension table */&lt;/FONT&gt; &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;CREATE TABLE Dim_A &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;( &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;&amp;nbsp; SK_A INT PRIMARY KEY IDENTITY(1,1) /* The surrogate key*/ &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;&amp;nbsp; , NK_A CHAR(10) NOT NULL /* The natural, source system key */ &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#008000 size=2 face="Courier New"&gt;&lt;STRONG&gt;/* The final destination table */ &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;CREATE TABLE Fact &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;( &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;&amp;nbsp; SK_A INT NOT NULL /* Surrogate key from dimension */ &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size=2 face="Courier New"&gt;&lt;STRONG&gt;)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;Now, use this script to generate 16M rows in the input table and create a 9000 row dimension table:&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;&lt;FONT color=#008000&gt;/* Create some staging data */&lt;/FONT&gt; &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;INSERT Stage_Fact WITH (TABLOCK) &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;SELECT RIGHT(REPLICATE('0', 10) + CAST(K AS VARCHAR(10)), 10) AS NK_A &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;FROM (SELECT ABS(binary_checksum(*) % 10000) AS K&amp;nbsp; &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;FROM sys.trace_event_bindings eb1 &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;CROSS JOIN sys.trace_event_bindings eb2) AS stuff&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;&lt;FONT color=#008000&gt;/* Populate Dim_A with 90% of the keys from the fact table */&lt;/FONT&gt; &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;INSERT Dim_A WITH (TABLOCK) (NK_A) &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;SELECT DISTINCT NK_A FROM Stage_Fact &lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT face="Courier New"&gt;&lt;STRONG&gt;WHERE NK_A &amp;lt; '0000009000'&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;With this data, there will be 1000 late arriving dimension keys in &lt;B&gt;Stage_Fact &lt;/B&gt;in around 1.8M rows. When we see a non-matched key in &lt;B&gt;Stage_Fact&lt;/B&gt;, we want to generate a new surrogate key in &lt;B&gt;Dim_A&lt;/B&gt;. But here is the catch: We only want to generate the surrogate once, and we do NOT want to do a roundtrip to the database the second time we see the same key. &lt;/P&gt;
&lt;P&gt;Project Real uses a .NET hash table to track the generated keys and perform quick lookups the next time we see the key. But, we already have a fine hash table available without using script components: the lookup transformation. Let us see how we solve the early arriving fact problem with Integration Services, au natural:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image002_2.jpg" mce_href="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image002_2.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH: 0px; DISPLAY: inline; BORDER-TOP-WIDTH: 0px; BORDER-BOTTOM-WIDTH: 0px; BORDER-LEFT-WIDTH: 0px" title=clip_image002 border=0 alt=clip_image002 src="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image002_thumb.jpg" width=478 height=425 mce_src="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image002_thumb.jpg"&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;The non-matched rows from &lt;B&gt;Lookup SK_A&lt;/B&gt; go into the second lookup (&lt;B&gt;New SK_A Cache&lt;/B&gt;). &lt;B&gt;New SK_A Cache&lt;/B&gt; is where we want to handle the early arriving facts.&lt;/P&gt;
&lt;P&gt;First, configure &lt;B&gt;New SK_A Cache&lt;/B&gt; as a partial cache: &lt;B&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image004_2.jpg" mce_href="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image004_2.jpg"&gt;&lt;IMG style="BORDER-RIGHT-WIDTH: 0px; DISPLAY: inline; BORDER-TOP-WIDTH: 0px; BORDER-BOTTOM-WIDTH: 0px; BORDER-LEFT-WIDTH: 0px" title=clip_image004 border=0 alt=clip_image004 src="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image004_thumb.jpg" width=372 height=144 mce_src="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image004_thumb.jpg"&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;B&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;Now, we play a clever trick: Whenever a partial lookup cache first receives a non-matched row, it will call a SQL statement and fetch data to populate the lookup&amp;nbsp; cache. The default is a SELECT statement, but it does not &lt;I&gt;have&lt;/I&gt; to be a SELECT statement. We could replace it with a stored procedure that returns the same result as the SELECT. Actually, let us do exactly that:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image006_2.jpg" mce_href="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image006_2.jpg"&gt;&lt;IMG style="BORDER-BOTTOM: 0px; BORDER-LEFT: 0px; DISPLAY: inline; BORDER-TOP: 0px; BORDER-RIGHT: 0px" title=clip_image006 border=0 alt=clip_image006 src="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image006_thumb.jpg" width=511 height=357 mce_src="http://blogs.msdn.com/blogfiles/sqlcat/WindowsLiveWriter/Assigningsurrogatekeystoearlyarrivingfac_E674/clip_image006_thumb.jpg"&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Now, the FIRST time the partial lookup cache sees a early arriving fact, it will call &lt;B&gt;Generate_SK_A&lt;/B&gt;. I have mapped the &lt;B&gt;NK_A&lt;/B&gt; (the source system, natural key) column to the input parameter. To finish the trick, we just have to create a simple stored procedure that uses &lt;B&gt;NK_A &lt;/B&gt;to lookup &lt;B&gt;SK_A&lt;/B&gt; (the Surrogate Key), and if not found, create a new key:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;CREATE PROCEDURE Generate_SK_A &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;&amp;nbsp; @NK_A CHAR(10) &lt;FONT color=#008000&gt;/* The key to find a surrogate for */&lt;/FONT&gt; &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;AS &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;SET NOCOUNT ON&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT color=#008000 face="Courier New"&gt;/* Prevent race conditions */&lt;/FONT&gt; &lt;BR&gt;&lt;FONT face="Courier New"&gt;SET TRANSACTION ISOLATION LEVEL SERIALIZABLE &lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;&lt;FONT color=#008000&gt;/* Check if we already have the key (procedure is idempotent) */&lt;/FONT&gt; &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;DECLARE @SK_A INT &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;SELECT @SK_A = SK_A &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;FROM Dim_A &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;WHERE NK_A = @NK_A&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;&lt;FONT color=#008000&gt;/* The natural key was not found, generate a new one */&lt;/FONT&gt; &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;IF @SK_A IS NULL BEGIN &lt;BR&gt;&amp;nbsp; &lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;INSERT Dim_A (NK_A) VALUES (@NK_A) &lt;BR&gt;&amp;nbsp; &lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;SET @SK_A = SCOPE_IDENTITY() &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;END&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT color=#008000 face="Courier New"&gt;/* Return the result.&amp;nbsp; &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT color=#008000 face="Courier New"&gt;&amp;nbsp; IMPORTANT: must return same format is the SELECT statement we replaced &lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT color=#008000 face="Courier New"&gt;*/ &lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;FONT face="Courier New"&gt;SELECT @SK_A AS SK_A, @NK_A AS NK_A&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Simple isn’t it?... No need to use any .NET script components here. Have a look at the attached files to study the technique further and you will be handling early arriving facts elegantly in no time.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9609343" width="1" height="1"&gt;</description><enclosure url="http://blogs.msdn.com/sqlcat/attachment/9609343.ashx" length="9939" type="application/x-zip-compressed" /><category domain="http://blogs.msdn.com/sqlcat/archive/tags/Development+_2600_amp_3B00_+Programming/default.aspx">Development &amp;amp; Programming</category><category domain="http://blogs.msdn.com/sqlcat/archive/tags/ETL/default.aspx">ETL</category><category domain="http://blogs.msdn.com/sqlcat/archive/tags/SSIS/default.aspx">SSIS</category><category domain="http://blogs.msdn.com/sqlcat/archive/tags/BI/default.aspx">BI</category><category domain="http://blogs.msdn.com/sqlcat/archive/tags/Data+Warehouse/default.aspx">Data Warehouse</category><category domain="http://blogs.msdn.com/sqlcat/archive/tags/Integration+Services/default.aspx">Integration Services</category></item></channel></rss>