Assigning surrogate keys to early arriving facts using Integration Services

Assigning surrogate keys to early arriving facts using Integration Services

Rate This
  • Comments 12

In data warehouses, it is quite common that fact records arrive with a source system key that has not yet been loaded in the dimension tables. This phenomena is known as “late arriving dimensions” or “early arriving facts” in Kimball terminology.

When you see a fact record that cannot be resolved to a dimension surrogate key, the typical solution is this:

  • Create a dummy member in the dimension table using the source system key
  • Assign a surrogate key to this dummy member
  • Use the newly create surrogate key and assign it to the fact record

If you use T-SQL to load the data warehouse, it means you have to pass over the input fact rows twice. First, you have to discover which keys are not present in the dimension (and create surrogates for them). Second, you will have to look at the input data again and use the newly generated surrogate keys to load the the fact table.

Using Integration Services, early arriving facts can be populated with just one pass over the source rows, which means less read I/O operations. Nice!

In project Project REAL, a script component is used to achieve this effect. If many of your dimension have early arriving facts, this creates a lot of copy/paste code. There is a cleaner solution that does not use script components.

There is a way handle early arriving facts without relying on script components. It is best illustrated with an example. Let us create these three tables:

/* The input table */
CREATE TABLE Stage_Fact
(
  NK_A CHAR(10) NOT NULL /* The late arriving source system key */
)

/* The late arriving dimension table */
CREATE TABLE Dim_A
(
  SK_A INT PRIMARY KEY IDENTITY(1,1) /* The surrogate key*/
  , NK_A CHAR(10) NOT NULL /* The natural, source system key */
)

/* The final destination table */
CREATE TABLE Fact
(
  SK_A INT NOT NULL /* Surrogate key from dimension */
)

Now, use this script to generate 16M rows in the input table and create a 9000 row dimension table:

/* Create some staging data */
INSERT Stage_Fact WITH (TABLOCK)
SELECT RIGHT(REPLICATE('0', 10) + CAST(K AS VARCHAR(10)), 10) AS NK_A
FROM (SELECT ABS(binary_checksum(*) % 10000) AS K 
FROM sys.trace_event_bindings eb1
CROSS JOIN sys.trace_event_bindings eb2) AS stuff

/* Populate Dim_A with 90% of the keys from the fact table */
INSERT Dim_A WITH (TABLOCK) (NK_A)
SELECT DISTINCT NK_A FROM Stage_Fact
WHERE NK_A < '0000009000'

With this data, there will be 1000 late arriving dimension keys in Stage_Fact in around 1.8M rows. When we see a non-matched key in Stage_Fact, we want to generate a new surrogate key in Dim_A. But here is the catch: We only want to generate the surrogate once, and we do NOT want to do a roundtrip to the database the second time we see the same key.

Project Real uses a .NET hash table to track the generated keys and perform quick lookups the next time we see the key. But, we already have a fine hash table available without using script components: the lookup transformation. Let us see how we solve the early arriving fact problem with Integration Services, au natural:

clip_image002

The non-matched rows from Lookup SK_A go into the second lookup (New SK_A Cache). New SK_A Cache is where we want to handle the early arriving facts.

First, configure New SK_A Cache as a partial cache:

clip_image004

Now, we play a clever trick: Whenever a partial lookup cache first receives a non-matched row, it will call a SQL statement and fetch data to populate the lookup  cache. The default is a SELECT statement, but it does not have to be a SELECT statement. We could replace it with a stored procedure that returns the same result as the SELECT. Actually, let us do exactly that:

clip_image006

Now, the FIRST time the partial lookup cache sees a early arriving fact, it will call Generate_SK_A. I have mapped the NK_A (the source system, natural key) column to the input parameter. To finish the trick, we just have to create a simple stored procedure that uses NK_A to lookup SK_A (the Surrogate Key), and if not found, create a new key:

CREATE PROCEDURE Generate_SK_A
  @NK_A CHAR(10) /* The key to find a surrogate for */
AS
SET NOCOUNT ON

/* Prevent race conditions */
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

/* Check if we already have the key (procedure is idempotent) */
DECLARE @SK_A INT
SELECT @SK_A = SK_A
FROM Dim_A
WHERE NK_A = @NK_A

/* The natural key was not found, generate a new one */
IF @SK_A IS NULL BEGIN
 
INSERT Dim_A (NK_A) VALUES (@NK_A)
 
SET @SK_A = SCOPE_IDENTITY()
END

/* Return the result. 
  IMPORTANT: must return same format is the SELECT statement we replaced */
SELECT @SK_A AS SK_A, @NK_A AS NK_A

Simple isn’t it?... No need to use any .NET script components here. Have a look at the attached files to study the technique further and you will be handling early arriving facts elegantly in no time.

Attachment: SK_Example.zip
Leave a Comment
  • Please add 2 and 6 and type the answer here:
  • Post
  • To get the attached ZIP file click on the title of this post to bring it up in full screen. At the bottom of the page, there is a link to SK_Example.zip.

    Thanks

    Thomas Kejser

  • You can even save the first lookup and just use the second one, when define your partial cache big enough to hold all your dimension elements. Then you also don't need the union all element and it looks very smart.

  • Marcel: If you try to profile your solution, you will see that even the KNOWN keys result in roundtrip to the database. If you have a 10.000 row dimension (like this example), then you will get 10.000 roundtrips, even though only 1000 keys are early arriving.

    By having the first lookup catch most rows with a FULL cache, you save 9000 roundtrips. A performance gain worth taking I think :-)

  • You are right, because the partial cache starts empty and then you have all those round trips over the StoredProcedure. But anyway a nice solution. :)

  • Everybody who is familiar with Kimballs data warehouse design (Kimball Group) also knows about the early arriving facts problem. In SSIS 2008 the solution for achieving inferred members is quite a little bit easier. Thomas Kejser wrote a nice blog entry

  • How did you add a second output to your first Lookup?   Is this a new feature in 2008?  Will this work in 2005 using the Lookup Error output?

  • TabAlleman: The second output is indeed a new features in SQL Server 2008.

    In 2005 you have would have to use the error output and distinguish between lookup errors and truncation errors

  • Anyone actually tried this in 2005?  I did and my second lookup (the one with the partial cache and sproc that creates new SKs) fails with an error:  "Row yielded no match during lookup."

    I have tested my sproc, and it always returns a matching row, creating a new SK when none exists.  It's as though my lookup is ignoring my Caching SQL Statement where I use the sproc, and is using the straight lookup instead.  

    Maybe the Lookup component in 2005 couldn't really work this way?   But if anybody has gotten this technique to work in 2005, maybe I'm doing something wrong.

  • TabAlleman - I have got a similar process working in 2005, but using an oledb command to call a stored procedure which does the lookup/insert/scope identity to return the new key.  I have yet to test if this can be done with a lookup as outlined above.

  • TabAlleman - I have got a similar process working in 2005, but using an oledb command to call a stored procedure which does the lookup/insert/scope identity to return the new key.  I have yet to test if this can be done with a lookup as outlined above.

  • @TabAlleman

    Hi! For me it is working within the 2005 enviroment.If you like I can send you an example in order to help you.

  • @TabAlleman

    Hi! For me it is working within the 2005 enviroment.If you like I can send you an example in order to help you.

Page 1 of 1 (12 items)