Lookup Pattern: Range Lookups

Lookup Pattern: Range Lookups

Rate This
  • Comments 27

Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.

This blog post outlines three separate approaches for doing range lookups in SSIS:

  1. Using the Lookup Transform
  2. Merge Join + Conditional Split
  3. Script Component

All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.

One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).

Lookup Transform

The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).

We can use the following query to have the lookup transform perform our range lookup:

select [ProductKey], [ProductAlternateKey], 
     [StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and   [StartDate] <= ?
and (
    [EndDate] is null or 
    [EndDate] > ?
)

On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.

image

This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.

Merge Join and Conditional Split

This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.

image 

For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:

table1

From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.

table2

We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):

StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))

table3

This approach is a little more complicated, but performs a lot better than using the Lookup Transform.

Script Component

The third approach uses a custom script component to perform the lookup. I wrote the script in two ways - one that simulates a "Full Cache" type lookup, and one that is similar to partial cache except it pulls back all values for a given natural key, instead of just the one for the given date range. The caching behavior is controlled by the PreCache boolean package variable.

Conclusion

I ran the three packages using the following environment (my laptop):

  • Dual core Intel 1.8ghz
  • 3gb of RAM
  • AdventureWorks2008 and AdventureWorksDW2008
    • ~120,000 order rows (SalesOrderDetail)
    • ~600 reference rows (DimProduct)

Here are the results, in rows per second (larger being better):

image 

At 120k+ rows per second, we can see that the custom script (or better yet, a custom transform) is the best alternative here. We can also see that even though the Lookup approach was by far the slowest (3639 rows / second), it is still a viable choice when you're processing a small number of rows.

There's a couple of reasons that the Lookup Transform performs poorly here. First, because it's not able to pre-cache any of the reference data, it has to go to the database often. Second, it matches only on actual parameter values - it doesn't have a concept of ranges. Since it will only find a cache hit if all parameters are the same, it ends up hitting the database for almost every row (120k times). By comparison, the script component will only query once per unique ProductNumber (~600 times max).

So there you have three different approaches for doing range lookups in SSIS. I'm hoping we'll be able to either enhance the Lookup component to support this functionality in the future, or perhaps provide a new transform to handle this case.

In the mean time, please feel free to post / email any alternative approaches you might have.

I've attached the packages used in this post incase you want to try out the different options for yourself.

 

Leave a Comment
  • Please add 4 and 5 and type the answer here:
  • Post
  • From the Lookup presentation I put together for the MS BI conference in October, here is a series of

  • This is just a place holder. Please see the post on different approaches of doing Range Lookups in SSIS

  • The SQL Server Integration Services team added valuable new caching options (and scalability) to the

  • Hi,

    Can anybody explain in brief the code used in the script component?

    I have a scenario to implement the same.

    Any help would be greatly appreciated. Thanks

  • Which part do you need explained?

  • Basically the part within the while loop.

    Why is StartDate not considered in the code?

    I am a starter, so please bear with me.

    Thanks for your help.

  • An optimization. It assumes that everything up to the next end date belongs to the previous record.

    ex.

    Key 1, EndDate 2001/01/01

    Key 20, EndDate 2003/01/01

    Key 40, EndDate 2008/01/01

    Key 60, EndDate NULL

    If a record with a transaction date of 1999 comes in, we can determine it belongs to Key 1. If a record with a transaction date of 2004 comes in, we know it belongs to Key 40.

    Note, this will only work if there are no gaps in your history. It assumes that your data has been cleaned before hand.

  • Hi,

    How do i redirect the lookup no match output in the script component if there is no dimension natural key available?

    Can you please help me out with the code?

    Thanks in advance for the help.

  • How do i redirect the lookup no match output in the script component if there is no dimension natural key available?

    Even if redirection is not possible i would like to replace the corresponding key of the record like 1, which is the key for default record in my dimension?

    waiting for your response...

  • You can redirect with the script by adding another output to the component.

    http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/66136/

    Alternatively, you can replace the key value with 1 instead of throwing an exception like the script does now.

  • Hi Masson,

    Thanks for the reply.

    I tried replacing the key for the corresponding row to 1 instead of throwing the exception as you said. But it fails again. I don't want the script component to fail at any cost, just replace the key with 1 and proceed.

    Below is the change I made to the code.

     if (ranges == null)

     {

               Row.ProductKey = 1;

    //commented

    //throw new NullReferenceException("Couldn't find //value for product number " + productNumber);

     }

    Am I missing something here or should I do anything else.. Please have a look at it. It is very critical for me.

  • There is solution based on the third-party commercial CozyRoc SSIS+ library. CozyRoc has implemented data flow destination script, which creates memory-efficient range dictionary object. The dictionary object can then be used in CozyRoc Lookup Plus component. For more information and demonstration how to use the script, check here:

    http://www.cozyroc.com/script/range-dictionary-destination

  • Hi, I'd really like to try out the downloadable solutions, but it appears they are in SSIS 2008 format.

    Any chance a SSIS 2005 version can be made available too?

    Thanks in advance.

  • Is it possible to get a VB.Net version of the script component's code? Thanks in advance.

  • I second that -- any chance for a VB.Net version of this code?

Page 1 of 2 (27 items) 12