I had a great time at PASS last week, and had a chance to talk to a lot of different SSIS users. One of the big topics seemed to be Slowly Changing Dimensions – I had a number of people ask for various improvements to the current Slowly Changing Dimension Transform in SSIS, and also ask for recommended alternatives in the meantime. I thought I’d summarize some of the more popular approaches I’ve seen, and see if anyone else has some alternatives.
You might have already tried the Slowly Changing Dimension Wizard that comes with SSIS 2005 and 2008 (and there are a number of good tutorials out there if you haven’t).
The SCD Wizard has a few things going for it - it’s quick and easy to implement, it handles most SCD scenarios out of the box, and its multi-component approach means you can customize it with the functionality you need.
It does have some pretty big limitations, however, which end up being a deal breaker for a lot of people.
A major inhibitor is the performance of the transform. It doesn’t perform that well for a couple of different reasons:
Another downside to the transform is the “one way” nature of the wizard – running it again (to change columns, for example) means you’ll lose any customizations you might have made to the other transforms.
I recommend using the wizard for simple dimensions, where you’re not processing a lot of data. If performance is a concern, consider one of the following approaches.
I came across this tip from the Kimball Group when I was putting together my Merge & CDC talk last year.
Using the SQL MERGE Statement for Slowly Changing Dimension Processing
In this approach, you write all of your incoming data to a staging table, and then use Execute SQL Tasks to run MERGE statements (you actually have to do two passes – one for Type 1 changes, and one for Type 2 – see the details in the tip above). I posted the sample packages and code I used in a previous blog post.
The performance in this approach is very good (although it moves the bulk of the work to the database machine, which might not be what you want). I recommend it if you don’t mind staging the data, writing custom SQL, or can’t use a 3rd party component in your environment.
I’ve heard great things about Todd McDermid’s custom SCD Transform. Instead of doing row by row lookups, this transform takes in the dimension as an input. This makes the comparison much faster than the stock SSIS version. It wraps up all of the functionality into a single transform, which is great if you’re following the Kimball methodology.
I had the chance to meet with the SQLBI.EU guys at PASS, and they mentioned their Table Difference component. I haven’t tried it out myself, but I remembered an email from one of the SQL Rangers (Binh Cao) that suggested this component for SCD processing. I’ve included his write-up here:
Table difference is an SSIS custom component designed to simplify the management of slowly changing dimensions and – in general – to check the differences between two tables or data flow with SSIS.
The component receives input from two sorted sources and generates different outputs for unchanged, new, deleted or updated rows.
The inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the inputs represent the same row, but this is easily accomplished by SQL Server with a simple “order by” and a convenient index; moreover the SCD do normally maintain an index by the business key, so the sorting requirement is easily accomplished and do not represent a problem.
Clicking on TableDifference control gives the following window.
TableDifference analyzes all the columns in both inputs and compares their names. If the name of two columns and their corresponding types are identical, TableDifference adds them to the available columns to manage.
If the flows are sorted, their sort columns will be marked as key fields, using the same order in which they appear in the sort.
All other columns are assigned a standard Update ID of 10 and are managed as comparable columns.
Using the component editor, you need to provide the following information for the columns:
Check Option: you can choose the column type between:
KeyOrder: If a column is of type “Key Field” it is the order under which the field appear under the “order by” clause of your query. Beware that the component do not check for the correct sorting sequence, it is up to you to provide this information.
Update ID: Each different UpdateID creates a different output flow. If you need to detect when a change appears in some column you can use different update ID. Beware that the lowest update ID wins, i.e. if AccountNumber has update id of 10 and AddressLine1 has update id of 20, then Accountnumber will be checked first and if a change is detected, the row will go to update output 10, no matter if AddressLine has a difference.
Outputs Panel gives option to choose which output to enable as well as to name and describe the output.
Output Details allows selection of the columns for each output. Here columns that are not needed for an output can be disabled. The picture shows an example of the DELETED output which only have the Customer Key column in its output. The less columns in the output, the better the performance of the component.
In the Misc Options tab, string comparisons definition can be defined:
In the Warnings panel, it will list any unused column from the input. As you might recall, if two columns are not identical regarding name and type, TableDifference will ignore them. This might be an error but the decision is up to you. By checking the warnings panel you can see if TableDifference is working with all the columns you need it to compare.
Do you have any other SCD processing approaches you’d like to recommend?
using a checksum works well also, I blogged about it here http://blog.stevienova.com/2008/11/22/ssis-slowly-changing-dimensions-with-checksum/
Thanks for the shout-out Matt,
I've just released v1.5 of the Kimball Method SCD component - get the full details here: http://toddmcdermid.blogspot.com/2010/01/kimball-method-slowly-changing.html
An 8-minute overview video is included...
Thanks a lot. We have been experiencing a lot of performance degradation issues with native SCD T/f and this blog helped me. As a Teach Lead, I faced the heat a couple of times from Client.
Shared this important information with my Team Members, implemented and we are all SUCCESS!!
70% performance improvement. Thanks again!!
I tend to use the check-sum method. Another great tool is www.pragmaticworks.com/.../UpsertDestination(BatchUpdateOrInsert).aspx but I found it too pricey.