It's typical in data warehouse loads to have lookups that go against dimension tables. If the key doesn't exist, either a new one is created, or a default value is used.

There're two ways to do this:

1. Lookup is configured to redirect rows that have no match in the reference table to a separate output (error output), then use a derived column to specify a default value, and finally merge both the lookup success output and the output of the derived column using a union all transform.

2. Lookup is configured to ignore lookup failures and pass the row out with null values for reference data. A derived column downstream of the lookup then checks for null reference data using 'ISNULL' and replaces the value with a default value.

As you may know Union All is an async transform due to implementation issues. This means its going to create a new type of buffer on the output and copy the input buffer data to the output.

Intuitively, #2 has been the way to go because it doesn't incur the cost of a memcopy, but I didn't get a chance to quantify the difference until now. Today, I was finally able to.

The test harness includes a script component that feeds 100M rows to the lookup, and the output of the union all in option #1 and derived column in #2 gets consumed by a script component that doesn't do anything with the rows. The AMD guys have given us a 64bit dual core machine with a couple of procs to play with and that's what I ran this on. There's enough (32GB) of memory on the box so that we can focus on just the perf of this isolated scenario. Since I was only interested in this specific scenario, the lookup statement itself is 'select * from whatevertable where 1=0' which means all keys fail lookups.

Average timings were:

Option #1: 104 seconds.

Option #2: 83 seconds.

For folks that have 12-13 dimension lookups, using approach #2 might significantly aid performance.

The other interesting fact here is that if there're lots of lookups, the execution tree that contains the lookups will be busy. Since there's one thread (currently) per execution tree, that thread will be doing a lot of work. Therefore, there will be a point where having X lookups on one thread would be slower than having X/2 lookups on it and a union all to introduce a new execution tree. I didn't get a chance to do that investigation as yet, hope to in the future.