Over the past couple of months I've been putting together a presentation on the Lookup Transform. I presented most of it as a Chalk Talk at the MS BI Conference last week, and from the evaluation scores, it seems like it was pretty well received. I'll be splitting up some of its content into a series of blog posts over the next little while. If you're interested in seeing the whole talk, it will also be shown at the SSWUG Virtual Conference in November.
The most important setting of the Lookup Transform is the Cache Mode - it can greatly impact your data flow performance, and affects overall package design. Because of its importance, we made it the first thing you see in the new 2008 Lookup UI. I feel this is a great improvement over 2005, where the cache mode was abstracted away - see Michael Entin's post for more details.
This blog post describes the three cache modes, how they work, and best practices around using them. Note that these cache modes apply when you're using an OLE DB connection manager - using the new Cache connection manager is similar to using a Full Cache mode.
The default cache mode for the lookup is Full cache. In this mode, the database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow, as all of the caching takes place before any rows are read from the data flow source(s). The trade off is that the lookup operations will be very fast during execution. One thing to note is that the lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.
In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.
Since no caching is done during the pre-execute phase, the startup time using a partial cache mode is less than it would be for a full cache. However, your lookup operations would be slower, as you will most likely be hitting the database more often.
When running in partial cache mode, you can configure the maximum size of the cache. This setting can be found on the Advanced Options page of the lookup UI. There are actually two separate values - one for 32bit execution, and one for 64bit. If the cache gets filled up, the lookup transform will start dropping the least seen rows from the cache to make room for the new ones.
In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don't exist. However, there are cases where you don't want to remember the misses - for example, if your data flow is adding new rows to your reference table. The Miss Cache is disabled by default.
As the name implies, in this mode the lookup transform doesn't maintain a lookup cache (actually, not quite true - we keep the last match around, as the memory has already been allocated). In most situations, this means that you'll be hitting the database for every row.
To find out more on how to implement the look up transform, please see these books online entries:
Good news - a couple of Information log events where added to the Lookup Transform in SQL 2008 to help
Well, you saved a ton of money, but missed a lot of fascinating presentations. Each hour brought another
As discussed before if you have a type 2 dimension and late arriving facts you are in a messy situation
Using the Cache Connection Manager (CCM) is a new option for the Lookup transform in SSIS 2008. The CCM
The cascading lookup pattern uses two lookup transforms with different cache modes . A common use of
The Lookup Transform does case sensitive string comparisons. This means that you need to a little bit
The SQL Server Integration Services team added valuable new caching options (and scalability) to the
What's difference between 'Reference set' and 'Reference table'
Because you mentioned
•When you're accessing a large portion of your reference set
•When you have a small reference table
Are they same ?
Whats the solution if the lookup table has million records (they dont fit in a full cache) and you dont have OLEDB driver (but ado.net instead) ?
when you say large refernce table,does tha contain large number of rows or it wide in column
I am using three lookups(FULL CACHE MODE) in one data flow task and (SSIS 2012). The first lookup is able to find the match but the other two are not. Is there a limit on using llokups in FULL CACHE MODE.
I am 100 percent sure that there is data that should match, and if i use only one lookup in full cache mode then it works fine.
I have tested this on many other packages in my solution , still the same result.
Can any one help me on understanding this.
I was testing this out as a means to optimize my package design in SQL 2012 environment (trying to figure out the best option: Merge Join which requires the use of sort component vs. Lookup)
Here is my situation:
I have 1million rows in reference table (APS )
My pipeline is using a CDC component to get net changes (so smaller changes, about 2k - 10k rows per hour)
My package will be executed hourly
My Destination is in APS (distributed with clustered columnstore index)
I executed the packages in VS and below are the results:
Using the Merge Join --- it processed 61k rows in 2:34 minutes
Using the Lookup with Partial Cache -- Over an hour before I finally stopped it (Lookup was processing about 100 rows per minute )
Using Lookup with No Cache --- Over an hour before I finally stopped it (Lookup was processing about 248 rows per minute)
Using Lookup with Full Cache -- it processed 61k rows in 2:20
For this test, I loaded my reference table with 1 million rows as a sample, the actual table have over 40 million rows.
Reading your post, I was expecting the Partial Cache mode to be the best option however base on this numbers, it doesn't seem that way.
For my case, I am considering the Merge Join Option because I can use parameters in the my dataset to limit the source pull.