Most users see good performance from SSIS packages using out-of-the-box configurations and with little tuning. Sometimes, though, it is necessary to do tuning to get optimal performance. One of the most commonly used transformations is the Lookup transformation. There are several techniques for getting optimal performance from a Lookup transform in SSIS. A few of these are surprising the first time you see them. This article discusses some tuning techniques that we have found can be important in different situations.
A common need in a data warehousing context would be to look up a surrogate key based on one or more business keys, before a new row is inserted into the warehouse. The screen shot below shows a highly simplified example: An arriving order record contains a customer name and phone number. Before an order can be inserted in the database, the customer name and phone number must be looked up to find the correct customer key. This simplified example will be used as the basis for the discussion in this article.
Most designers would use the obvious default behavior of the Lookup transform, selecting a table or view to look up in, as shown below.
The example scenario showed selecting data from a table. After naming the table, the designer also chooses what columns will be used in the lookup and what columns will be returned as a result of the lookup. This is done in the Columns tab of the Lookup Transformation Editor. The figure below shows that the columns C_NAME and C_PHONE will be referenced from the table, and the data from C_CUSTKEY will be returned.
Only three columns from the ORDERS table are used in this example. However, choosing a table will be interpreted as “SELECT *”. This will cause the transformation to fetch un-needed data in the other columns that do not participate in lookup operation. That is a waste of effort and memory. We recommend choosing “Use results of an SQL query” instead of naming a table, and in the query selecting only the columns that are used.
This in turn changes the column mappings. Now it is clear that only the desired columns have been selected. In our experience this can lead to a substantial gain in Lookup performance. In one case we saw recently, using this technique resulted in a doubling of the package speed.
Reducing the data returned by the Lookup has the advantage that less work is required to return the data, and it also reduces memory usage by SSIS. This is the next area for discussion: memory usage. Being more selective about the rows returned is one technique for saving memory; another is to enable memory restriction in the Lookup transform.
Two things happen when memory restriction is enabled on a Lookup: First, the amount of memory that the Lookup is allowed to use is limited. A cache policy is used and new rows are added to the cache on demand. Second, the new rows are added to the cache individually. In other words, SSIS will query the relational database whenever data for a row cannot be located in the internal cache kept by SSIS. These are singleton (single-row) queries, unlike the large set-based table queries that occur when memory restriction is not enabled. The next figure shows where in the Lookup Transformation Editor the setting is made. In the example, 20 megabytes are allowed for the Lookup cache. This is user-settable and must be adjusted to the needs of each lookup scenario.
The benefit of enabling memory restriction is that large lookups can be performed which might not be possible otherwise. The cost is that singleton queries are used, which collectively are usually slower than a single table query. However, we have found that this performance cost is not always as bad as one might expect, if appropriate indexes are present on the lookup table. In one recent experiment, we ran a package with and without memory restriction in a lookup. The lookup table for this package had millions of rows, and about 10% of them were actually referenced by the incoming data. When there was an index on the lookup table, the package with memory restriction ran only 25% longer than the package without memory restriction. When there was not an index on the lookup table, the package with memory restriction failed to complete after 24 hours.
Consider a case where incoming records must be looked up against a large table, but the incoming records match only a few of the records in the lookup table. It might actually be faster to enable memory restriction, thus avoiding a load of the entire table, and letting the few records be fetched individually. A decision to use memory restriction for this reason must be considered in the specific context of each application.
Most of the time you will get good performance from Lookup transforms with default settings. Sometimes, lookups can become significant in the performance of an SSIS package. This article has given a few tips that we have found make the biggest difference in lookup performance. We hope you find them useful in your package designs.
- Runying Mao and Len Wyatt