The Cache Transform isn't currently able to update (i.e. append to) an existing persistent cache file. This pattern presents a way to incrementally build up your lookup cache if your data flow process is responsible for adding new rows to your reference table. For an alternative approach, please see the Other Resources section at the end of this post.
The process is split up across multiple packages. Because a Cache Connection Manager reads its cache file once, and keeps the cache in memory until package execution is complete, you can't read and update a cache file in the same package.
This is the control package which uses a script task to check if the cache file already exists. If it doesn't, it executes the "Create Cache" package. Once the cache is there, the main data flow package which performs your general ETL logic is executed. The "Data Flow" package will write out any new rows for the reference cache to a temporary file, and update a variable in the control package to indicate how many rows were written out. The parent checks if any rows were written, and if so, runs the "Update Cache" package to update the persistent cache file
The cache creation package reads the source reference data, and then writes it to two separate locations. First it uses the Cache Transform to create a persistent cache, and then it writes the same data out to a RAW file. This RAW file will be incrementally updated by the Data Flow package to store the latest data for our reference set.
In the main ETL package, we use a key generation pattern to generate and insert new values into our reference table. Instead of rejoining the main flow right away, we multicast it out to a Row Count, followed by a RAW File Destination.
The Row Count writes to a local variable. After this data flow, we'll use a script task to copy this variable's value over to a variable in the parent package. You can see Jamie Thomson's post for more about how this works.
The RAW File Destination is able to append to an existing RAW File. Here we use it to add the new rows to the "incremental update" file we created in the "Create Cache" package.
If the main ETL package wrote any new rows to the incremental update RAW file, the control package will run the Update Cache package. This package simply reads the incremental update file (which will contain the entire cache, since it's simply being appended to each time), and recreate the cache file.
The sample packages used in this post can be downloaded from my SkyDrive share.
From the Lookup presentation I put together for the MS BI conference in October, here is a series of
The SQL Server Integration Services team added valuable new caching options (and scalability) to the
Thanks for this wonderful article.
I think this is one of the best article i have read from a long time.
But, This is i would Not say complete incremental in memory Cache update, Its like you are just keep adding new rows into you cache.
I would like to know from you -Is it possible to keep adding the incrementally within the same run
what i mean by that lets say i have Number of files which comes daily lets say 24 each for one hr and then i load them into my DW.
I have more than 20 mil rows so if i would follow your method i would have to keep adding into reference cache which will again will keep adding on my reference table
which i really don't want want becz i believe that will again increase cache size .
Now i would like to add while i am running lets say i am running package and now while i iterating it through each file in single RUN i would like to keep adding that only and if within the same run i would like to know yes this row has been added to reference Cache and now when this is coming in next hr file its update rows.
Within the same package run is it possible to do incrementally add within memory, instead of running package for each file source what i am doing right now.
IS It possible to achieve within single run of package with full cache.