Most of this content comes from my Performance Design Patterns series. If you’re interested in learning more about SSIS Design Patterns, I’ll be doing full day pre-conference training sessions at the PASS Summit and SQLBits.
Whenever I’m presenting a session about SSIS performance, I like to start off with the following slide:
SSIS is a pretty powerful tool which can be used in a variety of scenarios. Like most tools, it is more suited for some scenarios than others. It is a great choice if you’re pulling in data from multiple sources, or splitting it up and sending it to a number of places. It’s also good if your data needs to go through a series of transforms, or you’re merging multiple sources of data. Finally, the package designer in BIDS lets you visually layout your workflow, which many people prefer over writing custom .NET or T-SQL code.
You’ll want to be careful about using SSIS if your design requires you to do trickle feed or real-time ETL type operations. SSIS can do them, but it was really designed for bulk data loads. Our data pipeline is really fast, but the runtime that loads and hosts it can be slow to startup at times. When you’re moving large amounts of data, you don’t notice this startup cost, but you will if you’re running your package every 15 seconds, or only moving a row or two at a time.
One of the first big customer issues I worked on, they had a single set of packages to do their bulk loads, and their incremental feeds. I say incremental, but it was more like a trickle feed – they had a web process that would kick off all of the packages in the project to move one to five rows of customer data. Their solution was big – close to 30 packages. They’d run through these really complex data flows that worked great when they were moving their entire data set, but was overkill when moving just a couple of rows.
Finally, there are a couple of scenarios where SSIS isn’t the optimal tool to use.
First, if your source and destination databases are on the same server, you’ll probably want to do everything using SQL. Remember that SSIS is an “in memory ETL engine” – it will take the data from the source server, bring it into memory, and push it to the destination server. If all of your data is already in the same SQL Server instance, It’s usually more efficient to move it around with an INSERT INTO … SELECT statement.
Another scenario would be if you’re doing a straight file to database load, without applying any transformations or control flow type logic to it. This does work fine with SSIS, but it might be overkill for simple loads. You can use SSIS if you want the graphical design experience, or want to add additional control flow logic, but you can achieve the same performance performance, or maybe even a little better, if you use a BULK INSERT statement, or BCP.
Here are some additional links for more information:
So what is the right tool for real-time ETL across multiple datasources?
The video link I posted (Data Integration at Microsoft) should cover some of that. It will depend on a couple of things - how much data are you moving, and how much latency is acceptable. It usually comes down to using SQL Replication, Service Broker, or even StreamInsight (if you only need to be pushing out a subset of the incoming data).
Jamie Thomson recently blogged about using Service Broker for a real-time ETL job he was working on. sqlblog.com/.../service-broker-not-etl.aspx
"the runtime that loads and hosts it can be slow to startup at times"
- what could be causing to be slow to startup? also is the runtime different from DTEXEC?