With first Katmai (SQL Server 2008) CTP out, I think it is time to blog about some performance and scalability improvements in this release.

I'll assume readers are familiar with SSIS data flow performance concepts, if not make sure you've read these two articles:
http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx
http://www.simple-talk.com/sql/sql-server-2005/sql-server-2005-ssis-tuning-the-dataflow-task/

The most important concepts for today are:
Asynchronous components - components that create new rows and thus new data flow buffers (compare to synchronous components that modify data in the incoming buffers, but can't create or remove rows). Both blocking and partially blocking components (in terms of ssisperf whitepaper linked above) are asynchronous.
Execution trees - the tree describing the path a data buffer takes through the components, from start to end. The tree starts with a source or async component and usually ends with a destination or sometimes with a transform. It is a tree (not a simple list) because some transforms, e.g. Multicast "split" the data flow into multiple logical paths, but avoid buffer duplication - all sub-trees use the same "physical" buffer.

In SQL 2005 each execution tree was assigned a single OS thread ("worker thread"), and under some conditions (complex packages with small data flow EngineThreads property value) several execution trees could share a thread. One benefit of this approach is that all thread scheduling was done in advance during pre-execution phase, so the data flow did not have to spend any time to assign threads at runtime.

But there were several drawbacks: since the data flow did not know relative amount of work per execution tree, the scheduling could be suboptimal. Also, if you had simple package with just one or two execution trees, you would only use one or two processors, and the package might not benefit from high-end multiprocessor machine. You may have a lot of synchronous components, but if they share the same execution tree, they used to share the thread too. Even if you logically split the data flow using Multicast (synchronous transform), all output paths of Multicast belong to the same execution tree, and thus are executed serially by Yukon data flow task.

To achieve high level of parallelism on a multiprocessor machine, you had to split an execution tree, either by splitting the tree into independent paths, or by inserting async transforms to create a new tree. UnionAll could be used for later, as described in the ssisperf whitepaper. If you insert UnionAll with one input, it does not change the outputs, but splits the execution tree into two new trees - each can be executed on its own processor. The drawback is that the UnionAll is async transform, and thus has to copy data, so it might have a noticeable performance overhead. You should only use this trick if you checked that your package benefits from this extra parallelism. Usually packages already have multiple data flows, or multiple execution trees inside single data flow, so they don't need to do this. But if you have a very high-end machine and a pipeline that takes too long and does not use all the processors, you should try it (make a backup copy of original package and compare the performance before commiting to this change).

 

Now the good news: you don't need to worry about any of this or use the UnionAll trick anymore! In Katmai, the data flow was redesigned to do dynamic scheduling and can now execute multiple components in parallel, even if they belong to the same execution tree. The overhead of dynamic scheduling is very small. We sometimes saw a very small (~1%) performance loss on single processor machines, but on multiprocessor machines you'll usually see performance improvement, especially if you had to use UnionAll trick to introduce more parallelism and remove it when you move to Katmai.

As most server machines now have two or more processors or at least Hyper Threading, we think most users will see the performance improvements from this change. And you get the best performance automatically, no need to think if you need to introduce more parallelism by adding UnionAll transform, one less bullet to worry about.