I was helping a friend of mine to debug performance issue with his SSIS package. The package was pretty simple – he needed to read data, shard it using some partitioning scheme, then insert the results into 8 SQL destinations. The hashing algorithm used for partitioning was rather complex for conditional split, so he created a simple custom transform component that calculates the partition ID and adds it as a new column; the package then used this column in conditional split and finally there were OLEDB destinations for each shard.

But performance was not good – it took about 10 minutes to push ~32mln rows.

I started with simplest debugging technique – remove destination to check if it is the cause, then remove conditional split, etc. We quickly found the problem is caused by the custom component. But it is very simple:

public override void ProcessInput(int inputID, PipelineBuffer buffer)
{
    while( buffer.NextRow() )
    {
byte[] guid = buffer.GetBytes(m_inputColumn);
int partition = Partition(guid);
buffer.SetInt32(m_outputColumn, partition); buffer.DirectRow(ComponentMetaData.OutputCollection[0].ID); } }

Can you guess why is it so slow?

There are many potential inefficiencies here, e.g. due to the buffer API we’ve to allocate and copy byte array for every row. But the code that took most of the overall execution time was

    buffer.DirectRow(ComponentMetaData.OutputCollection[0].ID);

Why so slow? You see, the PipelineBuffer object is optimized for performance, but ComponentMetaData object and various collections of inputs/outputs/columns/etc are mostly for design time and were not optimized for performance at all. And there is a good reason for this: these collections do not change during runtime, so one should not call them at runtime. So we moved their usage to PreExecute and then used the cached output ID:

public override void PreExecute ()
{
m_outputID = ComponentMetaData.OutputCollection[0].ID;
}

and somewhere in ProcessInput(..) -

    buffer.DirectRow(m_outputID);

Actually, this particular component did not need to call DirectRow at all, and using default exclusion group was enough, so we finally removed it. But if you do need to call it – cache the output ID in advance. Our performance has improved about 20 times by this simple change.