Welcome to MSDN Blogs Sign in | Join | Help

Handling Slowly Changing Dimensions in SSIS

I had a great time at PASS last week, and had a chance to talk to a lot of different SSIS users. One of the big topics seemed to be Slowly Changing Dimensions – I had a number of people ask for various improvements to the current Slowly Changing Dimension Transform in SSIS, and also ask for recommended alternatives in the meantime. I thought I’d summarize some of the more popular approaches I’ve seen, and see if anyone else has some alternatives.

Slowly Changing Dimension Wizard

You might have already tried the Slowly Changing Dimension Wizard that comes with SSIS 2005 and 2008 (and there are a number of good tutorials out there if you haven’t).

Outputs from the Slowly Changing Dimension Wizard

The SCD Wizard has a few things going for it - it’s quick and easy to implement, it handles most SCD scenarios out of the box, and its multi-component approach means you can customize it with the functionality you need.

It does have some pretty big limitations, however, which end up being a deal breaker for a lot of people.

A major inhibitor is the performance of the transform. It doesn’t perform that well for a couple of different reasons:

  1. The data lookups are not cached – each row results in a SQL Query
  2. OLE DB Command does row by row updates
  3. OLE DB Destination added by the wizard doesn’t use FastLoad by default (which can be easily changed in the OLE DB Destination editor UI)

Another downside to the transform is the “one way” nature of the wizard – running it again (to change columns, for example) means you’ll lose any customizations you might have made to the other transforms.

I recommend using the wizard for simple dimensions, where you’re not processing a lot of data. If performance is a concern, consider one of the following approaches.

Using MERGE

I came across this tip from the Kimball Group when I was putting together my Merge & CDC talk last year.

Using the SQL MERGE Statement for Slowly Changing Dimension Processing

In this approach, you write all of your incoming data to a staging table, and then use Execute SQL Tasks to run MERGE statements (you actually have to do two passes – one for Type 1 changes, and one for Type 2 – see the details in the tip above). I posted the sample packages and code I used in a previous blog post.

The performance in this approach is very good (although it moves the bulk of the work to the database machine, which might not be what you want). I recommend it if you don’t mind staging the data, writing custom SQL, or can’t use a 3rd party component in your environment.

Kimball Method SSIS Slowly Changing Dimension Component

I’ve heard great things about Todd McDermid’s custom SCD Transform. Instead of doing row by row lookups, this transform takes in the dimension as an input. This makes the comparison much faster than the stock SSIS version. It wraps up all of the functionality into a single transform, which is great if you’re following the Kimball methodology.

Table Difference Component

I had the chance to meet with the SQLBI.EU guys at PASS, and they mentioned their Table Difference component. I haven’t tried it out myself, but I remembered an email from one of the SQL Rangers (Binh Cao) that suggested this component for SCD processing. I’ve included his write-up here:

Table difference is an SSIS custom component designed to simplify the management of slowly changing dimensions and – in general – to check the differences between two tables or data flow with SSIS.

The component receives input from two sorted sources and generates different outputs for unchanged, new, deleted or updated rows.

clip_image001

  • Unchanged rows (are the data rows that are the same in both inputs)
  • Deleted rows (are the data rows that appear in old source but not in new source)
  • New rows (are the data rows that appear in new source but not in old source)
  • Updated rows (are the data rows that appear in both flows but something is changed)

The inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the inputs represent the same row, but this is easily accomplished by SQL Server with a simple “order by” and a convenient index; moreover the SCD do normally maintain an index by the business key, so the sorting requirement is easily accomplished and do not represent a problem.

Clicking on TableDifference control gives the following window.

clip_image002

TableDifference analyzes all the columns in both inputs and compares their names. If the name of two columns and their corresponding types are identical, TableDifference adds them to the available columns to manage.

If the flows are sorted, their sort columns will be marked as key fields, using the same order in which they appear in the sort.

All other columns are assigned a standard Update ID of 10 and are managed as comparable columns.

Using the component editor, you need to provide the following information for the columns:

Check Option: you can choose the column type between:

  • Key field:  column will be used to detect when two rows from the inputs represent the same row. Beware that the inputs must be sorted by those columns
  • Compare:  column will be compared one by one to detect differences
  • Prefer NEW: columns will be copied from the NEW input directly into the output, no check
  • Prefer OLD: columns will be copied from the OLD input directly into the output, no check

KeyOrder: If a column is of type “Key Field” it is the order under which the field appear under the “order by” clause of your query. Beware that the component do not check for the correct sorting sequence, it is up to you to provide this information.

Update ID: Each different UpdateID creates a different output flow. If you need to detect when a change appears in some column you can use different update ID. Beware that the lowest update ID wins, i.e. if AccountNumber has update id of 10 and AddressLine1 has update id of 20, then Accountnumber will be checked first and if a change is detected, the row will go to update output 10, no matter if AddressLine has a difference.

clip_image003

Outputs Panel gives option to choose which output to enable as well as to name and describe the output.

clip_image004

Output Details allows selection of the columns for each output.  Here columns that are not needed for an output can be disabled.  The picture shows an example of the DELETED output which only have the Customer Key column in its output.  The less columns in the output, the better the performance of the component.

clip_image005

In the Misc Options tab, string comparisons definition can be defined:

  • The culture ID to use to perform the comparison. If not specified TableDifference will use the culture ID of the running task. The default is “empty”.
  • If you want it to ignore casing during string comparisons. The default is unchecked so TableDifference will perform comparison considering case.

In the Warnings panel, it will list any unused column from the input. As you might recall, if two columns are not identical regarding name and type, TableDifference will ignore them. This might be an error but the decision is up to you. By checking the warnings panel you can see if TableDifference is working with all the columns you need it to compare.

----

Do you have any other SCD processing approaches you’d like to recommend?

Something new for SSIS in SQL Server 2008 R2 November CTP

Ok, so it turns my previous post about R2 wasn’t entirely correct.

SSIS does have one new feature which appears in the latest SQL Server 2008 R2 CTP. If you open up the ADO.NET Destination UI, you’ll notice a new “Use Bulk Insert when Possible” check box.

image

Previously, the ADO.NET Destination did all of its inserts row by row (we do some batching internally, which is why the component has a BatchSize property, but the underlying ADO.NET provider will always do single row inserts). With this new feature enabled, SSIS will use a bulk insert interface (like enabling “FastLoad” for OLEDB Destination). Unfortunately, there isn’t a generic Bulk Load interface for ADO.NET, so this functionality is currently only supported by SQL Server (through the SqlBulkCopy API). Hopefully we can extend support to other ADO.NET providers in the future.

The main reason for implementing this functionality was to improve our support for SQL Azure. As you might already know, ADO.NET is the primary way to communicate with SQL Azure, and adding support for SqlBulkCopy greatly increased the transfer speed. It also speeds up things when you’re working with regular SQL Server systems -- although OLEDB with FastLoad is still the preferred way of doing SQL data loads.

Be sure to try it out and provide feedback before the final release.

Posted by mmasson | 0 Comments
Filed under: , , ,

2009 PASS Summit next week

I’m really looking forward to attending PASS next week!

I’ll be presenting two separate talks this time around:

Maximize Your SSIS Investment with Tuning Tricks and Tips

Session Details

Maximize the performance of your packages with this set of best practices and data flow tuning techniques. We'll show you how to improve the operational efficiency of various Integration Services connectors, and take a look at how to efficiently load your data warehouse solutions.

This is basically a talk about how to get good performance from SSIS without having to invest too much time and effort. I’ll talk about “smart package design”, how to effectively introduce parallelism, and then jump into some of the easier tuning tweaks you can do in your packages.

Originally this talk was scheduled as one of the first sessions on Tuesday, which made it a very good segue for Thomas Kejser’s SQLCAT chalk talk on Tuning ETL and ELT. It looks like my session was moved to Thursday in the latest PASS schedule, which means I’ll be presenting after he does. I think both sessions are a good compliment to each other - While mine focuses on package design, Thomas takes a more scientific approach to performance tuning, with a pretty deep dive on SQL Server optimization. If you’re interested in performance, I highly recommend attending both sessions (there shouldn’t be much overlap).

My other session is a Chalk Talk that I’ll be co-presenting with John Welch.

Metadata Driven ETL - Creating Dynamic Packages with Integration Services

Session Details

Schema changes got you down? Tired of hand coding your SSIS packages for a ton of different data sources? Come see this session where we'll focus on the requirements for metadata driven ETL, and how you can build your own dynamic package generation framework using the SSIS object model. We'll talk about some of SSIS package frameworks currently used in-house at Microsoft, how to automate and test your packages, and about the open-source resources available online.

I’ll be at the SSIS booth and Ask the Experts areas throughout the conference, so be sure to come by and say hello.

Posted by mmasson | 0 Comments
Filed under:

No new features in for Integration Services in SQL Server 2008 R2

Update 2009-11-12: One minor feature change announced.

Have you been holding your breath waiting for Microsoft to announce the new SSIS features for SQL Server 2008 R2?

The answer is pretty simple – nothing new is being added for SSIS in R2. There will be some small bug fixes, but these will also be available in a Cumulative Update / Service Pack release for 2008. R2 is a target release for the Self-Service BI market, with some other engine and tools enhancements included. From a SQL BI point of view, the release is mostly about Project Gemini, Master Data Management, and some cool enhancements to Reporting Services.

Instead of R2, we’ve been focused on the next major SQL Server release - we decided early on that splitting our effort to deliver some new features in R2 would end up diluting the overall value of what we eventually ship. We’ve been hard at work the past little while, and in fact, we’re currently wrapping up our first wave of improvements. It’s a bit too early for us to publically start talking about what we have planned, but I can say that it’s pretty big, and the feedback we’ve received so far is very promising.

I’m really looking forward to when we can start giving demos of the new stuff at conferences... I’ll be sharing all of the details here as soon as I can!

Posted by mmasson | 5 Comments

SQL PASS 2009

I was just looking at the sessions for the SQL PASS 2009 Summit and was happy to see so many SSIS presentations on the list!

Andy Leonard Applied SSIS Design Patterns
Steve Simon SQL Server Integration Services and the modern financial institution
Davide Mauri Instrumenting, Monitoring and Auditing of SSIS ETL Solutions
Brian Knight Loading a Data Warehouse with SSIS
Dave Fackler ETL from the Trenches: Using SSIS in the Real World
Joy Mundy ETL: The Linchpin for the Complete Data Warehouse
Erik Veerman Overcoming SSIS Deployment and Configuration Challenges
Data Profiling and Cleansing with Integration Services 2008

In addition to these SSIS specific topics, it looked like there were a lot of other data warehousing / BI best practices talks that looked like they touched on SSIS as well.

I’ll be presenting a session entitled Maximize Your SSIS Investment With Tuning Tricks and Tips, and I found out earlier this week that I might also get the chance to host a chalk talk on metadata driven ETL. And of course, members of the SSIS team will be in the pavilion / Ask the Experts area throughout the conference, so be sure to come by and say hello.

Posted by mmasson | 0 Comments

Looking up SSIS HResult / COMException ErrorCode

The method I describe in my post about handling COMExceptions during package generation works if you have control over the package generation code, but sometimes you’ll be using third party libraries, or debugging after the fact.

Note, the error codes, symbolic names, and descriptions for all of the SSIS HResults can all be found on the Integration Services Error and Message Reference page in Books Online. This should be your first stop if you want to quickly (and manually) lookup an SSIS error code.

However, if you want to do this programmatically…

The symbolic name for each SSIS error is slightly more useful than the hex error code value, and will sometimes be enough for you to isolate your problem right away. You can programmatically determine the symbolic name by comparing the error code value against the members of the HResults class.

public static string GetSymbolicName(int errorCode)
{
    string symbolicName = string.Empty;
    HResults hresults = new HResults();

    foreach (FieldInfo fieldInfo in hresults.GetType().GetFields())
    {
        if ((int)fieldInfo.GetValue(hresults) == errorCode)
        {
            symbolicName = fieldInfo.Name;
            break;
        }
    }

    return symbolicName;
}

I could use this function in a try/catch block to give the user additional information as to why the error occurred.

catch (COMException ex)
{
    string symbolicName = GetSymbolicName(ex.ErrorCode);
    Console.WriteLine("Symbolic Name: {0}", symbolicName);
}

Retrieving the actual error message for a given SSIS HResult is a little more involved. We can use the FormatMessage API to pull the message directly out of the dtsmsg100.dll (in 2005, it’s dtsmsg.dll). To do this, we’ll need to expose a couple of native methods so we can pinvoke them.

static class NativeMethods
{
    [DllImport("kernel32.dll")]
    public static extern IntPtr LoadLibrary(string dllToLoad);

    [DllImport("kernel32.dll")]
    public static extern bool FreeLibrary(IntPtr hModule);

    [DllImport("Kernel32.dll", SetLastError = true)]
    public static extern uint FormatMessage(uint dwFlags, IntPtr lpSource,
       uint dwMessageId, uint dwLanguageId, ref IntPtr lpBuffer,
       uint nSize, IntPtr pArguments);

    public const int FORMAT_MESSAGE_ALLOCATE_BUFFER = 256;
    public const int FORMAT_MESSAGE_IGNORE_INSERTS = 512;
    public const int FORMAT_MESSAGE_FROM_STRING = 1024;
    public const int FORMAT_MESSAGE_FROM_HMODULE = 2048;
    public const int FORMAT_MESSAGE_FROM_SYSTEM = 4096;
    public const int FORMAT_MESSAGE_ARGUMENT_ARRAY = 8192;
    public const int FORMAT_MESSAGE_MAX_WIDTH_MASK = 255;
}

First we’ll get a handle to dtsmsg100.dll using LoadLibrary:

IntPtr handle = NativeMethods.LoadLibrary("dtsmsg100.dll");
if (handle == IntPtr.Zero)
{
    throw new Exception("Couldn't open library dtsmsg100.dll");
}

Then we’ll call FormatMessage. The first parameter is a set of flags – we’ll want the native method to do the allocation, the messages to be loaded from a specific DLL (dtsmsg), and we don’t want to do any parameter substitution.

const int FormatMessageFlags = NativeMethods.FORMAT_MESSAGE_ALLOCATE_BUFFER | // FormatMessage will allocate buffer
                               NativeMethods.FORMAT_MESSAGE_FROM_HMODULE |    // Messages are loaded from specified DLL
                               NativeMethods.FORMAT_MESSAGE_IGNORE_INSERTS;   // Don't perform place holder substitutions
IntPtr lpMsgBuf = IntPtr.Zero;
uint dwChars = NativeMethods.FormatMessage(
                    FormatMessageFlags,
                    handle,                 // handle to error message dll
                    errorCode,              // error code we're looking up
                    0,                      // 0 for default language
                    ref lpMsgBuf,           // message buffer
                    2048,                   // max size of message buffer
                    IntPtr.Zero);           // substitution arguments

if (dwChars == 0)
{
    // FormatMessage will set LastError
    throw new Win32Exception();
}

string sMessage = Marshal.PtrToStringAnsi(lpMsgBuf);

// Free the buffer
Marshal.FreeHGlobal(lpMsgBuf);

Once we have the error message, we can release the handle to dtsmsg using FreeLibrary.

Using the same scenario from my previous post, getting a COMException off a call to AcquireConnection() gives me an HResult of 0xC020801C. Programmatically retrieving the error message gives me this:

SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER.  The AcquireConnection method call to the connection manager "%1" failed with error code 0x%2!8.8X!.  There may be error messages posted before this with more information on why the AcquireConnection method call failed.

Notice the format placeholders (%1 and %2!8.8X!) weren’t filled in. This is because we used the FORMAT_MESSAGE_IGNORE_INSERTS flag on FormatMessage – we wouldn’t have known what values to put in there, since we weren’t the ones that raised the original exception.

I’ve encapsulated all of this logic in the HResultsHelper class. I’ve included the full code listing (including a CreatePackage method from previous API posts) here.

-------

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.SqlServer.Dts.Runtime;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using System.Runtime.InteropServices;
using System.ComponentModel;
using System.Globalization;
using System.Reflection;

namespace ErrorMsgFromHResult
{
    public class HResultHelper : IDisposable
    {
        const string MessageDLL = @"dtsmsg100.dll";

        private IntPtr handle = IntPtr.Zero;
        bool disposed;
        const int FormatMessageFlags = NativeMethods.FORMAT_MESSAGE_ALLOCATE_BUFFER | // FormatMessage will allocate buffer
                                       NativeMethods.FORMAT_MESSAGE_FROM_HMODULE |    // Messages are loaded from specified DLL
                                       NativeMethods.FORMAT_MESSAGE_IGNORE_INSERTS;   // Don't perform place holder substitutions

        public HResultHelper()
        {
            // Load the DLL we'll get the error messages from
            handle = NativeMethods.LoadLibrary(MessageDLL);
            if (handle == IntPtr.Zero)
            {
                throw new Exception(string.Format(CultureInfo.CurrentCulture, "Couldn't open library: {0}", MessageDLL));
            }
        }

        public string GetErrorMessage(COMException comException)
        {
            return GetErrorMessage((uint)comException.ErrorCode);
        }

        public string GetErrorMessage(uint errorCode)
        {
            IntPtr lpMsgBuf = IntPtr.Zero;
            uint dwChars = NativeMethods.FormatMessage(
                                FormatMessageFlags,
                                handle,                 // handle to error message dll
                                errorCode,              // error code we're looking up
                                0,                      // 0 for default language
                                ref lpMsgBuf,           // message buffer
                                2048,                   // max size of message buffer
                                IntPtr.Zero);           // substitution arguments

            if (dwChars == 0)
            {
                // FormatMessage will set LastError
                throw new Win32Exception();
            }

            string sRet = Marshal.PtrToStringAnsi(lpMsgBuf);

            // Free the buffer
            Marshal.FreeHGlobal(lpMsgBuf);

            return sRet;
        }

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }

        private void Dispose(bool disposing)
        {
            if(!this.disposed)
            {
                NativeMethods.FreeLibrary(handle);
                handle = IntPtr.Zero;
                disposed = true;
            }
        }

        ~HResultHelper()
        {
            Dispose(false);
        }

        static class NativeMethods
        {
            [DllImport("kernel32.dll")]
            public static extern IntPtr LoadLibrary(string dllToLoad);

            [DllImport("kernel32.dll")]
            public static extern bool FreeLibrary(IntPtr hModule);

            [DllImport("Kernel32.dll", SetLastError = true)]
            public static extern uint FormatMessage(uint dwFlags, IntPtr lpSource,
               uint dwMessageId, uint dwLanguageId, ref IntPtr lpBuffer,
               uint nSize, IntPtr pArguments);

            public const int FORMAT_MESSAGE_ALLOCATE_BUFFER = 256;
            public const int FORMAT_MESSAGE_IGNORE_INSERTS = 512;
            public const int FORMAT_MESSAGE_FROM_STRING = 1024;
            public const int FORMAT_MESSAGE_FROM_HMODULE = 2048;
            public const int FORMAT_MESSAGE_FROM_SYSTEM = 4096;
            public const int FORMAT_MESSAGE_ARGUMENT_ARRAY = 8192;
            public const int FORMAT_MESSAGE_MAX_WIDTH_MASK = 255;
        }
    }


    class Program
    {
        static void Main(string[] args)
        {
            HResultHelper helper = new HResultHelper();

            try
            {
                CreatePackage();
            }
            catch (COMException ex)
            {
                string errorMessage = helper.GetErrorMessage(ex);
                Console.WriteLine(errorMessage);
            }

            helper.Dispose();
        }

        static Package CreatePackage()
        {
            Package package = new Package();

            // Add Data Flow Task
            Executable dataFlowTask = package.Executables.Add("STOCK:PipelineTask");

            // Set the name (otherwise it will be a random GUID value)
            TaskHost taskHost = dataFlowTask as TaskHost;
            taskHost.Name = "Data Flow Task";

            // We need a reference to the InnerObject to add items to the data flow
            MainPipe pipeline = taskHost.InnerObject as MainPipe;

            // Set the IDTSComponentEvent handler to capture the details from any 
            // COMExceptions raised during package generation
            ComponentEventHandler events = new ComponentEventHandler();
            pipeline.Events = DtsConvert.GetExtendedInterface(events as IDTSComponentEvents);

            //
            // Add connection manager
            //

            ConnectionManager connection = package.Connections.Add("OLEDB");
            connection.Name = "localhost";
            connection.ConnectionString = "Data Source=localhost;Initial Catalog=AdventureWorksDW2008;Provider=SQLNCLI10.1;Integrated Security=SSPI;Auto Translate=False;Connect Timeout=1";

            //
            // Add OLEDB Source
            //

            IDTSComponentMetaData100 srcComponent = pipeline.ComponentMetaDataCollection.New();
            srcComponent.ComponentClassID = "DTSAdapter.OleDbSource";
            srcComponent.ValidateExternalMetadata = true;
            IDTSDesigntimeComponent100 srcDesignTimeComponent = srcComponent.Instantiate();
            srcDesignTimeComponent.ProvideComponentProperties();
            srcComponent.Name = "OleDb Source";

            // Configure it to read from the given table
            srcDesignTimeComponent.SetComponentProperty("AccessMode", 0);
            srcDesignTimeComponent.SetComponentProperty("OpenRowset", "[DimCustomer]");

            // Set the connection manager
            srcComponent.RuntimeConnectionCollection[0].ConnectionManager = DtsConvert.GetExtendedInterface(connection);
            srcComponent.RuntimeConnectionCollection[0].ConnectionManagerID = connection.ID;

            // Retrieve the column metadata
            srcDesignTimeComponent.AcquireConnections(null);
            srcDesignTimeComponent.ReinitializeMetaData();
            srcDesignTimeComponent.ReleaseConnections();

            //
            // Add OLEDB Destination
            //

            IDTSComponentMetaData100 destComponent = pipeline.ComponentMetaDataCollection.New();
            destComponent.ComponentClassID = "DTSAdapter.OleDbDestination";
            destComponent.ValidateExternalMetadata = true;

            IDTSDesigntimeComponent100 destDesignTimeComponent = destComponent.Instantiate();
            destDesignTimeComponent.ProvideComponentProperties();
            destComponent.Name = "OleDb Destination";

            destDesignTimeComponent.SetComponentProperty("AccessMode", 3);
            destDesignTimeComponent.SetComponentProperty("OpenRowset", "[DimCustomer_Copy]");

            // set connection
            destComponent.RuntimeConnectionCollection[0].ConnectionManager = DtsConvert.GetExtendedInterface(connection);
            destComponent.RuntimeConnectionCollection[0].ConnectionManagerID = connection.ID;

            // get metadata
            destDesignTimeComponent.AcquireConnections(null);
            destDesignTimeComponent.ReinitializeMetaData();
            destDesignTimeComponent.ReleaseConnections();

            //
            // Connect source and destination
            //

            IDTSPath100 path = pipeline.PathCollection.New();
            path.AttachPathAndPropagateNotifications(srcComponent.OutputCollection[0], destComponent.InputCollection[0]);

            //
            // Configure the destination
            // 

            IDTSInput100 destInput = destComponent.InputCollection[0];
            IDTSVirtualInput100 destVirInput = destInput.GetVirtualInput();
            IDTSInputColumnCollection100 destInputCols = destInput.InputColumnCollection;
            IDTSExternalMetadataColumnCollection100 destExtCols = destInput.ExternalMetadataColumnCollection;
            IDTSOutputColumnCollection100 sourceColumns = srcComponent.OutputCollection[0].OutputColumnCollection;

            // The OLEDB destination requires you to hook up the external columns
            foreach (IDTSOutputColumn100 outputCol in sourceColumns)
            {
                // Get the external column id
                IDTSExternalMetadataColumn100 extCol = (IDTSExternalMetadataColumn100)destExtCols[outputCol.Name];
                if (extCol != null)
                {
                    // Create an input column from an output col of previous component.
                    destVirInput.SetUsageType(outputCol.ID, DTSUsageType.UT_READONLY);
                    IDTSInputColumn100 inputCol = destInputCols.GetInputColumnByLineageID(outputCol.ID);
                    if (inputCol != null)
                    {
                        // map the input column with an external metadata column
                        destDesignTimeComponent.MapInputColumn(destInput.ID, inputCol.ID, extCol.ID);
                    }
                }
            }

            return package;
        }

        class ComponentEventHandler : IDTSComponentEvents
        {
            private void HandleEvent(string type, string subComponent, string description)
            {
                Console.WriteLine("[{0}] {1}: {2}", type, subComponent, description);
            }

            #region IDTSComponentEvents Members

            public void FireBreakpointHit(BreakpointTarget breakpointTarget)
            {
            }

            public void FireCustomEvent(string eventName, string eventText, ref object[] arguments, string subComponent, ref bool fireAgain)
            {
            }

            public bool FireError(int errorCode, string subComponent, string description, string helpFile, int helpContext)
            {
                HandleEvent("Error", subComponent, description);
                return true;
            }

            public void FireInformation(int informationCode, string subComponent, string description, string helpFile, int helpContext, ref bool fireAgain)
            {
                HandleEvent("Information", subComponent, description);
            }

            public void FireProgress(string progressDescription, int percentComplete, int progressCountLow, int progressCountHigh, string subComponent, ref bool fireAgain)
            {
            }

            public bool FireQueryCancel()
            {
                return true;
            }

            public void FireWarning(int warningCode, string subComponent, string description, string helpFile, int helpContext)
            {
                HandleEvent("Warning", subComponent, description);
            }

            #endregion
        }
    }
}
Posted by mmasson | 0 Comments
Filed under:

Debugging a COMException during package generation

You’ll occasionally receive a COMException when you’re programmatically generating SSIS package – typically when dealing with Data Flow components. These COM Exceptions will provide you with an HRESULT (ErrorCode), and no additional information.

System.Runtime.InteropServices.COMException (0xC020801C): Exception from HRESULT: 0xC020801C
   at Microsoft.SqlServer.Dts.Pipeline.Wrapper.CManagedComponentWrapperClass.AcquireConnections(Object pTransaction)
   at MyApp.Program.CreatePackage()
   at MyApp.Program.Main(String[] args)

Ideally the exception you receive would have the underlying error message included. Unfortunately, the SSIS pipeline API doesn’t provide the IErrorInfo support needed for the .NET runtime to determine the error details, but there is a way for you to receive the error(s) using an event handler.

For this to work, you’ll need to be able to modify the package generation code (or at least control the TaskHost and MainPipe of the data flow objects). You’ll receive detailed error messages by supplying an IDTSComponentEvents handler for the IDTSPipeline100.Events property. The FireError event will be raised right before any COMException is thrown, proving you the error details.

The first step is to create a class which implements the IDTSComponentEvents interface. This class is simply going to report all errors and warnings out to the console as they occur.

class ComponentEventHandler : IDTSComponentEvents
{
    public bool FireError(int errorCode, string subComponent, string description, string helpFile, int helpContext)
    {
        Console.WriteLine("[Error] {0}: {2}", subComponent, description);
        return true;
    }
[…]
}

Next, we’ll hook up an instance of this class to the Events property of our MainPipe (Data Flow) object.

Package package = new Package();

// Add Data Flow Task
Executable dataFlowTask = package.Executables.Add("STOCK:PipelineTask");

// Set the name (otherwise it will be a random GUID value)
TaskHost taskHost = dataFlowTask as TaskHost;
taskHost.Name = "Data Flow Task";

// We need a reference to the InnerObject to add items to the data flow
MainPipe pipeline = taskHost.InnerObject as MainPipe;

// Set the IDTSComponentEvent handler to capture the details from any 
// COMExceptions raised during package generation
ComponentEventHandler events = new ComponentEventHandler();
pipeline.Events = DtsConvert.GetExtendedInterface(events as IDTSComponentEvents);

 

After this, any COMExceptions raised by the pipeline will be displayed on the console.

For example, let’s look at the COM Exception I listed at the top of this post (HRESULT 0xC020801C). Looking up the error code on the Integration Services Error and Message Reference page tells me that the HRESULT maps to DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER, so there’s a good chance one of my calls to AcquireConnection() is causing the exception. If my package contained a number of data sources and connection managers, it could make it very difficult to debug.

After hooking up the event handler, I see the following on the command line before receiving the exception:

[Error] OleDb Source [1]: SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER.  The AcquireConnection method call to the connection manager "localhost" failed with error code 0xC0202009.  There may be error messages posted before this with more information on why the AcquireConnection method call failed.

Now I know which source is failing (OleDb Source [1]), and which connection manager (localhost) is causing the problem. It could still be better, but now at least I know where to start debugging (ie. make sure that the database on localhost is up and running).

Here is a full implementation of IDTSComponentEvents incase you want to try it out yourself.

--------

class ComponentEventHandler : IDTSComponentEvents
{
    private void HandleEvent(string type, string subComponent, string description)
    {
        Console.WriteLine("[{0}] {1}: {2}", type, subComponent, description);
    }

    #region IDTSComponentEvents Members

    public void FireBreakpointHit(BreakpointTarget breakpointTarget)
    {
    }

    public void FireCustomEvent(string eventName, string eventText, ref object[] arguments, string subComponent, ref bool fireAgain)
    {
    }

    public bool FireError(int errorCode, string subComponent, string description, string helpFile, int helpContext)
    {
        HandleEvent("Error", subComponent, description);
        return true;
    }

    public void FireInformation(int informationCode, string subComponent, string description, string helpFile, int helpContext, ref bool fireAgain)
    {
        HandleEvent("Information", subComponent, description);
    }

    public void FireProgress(string progressDescription, int percentComplete, int progressCountLow, int progressCountHigh, string subComponent, ref bool fireAgain)
    {
    }

    public bool FireQueryCancel()
    {
        return true;
    }

    public void FireWarning(int warningCode, string subComponent, string description, string helpFile, int helpContext)
    {
        HandleEvent("Warning", subComponent, description);    
    }

    #endregion
}
Posted by mmasson | 0 Comments
Filed under:

Speaking for the Vancouver BI PASS Chapter

I’ll be presenting for the Vancouver BI PASS Chapter next Friday, July 10th, at the Microsoft Canada offices. I really enjoyed meeting the Pacific Northwest SQL Server User Group a couple of weeks ago, and I’m looking forward to sharing SSIS with my fellow Canadians in Vancouver. I’ll be going over some common SSIS pitfalls, and it seems like I’ll have plenty of time to go over any other topics that come up (perhaps there will be interest in CDC & Merge?)

Hope to see you there!

Friday, July 10th 2009 
8:30 - 11:30 AM

Microsoft Canada
1111 W. Georgia, Suite 1100 (11th floor) 
Vancouver, British Columbia 
V6E 4M3

Posted by mmasson | 0 Comments
Filed under:

Scripts and assembly caching in 2005

A recent discussion with MVP Andy Leonard about a new property which showed up on the Script Task in SQL Server 2005 SP3 reminded me that I’ve been meaning to blog about how the script task and script component cache their precompiled code at runtime. This behavior has changed in every SP since 2005 RTM, and while there are a couple of KB articles that describe the issues that prompted the changes, I haven’t seen anyone document the behavior itself.

Note: The behavior in SQL Server 2008 is different from 2005 because of the switch from VSA to VSTA. 2008 does not have any of the problems or limitations listed below.

Script Assemblies

First a bit of background information on how the script task and script component work. I think the BOL entry for the script task does a good job of explaining the compilation behavior:

VSA scripts must be compiled before the package runs. The Script task includes the option to precompile script into binary code when the package is saved. When script is precompiled, the language engine is not loaded at run time and the package runs more quickly; however, compiled binary files consume significant disk space. If you do not use the precompile option, the script is compiled at run time, which slows package execution but consumes less disk space. If storing large packages is not a problem, you should precompile your VSA scripts. Also, only precompiled script can be run in a 64-bit environment.

Whether you’re using the PreCompile option or not, the script will eventually be compiled into a .NET assembly. Loading the assembly takes time, uses memory, and requires a separate AppDomain

2005 RTM

In the initial 2005 release, SSIS did not cache these Script Assemblies. This meant that if you were executing a script task inside of a loop, a new AppDomain would be created for each iteration, and the assembly is reloaded. This wasn’t good for performance, and because we can’t unload/destroy an AppDomain until the process finishes, could result in memory issues.

2005 SP1

Script assembly caching was introduced in SP1 to address the performance and memory issues. This functionality made sure that the script assembly was only loaded once, and then reused on subsequent executions. However, each script was associated with a GUID… if you copy/paste the .DTSX file, the GUID for the script task or script component stays the same (even if you later go in and modify the script code). This led to unexpected behavior – if the script code was completely different, but came from a copy/pasted script task, you’d end up reusing one of the script assemblies for both tasks. The related KB article describes the problem in more detail.

2005 SP2

This behavior was fixed in SP2 by changing the GUID every time the script is modified. This meant that if the script code was the same, and the name of the script task or component matched, the script assembly would only be loaded once.

2005 SP3

There was one remaining side effect of script caching that hadn’t turned up yet. This KB article describes the problem you might run into, but doesn’t give much detail (other than that it’s related to caching).

The issue was that if the same script is executed in parallel (multiple execute package tasks running the same child package in process), they might share the same Dts runtime object (which has collections for variables, and connection managers). So if a script in the first instance sets variable Foo to “A”, and the second instance sets it to “B”, the first instance would end up with a value of “B” as well.

You can see why this would be a problem (and extremely confusing).

Unfortunately this problem isn’t easily fixed without entirely disabling script caching. Because we don’t want to go back to the RTM behavior (no caching at all), we introduced a new property to the Script Task / Component to control caching - OptimizeScriptExecution. If you’re running a child package with scripts multiple times in parallel, be sure to set this value to False, which disables the cache and prevents the scripts from sharing a runtime object.

I hope that helps!

Posted by mmasson | 2 Comments
Filed under:

Speaking at the PNW SSUG

I’ll be speaking at the Pacific Northwest SQL Server User Group meeting next Wednesday, June 10th. I’ll be giving an updated version of the Avoiding Common Pitfalls in SSIS talk I gave at TechEd a couple of weeks ago. It will be my first time attending a user group meeting, and I’m really looking forward to meeting some more SQL people from the area.

See you there!

Posted by mmasson | 0 Comments
Filed under:

Tutorial: SSIS Performance Videos

This post is coming a little late (these have already been announced here and here), but incase you haven’t seen these before, I wanted to bring attention to four new performance related created by the SQLCAT and SSIS teams. They mostly focus on larger scale/Enterprise level package deployments, but there are good tips for all levels of SSIS usage.

From the SQLCAT post:

Measuring and Understanding the Performance of Your SSIS Packages in the Enterprise (SQL Server Video)
Link | Watch this video

Author: Denny Lee, Microsoft Corporation

This video demonstrates how to measure and understand the performance of packages, based on lessons learned from enterprise customers. In this video, you will learn the following guidelines for improving performance:

  • How the limits of the source system affect performance.
  • Why disk I/O is important.
  • Why you should establish a package performance baseline

Tuning Your SSIS Package Data Flow in the Enterprise (SQL Server Video)
Link | Watch this video

Author: David Noor, Microsoft Corporation

This video demonstrates how to improve the performance of the data flow in an Integration Services package. In this video, you will learn how to tune the following phases of the data flow:

  • Extraction
  • Transformation
  • Loading

You can apply these performance tuning tips when you design, develop, and run the data flow.

Understanding SSIS Data Flow Buffers (SQL Server Video)
Link | Watch this video

Author: Bob Bojanic, Microsoft Corporation

This video looks at the memory buffers that transfer data to the data flow of an Integration Services package. The video includes the following items:

  • Demonstration of a simple package that shows you how Integration Services divides the data along the data flow paths and how buffers carry data along those paths.
    This simple package performs the following operations:
    • Data extraction
    • Character mapping
    • Creation of new columns by using a synchronous transformation
    • Multicasting
    • Sorting by using a blocking (asynchronous) transformation.
  • Design guidelines for building and maintaining the data flow of a package.

Designing Your SSIS Packages for Parallelism (SQL Server Video)
Link | Watch this video

Author: Matt Carroll, Microsoft Corporation

This video demonstrates how to increase the performance of your Integration Services packages by designing for parallelism.

------

These four videos join the growing list of SSIS tutorial videos in the SQL Server Video series. Others that might interest you:

Avoiding Common Pitfalls in SSIS

I’ve uploaded the slides for the Chalk Talk I gave at TechEd last week. I always appreciate feedback (good or bad), and I’ve gotten some great comments so far. If you attended the session (or just like the slides), feel free to comment below.

Posted by mmasson | 2 Comments
Filed under:

Samples for the CDC & Merge demo

I’ve put together the samples I used in the Incremental Data Warehousing talk I did at the SSWUG.Org Virtual Conference. You can download them from my sky drive, here:

The zip has a readme.txt file that should explain how to setup and run the samples. Please let me know if you have any questions.

For those keeping track – these are the same demos I did at the last PASS Conference in Seattle. I updated the talk a bit for the SSWUG conference, but the demos are essentially the same.

Posted by mmasson | 0 Comments

Sample videos posted for the SSWUG.ORG Virtual Conference

They’ve posted 10 minute previews for some of the talks at the upcoming SSWUG.ORG Virtual Conference. Here are some of the SSIS related topics:

Matt Masson - Incremental Data Warehouse Loads with Merge & CDC

John Welch -  Doing More (ETL) With Less (Effort) by Automating SSIS

Brian Knight - Loading a Data Warehouse With SSIS

Anthony D’Angelo - Deep Dive: Extending SSIS with .NET Development

They also have a free community event coming up on April 17th. If you’re not sure how the whole virtual conference thing works, this event is a great way to try it out.

Posted by mmasson | 0 Comments
More Posts Next page »
 
Page view tracker