Welcome to MSDN Blogs Sign in | Join | Help

Accessing a data profile programmatically

The new Data Profiling Task in 2008 generates an XML output file. The output is easy to read, and could be used to make decisions within a control flow. For example, you could check whether or not to process your most recent data set based on criteria in the profile (are any values in certain columns null? Do my values fall within expected ranges?)

While you can parse this XML file in traditional ways, those of you who aren't afraid of using undocumented, unsupported APIs can make use of the classes in the Microsoft.SqlServer.DataProfiler assembly which we use to load the XML profile internally.

Let's use a simple scenario: We want to check one of the columns (AddressLine1) in our staging table to make sure that it contains no NULL values. If it is clean, we process it in a data flow task. If NULL values are found, we want to send an email to the DBA.

Here's what the control flow looks like:

image

We're first running our data profile, then processing the results in a script task. If we match our criteria (i.e. no nulls in the AddresLine1 column), the script task returns success and we run the "Process" data flow. If the criteria doesn't match, we fail the task, and run the send mail task instead.

Note, we can store the results of the data profiling task in a variable (as a string) instead of saving it out to disk.

image

Now the interesting part - the script task.

First, add a reference to the Microsoft.SqlServer.DataProfiler DLL. It can be found under %ProgramFiles%\Microsoft SQL Server\100\DTS\Binn\Microsoft.SqlServer.DataProfiler.dll (and should also be in the GAC).

The following code loads the data profile XML from the package variable, de-serializes it into a DataProfile object, and cycles through the profiles until it finds the one its looking for.

using Microsoft.DataDebugger.DataProfiling;
 
const string ColumnName = "AddressLine1";
readonly long Threshold = 0;
 
public void Main()
{
    Dts.TaskResult = (int)ScriptResults.Success;
 
    // Retrieve the profile from the package variable
    string dataProfleXML = Dts.Variables["User::DataProfile"].Value.ToString();
 
    // Deserialize
    DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();
    DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));
 
    // Cycle through the profiles to find the one we're looking for
    foreach (Profile p in profile.DataProfileOutput.Profiles)
    {
        // Check the profile type
        if (p is ColumnNullRatioProfile)
        {
            // Match the column name
            ColumnNullRatioProfile nullProfile = p as ColumnNullRatioProfile;
            if (nullProfile.Column.Name.Equals(ColumnName, StringComparison.InvariantCultureIgnoreCase))
            {
                // Make sure it's within our threshold
                if (nullProfile.NullCount > Threshold)
                {
                    // Fail the task
                    Dts.TaskResult = (int)ScriptResults.Failure;
                }
 
                break;
            }
        }
    }           
}

Note, as I mentioned above, this API is for internal use, and is subject to change (it already has between CTP5 and CTP6, and will change again in the upcoming CTP-Refresh). I just thought some people out there might find this interesting.

Published Wednesday, March 12, 2008 11:03 PM by mmasson
Filed under:

Comments

# MSDN Blog Postings » Accessing a data profile programmatically

# re: Accessing a data profile programmatically

Yeah, very interesting. I plan to put a post together outlining a supported way to do this -haven't got round to it yet- but if and when I do I'll definitely link back here.

-Jamie

Thursday, March 13, 2008 9:27 AM by jamiet

# re: Accessing a data profile programmatically

I am trying to use 2008's data profiler XML output data in SSRS report (I am under the impression my SSRS data source needs to be a SQL table).  Is it possible for me to convert the XML profiler ouput to a SQL table using the API mentioned above?  Regards,

Tuesday, December 09, 2008 1:29 AM by n0012

# re: Accessing a data profile programmatically

There's nothing in the API that lets you automatically load the profile XML document into a sql table, however, you could do the mapping yourself with a little code.

I believe that SSRS supports a number of data source types, including XML and custom .NET objects. If you search the books online, you should be able to find a couple of different ways to access the data.

http://msdn.microsoft.com/en-us/library/aa964129.aspx

Tuesday, December 09, 2008 12:04 PM by mmasson

# re: Accessing a data profile programmatically

Thank you for the feedback.  Since our reports already run off of SQL tables, I think I will try to convert the XML.

I tried following the steps mentioned in your post, but run into a problem due to a SSIS string variable memory limit error.  (When I save the data profile results to an XML file, the sizes can get quite large for value frequency distros, which is the component I am using.  One example I tried was over 200mb – is there a varchar(max) like variable in SSIS?)  I spoke with a coworker who has a bit more XML experience, and he mentioned being able to bulk load XML files into a sql table using a schema mapping file.  

What are your thoughts on going that route?

I created in VS a XSD file based on the data profiler XML output, but am getting stuck on adding the sql relationship code snippets for bulk loading.

The examples I have been trying to follow are here: http://msdn.microsoft.com/en-us/library/aa225754.aspx

I can email you the XML, XSD, and create table statement if that would be useful.

Thanks for helping out a XML novice!  Any feedback you have regarding the best approach would be much appreciated.

Wednesday, December 10, 2008 10:32 PM by n0012

# re: Accessing a data profile programmatically

I made a bit of progress using the API.  I'm stuck at the point where I try to access the freqdist values and counts - I believe some kind of collection / array is being used.  If I can figure that out, plus solve the memory error issue, I think I'm close to having a solution.  Regards,

       using System;

       using System.Collections;

       using System.Data;

       using System.Data.SqlClient;

       using Microsoft.SqlServer.Dts.Runtime.VSTAProxy;

       using System.Windows.Forms;

       using Microsoft.DataDebugger.DataProfiling;

       string  DataSource;

       string  DataBase;

       string  Schema;

       string  Table;

       long    RowCount;

       string  ColumnName;

       long    DistinctValues;

       string  Value;

       long    Count;

       public void Main()

       {

           Dts.TaskResult = (int)ScriptResults.Success;

           //SqlConnection sqlConn = new SqlConnection("DSN");

           //sqlConn.Open();

           // Retrieve the profile from the package variable

           string dataProfleXML = Dts.Variables["User::xmlResults"].Value.ToString();

           // Deserialize

           DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();

           DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));

           // Cycle through the profiles to find the one we're looking for    

           foreach (Profile p in profile.DataProfileOutput.Profiles)

           {

               if (p is ColumnValueDistributionProfile)

               {

                   ColumnValueDistributionProfile ValueDist = p as ColumnValueDistributionProfile;

                   DataSource = ValueDist.Table.DataSource;

                   MessageBox.Show(DataSource,"DataSource");

                   DataBase = ValueDist.Table.Database;

                   MessageBox.Show(DataBase,"Database");

                   Schema = ValueDist.Table.Schema;

                   MessageBox.Show(Schema, "Schema");

                   Table = ValueDist.Table.Table;

                   MessageBox.Show(Table, "Table");

                   RowCount = ValueDist.Table.RowCount;

                   MessageBox.Show(RowCount.ToString(), "RowCount");

                   ColumnName = ValueDist.Column.Name;

                   MessageBox.Show(ColumnName, "ColumnName");

                   DistinctValues = ValueDist.NumDistinctValues;

                   MessageBox.Show(DistinctValues.ToString(), "DistinctValues");

                   // how do i access the ValueDistributionItem details (values and counts) ??

                   // the below does not seem to work

                   IEnumerator MyEnum = ValueDist.ValueDistribution.GetEnumerator();

                   while (MyEnum.MoveNext())

                   {

                       MessageBox.Show(MyEnum.Current.ToString());

                   }

                   // insert variables into sql table using ado connection

               }

           }

           //sqlConn.Close();

       }

Thursday, December 11, 2008 3:41 AM by n0012

# Learn more about Data Profiling in SQL 2008: A compilation

Microsoft Research contributed the Data Profiling Task and Data Profile Viewer to Integration Services

Tuesday, January 20, 2009 6:47 PM by Douglas Laudenschlager

# Extending the Data Profiling Task

Extending the Data Profiling Task

Tuesday, March 31, 2009 10:20 AM by Sid Atkinson Jr.

# re: Accessing a data profile programmatically

Is there anyway I can save the results to a SQL table?

I'm currently trying to use an ExecuteSqlTask to access the results from a package user variable and update the table but I don't know how this task can access package variables.

I would use a ScriptTask component but I need to be able to do this programatically and as far as I am aware you cannot add a ScriptTask component programatically.

Wednesday, April 01, 2009 9:32 AM by LiamWalsh

# re: Accessing a data profile programmatically

Hi Liam,

It is possible to add a script task programmatically - see this post here:

http://dougbert.com/blogs/dougbert/archive/2008/05/24/adding-a-vsta-script-task-programmatically.aspx

You can use variables with the Execute SQL Task by binding variables to query parameters. You could create a stored procedure which takes in an XML document (which is how the Data Profiling Task will store the data), and then either store the value in a table, or perform some sort of XQuery on it.

I haven't tried out this approach myself, but theoretically it should work. :) I'd recommend using the ADO.NET connection type if you do - it should have the best handling for XML types.

Wednesday, April 01, 2009 10:11 AM by mmasson

# re: Accessing a data profile programmatically

Thanks mmasson. I managed to get this working with the ExecuteSQLTask. Works a treat :)

Monday, April 06, 2009 11:30 AM by LiamWalsh
Anonymous comments are disabled
 
Page view tracker