Accessing a data profile programmatically

Accessing a data profile programmatically

  • Comments 14

The new Data Profiling Task in 2008 generates an XML output file. The output is easy to read, and could be used to make decisions within a control flow. For example, you could check whether or not to process your most recent data set based on criteria in the profile (are any values in certain columns null? Do my values fall within expected ranges?)

While you can parse this XML file in traditional ways, those of you who aren't afraid of using undocumented, unsupported APIs can make use of the classes in the Microsoft.SqlServer.DataProfiler assembly which we use to load the XML profile internally.

Let's use a simple scenario: We want to check one of the columns (AddressLine1) in our staging table to make sure that it contains no NULL values. If it is clean, we process it in a data flow task. If NULL values are found, we want to send an email to the DBA.

Here's what the control flow looks like:

image

We're first running our data profile, then processing the results in a script task. If we match our criteria (i.e. no nulls in the AddresLine1 column), the script task returns success and we run the "Process" data flow. If the criteria doesn't match, we fail the task, and run the send mail task instead.

Note, we can store the results of the data profiling task in a variable (as a string) instead of saving it out to disk.

image

Now the interesting part - the script task.

First, add a reference to the Microsoft.SqlServer.DataProfiler DLL. It can be found under %ProgramFiles%\Microsoft SQL Server\100\DTS\Binn\Microsoft.SqlServer.DataProfiler.dll (and should also be in the GAC).

The following code loads the data profile XML from the package variable, de-serializes it into a DataProfile object, and cycles through the profiles until it finds the one its looking for.

using Microsoft.DataDebugger.DataProfiling;
 
const string ColumnName = "AddressLine1";
readonly long Threshold = 0;
 
public void Main()
{
    Dts.TaskResult = (int)ScriptResults.Success;
 
    // Retrieve the profile from the package variable
    string dataProfleXML = Dts.Variables["User::DataProfile"].Value.ToString();
 
    // Deserialize
    DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();
    DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));
 
    // Cycle through the profiles to find the one we're looking for
    foreach (Profile p in profile.DataProfileOutput.Profiles)
    {
        // Check the profile type
        if (p is ColumnNullRatioProfile)
        {
            // Match the column name
            ColumnNullRatioProfile nullProfile = p as ColumnNullRatioProfile;
            if (nullProfile.Column.Name.Equals(ColumnName, StringComparison.InvariantCultureIgnoreCase))
            {
                // Make sure it's within our threshold
                if (nullProfile.NullCount > Threshold)
                {
                    // Fail the task
                    Dts.TaskResult = (int)ScriptResults.Failure;
                }
 
                break;
            }
        }
    }           
}

Note, as I mentioned above, this API is for internal use, and is subject to change (it already has between CTP5 and CTP6, and will change again in the upcoming CTP-Refresh). I just thought some people out there might find this interesting.

Leave a Comment
  • Please add 7 and 3 and type the answer here:
  • Post
  • PingBack from http://msdnrss.thecoderblogs.com/2008/03/13/accessing-a-data-profile-programmatically/

  • Yeah, very interesting. I plan to put a post together outlining a supported way to do this -haven't got round to it yet- but if and when I do I'll definitely link back here.

    -Jamie

  • I am trying to use 2008's data profiler XML output data in SSRS report (I am under the impression my SSRS data source needs to be a SQL table).  Is it possible for me to convert the XML profiler ouput to a SQL table using the API mentioned above?  Regards,

  • There's nothing in the API that lets you automatically load the profile XML document into a sql table, however, you could do the mapping yourself with a little code.

    I believe that SSRS supports a number of data source types, including XML and custom .NET objects. If you search the books online, you should be able to find a couple of different ways to access the data.

    http://msdn.microsoft.com/en-us/library/aa964129.aspx

  • Thank you for the feedback.  Since our reports already run off of SQL tables, I think I will try to convert the XML.

    I tried following the steps mentioned in your post, but run into a problem due to a SSIS string variable memory limit error.  (When I save the data profile results to an XML file, the sizes can get quite large for value frequency distros, which is the component I am using.  One example I tried was over 200mb – is there a varchar(max) like variable in SSIS?)  I spoke with a coworker who has a bit more XML experience, and he mentioned being able to bulk load XML files into a sql table using a schema mapping file.  

    What are your thoughts on going that route?

    I created in VS a XSD file based on the data profiler XML output, but am getting stuck on adding the sql relationship code snippets for bulk loading.

    The examples I have been trying to follow are here: http://msdn.microsoft.com/en-us/library/aa225754.aspx

    I can email you the XML, XSD, and create table statement if that would be useful.

    Thanks for helping out a XML novice!  Any feedback you have regarding the best approach would be much appreciated.

  • I made a bit of progress using the API.  I'm stuck at the point where I try to access the freqdist values and counts - I believe some kind of collection / array is being used.  If I can figure that out, plus solve the memory error issue, I think I'm close to having a solution.  Regards,

           using System;

           using System.Collections;

           using System.Data;

           using System.Data.SqlClient;

           using Microsoft.SqlServer.Dts.Runtime.VSTAProxy;

           using System.Windows.Forms;

           using Microsoft.DataDebugger.DataProfiling;

           string  DataSource;

           string  DataBase;

           string  Schema;

           string  Table;

           long    RowCount;

           string  ColumnName;

           long    DistinctValues;

           string  Value;

           long    Count;

           public void Main()

           {

               Dts.TaskResult = (int)ScriptResults.Success;

               //SqlConnection sqlConn = new SqlConnection("DSN");

               //sqlConn.Open();

               // Retrieve the profile from the package variable

               string dataProfleXML = Dts.Variables["User::xmlResults"].Value.ToString();

               // Deserialize

               DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();

               DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));

               // Cycle through the profiles to find the one we're looking for    

               foreach (Profile p in profile.DataProfileOutput.Profiles)

               {

                   if (p is ColumnValueDistributionProfile)

                   {

                       ColumnValueDistributionProfile ValueDist = p as ColumnValueDistributionProfile;

                       DataSource = ValueDist.Table.DataSource;

                       MessageBox.Show(DataSource,"DataSource");

                       DataBase = ValueDist.Table.Database;

                       MessageBox.Show(DataBase,"Database");

                       Schema = ValueDist.Table.Schema;

                       MessageBox.Show(Schema, "Schema");

                       Table = ValueDist.Table.Table;

                       MessageBox.Show(Table, "Table");

                       RowCount = ValueDist.Table.RowCount;

                       MessageBox.Show(RowCount.ToString(), "RowCount");

                       ColumnName = ValueDist.Column.Name;

                       MessageBox.Show(ColumnName, "ColumnName");

                       DistinctValues = ValueDist.NumDistinctValues;

                       MessageBox.Show(DistinctValues.ToString(), "DistinctValues");

                       // how do i access the ValueDistributionItem details (values and counts) ??

                       // the below does not seem to work

                       IEnumerator MyEnum = ValueDist.ValueDistribution.GetEnumerator();

                       while (MyEnum.MoveNext())

                       {

                           MessageBox.Show(MyEnum.Current.ToString());

                       }

                       // insert variables into sql table using ado connection

                   }

               }

               //sqlConn.Close();

           }

  • Microsoft Research contributed the Data Profiling Task and Data Profile Viewer to Integration Services

  • Extending the Data Profiling Task

  • Is there anyway I can save the results to a SQL table?

    I'm currently trying to use an ExecuteSqlTask to access the results from a package user variable and update the table but I don't know how this task can access package variables.

    I would use a ScriptTask component but I need to be able to do this programatically and as far as I am aware you cannot add a ScriptTask component programatically.

  • Hi Liam,

    It is possible to add a script task programmatically - see this post here:

    http://dougbert.com/blogs/dougbert/archive/2008/05/24/adding-a-vsta-script-task-programmatically.aspx

    You can use variables with the Execute SQL Task by binding variables to query parameters. You could create a stored procedure which takes in an XML document (which is how the Data Profiling Task will store the data), and then either store the value in a table, or perform some sort of XQuery on it.

    I haven't tried out this approach myself, but theoretically it should work. :) I'd recommend using the ADO.NET connection type if you do - it should have the best handling for XML types.

  • Thanks mmasson. I managed to get this working with the ExecuteSQLTask. Works a treat :)

  • We have made some progress in SSIS data Profiling see Video creating a Metadata Mart via #SSIS #DataQuality , plus #DataProfiling Analytics via #Excel http://youtu.be/3FQNO8AQCAY #MelissaData

    www.melissadata.com/.../data-quality-components-for-ssis.htm

  • Hi,

    Output of the data profiling task (xml data) load into SQL Server

    In my source table how many number of the null values is therer I want to find out,for that one I'm using data Profiling(Column null ratio profile)  task.Now I want to load that null values into sql table.Can any one help to me regarding to this issue.

  • Microsoft.DataDebugger.DataProfiling.dll is missing. Where can I get it? I am using Sql Server 2008 R2

Page 1 of 1 (14 items)