Automating the data matching process in SQL Server Data Quality Services (DQS)

Automating the data matching process in SQL Server Data Quality Services (DQS)

Rate This
  • Comments 21

OH22 data has just released a free SSIS transform task for SQL Server Data Quality Services (DQS) matching. The solution is published on codeplex. With this custom transform task, you can use the Knowledge Base (KB) created in DQS to automate data matching through SSIS.  

 

In this article, I am going to walk through using the new SSIS transform to automate matching project which I described in the previous article: Matching related and duplicate Customer Records using SQL Server Data Quality Services (DQS).

 

To summarize, we have the following data set which we would like to identify duplicate and related records:

 

 

In the previous article, we created a DQS Knowledge Base, MyCustomerKB,  with the following matching rules:

  1. Match By Customer Name: Matching any records with similar CustomerName value.
  2. Match By City State: Matching any records with similar City (50% weight) and State (50% weight)

We were using Data Quality Client tool to (manually) execute a matching project using MyCustomerKB against the data input above. In this article, we will be automating the process using the new SSIS transform task.

 

The following are the steps to automate:

 

A. Install DQS Matching codeplex project

B. Create SSIS project and configure DQS matching transform task

C. Execute the SSIS project and review the results

 

A. Install DQS Matching codeplex project

 

  1. Run Windows Installer Package (msi) on the SQL Server 2012 (with Integration Services)  server

 

B. Create SSIS project and configure DQS matching transform task

 

  1. Create a new Integration Services project using SQL Server Data Tools 
  2. On the Data Flow tab, create a new Data Flow task 
  3. Drag and drop Source Assistant and configure to connect to the customer data you would like to de-duplicate 
  4. Drag and drop DQS Matching transform task 

 

  1. Connect the output line from the source to the DQS Matching transform task 
  2. Double click DQS Matching to open the configuration editor window
    • Under the Connection Manager tab, connect to the SQL Server Data Quality Service server by clicking New:

    • Select the Data Quality Knowledge Base containing the matching policy - the available matching rules and domains are displayed for you to review
    • Click the Mapping tab to map the columns in your data source with the Knowledge Base domain

      • Note:

        • You must map all domains used by matching rules in your domain. The following error message appears if you click OK without mapping all domains: "Not all domains have been assigned or has been assigned twice. Please assign all domains properly"

        • When you use composite domain as part of your matching rule, you can either map the composite domain or all of the single domains that make up the composite domain. If you map a column to a composite domain, then the single domains removed as drop down options. Conversely, if you select one of the single domains that are part of the composite domain, then the composite domain disappear from the drop down option. You can redo selection by un-selecting the column checkbox from the top section.

        • Optionally, you can also go to the Advanced tab to specify the minimal matching rule threshold. By default, DQS match only records with matching score of 80% and above. You can raise the minimal score by changing it from the advanced tab.

 

  1. Create destination table. The SSIS DQS Matching produce 2 outputs: Matched Output and Unmatched Output. The
    Unmatched Output will create the same output schema as your input data source. The Matched output add the following columns: 

    • [RecordId] nvarchar(255)

    • [ClusterId] nvarchar(255)

    • [ClusterRecordRelationId] nvarchar(255)

    • [MatchingScore] nvarchar(255)

    • [RuleId] nvarchar(255)

    • [IsPivot] nvarchar(255)

    • [Status] nvarchar(255)

    • [PairId] nvarchar(255)

    • [SiblingId] nvarchar(255)

    • [PivotId] nvarchar(255)

You do not need to include all the metadata above in your matched output table. I suggest to include at least the following metadata:

    • RecordID : Unique record identifier for the data set 
    • ClusterID : Identifier for the matched record; any records with the same ClusterID value are considered matched and may be potential duplicate 
    • RuleID: Identifier for the rule used for the matching 
    • SiblingID : Identifier for another record which was compared to.  
    • Matching Score: The comparison score between the record and the sibling 

 The scripts to create the matched and unmatched table based on the data source we previously used:

CREATE TABLE MyCustomers_Matched
(
 CustomerID INT,
 CustomerName NVARCHAR(255),
 City NVARCHAR(32),
 Province NVARCHAR(32),
 LastUpdate DATETIME,
 [RecordId] nvarchar(255),   
 [ClusterId] nvarchar(255),  
 [RuleId] nvarchar(255),
 [SiblingId] nvarchar(255),  
 [MatchingScore] nvarchar(255)
)

CREATE TABLE MyCustomers_UnMatched
(
 CustomerID INT,
 CustomerName NVARCHAR(255),
 City NVARCHAR(32),
 Province NVARCHAR(32),
 LastUpdate DATETIME
)

 The final data flow task should look like below:

 

 

C. Execute the SSIS project and review the results

 

  1. Execute the SSIS project
  2. Open SQL Server Management Studio and review the MyCustomersMatched table:

 

 Note:

  • The ClusterID indicates the group of matched records. In our example, CustomerID=1 and CustomerID=2 has the same clusterID of 1000000
  • SiblingId indicates the corresponding record that is being compared for a given MatchingScore, for example: CustomerID = 6 has SiblingId of 1000004 (correspond to CustomerID=5), so the matching score of 87.5 is between CustomerID=6 & CustomerID=5
  • The output table also include RuleID, in order to get the Rule name, you first run a query to get an ID for the Knowledge Base you used then use the ID to query for the matching rule :

 

 

In this article, I describe how to use the SSIS DQS Matching transform task developed by OH22. Using this transform task, you can automate the DQS record matching using the Knowledge Base created from DQS client. You can download the codeplex project from https://ssisdqsmatching.codeplex.com/.

 

Leave a Comment
  • Please add 6 and 8 and type the answer here:
  • Post
  • This is a great walk-through. Good job!

  • This component potentially saves a lot of programming effort especially when matching may not be done using database facilities or RegEx. I only wish it could have appeared long ago.

    A big thank you!

  • How can we update the KB with the results of the match or is that pointless as they are all auto-approved?

  • DQS does not update KB from matching project, you can only update KB from cleansing project

  • Subscribe to our Data Matching services and avail all the benefits and increased ROI.. visit www.freedomprospects.com for more information

  • Great tips. On sql for detailed DBMS  prospects email lists  <a href="www.optinlistsgroup.com/.../dbms_prospects_email_list.php">click here </a>

  • Informative... we provide different data services that enhances your email marketing at <a href="b2bemaillistz.com/">Data Services at B2B email Listz</a>

  • Its very nice article.

    while i tried to add data matching to data flow i am getting below error

    The component has detected potential metadata corruption during validation.

    Error at relate [DQS Matching [181]]: System.MissingMethodException: Method not found: 'Void Microsoft.Ssdqs.Component.Common.Utilities.ComponentUtility.FireError(Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSComponentMetaData100, Microsoft.Ssdqs.Component.Common.Messages.ComponentMessage, System.Object[])'.

      at oh22is.SqlServer.DQS.Matching.Validate()

      at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostValidate(IDTSManagedComponentWrapper100 wrapper)

    It may not be possible to recover to a valid state using a component-specific editor. Do you want to use the Advanced Editor dialog box for editing this component?  

  • Is this implementation or a similar one to be included in SQL Server 2014?

  • Great Blog,,,, Data matching services have provided huge database related information to improve business in different scenarios. for more information on data matching services. please visit:<a href="http://www.b2bemaillistz.com">click here</a>

  • Informative.. Thanks.. For trustworthy Data Quality Services and Data Related Solutions Visit http://freedomlists.com/

  • Excellent illustration! Thanks a bunch! Makes it a lot easier to map an example to a real life situation.

  • Agree...Automating Data Matching process is key in any industry..... To know more about Data Services visit http://www.freedomlists.com

  • Yes Automation in any process is must in this modern era of development of online services.... upgrade your business to newer heights at www.sagaciousinfosystems.com

  • Very nice and Informative, we provide different Email marketing services please visit our site

Page 1 of 2 (21 items) 12