Cleansing Customer Data Using Data Quality Services (DQS)

Cleansing Customer Data Using Data Quality Services (DQS)

Rate This
  • Comments 13

SQL Server 2012 Data Quality Services (DQS) enables you to cleanse data using your own Knowledge Base. In this blog article, I am going to show how to create a simple Knowledge Base which you can use to cleanse your customer data.

The steps for using DQS to cleanse data is as follows:

A. Create DQS Knowledge Base
B. Create a DQS project to cleanse your data using the Knowledge Base

 For this example, I am using the following sample data - you can run the following script on your SQL Server database :

 CREATE TABLE MyCustomers
  (
       CustomerID INT,
       CustomerName NVARCHAR(255),
       City NVARCHAR(32),
       Province NVARCHAR(32),
       LastUpdate DATETIME
  )
  
 INSERT INTO MyCustomers
  VALUES      (1, 'Consolidate Co Ltd', 'Miami', 'FL','2013-01-01'),
                     (2, 'Consolidation Company Ltd', 'New York', 'NY','2013-01-01'),
                     (3, N'什锦的件', 'LA', 'CA','2013-01-01'),
                     (4, 'Chop-suey Chinese', 'Los Angeles', 'CA', '2013-03-03'),
                     (5, 'Big Cheese, The', 'Redmond', 'WA', '2013-02-02'),
                     (6, 'THE BIG CHEESE', 'Chicago', 'Il','2013-02-02'),
                     (7, 'To Be Filled Later', 'Redmond', 'Wash.', '2013-01-01')

 

A. Create DQS Knowledge Base

  1. Start Data Quality Client. For information about this step, refer to Run the Data Quality Client Application.
  2. In the Data Quality Client home screen, under Knowledge Base Management, click New Knowledge Base.
  3. Name your new Knowledge Base (e.g. MyCustomerKB), make sure Domain Management activity is selected and click Next

  1. Click Create Domain Icon.

  1. From the pop-up window, name your domain (e.g. CustomerName). As an option, you can select to change the format of the output to Capitalize.

  1. Repeat the steps and create domains for City and State. 
  2. Select City domain and click on domain values tab. In this screen, you can enter all the list of valid and invalid values for the domain.
    • Click on Add new domain value icon. Enter a correct value (e.g. Los Angeles) then press enter. 
    • Click on Add new domain value icon. Enter an invalid value for the city (e.g. United States), click the down arrow in the Type column and select the invalid type (yellow triangle) then press enter. (There are 3 supported type: correct, invalid, and error; invalid suggests that the value is not valid for the given domain, but may be valid in a different domain. For example, United States is not a valid value for city domain but is a valid value for country domain; error suggests that the value is globally incorrect). For invalid or error value, you can optionally specify correct to value for DQS to automatically correct. Otherwise, records with invalid or error value will be flag as invalid during cleansing.
    • Click on Add new domain value icon. Enter a synonym value for the city (e.g. LA) then press enter. Highlight both "LA" and "Los Angeles" values, right click then select Set as Synonyms.

Note: by default, DQS include DQS_NULL as valid value for the domain, you can change the type to invalid if you would like the record with missing value to be flagged

    • Update the leading value by right clicking Los Angeles and select Set as Leading during cleansing.

    • The final city domain values should look as follows:

  1. Select State domain and click on domain values tab. In this step, we will import the values from a spreadsheet.
    • Obtain the list of US states, its standard abbreviation and postal abbreviation from here. Copy the first 3 columns into excel and save as csv file.
    • Click the down arrow next to import values icon on the state domain management and select import valid values from excel. From the pop-up window, browse to the location of the csv file (make sure to select file type as csv) and update Use First row as header checkbox according to your file (if you do not include header when you prepare the file, then you can leave the default option unchecked.

    • You should see the State domain values screen populated as below - notice that when you import list of values with more than one column, DQS automatically set the first column as leading value and the subsequent column value as its synonyms.

  1. Select CustomerName domain and click on Term-Based Relations tab. DQS allows you to define terms within your domain value and standardize them into standard term. For example, business name often include abbreviation such as "corp", etc. When there are variation for the use of the terms (e.g. one record may use abbreviation such as "Microsoft Corp" while another record may have values for "Microsoft Corporation"), you can use DQS to standardize the use of terms within your domain values. In this step, we will define 2 term based relations for our CustomerName domain:
    • Click Add new relation icon
    • Add "co" as value and "Company" as correct to. Press enter to continue.
    • Add "ltd" as value and "Limited" as correct to. 
    • You should see the CompanyName Term-Based Relations screen to be populated as below. Click finish to continue.

  1. Click Publish Knowledge Base button to continue.

 

Creating your own DQS Knowledge Base sometimes requires a lot of effort. For things such as address cleansing, phone number cleansing, creating your own complete list of all valid and invalid values can be a huge effort. DQS supports integration with third party service provider to cleanse your data through DQS. Refer to my blog article on how to cleanse Customer Data using Dun & Bradstreet for more information.

B. Create a DQS project to cleanse your data using the Knowledge Base

  1. In the Data Quality Client home screen, under Data Quality Projects, click New Data Quality Project.
  2. Name your new Project (e.g. MyCustomer Cleansing Project), make sure you select Knowledge Base created in the previous step (e.g. MyCustomerKB) then click Next to continue.
  3. In the Map screen:
    • Select data source, database, and table for your data (e.g. MyCustomers table and data generated from the script at the beginning of this article)
    • Select columns to be cleansed and map to the domains in your knowledge base
    • You should have your datasource columns mapped as below. Click Next to continue.

  1. In the Cleanse screen, click Start to begin the process. Click Next when the process finishes.
  2. In the Manage and View results screen:
    • Select City domain and click Corrected tab. You should see "LA" corrected to "Los Angeles"

 

    • Select City domain and click New tab. You should see the list of city not defined in the knowledge base domain value definition listed here. Click Approve all terms icon to continue (Refer to this msdn article on how do import all approved project values into DQS knowledge base). The approved values are now appear under Correct tab.

    • Select CustomerName domain and click New tab. You should see Customer Names are updated to use standardized terms. Click Next to continue.

  1. In the Manage and View results screen, you can export the results to SQL Server table or Excel.

In this blog article, I discussed how to create a Knowledge Base to cleanse customer data, including use of domain values and term based relations. You can then refer to the Knowledge Base to create Data Quality Projects to cleanse your data. You can use the same knowledge base to perform cleansing on many Data Quality Projects. You can also automate the cleansing using SQL Server 2012 Integration Services. Matt Mason wrote a nice article : Overview of DQS Transform that describes the SSIS DQS Cleansing transform.

You may also notice that there appears to be duplicate records in the sample data I used. In the next article, I will describe on how to enhance your knowledge base by adding matching policy and identify duplicate and related records in your dataset.

Leave a Comment
  • Please add 7 and 8 and type the answer here:
  • Post
  • Indeed a very nice post. I am also associated with imaginationworksla, Foam board printing los angeles, vehicle wraps los angeles, car wraps los angeles, banner printing los Angeles, banner printing san fernando valley, coroplast printing los Angeles Thanks for writing such good posts and as I have subscribed to your blog.

  • Thanks, this is really useful. I'm working with the DQS Client at the moment. If I open an existing (completed) project it will open at stage 4 - Export. I can use the BACK button to get to Stage 3, but it then becomes greyed out.

    How can I get back to Stages 1 and 2 to refiew the Mapping/Cleansing stages?

  • Richard. You can close the project without clicking finish to allow you to change/restart your project. However, after you click finish, then you can't go back beyond reviewing the results.

  • great blog,excelent way to cleanse our data and to replace the old unreliable data with the up-to-date  data. For more details visit us at <a href="www.optinlistsgroup.com/.../data_cleansing.php">click here.</a>

  • great blog,excelent way to cleanse our data and to replace the old unreliable data with the up-to-date  data. For more details visit us at <a href="www.optinlistsgroup.com/.../data_cleansing.php">click here.</a>

  • Informative. Data cleansing plays a vital role.. for clean Data cleansing service contact us at <a href="b2bemaillistz.com/data-matching.php">Data Cleansing</a>

  • The DQS toolset is at the moment very inmature and unusable in large enterprise environments. I've been working on it for the past 5 months and I finally had to remove it completely and bastardize MDS & SQL to perform this function. The main obstacles I've found are:

    - The SSIS task is a joke. It cannot be migrated between environments as it cannot be parameterized. It has and awfully slow performance, even with the suggestions to split the data-flow and work with multiple items at the same time it would take long minutes to map anything beyond trivial amounts of data (less than 1000 rows)

    - The project files that the SSIS task generates are also useless as they omit all the contextual information about the rows and concentrate just on the values to clean. Any data steward would have a hard time figuring out the source of any individual row that has to be cleaned.

    - Domain editing features are too cumbersome for large domains.

    - Importing and exporting values from the client does not support standard formats like csv or excel, just the propietary xml format of the tool.

    - Integration with MDS is in its infancy as you cannot use MDS as a reference service for DQS

    - The APIs are closed so the product is not integrable with other toolsets.

    Hopefully some resources have been allocated towards improving this in the upcoming SQL 2014.

  • Great Blog,,,, Agree with you, data cleansing has played a key role to maintain updated data. for further details on data cleansing up to date information ,please visit:<a href="http://www.b2bemaillistz.com">click here</a>

  • Excellent Article, Cleansing Customer Data Using Data Quality Services  is really very informative to get Great deals on Data Cleansing services visit<a href="www.pioneerlists.com/.../index"> Pioneerlists.com</a>

  • Helpful Post on data cleansing .Data cleansing play very important role in email marketing .To Know more about data cleansing visit <a href="www.b2bmarketingpartners.com/.../Data-Cleansing">Data Cleansing</a>

  • Impressive blog, data cleansing service helps to maintain updated information which really helps email marketing. For more details on data cleansing visit <a href=” www.b2bmarketingarchives.com/.../Data-Cleansing “> b2bmarketing archives </a>

  • Creative blog, the article is well described to make effective use of quality data service, for more information on Data cleansing visit <a href="http://www.b2boptinlists.com">B2Boptinlists.com</a>

  • At Create DQS Knowledge Base, step 8, my 'Import values' option is disabled. How do I turn it on? Thx.

Page 1 of 1 (13 items)