Using the SSIS DQS Cleansing Component

Using the SSIS DQS Cleansing Component

Rate This
  • Comments 7

 

As part of our DQS CTP3 offering, we are releasing a new DQS SSIS component. This component incorporates the DQS cleansing functionality into an SSIS data flow.

 

When should I use the SSIS DQS Cleansing Component?

The DQS Cleansing component can add value when:

1.       Cleansing should be performed as a batch process.

2.       The cleansing functionality is used as part of a larger data integration scenario.

3.       The cleansing process has to be automated, or run periodically.

 

Sounds great. I want to cleanse my data with the DQS cleansing component. What do I do now?

DQS is a knowledge driven data quality product. It means that first you have to create a knowledge base (KB) that is relevant to the data sources which you would like to cleanse. You build this KB by acquiring knowledge from samples of your data (a process we call “knowledge discovery”), by configuring the KB to use external knowledge from Windows Azure Data Market, or by manually adding knowledge to the KB. The knowledge is stored in the context of data entities that we call data domains. Some examples for domains include City, State, Email Address, etc. The knowledge in the domains consists of good values, bad values, relations between values, validation rules, etc.

Creating and managing KBs are done through the DQS client. Click here for additional information on how to create a KB.

 

So I Have built a good KB, what now?

Once you feel that the KB contains sufficient relevant knowledge for your cleansing tasks, you can create an SSIS data flow that contains the DQS Cleansing component, configure it to work with the prepared KB, and run the package.

Sometimes, it can help to run an interactive cleansing project through the client before running a batch project through SSIS. In this way, you can conveniently evaluate the results through the DQS client UI and decide whether you are ready to perform batch cleansing, or whether you still need to do some work to enhance and improve your KB.

 

How do I configure and use the DQS Cleansing component?

In general, the DQS Cleansing component is a standard SSIS transformation component, so familiarity with SSIS is required. To read about this component specific configuration and usage, please refer to this post in the SSIS team blog.

Note that while the DQS Cleansing component is installed as part of SSIS, it is a part of the DQS product and requires the DQS Server installation to function.

 

Please note that the actual DQ work is done in the DQS server, so from DQS point of view, the component is served as a client. You can install SSIS and DQS server on the same machine, but you can also operate the DQS Cleansing component with a remote DQS server.

If your domains are attached to external reference data services (RDS), your records will be further sent to the cloud. A post about how to use RDS will be published soon. 

The following diagram shows how all the components and entities that were mentioned above relate to each other:

 

 

Best Practice: using DQS Cleansing in conjunction with Conditional Split for optimized handling of results

The DQS Cleansing component takes input records, sends them to a DQS server, and gets them back corrected. The component can output not only the corrected data, but also additional columns that may be useful for you. For example - the status columns. There is one status column for each mapped field, and another one that aggregated the status for the whole record. This record status column can be very useful in some scenarios, especially when records are further processed in different ways depending on their status. Is such cases, it is recommended to use a Conditional Split component below the DQS Cleansing component, and configure it to split the records to groups based on the record status (or based on other columns such as specific field status).

By using this best practice, “good” records can be immediately sent for further processing downstream, while “bad” records can be isolated or redirected for appropriate handling (automatic or manual).

Watch this video in order to understand how this best practice works in reality.

 

This is for now - feel free to contact us with comments and feedback.

The DQS Team

Leave a Comment
  • Please add 7 and 2 and type the answer here:
  • Post
  • I am looking for a bit of help.

    After some tests and trials of the SSIS Cleansing component, I find that all of my attempts remain as "Active" activites for the KB.

    Forever.

    I can find no way of changing the status.

    This means that I can't update my KB at all?

    Any suggestions

  • When I open SSIS project (with Data Cleansing Component) in DQS client, it doesn't show all the columns available in SSIS. It only shows the Domain related columns. It is not possible to link the correctd data (through DQS) back to actual data.

  • When testing the DQS Cleansing Component in SSIS, I notice 2 puzzling things. Firstly invalid data gets a  Status of "New" rather than "Invalid". This happens in the DQS Client as well, which is fair enough as you can then manually decide if it is genuinely valid or not. However in cleansing the data in a batch (eg SSIS) you can't intervene, so "Invalid" seems to me to be more accurate than "New". Is there a way to force it to display "Invalid" rather than "New"? I know that in my subsequent Conditional Split component I can search for "New" rather than "Invalid", but that brings me to my next issue.

    If a record has multiple fields, and one gets a Status of "Correct" (or "Corrected") and another has "New" (ie Invalid), the overall Record Status comes back as "Correct" (or "Corrected"). This seems incorrect to me. I know I can search on each individual field Status, but if I'm cleansing multiple fields, I'd much prefer to look at just the Record Status. Is the Record Status correct in the case where one field returns a Status of "New"? And is there a way to make the Record Status reflect the "worst" of all the field statuses?

  • Hi I'm new in using DQS. When I tried to use it in SSIS, after creating new DQ connection manager, it displays "the data quality knowledge base is empty. Specify a valid coonection."  DQS Client is on the same machine. Please help. Thanks.

  • Did anyone find the source of the error message  "data quality knowledge base is empty. Specify a valid Knowledge Base?"

  • I want to create a SSIS package that takes the parameter at run time and then finds this value in DQS and if this value match to any record then send the output row in matched table. I have created the SSIS but when I send the value in DQS KB then it returns the unmatched output although the value exists in table. Please help me resolve this task.

  • I have one excel file with 5 columns like date, customer id, account no, location, customer name.

    I have designed the package.

    If we get bad records like wrong date or non-numeric customer id, etc then we put bad records into another excel with extra column reason for bad record

    And we have to send email whenever bad record occur.

    I done for every column like if there is bad date then mail will be sent.

    If customer id is non numeric, again mail will be sent and so on.

    But the problem is – If we get 2 bad records or more than 2 then email will be sent each time. I want to send only one email for one bad row whether it contains 2 bad columns or more with reason.

    Kindly help me in this if u have any idea.

Page 1 of 1 (7 items)