Data Quality Services is a new SQL Server product in Denali which provides data cleansing capabilities. This post describes how to make use of the DQS Cleansing Transformation which is part of the default set of SSIS data flow components.
The DQS Cleansing transform can be found in the Data Flow Toolbox.
The first thing we need to set in the DQS Cleansing Transformation Editor is the Data Quality Connection Manager. This will point to our DQS installation residing on a SQL instance.
Once the connection manager has been created, you select the Knowledge Base (KB) you want to use. The current CTP comes with a default “DQS Data” KB, which contains domains that you’d use for address validation (at TechEd they mentioned that the final release may include additional domains out of the box). Selecting the Knowledge Base you want to use will bring up its list of domains.
There are two types of domains in this list; regular Domains (ex. City, State, Zip), and Composite Domains (also called CDs), which are made up of two or more regular domains. Composite Domains allow you to validate multiple fields as a single unit. For example, the “Company – US” composite domain in the default DQS Data KB is made up of Company Name, Address Line, City, State and Country. This lets you validate that “Microsoft Corporation” (Company Name) exists at “One Redmond Way” (Address Line), “Redmond” (City), “WA” (State), “USA” (Country) (given that you have the appropriate CD rules in your knowledge base, or the CD is attached to an online reference data provider). DQS would flag the entry as incorrect if you had “Seattle” as the City – even though Seattle is a valid city name, that’s not where the Microsoft office is.
There are two ways to make use of Composite Domains in SSIS:
The Mapping tab allows you to select the columns you want to cleanse, and map them to domains in your knowledge base.
Note that the Domain drop down will automatically filter out columns with incompatible data types (for example, it won’t show domains with a String data type if you are using a DT_I4 column).
The Advanced tab has number of different options – most of which control the columns that will be included in the output.
After mapping one of my input columns (StateProvinceName) to one of the domains (State), I check the Confidence and Reason boxes on the Advanced tab, and click OK to save the changes. If I connect the path to another transform and look at it’s metadata (right click on the path, Edit … and click the Metadata tab), I can see that four columns were added for each input column I chose to cleanse – Corrected_<column>, Status_<column>, Confidence_<column> and Reason_<column>. The first two show up by default – the last two are there because of the options I selected on the Advanced tab.
The Corrected column will contain the value for the field, whether or not it was corrected by DQS. This is most likely the field you’ll want to use later on in your data flow.
The Status column will contain the result of the correction operation. Possible values are:
Some things to note about the DQS Cleansing transform:
I think this feature will be praised by the developers.
Besides, it resembles what I saw in MelissaData (www.melissadata.com).
Question: will a developer be able to compose custom KBs or modify the logic in the canned ones?
Yes - you can create your own KBs, and modify the domains in the default KB. You can also make a copy of a domain, and reuse it across one or more KB. You have full control of it.
Through DQS, you can actually subscribe to MelissaData using an online refence data service (through SQL Azure Data Market). You can see it in action in the DQS and SSIS presentations from TechEd.
How to include the matching component of DQS in SSIS package.
We have DQS Cleansing component in SSIS package similarly how can i make use of matching comonent (as in DQS) in SSIS Package as well
The DQS Cleansing transform is installed by default with SQL Server Denali. It is not available for previous versions of SQL Server. The component currently does not support matching, but that functionality is available by using the DQS client (it's also available as part of the MDS + DQS integration).
Matt, I can't map multiple single domains to a composite domain (i.e. if I have Composite Domain "C" made up of fields "A" and "B", I can't map more than one column to Composite Domain "C" as you imply in point 2. Is this an RC0 bug or am I doing it wrong?
Hi James - there are two ways to use a Composite Domain (CD) from the SSIS Cleansing transform. You can merge all of the values into a single string ("123 Main St., New York, NY") and map it to the CD, or map columns to the domains that make up CD. The CD cleansing logic will be used as you've mapped all of the CD domains. (That UI needs a bit of work - we should have a way to indicate that the CD validation logic is being mapped / used)
Thanks Matt. As it stands mapping the columns to the domains doesn't seem to work, or at least from a UI perspective makes no sense. The building of a CD into a string is fine.
For example, I have a composite Domain called "Shape and Sides", made up of two domains "Shapes" and "Sides".
I can map one of the "Shapes" or "Sides" column to the composite domain "Shape and Sides (CD)", but not both.
I can also map the "Shapes" column to the "Shapes" domain, but "Sides" domain isn't available in the dropdowns (this may be an unrelated bug).
I'm assuming CTP bugginess in this regard. Ping me a mail if you want me to raise a Connect.
Hi Matt, thanks for sharing such useful information.
You says in your post that the DQS Cleansing component work in SSIS as an asynchronous dataflow component component and that's truly that the Output seem to say to me in SSDT when executing DQS cleansing packages. But, the DQSPerformanceBestPractices whitepaper says that "Each DQS Cleansing Component in a SSIS Dataflow Task utilizes only a single CPU core on Data Quality Server and runs synchronously in the SSIS dataflow". Are the wrong ?
This being said, is there any mean to directly or indirectly control the batch size for the component ? At this time, my current package is sending max 50 records every single batch (The DQS Cleansing component sent 33 records to the DQS server !) and with about one million ones, it will take days. I think it can improve because on other packages at other location, the record batch size upsize to 1000 records.
How can we parameterize the DQS connection and not loose the DQS cleansing task when changing from a Development to QA environments?
>How can we parameterize the DQS connection and not loose the DQS cleansing task
>when changing from a Development to QA environments?
It's a very big issue for us since we are planning to go to production and we are planning to automatize the deployment process. And this is a blocker for us. I wonder if anybody cares about it in Microsoft?
Echoing Ulises & Oleksii comments, How do you make SSIS DQS cleansing portable? I have exactly the same problem when changing the DQS server connection (ie moving from Dev to Production) all mappings to the knowledge base and the domains are lost even though the exact same KB exists on the production server. We have to redo the mapping on our live server every time we deploy.
Not good Microsoft! Is there a workaround? A patch ? A response would be good as I feel this is a major flaw in the DQS Cleansing component.
Hi , I am using DQS Cleansing component in SSIS . in the mapping window there is no check box next to each input column to select and map to domain. There is only on check box on top which is going to map all the input columns!!!
Anyone has any idea?
Just a little helper as I got caught the record status for Suggestion is actually Auto suggest
[Record Status] == "Auto suggest"
how to improve the performance of dqs, i have >3.2 million records to load.