This blog provides information, news, tips, and announcements about the SQL Server Data Quality Services (DQS) feature introduced in SQL Server 2012.
It's all about your data...
Whether you are a small shop, mid-size or large organization, one of your most important asset is your data; organizations often spend hundreds of thousands or millions of dollars, a large portions of their information technology (IT) budgets on implementing sophisticated applications, data synchronization systems and data warehouses.Making large investments of money, time and resources to collect large volumes of data without ensuring the quality of the data is futile and will almost certainly lead to a substantial waste of valuable resources. However, if high-quality data can be achieved and maintained, the business value of these IT investments can truly be exploited, allowing, for example:
One of the most damaging data problems is duplicate data records. Duplication occurs when the same record entity (product, customer, etc.) is represented by multiple similar, though not identical,records in a company’s databases. These differences could be due to typos, differences in how different employees perform data entry, differences in terminology or naming standards used among departments, differences in language and so forth. Whatever the cause, the results are highly detrimental to the successful function of the company, as the following examples illustrate.
What are the challenges of data matching?
In the business world, effective matching requires more than identifying identical records (or records containing unique identifiers such as SKU or social security number), but also similar records which, in fact, refer to the same entity. The type of similarity varies tremendously between records and applications. Human beings are good at this task, but the amounts of data involved make this time- and cost-prohibitive.
Developing software to accurately ascertain similarity by means of “conceptual closeness” among unstructured data records is a great challenge, whether in the world of corporate data, web search engines, military intelligence or any other field in which matching between similar informational entities is critical. To a human, being familiar with the context of a particular data set, identifying very similar records is not very difficult. For example, it is not difficult for a person to identify that the same person lives in two different addresses based on the person’s identifiers; however, developing computer software with the “intelligence” to compare pieces of unstructured data and to ascertain whether or not they are essentially the same item is quite challenging.
This challenge is made easier with Data Quality Services where you can create and train you’re matching rules in an easy and interactive process and use this knowledge along with other knowledge stored in the knowledge base such as data domains values in a matching project for identifying similar records.
So how do we make sure that our data sources are duplicates free?
To successfully automate the identification of duplicate records, a sophisticated technological solution is necessary. To achieve this purpose, it is necessary to correctly identify duplicated records (i.e., those which actually represent the same physical entity) and then “de-duplicate” all redundant records.
How can Microsoft’s Data Quality Services help?
Data Quality Services (DQS) performs data matching by comparing each row in the source data to every other row, using the matching policy defined in the knowledge base, and producing a probability that the rows are a match. Matching is one of the major steps in a data quality project, it is best performed after data cleansing, so that the data to be matched is free from error.
DQS provides functionality to reduce data duplication and improve data accuracy in a data source. The DQS matching process has the following benefits:
A matching project is a 3 step task and is available in the Data Quality Client application. The following figure illustrates the Matching steps, showing results of a matchingprocess.
Stay tuned for more posts about matching and start removing duplicates today :)
The DQS team
Interesting guys, nice to see you adding these kind of developments.
Do you have any video footage of an actual scenario?
Our members would find that useful.
Editor - Data Quality Pro.com
I'm looking forward to finally have Data Quality Services in SSIS.
Is it possible to use defined matching policies in a SSIS package?
I saw that DQS Cleansing Transformation performs cleansing activities.
Can also performs matching activities based on KB defined matching policies?
Or will Fuzzy Group Component now connect to DQS Server?
Thank you for your post and interest in the DQS technology. You can find videos about creating a matching policy and matching project on Technet:
technet.microsoft.com/.../denali_resource_center.aspx (under Videos -> Data Quality Services).
It would be great to get your feedback.
DQS in SSIS is indeed exciting, you will find many ways of utilizing a knowledge base with the DQS Cleansing data flow component in SSIS.
We do not provide a Matching component for SSIS in this release, and the existing matching components (Fuzzy Lookup and Fuzzy Grouping) are not integrated with the DQS server therefore they cannot leverage a DQS knowledge base; however you can use the DQS Cleansing SSIS component to cleanse your data, then use this output as your input to a Fuzzy Lookup or Fuzzy Grouping components.
One suggestion to the Microsoft DQS team: You show the score for matching records, It would be really beneficial to see the score for even non-matched records. For example, I have a simple matching policy where the SSN is a prerequisite, 70% weight for similar First Name, 10% weight for similar Last Name and 20% wait for similar phone number. I have duplicate records with everything identical except first name. First name is similar, except duplicate records have one less character. My matching policy does not pick up this duplicate and I am scratching my head as to why. Hence my request to the DQS team to show the score even for unmatched records.