Happy new year everyone! As we enter into 2008 I thought I would start us off on a few musings that I have around data quality in relation to business intelligence and data warehouse projects. A few recent projects that I have worked on and witnessed had me thinking about the topic and some thoughts to relate to you all.
First, why is data quality important in BI projects? A very basic question to ask, indeed, but a valid starting point because I hear this question too often. Doesn't matter if you are using SQL Server with SSIS, custom ETL, any data integration whether that be with Oracle or specialized tools like Informatica ... data quality is crucial to you BI application project. In the end, not ensuring data quality will likely result in poor decisions being made by the business from faulty statistics or incorrect data. When your end users are looking at BI reports, they should be looking at actionable knowledge. The analytics must be the result of vetted quality data.
Second, profile your data. This is very important and can help you identify common sources of data quality problems such as duplicate data, missing data and suspicious data. I would also recommend having a look at the new Excel data mining add-in cpabilities to examine source data files so that you can identify outliers in your data sources. Missing data and duplicate data are common sources of downstream business intelligence flaws. I promise to follow up on this post with a few words about finding proper entity sources and other data integration concepts including MDM very shortly to address those concerns.
A nice source of data quality strategies utilizing SQL Server is available here: http://msdn2.microsoft.com/en-us/library/aa964137.aspx.
That's all for now, folks. Until next time ... BR, Mark