Business Intelligence Network Business Intelligence Resources

Blog: Krish Krishnan

« Does data have a lifecycle? Part II | Main | A Scalable Architecture »

Why you need Data profiling

An interesting problem that often surfaces in data warehousing and business intelligence activities is the content within the different attributes.

Take a scenario of a simple data warehouse solution consisting of customer, product, time,location and transactions. This data model has to accommodate multiple locations and their transactions in a unified presentation to the end business user, as mandated by the business requirements. All of this is fine and dandy.

The data model is approved by the business users in a data governance meeting and metadata definitions are agreed upon and the physical database has been created. Now you load the data warehouse, then you build your aggregates and summary data and declare that it is ready for QA and UAT.

A harried report user calls out an error in the calculations for certain locations. This sets of a chain of investigations and after spending hours of time from various team members (not to forget the starbucks coffee and krispy kreme donuts) it is determined that the value of the data as reported by these locations for sales is at a different level than the rest of the locations.

Your first instinct is to start looking at data mapping from source to target, look at end user training notes, data model reviews etc. Even after combing with a fine tooth comb you cannot determine how this occurred. Net-net is that all the data loaded for these locations have to be scrubbed and data has to be reloaded, this is not bad if you have the source data available else it is a far worse problem.

How could you mitigate these issues? what processes need to be adopted to mitigate the risk, well a few simple steps can help mitigate the problem to a large extent

1. Confirm the business requirements gathered with sample data.
2. Setup and execute a data profiling activity, tools are available for under $1000 to use. Profile the data from each input to confirm that values within the data attributes are consistent for type, length and usage.
3. If your results from the steps above show problems with the data content in attributes, classify the attributes in question and present the problem to the business and data governance teams.
4. Do all these activities parallel to the data modeling processes, this way you can make changes to the data model if needed, before anything goes to a physical database.

Whatever maybe the steps executed, they should be done in a proactive fashion. This will alleviate the risks and minimize the need to revisit the issue at a later point where any mitigation strategy will be expensive.

  Posted by kkrishnan on August 14, 2007 1:52 PM |

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)