Blog: Krish Krishnan« Does data have a lifecycle? Part II | Main | A Scalable Architecture » Why you need Data profilingAn interesting problem that often surfaces in data warehousing and business intelligence activities is the content within the different attributes. Take a scenario of a simple data warehouse solution consisting of customer, product, time,location and transactions. This data model has to accommodate multiple locations and their transactions in a unified presentation to the end business user, as mandated by the business requirements. All of this is fine and dandy. The data model is approved by the business users in a data governance meeting and metadata definitions are agreed upon and the physical database has been created. Now you load the data warehouse, then you build your aggregates and summary data and declare that it is ready for QA and UAT. A harried report user calls out an error in the calculations for certain locations. This sets of a chain of investigations and after spending hours of time from various team members (not to forget the starbucks coffee and krispy kreme donuts) it is determined that the value of the data as reported by these locations for sales is at a different level than the rest of the locations. Your first instinct is to start looking at data mapping from source to target, look at end user training notes, data model reviews etc. Even after combing with a fine tooth comb you cannot determine how this occurred. Net-net is that all the data loaded for these locations have to be scrubbed and data has to be reloaded, this is not bad if you have the source data available else it is a far worse problem. How could you mitigate these issues? what processes need to be adopted to mitigate the risk, well a few simple steps can help mitigate the problem to a large extent 1. Confirm the business requirements gathered with sample data. Whatever maybe the steps executed, they should be done in a proactive fashion. This will alleviate the risks and minimize the need to revisit the issue at a later point where any mitigation strategy will be expensive. |