What to Do with Bad Data? When and How to Dispose of It

Originally published September 23, 2009

Everybody hates poor quality data. But let’s face it, in most organizations you will run into data quality issues, at least from time to time. So instead of arguing against poor data quality, a more useful question is: how do you deal with it?

To make business intelligence (BI) applications add value to the corporation, some level of data integration invariably takes place. This might be a data warehouse (DW), operational data store (ODS), enterprise resource planning (ERP), customer relationship management (CRM) application or another application you might have. We’ll assume that the data quality problems originate in upstream (primary) systems that are generating your source data. In this article, I’ll focus exclusively on DW solutions.

There are two fundamentally different ways of dealing with bad data: either you load all of the data “as is” (and deal with the errors later) or you clean/scrub the data on the way in to the DW. The former is the approach advocated by Data Vault architects, the latter I will label the “Ralph Kimball” approach, in honor of his extensive writing on this subject. This article focuses on the pros and cons of both approaches.

Two Perspectives: Clean on the Way in or out of the DW

I have mentioned two fundamental architectural options for dealing with bad data. You either clean the data before it enters the DW, or alternatively, you load the bad data and subsequently deal with the errors on the way out of the DW. Note that both approaches will typically use some kind of data staging prior to loading data in the DW.

The “traditional” way has been to intercept "bad" data on the way in, and deal with it during the extract, transform and load (ETL) phase. This approach I will call the Kimball approach. In all fairness, several authors have written about this, but none so clearly and conclusively as Ralph Kimball. His book The Data Warehouse ETL Toolkit1 (2004) and a whitepaper (2007) titled, An Architecture for Data Quality2 are two excellent examples.

I’ll label the alternative approach a data vault architecture, as put forward (mainly) by Dan Linstedt in his book, The Business of Data Vault Modeling3. Under this architecture you load 100% of the data, all the time. After the initial DW load, only new records are added (deltas), but again, all of them. Note that an explicit choice is made to load data that is known to be of poor quality. So the DW houses the good, the bad and the ugly data. Only when data is moved to data marts (DMs) do you separate the “good” (which should appear in corporate reports) from the “bad” data (which is recorded in error marts). As data is accumulating in the DW, the entire history of erroneous records is preserved there.

What is a Data Vault ?

A data vault is a hybrid data modeling approach for enterprise data warehouses (EDWs). It combines the best of third normal form and star schema modeling, specifically catered to EDWs. The benefits are greater flexibility and agility and, therefore, superior resilience in the face of changing requirements. It (uniquely) provides auditability of data, which is dearly needed in this era of compliance and governance.

A data vault is a hybrid data modeling approach for enterprise data warehouses (EDWs). It combines the best of third normal form and star schema modeling, specifically catered to EDWs. The benefits are greater flexibility and agility and, therefore, superior resilience in the face of changing requirements. It (uniquely) provides auditability of data, which is dearly needed in this era of compliance and governance.

A data vault differs from traditional approaches (among others) in that known bad data is deliberately loaded into the DW. This is one of the requirements for auditability. If you change (or delete) data on the way in to the DW, the relationship with source systems is lost and, hence, auditability goes out the door. But more importantly, the philosophy behind this practice is that there is genuine value in storing bad data (I’ll discuss an example later) and that the perspective of what constitutes bad data is likely to evolve over time. Therefore, you need to load all data "as is" to maintain an integral view of an authentic history of the facts. What the business considers to be the "truth" changes over time, and a history of business rules that determines what is and is not good data may separately contain tremendous value.

Cleansing of data, separating what data should and should not appear in corporate reports, takes place when data is moved from the EDW to DMs. Consequently, end users do not query the EDW directly using ad hoc queries. They are provided with access to data either through DMs or standardized (and optimized) queries on the EDW. This is how you ensure that information that shows up in reports matches management’s opinion (perception) of what is considered the truth.

Please note there is considerably more to a data vault, but for the purpose of this article, I’ll limit the dsicussion to aspects as they pertain to dealing with data quality.

How Kimball Deals with Data Quality

In a Data Vault architecture, the EDW contains all the erroneous data. This is in marked contrast with the Kimball approach to dealing with data quality problems which cleans data on the way in to the DW. You flag which records are suspect/changed and set up business rules to determine the "best" replacement value. For this data imputation you might well use advanced statistical models. A separate error dimension contains the details pertaining to erroneous and replaced records. Faulty data need to be replaced because every star in a DW, supporting some business process, might be queried though a link to other star schemas using conformed dimensions. This DW modeling approach works on the assumption that all the data in the DW can be trusted, all the time.

Of course, despite your best efforts, sometimes bad data may still inadvertently enter a Kimball style DW. If analysis later reveals there were errors left in the DW, this is seen as a shortcoming of the ETL phase. Hopefully feedback about errors will help to resolve these problems in the future. The DW setup, from both a technical as well as a process standpoint is to only add data and basically never remove or overwrite data in a Kimball DW. If you would overwrite existing data this will cause inconsistencies in corporate reports, some of which may already have been (widely) distributed. Altering data after it has been committed to the DW is indeed a very rare occurrence. It is inherently "impossible" to change your mind about which data is good or bad and is presented as such to the end-user community.

Why Store Bad Data?

The Data Vault practice of deliberately loading "bad" data sometimes feels counter intuitive. Yet the reasons are very compelling. To begin with, bad data is rarely a purely technical problem. In most cases, bad data is a sign of something, somewhere going wrong. This is what Michael Hammer referred to in Reengineering the Corporation4: seemingly small data quality issues are, in reality, important indications of broken business processes. Maybe an interface doesn’t provide employees some necessary option and therefore they resort to rigging the system. We have all stood in line, listening to a clerk saying, “Hang on, the system doesn’t allow me to do this.” Front-office staff will make it happen, even if circumventing the system is required in some way, such as manipulating data to “force” a transaction to be completed. The relation between data quality errors and broken business processes can come in myriad forms and shapes.

The second reason for deliberately storing bad data is more subtle. When you overwrite (or delete) erroneous records on the way in to the DW, you need to make that call then and there (in a Kimball DW). When you store bad data as is, you are in a position to reconsider what constitutes good or bad data, and these definitions can (and often will) evolve over time.

One company I worked with set their DW processes so that a central CRM system would be the first and sole place where customer IDs were to be assigned. Since all systems and business process were geared that way, this worked remarkably well and led to very high standards of data quality. However, a very small percentage of records got lost every month because they had no customer IDs, so these were dismissed. Further analysis revealed these were exception prospects that entered through a so far unacknowledged sales channel.

When these records without a customer ID were dropped, the company literally “dropped the ball.” These customers were effectively ignored, as if they did not even exist. When the source of these errors was found, the company managed to capitalize on this as a business opportunity and grew it into a considerable revenue stream. This is a perfect example of how supposedly bad data holds value. It can be like a diamond in the rough, waiting to be discovered and exploited. The organization first discovered a broken business process (prospects that were not getting follow-ups) and fixed that. Later this proved to be a viable new business venture that came from “nowhere”: from prospects they did not know existed and that supposedly were not allowed to exist.

Bad data is here to stay for at least the foreseeable future. Because errors often only become apparent when you try to integrate data, a DW is particularly susceptible to run into these kinds of challenges. Installing robust processes to deal with this reality are therefore a necessity  a key to success with your DW.

I have distinguished two fundamental approaches to dealing with poor quality data. Either you delete or overwrite bad data on the way in to the DW (the Kimball style), or you load bad data “as is” and apply transformations (and filters) when you move data to DMs (Data Vault style). Both approaches have pros and cons.

An advantage of dealing with data quality errors on the way in to the DW is that your architecture requires one less layer of storage, and the EDW data is available to see (and presumably be trusted) by everyone in the corporation. The disadvantage is that you are stuck with any choice you make with regard to how you treat erroneous records. And you cannot mine how business rules that transform bad data into good data evolve over time.

The advantage of the data vault architecture is that all value inherent in storing bad data remains available as a source of learning and improvement. You retain maximum flexibility in how you choose to treat bad data. As your insights evolve and mature, you can constantly update your transformation (and selection) rules from the data warehouse to the data marts. Additionally, your DW is and remains forever fully auditable, and can, therefore, become the authoritative single source of the facts.

References

1. Ralph Kimball & Joe Caserta. The Data Warehouse ETL Toolkit. 2004. ISBN# 0764578578.

2. Ralph Kimball. An Architecture for Data Quality. Whitepaper Kimball Group. 2007.

3. Dan Linstedt, Kent Graziano & Hans Hultgren. The Business of Data Vault Modeling. 2008. ISBN# 9781435719149.

4. Michael Hammer. Reengineering the Corporation.  1994. ISBN# 1857880560.

  • Tom BreurTom Breur
    Tom Breur, Principal with XLNT Consulting, has a background in database management and market research. For the past 10 years, he has specialized in how companies can make better use of their data. He is an accomplished teacher at universities, MBA programs and for the Certified Business Intelligence Professional (CBIP) program. He is a regular keynoter at international conferences.  Currently,he is a member of the editorial board of the Journal of Targeting, the Journal of Financial Services Management and Banking Review. He acts as an advisor for The Council of Financial Competition and the Business Banking Board and was cited among others in Harvard Management Update about state-of-the-art data analytics. His company, XLNT Consulting, helps companies align their IT resources with corporate strategy, or in plain English, he helps companies make more money with their data. For more information you can email him at tombreur@xlntconsulting.com or call +31646346875.

     


Related TechTarget Editorial Content


 

Comments

Want to post a comment? Login or become a member today!

Posted September 24, 2009 by tombreur@xlntconsulting.com

Hi Ed,

I wholeheartedly agree that data governance is not a "project" nor a "focused activity" of some enlightened bunch. Just like "quality" can not be produced by a centralized task force. The entire organization needs to be imbued with it.

Interestingly most of my other papers have not been on "technical" but rather on organisational aspects of data quality. This topic is near and dear to my heart.

Is this comment inappropriate? Click here to flag this comment.

Posted September 24, 2009 by Ed Gillespie

Excellent analysis, Tom.  You've done a great job capturing the technology alternatives.  Perhaps, though, it's also important to develop a "data quality mindset" within organizations.  Navin Sharma at PBBI recently wrote "data governance and accountability need to lie with every employee and business partner and not just one or two individuals who would otherwise end up fighting a losing battle".  Interesting idea.  If you want to read his full comments, can check out http://ebs.pbbiblogs.com/2009/08/03/data-governance-its-everbodys-business/

Is this comment inappropriate? Click here to flag this comment.