We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Six Sigma Data Quality Processes

Originally published May 2, 2006

In a previous article, The Partnership of Six Sigma and Data Certification, we discussed how Six Sigma concepts, methodologies, and practices can naturally be applied to data certification improvement programs. We also reviewed how companies, by applying Six Sigma techniques and metrics, can realize multiple benefits of reduced costs, reduced risks, increased revenues, improved margins and regulatory compliance.

In this article we will explore methods for controlling data quality throughout the data certification production process.  We will also illustrate how two major banking institutions successfully accomplished this goal.

As a quick review of previously presented concepts:

What is Six Sigma?
The narrowest definition of Six Sigma is a statistical one: controlling a process to limit output defects to 3.4 defects per million opportunities. Whereas a defect is anything outside a customer’s requirement specification, an opportunity is any chance of a defect occurring. Most companies use Six Sigma more broadly, though, to refer to the goal of achieving near-ideal quality for a product or service through new or improved processes and tools, as well as to refer to the mind-set of satisfying customer requirements.

The Link Between Six Sigma and Certified Data
Certified data is data that has been subjected to a structured quality process to ensure that it meets or exceeds the standards established by its intended consumers. Such “standards” are typically documented via service level agreements (SLAs) and administered by an organized data governance structure. Six Sigma promotes measurable standards that will impact quality and/or productivity and also promotes measurable corporate results.

Since Six Sigma and data quality improvement share the same goal of reducing defects, data certification improvement programs are natural candidates for the application of Six Sigma methodologies.

Measuring Data Quality
Central to Six Sigma and data certification is the ability to measure data quality throughout the entire process and compare the actual outputs to the desired, required or expected outputs. The ability to certify data is determined by how closely the data produced reflects the data that was required or expected. Some typical metrics used to certify data include:

  • Accuracy/precision
  • Completeness
  • Reliability
  • Availability
  • Timeliness/freshness
  • Consistency
  • Uniqueness

Controlling Data Quality
At a high level, controlling data quality is all about wrapping a process around the tasks of sourcing, transforming and publishing data that enable data quality/certification.  Six Sigma provides a framework or structure around the collection, analysis and control of these processes to improve the level of data quality/certification.

Two sets of interdependent processes are used to accomplish these data quality (DQ) objectives: off-line and in-line. The off-line DQ process is run outside of the certified data production process, while the in-line DQ process is run in synchronization with the certified data production process. The relationship between the two DQ processes is shown in Figure 1.

 

Figure 1: Data Quality Process Cycle

Off-Line DQ Process
The off-line DQ process is used to perform the initial data quality assessment of the input sources and is then executed periodically thereafter. This process is executed off-line or outside of the data production process because it is usually heavily resource (CPU and disk) intensive and because it is a semi-automated process involving manual inspection and analysis of the results.

During this process, the input data sources are analyzed with respect to structural integrity and consistency, valid numeric values and ranges, statistical anomalies, duplicate values, missing values and other data errors that violate business rules or expected behavior. An example of a business rule may be that a sales account representative is assigned to only one market segment. The corresponding data error or violation would be the appearance of a sales transaction appearing for a sales representative outside of his/her market segment.

As shown in Figure 2, the process outputs include a “profile” of the source data sets and a list of the candidate data errors. Oftentimes, the input sources are provided by other organizations, processes and/or third-party vendors outside of your direct control. In these instances, it is important to establish service level agreements (SLAs) with the data source provider.

The process also outputs a set of profile rules and remediation rules that are then used by the in-line DQ process.

 

Figure 2: Off-Line DQ Process

In-Line DQ Process
The in-line DQ process, shown in Figure 3, is run in synchronization with the certified data production process and consists of three types of data quality control checks.

Figure 3: In-Line DQ and Data Certification Processes


Profile Checks
Profile checks, using the profile rules generated by the off-line DQ process, are used to measure and control data quality of the input sources. Since these are the raw inputs to the certified data production process, it is vital that the data quality of the sources be as high as possible. Defects introduced by the sources will be costly to fix in the production processes. If they can’t be fixed, these defects will be passed through the production processes, thereby degrading the quality of the final certified data.

For example, a major credit card company develops multiple weekly marketing campaigns. It relies on a third-party vendor to provide a list of customers and associated addresses fitting its segmentation and risk profiles. The list comprises millions of customers. Errors in offering the wrong cards to the wrong customers could lead to lost acquisition opportunities or potentially high default costs.

Process Checks
Converting the input sources to certified data involves a sequence and combination of extraction, transformation, staging and loading operations. During any one of these operations, defects or faults may be introduced due to dropped records, interrupted processing or logic exceptions. Process checks use a set of rules and thresholds to detect, measure and report these types of faults.

User-Defined Checks
Certifying data – especially certifying data for specific uses – relies heavily on user/business-defined metrics.  The actual certification is performed in the transformation step where data is converted, combined, calculated and aggregated to form new data elements and values. To monitor and control these transformation operations, users need to define metrics and corresponding business rules to construct select monitoring probe points into the data production process. An example used by a major bank as a probe point is the computation of customer average daily balance after performing a cumulative aggregation. An example used by a credit card company is the computation of customer average monthly credit.

Taking Action
The whole goal of Six Sigma is improvementimprovement of the processes or, in the case of this article, improvement in the quality/certification of data. So far, we have explored the core of the Six Sigma data effort: measurement.  Now that we are able to collect data both off-line and in-line, what do we do with what we have found?

Two processes should be executed once data is collected either in-line or off-line:

  1. Scoring: This process focuses on evaluating the metric data captured in order to provide a measurement (score) of the degree of the data quality.  This score is published with the data and available for use in reporting so the end data consumer can understand the degree of confidence that can be placed in the data.  This topic is more complex and will be the topic of a future article.

  2. Monitoring and control: This process focuses on capturing and dealing with the metric data that is captured during the four measurement processes discussed earlier in this article. The emphasis here is data quality and process improvement.  At each of the four points where data quality metrics are collected (see Figures 1, 2 and 3), the following monitoring and control process should be implemented: 
    • Off-lineprofiling: structural, numerical, statistical and rule analysis
    • In-lineprofile check
    • In-lineprocess check
      In-lineuser-defined rule check

Monitoring and control is a straightforward process for determining a course of action to take based on a set of parameters and rules. During this process, the following sequential steps are executed (see Figure 4):

    • Collect: The monitored data points are collected and stored. The storage may be temporary or persistent.

    • Classify: The monitored data points are classified and categorized based on the type of check performed, the priority of the data quality check, and user-selected data quality attributes.

    • Detect: Rules are executed based on the classification of the data quality data points. If a data quality fault is detected, an action is taken.

    • Act/control: If a fault is detected, a sequence of one or more actions is initiated. These may include providing e-mail notification, fixing the fault, aborting the data quality job stream or continuing with the job stream while noting exceptions.

    • Log: The detected fault and the resulting actions are stored in a log file that can be used for auditing or analysis.

Figure 4: Monitoring and Control Process

In summary, we have identified a process with four points of data collection that comprise the foundation for a Six Sigma data quality process.  As metrics are collected at each of the four points, they are acted upon, through monitoring and control, in a manner consistent with Six Sigma objectives.  The summary diagram in Figure 5 illustrates how all of the parts fit together.

 

Figure 5: Foundation for a Six Sigma Data Quality Process

Real-World Examples
It is important to understand that how an organization begins implementing a data quality/certification program to accomplish Six Sigma goals is key to the ultimate success of the program.  Starting too small makes it difficult to really prove the value.  Starting too large makes it very likely that the project will become overwhelmed.  Here are some real-world examples of how two of the largest banking institutions began their journey.

Institution A
A major U.S. financial institution is well on its way to implementing a data quality/certification process across all of its enterprise data, which currently comprises 80 sources.  The first step in this process was driven by the bank’s regulatory compliance requirements.  The bank needed to supply Sarbanes-Oxley compliant demand deposit transaction data to its finance data warehouse where data is aggregated, analyzed and used for reports to management, regulators and investors.  As part of the assurance process, this data was processed through a data hub where the mainframe-supplied demand deposit transactions are extracted, converted, and transformed into a form usable by the bank’s financial accounting system.  In addition, a robust monitoring and control process was used to implement the tights rules and thresholds required by the bank for detecting potential faults during both profile and process checks:

    • Collect: The numbers of records and bytes are captured after key lookup and aggregation steps. A user-defined check of total average monthly balance is also calculated and monitored.

    • Classify: These data quality checks are classified as high priority/alert checks.

    • Detect: Values are compared to a moving average of previous months’ values. If the values deviate greater than 5%, a fault alarm is raised.

    • Act/control: If a fault is detected, an alarm is raised and a message is sent to both system operators and data analysts familiar with demand deposit data transactions. Further processing is stopped until a resolution is reached, which may be a decision to continue processing or to correct errors and re-initiate the transformation process.

    • Log: The data quality check point values, the fault and any subsequent actions are logged.

Institution B
A second major financial institution has focused on the need to develop and implement a data quality/certification program that will span the entire enterprise.  This is, by any measure, a monumental task.  Developing such a program and implementing it as a “big bang” is fraught with risk and expense.  They have elected to start by focusing on a single data store that is heavily used across all of the major lines of business within the consumer bank.  They have elected to implement a profiling and scoring process that will enable them to certify large sections of the data and prove the value of the process to the LOBs before rolling out the process.  Their solution is shown in Figure 6.

Figure 6: Institution B Solution

Achieving Six Sigma Goals
In addition to defining data quality metrics, requirements and goals, institutions must include a data quality monitoring and control infrastructure as an inherent part of their certified data production processes. Only then can companies improve their levels of data certification and achieve their Six Sigma goals.

 

  • Duffie BrunsonDuffie Brunson

    Duffie is a Senior Principal for Financial Services at Knightsbridge Solutions. With more than 30 years of experience in financial institutions as both a banker and consultant, he has been involved with leading-edge developments within the industry, including the creation of the automated clearinghouse, the debit card, in-home transaction services, co-branded credit cards, electronic payment networks, financial advisory/planning services and integrated customer data warehouses.

    Duffie holds an undergraduate degree from the University of Virginia and an MBA from Georgia State University. He is a graduate of the Seidman Auditing School at the University of Wisconsin, and the Stonier School of Banking at Rutgers University. He has served as a member of the ABA's Operations and Automation Quality Council, as a faculty member of the Graduate School of Banking at Colorado, and as a lecturer at the Universita' Cattolica del Sacro Cuore in Milan, Italy.

    Duffie can be reached at dbrunson@knightsbridge.com.

    Editor's note: More financial services articles, resources, news and events are available in the Business Intelligence Network's Financial Services Channel. Be sure to visit today!

  • Sid Frank

    Sid is a senior principal for Financial and Government Services at Knightsbridge Solutions, HP’s new Information Management practice. Sid’s expertise includes practice management and systems development. At Knightsbridge, Sid manages data management assessment and development projects. He is a former senior manager with PricewaterhouseCoopers and Naviant. At both firms, Sid focused on the practice of designing and developing business-critical decision support systems, knowledge management systems, and competitive intelligence systems for the financial, telecommunications, and retail industries. At GE, he was responsible for managing both development and R&D programs. Sid holds an executive MBA from Temple University, a Masters of Systems Engineering from the University of Pennsylvania, and a Bachelor's degree in Electrical Engineering from CUNY.

    Sid has written “An Introduction to Six Sigma Pricing” and “What’s in a Price: Losing Earnings Through Price Confusion.” He coauthored “The Partnership of Six Sigma and Data Certification,” “Six Sigma Data Quality Processes” and "Managing to Yield: A Path to Increased Earnings.” Sid can be reached at sfrank@knightsbridge.com

Recent articles by Duffie Brunson, Sid Frank

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!