Originally published May 18, 2010

Information quality (IQ) management, as with manufacturing quality management, focuses on statistical quality control to ensure that information products produced and delivered to information consumers meet or exceed their expectations and knowledge requirements.

Chapter 5 in my new book, Information Quality Applied: Best Practices for Business Information, Processes and Systems, describes an effective step-by-step approach to measure accurately the extent of information production process failure in important “critical-to-quality” IQ characteristics that will get and sustain the attention of your executive leadership team.

Key definitions include:

Control Chart: A graphical chart for reporting process performance over time used to monitor whether a process is in control and producing consistent quality.

Statistical Process Control (SPC): The application of statistical methods, including control charts, to measure, monitor, and analyze a process or its output in order to improve the stability and capability of the process, a major component of statistical quality control.

Statistical Quality Control (SQC): The application of statistics and statistical methods to assure quality, using processes and methods to measure process performance, identify unacceptable variance, and apply preventive actions to maintain process control that meets all customer expectations. SQC consists of process measurement, acceptance sampling, and process improvement.

Statistical Process Control (SPC): The application of statistical methods, including control charts, to measure, monitor, and analyze a process or its output in order to improve the stability and capability of the process, a major component of statistical quality control.

Statistical Quality Control (SQC): The application of statistics and statistical methods to assure quality, using processes and methods to measure process performance, identify unacceptable variance, and apply preventive actions to maintain process control that meets all customer expectations. SQC consists of process measurement, acceptance sampling, and process improvement.

Steps for implementing statistical process control for information processes:

- Initial Planning: Identify the critical-to-quality (CTQ) IQ characteristics that your executives need to watch to ensure enterprise success. See pages 180-186 for a comprehensive list of IQ characteristics. Note that different information consumer groups may have different IQ characteristics they require. Create a short list of those CTQ IQ characteristics for each information consumer group, such as accuracy, completeness, timeliness, information accessibility, and information presentation clarity.
- TIQM Process P2.1: Document the information group to be assessed and the production processes in the CTQ core business value circles that create and deliver a CTQ information group.
- TIQM Process P2.2: Plan your IQ objectives, characteristics for measuring quality and specific test for statistical quality control. Note that this measures process effectiveness – not just quality in a database. Ensure that your measurement of the information is not biased.
- TIQM Process P2.3: Identify the business value circle producing the CTQ information required by the knowledge workers. Document all process steps that create, update, and deliver information. Use the SIPOC (Supplier-Input-Process-Output-Customer) charts to identify CTQ information consumer requirements.
- TIQM Process P2.4: Determine the processes to assess. These include the process or process steps that create, update, calculate, retrieve and deliver information.
- TIQM Process P2.5: Identify the accuracy verification sources from which to measure accuracy. This is one of the most important inherent characteristics of information quality.
- TIQM Process P2.6: Extract a statistically valid sample of records. This keeps the costs of sampling and measurement down without compromising the quality of the IQ assessments of each CTQ IQ characteristic.
- TIQM Process P2.6.1: Identify the total number of records in the full population of records. For statistical process control, count the number of records created in a typical cycle of time to calculate an average number of records created in a day (week or month, depending on your natural cycle).
- TIQM Process P2.6.2: Estimate the standard deviation of the total population of records by extracting a small sample (50 or 100 records) that will be used to calculate your sample size for this process control measurement. To calculate the standard deviation of the sample, use the formula:

s = the Standard Deviation of a sample:

Note: Beware of those who push you to create a single figure representing an aggregated “score” of information quality. If you do this, you will not be able to identify which quality characteristics are in control or not, nor can you easily identify root cause and process improvements that could prevent the recurrence of information defects.

Calculate standard deviation process steps:

- Count the number of records in the sample
- Count the number of data elements in all records that contain a defect of the IQ characteristic measured
- Calculate the mean (x) or average number of errors per record by dividing the total number of errors by the number of sampled records
- Calculate the deviation (d) of each record by subtracting the mean number of errors from the actual number of errors in the record
- Calculate the deviation squared (d
^{2}) for each record by multiplying the deviation by itself - Calculate the sum of the deviations squared (d
^{2}) by adding all of the deviations squared together - Calculate the standard deviation of the data sample (s) by dividing the sum of the deviations squared (d
^{2}) by the value of one less than the number of records in the sample (n-1) and taking the square root of the result

Table 1: Sample of 50 Records with 23 NOT Accurate Data Values Marked “X” in 10 Records with One or More Defects

Figure 1: Calculation of Standard Deviation

Calculate Sample Size

The formula for determining a statistical sample size is:

n = ((z x s) / B)^{2}

where:

n = the number of records to extract.

z = a constant representing the confidence level you desire. How confident are you that the measurement of the sample is within some specified variation of the actual state of the data population? The confidence level is the degree of certainty, expressed as a percentage, of being sure about the estimate of the mean. For example, a 95 percent confidence level indicates that if you took 100 samples, the mean of the total population would be within the confidence interval (mean plus or minus the bound) in 95 of the 100 samples taken. There are statistical charts containing these constants. However, the three most-used confidence levels and their constants are show in Figure 2.

z = a constant representing the confidence level you desire. How confident are you that the measurement of the sample is within some specified variation of the actual state of the data population? The confidence level is the degree of certainty, expressed as a percentage, of being sure about the estimate of the mean. For example, a 95 percent confidence level indicates that if you took 100 samples, the mean of the total population would be within the confidence interval (mean plus or minus the bound) in 95 of the 100 samples taken. There are statistical charts containing these constants. However, the three most-used confidence levels and their constants are show in Figure 2.

Figure 2: z Constant Values for Confidence Level desired

s = an estimate of the standard deviation of the data population being measured. There is an inverse relationship between the degree of variation of errors within the data population and the sample size for analysis. The larger the variation, the smaller the sample size required to get an accurate picture of the entire population. The smaller the variation, the larger the sample size required. The fewer the errors, the more records must be sampled to find the defective records.

B = the bound or the precision of the measurement. This represents the variation from the sample mean within which the mean of the total data population is expected to fall given the sample size, confidence level, and standard deviation. If a sample has a mean of 0.4600 errors per record, and a bound of 0.0460, the mean of the total data population is expected to fall within a range of 0.4600 ± 0.0460, or from 0.4140 to 0.5060 errors per record, given the sample size, confidence level, and standard deviation.

B = the bound or the precision of the measurement. This represents the variation from the sample mean within which the mean of the total data population is expected to fall given the sample size, confidence level, and standard deviation. If a sample has a mean of 0.4600 errors per record, and a bound of 0.0460, the mean of the total data population is expected to fall within a range of 0.4600 ± 0.0460, or from 0.4140 to 0.5060 errors per record, given the sample size, confidence level, and standard deviation.

NOTE

If records of a given type, such as Order, are captured in separate locations, such as call centers or stores, you have a distributed population. Your sample should have a proportional representation of records from the distributed populations that together make a union of a single enterprise population. Your subsequent analysis may yield a different patterns of error in the different strata of distributed sources

NOTEFor ease of statistical process control, select a sample quantity above the calculated minimum sample size. It will keep your quality control charts with the same upper and lower control limits.

To calculate sample size process steps:

- Define the confidence level you desire for the assessment (99%, 95%, 90% or other) and put the z constant, from above, in the sample size formula.

NOTE

An alternative to determining the bound, is to set a fixed number of records to sample, such as 300, 500, 1,000 or 2,000 and a desired confidence level, and then let the bound be calculated in the sample size formula - Calculate or select the sample size for the assessed information group:

n = ((z * s) / B )^{2}

2.1 Determine the confidence level (99%, 95%, 90%). I recommend 95% CL.

2.2 Determine the bound ± variation from the mean of a sample

2.3 Determine the standard deviation from a quick sample or from a previous calculation

2.4 Multiply the z constant times the standard deviation (s)

2.5 Divide the result by the bound

2.6 Square the result

2.7 Round up to the next whole number. This is the sample size to accomplish the confidence level

Figure 3: Example of a Calculation of Sample Size

NOTE

For ease of statistical process control, select a sample quantity above the calculated minimum sample size. It will keep your quality control charts with the same upper and lower control limits

P2.6.4: Implement a mechanism to execute the random number generator in the sampling process.

TIP

If you wish to extract a given percentage of records from a population, use a random number generator from 0 to 1, where 0 = no sampling, and 1 = 100% sampling. If your percentage is five percent of the records, set r = to 0.05 (5% of 1) and select record where r ≤ 0.05.

TIP

If you wish to extract a given number of records from a population, determine the number of records to be extracted out of the total population and calculate the percent of sampled records out of the total population. If, for example you wish to select 300 records out of a population of 2,200 records (say the average number of orders per day), then 300 / 2,200 = 13.64%. Use a random number generator from 0 to 1, where 0 = no sampling, and 1 = 100% sampling. If your percentage is 13.64%, set r = to 0.1364 (13.64% of 1) and select record where r ≤ 0.1364. This is especially important if you want to collect the same number in a sample. There will be some variation in the number sampled, based on the variation in the number of orders taken per day.

- For originating application programs that create data into a relational database with trigger capability, set a random number generator routine that will calculate whether a record will be selected or not on record insertion.
- For paper documents where information is first created, count and number each paper document individually. Based on the number of documents in the total population, calculate your sample size. Develop a random number generation routine or use sampling software to calculate a number that will achieve your calculated sample size
- P2.6.5: Isolate the sampled records in a controlled data store or paper record, or photocopied image of the source creation document
- For electronic records, write them to the sample database before the record is confirmed and updates can be made

For integrity of the assessment process, the chain of custody of the sampled records or paper documents, must be controlled to prevent tampering with or altering the electronic records or paper documents. For electronic records, you must prevent updates to any of the sampled records. For paper records, you must prevent alteration of the information collected

P2.6.6: Document the controls in the chain of custody of the sampled records to be assessed that prevents alteration or updates to the data once it has been extracted for assessment.

For original electronically created data, you must assure that:

- The information producers are not aware of the timing of the sampling for IQ assessment. This will bias their information production activities.
- The sampling process assures a statistical random sample (each record has equal likelihood of being selected) that assures the minimum number of records given the standard deviation of a quick sample or previous assessment sample.
- The process has not been modified in any way from the currently defined procedures, equipment and personnel.
- The records have not been altered since record creation.
- The records sampling process did not introduce errors into the sampled records.
- The data store into which the sampled electronic records are stored did not corrupt the data, such as by truncating data values, mapping data to incompatible data types (such as variable to fixed size or numeric to alphanumeric or scientific notation), having inconsistent format with the source database, or treating missing values differently.

- The records could not have been altered by someone since record creation. Erasures or strike-outs are evidence of alteration. Seek to find out the original values and reason for change. If that cannot be determined, reject the record.
- You treat erasures or strike-outs as corrections to an original value.
- The records sampling process did not introduce errors into the sampled records, such as pulling a paper record that was not selected by the random number generator.
- Any photocopying made of the actual documents did not introduce errors into the copy, such as missing parts of the data at the edges of the document or being too light, failing to reveal data or erasures, or too dark, obscuring the readability of the data.

NOTE

Without the assurance of sound statistical methods to sample data and without assurance of the integrity of the chain of custody of the sampled data, you can lose credibility of any assessment results

If you follow these simple step-by-step guidelines, you will get the attention of your executive leadership team to begin the culture transformation to a high IQ enterprise that will become world class!

**Recent articles by Larry English**

## Comments

Want to post a comment? Login or become a member today!

Be the first to comment!