Print
Email
ADVERTISEMENT
Business Intelligence Resources
Enterprise Data Asset Inventory and Profiling
Understanding Duplicate Data
Published: July 13, 2006
This article explores the various concepts of data duplication and gives examples of their differences.

Reminder: In this article, when the term “duplicate” is used, the principles described apply to triplicate or even four or more copies of data.

Many Files, Common Format
There is a profound difference between data formats and data. The file description (FD) and the file are not the same thing, nor should they be confused (although they have been confused in the design of some CASE tools). However, many analysts of modest experience often confuse them. Such analysts (or programmers who received social promotions) think that if they know the names of the fields, they know what is in the data. They confuse the description for the content. Big mistake!

Many old COBOL systems (especially systems more than 20 years old) passed data through a cascading series of files, sometimes all with the same format. In Figure 1, we see a tape of new invoices coming into a simple process (job step) by which a week-to-date invoice history file is created. Last night’s week-to-date tape is merged with today’s activity to create a new week-to-date tape.

In the evolution of classic mainframe systems, when direct access storage devices (DASDs) became cheap enough, these tape processes were modified (only in the job control language, not in the program) to take in disk files and write out sequential disk files. The effect is still cascading data from job to job, step to step, file to file.

The point of Figure 1 is that all these files have the same format, but they are not the same data. Even if the merge job IV201 is merely a merge/sort utility, the output is not the same as the input. Hence, while we may see four identical file descriptions in the program (if not a vendor-supplied merge utility program), they should have different names, and they certainly have different scopes of data.

Figure 1: Daily Invoice Processing Merge Into Week-to-Date Invoices

So we can see that scope of a data file is a very important differentiating factor in its identity. Format is important, of course, but so is scope. The scope can be in time (daily, weekly, week-to-date, monthly, month-to-date, etc.), or it can be geographical or organizational (one division, all divisions). In one system I saw, five files came together from five different systems. They all had the same layout, but their content was quite different.

These legacy, cascading file applications are still around in many major corporation and government bureaucracies.

Parallel Application Situations
Many modern enterprises have distributed their processing onto multiple servers, often scattered around the organization. Frequently, each server has software identical to its peers. A good example of this is where a warehouse management system is installed in each of some regional warehouses.

We may reasonably expect (but we need to confirm by testing or observation) that the software and the database design for each installation is identical and of the same release. It might be prudent to ensure that no changes have been made to the operating system, application, DBMS or job control language of one installation and not the others.

Thus, in the situation shown in Figure 2, the software in warehouse management system in Denver is identical to the software in Chicago, etc.

Figure 2: Parallel Installations of a Warehouse Management System

However, the data may not be the same. Here, we may find two kinds of data files or tables: that data which is unique to the location, and that data (such as the product master file) which should probably be common across all installations.

In Figure 2, we illustrate what should be obvious – that the current inventory table will contain different records for each installation. (Note: If this is not the case, then the alternative design may be difficult to manage – namely, that the current inventory table on each installation contains all the inventory across all locations). That may not be a good design, if it is to be kept current, up-to-the-second, because a depletion of inventory in any one location (for example Denver) must be communicated to the current inventory tables in all the other locations. This can be done, but it is rather taxing for the network. However, in taking our data inventory, we want to know what the scope of each physical table is – global or local.

Some tables, generally those which describe more universal abstractions, may appropriately be maintained in a duplicate state across installations. For example, it would be a reasonable design to populate the product master file the same across all locations (as shown in Figure 3). Doing so allows a relatively simple update process – send the same update information to each installation.

Figure 3: Potentially Duplicate Tables Across Parallel Installations of Software

A maintenance programmer responsible for the inventory management system may not care about the difference in data among the various instances. For most programming issues, the file formats are all the same, and generally the data will behave in a similar manner.

For astute data management, however, one may wish to verify that peer tables (e.g., the current inventory tables across the various installations) do contain similar kinds of subject entities and similar subtypes.

If one contains an anomaly which is not found on the others, then this anomaly would have significance for programming and system maintenance. A programmer who made a change to the code might test on the data coming from Boston and not realize that Denver has a bunch of anomalous records which may cause abends when the new program code encounters them. This can happen when rushed programmers do not design and run enterprise-wide tests before putting the changes into production – but it is also an easy assumption to make (“If you have seen the Boston data, you have seen it all”) and an easy error to experience.

So we can see that the scope of data may or may not be unique; hence, our definition of duplicate data differs from duplicate file formats.

Duplicate Content, Different Format
The converse of this situation is when two files may be written from the same process and have exactly the same number of records covering the same scope of reality.

Figure 4: Column Subsets of Data Coming from Customer Master File

Today in an RDBMS environment, we would simply create a VIEW of the customer master table containing the subset of desired columns. But in some legacy mainframe situations, a variety of “derivative” files may have to be created for specific purposes or specific destinations.

Is this duplicate data? In one sense, yes it is. It may be appropriate to confirm that no logical editing of the output data is done (either truncated or Boolean manipulation) for the outputs. And we would want to know, and document, how the record count (i.e., business scope) may vary among the various derivative files.

Entity Life Cycle
Before we get into seeking duplicate records for a table, we need to understand what the life cycle is for a subject entity represented by a table. In general, there will probably be two major kinds of entities: slowly changing entities (people, departments, customers, products, etc.) and event-oriented entities (customer orders, G/L postings, invoices, shipments and other transactions). We want to know what the life cycle is for each.

Are the instances (in reality) represented by the table mutually exclusive (and obviously so?), and are the records in the file mutually exclusive?

Slowly Changing Entities
Many tables represent subject entities, which are relatively stable. Some attributes may change, from time to time, but the entity instances tend to have a long-term life. One obvious question is: are the instances of such entity clearly and obviously mutually exclusive?

A product of a manufacturing company may go through substantial changes over its life. It may be reformulated, redesigned, etc. Sometimes, when the product changes substantially, the retail system (i.e., distributors, retailers) requires that a new identifier be used (thus, requiring a new record in the product master file). But even when the UPC (universal product code) is not changed as the product “morphs,” the characteristics may change so substantially that several records describing the product must be maintained.

A big question is this: Is it important to retain the historical, critical characteristics of the entity? We can find many examples of this. We retain the address history of patients of a healthcare system. It may be important to know where they lived when they showed certain symptoms and received certain medical care. A manufacturer may need to retain the formulation (and also, as a result, the shipping weight) of a food product over time, especially if products containing the old formulation are still somewhere in inventory or still being used by customers.

Personnel Life Cycle
In a human resources example, we can ask: Can there be two records for one person? Could a former employee reapply years after he left a company, and, if he is hired, is he given the same employee number or a new one? This is a simple and typical example of a common business phenomenon. If a new record is created (often because of immature database design) because there can be only one START_DATE value, then (1) the record in that table represents an employment period, not just an employee, and (2) there could be triplicate or more records for a given employee.

Hence, analysis of the concept of “duplication” of data must consider business policy. And one may find that business policy was not thoroughly designed (or designed “on the fly” by a programmer when a glitch came up). A former employee returning to work again may be an anomalous event that original business policy (or application) designers did not anticipate.

How that anomalous event is handled by the application and its database should be examined. In the employee file example, we may ask if records are handled consistently across all such “returning” individuals.

Transactional Entities
Business events may occur at a precise point in time (in which case, the attributes of that event should be unambiguous portrayals of what happened). Or, events may really have a longer life cycle, during which changes to certain attributes might be made.

A customer order for products may be posted and then amended as the customer sees his needs changing. In some vendors’ business systems, the unit value on an order line item could be raised, or canceled, any time up to the moment the order is shipped. In other systems, because of the flow of the data through the system, the original order must be canceled or reversed and a totally new order posted.

When transaction-ish events can be restated after they are originally posted, then it becomes critical to understand the rules (coupled with the process design of the business application software) by which those events could be modified. In other words, we are back to understanding a life cycle. What is the complete life cycle…with all its alternate routes for anomalies?

Some legacy order-processing applications, for example, pass data from step to step (as did old tape systems through cascading files). When this happens, how is an order, once posted, later amended? Is a new physical record created to flow through the system? It may be marked as an amendment or restatement. Or, is a new record created and a reversal of the original order also created? Thus, three records could describe a single customer order. In this case, it is important to be able to distinguish between them.

Another important question to ask is this: Where an amended record is created, does it carry all the attributes of the original record or just the changed values (with the unchanged attributes left blank or null)?

In the following example, a field repair log has an original record with all the fields populated.

However, an amendment is later posted. “Oh, we actually arrived at that work site much earlier.”

In this case, an additional amendment record was created. However, the fields that did not change were not populated in the amendment record. This seems a strange practice, perhaps, but some systems were designed this way.

So this aspect of the entity life cycle has an impact upon how you view and understand the meaning of your data.

Thus, another way to ask this question is: Can a transaction be changed after it is created? If so, how does that happen? Are additional records created? How are they distinguished?

Who said that data analysis doesn’t consider process? We must understand business process.

Perfect Duplicate Records
There may be different meanings to duplicate records. We need to be precise in our language and discussion.

A fundamental question about any table is this: Is there only one record in the table for any instance in reality, or could there be more than one?

Duplicate Identifiers
There may be situations where a single identifier value may occur in multiple records, each describing a different instance in reality. The converse is that two physical records in a file describe a single instance in reality.

The first situation may be this: Are there multiple records for any key? The presence of multiple records for any key is quite easy to test. Here is a useful query in SQL:

…we can guess that no key value occurs in more than one record.

However, occasionally, we may see results that look like this:

Here, we see that three records in this file (table) have a common CUST_KEY value – namely 4127. This may come as a surprise to programmers who had assumed that CUST_KEY was unique.

Let us say, for the sake of argument, that we inspect these records with duplicate keys and find the following data behavior:

Then these appear to be truly duplicate records. If absolutely every field-pair has identical values in the two records, then eliminating one of them should be easy (although we may wish to know how the second record got onto the table in the first place).

That situation was easy. However, if a formatted dump displays these two records as follows…

…then we clearly see these are not duplicate records, although they have identical customer numbers. In this case, a single key value attempts to identify two different instances in the real world.

This may be an application fault or a database design fault. In either case, it should not be tolerated, and it can be costly to repair. One of the customers is going to have to experience a customer number change. That can be very complex. For example, my cell phone carrier has undergone several acquisitions, resulting in my having a really long customer number (needlessly long). This is, in part, the result of appending high-order digits to my original customer number to keep the merged customer list unique. Renumbering the customer base must have been daunting to even this sophisticated company.

Multiple Records for a Single Instance
More difficult to detect are pairs of records which describe the same instance in reality, but have different identifier values. An example appears below.

It appears we have two records for Mary Jones, each with a different customer number. How would you detect a situation like this? One way is to simply count the records doing a GROUP-BY and ORDER-BY on all the fields at first, and then remove one field at a time (starting with the CUST_NO). Then, from that, select only the records (or actually, the records containing common values in non-key fields) that have a count greater than one.

However, that technique would probably not detect the following “near-duplicate” situation:

In this scenario, one name has a middle initial in it; the other does not. And there has been a transposition in the last two digits of the zip code. So the GROUP_BY and ORDER_BY summation in SQL would probably not reveal this near-duplicate situation.

But this is a common situation and a common need – to de-dupe name and address data. In this situation, given the common address and identical date of birth, we can feel quite confident that these two records describe the same person. However, if there were subtle variations in spelling in other fields, our confidence in a match may decline.

There are a number of software products (and services) on the market which will detect and even fix these problems. You want to be sure what they mean by “fix” before purchasing them. The processes and criteria they use should not be obscure to you, the buyer. There should be no “magic” inside the box.

Unless there is a reasonable business reason for having both these records, then this data behavior probably reveals a problem in the application design. We must ask why the system did not detect the first record when the second record was created. 

However, there are legitimate business reasons for quasi-duplicate records. Here, in an employment master file, we find two records for the same individual.

In the business policy of this company, if a former employee returns to apply for employment, they must create a new record on the employee master file. In this design, a single record on this file does not represent an employee subject entity, but rather an “employee in employment episode” subject entity. Not elegant design, but it happens.

Lateral Duplication – Duplicate Attributes
After spending a lot of time thinking of the vertical duplication (i.e., records), we may now wish to think about lateral duplication of data attributes or columns. Assume a customer master file with the following layout:

A prudent initial survey of the contents of this file would search for any perfect duplication of the contents of any two fields. Some data quality tools on the market will do this automatically.

Let us say that as we do our survey of the observed domains of each of these fields (which would generally be done somewhat later in the flow of tasks in an enterprise data inventory), we might find the following distribution of values for INDUSTRY_CODE:

And, curiously enough, we found the following domain distribution for the field DIVISION_CODE (which represents a sales division).

Astute readers will notice that while the codes have different formats, they have identical distributions. A good, cynical analyst would then run a cross-tabulation between the two fields.

Here, we find a perfect correlation between two fields:

Whoa! While the formats and observed values found in these fields are different, they do have a perfect correlation. Do we call this duplicate data? Good question. We need to do some asking around and research to find the business reason for having these two fields.

A slightly different result actually happened at a large manufacturing company. The cross-tabulation between division code and industry code appeared as follows:

Yes, this actually happened. A first blush analysis would suggest that these two fields were perfectly correlated in a business sense, and the apparent diversity resulted from keying errors or clerical mistakes. That is a reasonable first assessment. However, we need to dig deeper and ask around the business community.

In this real situation, when carrying this data forward into the data warehouse, we had to carry both fields; because while they were very closely related, they had distinctly different business meaning for legal purposes.

Be Specific
When anyone talks about “duplicate data,” you should now have a sense of how ambiguous that expression is. We know that non-duplicate files can have identical formats. Even files with identical formats and labels can be different in timing or scope. We know that the concept of duplicate records may mean perfect duplicates or duplicate identifiers (describing different instances), or two records (with different identifiers) describing the same instance. We also know that data in different columns within a table could appear to be duplicate, but may indeed have a significant difference in business meaning.

So when someone complains about “duplicate data,” ask them to be specific. Some duplication is normal.


Recent articles by Michael Scofield

Michael Scofield -

Michael is a widely known speaker and author on data quality and semantic data integration. He has held data architecture and data quality management positions in banking, finance and education. He has taught workshops for numerous organizations including the information quality conferences, numerous DAMA chapters, The Data Warehousing Institute, the Institute of Internal Auditors, chapters of the Quality Assurance Association, the Enterprise Data Forum, European Meta-data Conferences, Association of Computing Machinery, numerous DBMS user groups, and business intelligence tool conferences. His articles appear in numerous professional journals, and he writes occasional humor for the Los Angeles Times and other magazines. Michael can be reached at NMScofield@aol.com.

 

showing all