Originally published February 6, 2008
The enormous interest in master data management (MDM) that has appeared in the past couple of years has not yet generated a great deal of methodological progress. Hopefully, as data professionals, consultants, and vendors grapple with the complex issues involved, the situation will improve. A central problem, however, is that there is little agreement about what master data is. It is usually defined by examples, like product, customer, or account, as if to say “I know it when I see it”. Alternatively, master data is defined using generalities such as that it is simply highly shared data, or that it is data used by an application, but which is not produced by the application.
Definitions do matter. They tell us something fundamental about what is being defined. In the case of master data, there is a special need for a greater understanding because MDM is still at an early level of maturity. For several years, I have been using an approach to categorizing data that provides a detailed definition of master data. I have found this approach useful in that it can be practically applied to master data management problems.
A fundamental question about data is whether it is homogenous. In other words, are the boxes we see in a data model, or the tables contained in a physical database, all the same in terms of their properties, behaviors, and management needs as data? The fact that we are even talking about master data management indicates that there are qualitative differences among entities (at the logical level) or tables (at the physical level). There is, in fact, strong evidence that we can categorize data within a taxonomy that recognizes the different roles that data plays in the operational transactions of the enterprise.
Figure 1 shows a taxonomy of data related to segregating the management needs of data from a perspective of the use of data in operational transactions. It divides data into 6 distinct categories.
Figure 1: The Six Layers of Data
The first category of data in this scheme is metadata. What is meant by this is the metadata that truly describes data. For a logical data model, this will be the descriptive information about entities, attributes, and relationships. For a physically implemented database, this will be information about tables and columns. The latter is found in the system catalog of a database, but it is increasingly being materialized as tables in databases too.
Metadata, as the term is used here, is important because it has semantic content that needs to be managed. Tables and columns have meanings. The metadata has to be ready before a database can be implemented and should remain unchanged for the lifespan of the database. If it has to change, there is likely to be significant impact. For instance, if the datatype of Customer Last Name has to be increased from Char(20) to Char(40), then many programs, screens, and reports will be affected.
Below metadata in the hierarchy shown in Figure 1 is reference data. “Reference data” is used to mean many things today, but in the sense used here, it describes what are usually termed “code tables”. These are also called “lookup tables” and “domain values”. Reference data tables usually consist of a code column and a description column. Typically, these tables have just a few rows in them. In general, the data in these tables changes infrequently. Because of this apparent structural simplicity, low volume, and slow rate of change, these tables get very little respect. However, they can represent anywhere from 20% to 50% of the tables in an implemented database. Also, although they receive little attention, IT professionals fear changing the values in them.
Reference data tables share something with metadata – their physical values have semantic content. For instance, a customer preferred status of “bronze” may mean that a customer with this status has 30 days to pay their bills and can only be extended $1,000 of credit. No other kind of data in a database has this property. The semantic property is why this data is used to drive business rules. If business rule logic refers to actual data values, it is a near certainty that these values will come from reference data tables. Reference data can be defined as follows:
Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.
Next in the hierarchy of Figure 1 is enterprise structure data. This is data that allows us to report business activity by business responsibility. Examples are Chart of Accounts and Organization Structure. One of the main issues with this kind of data is managing hierarchies, which may be incomplete or “ragged”. Additionally, this category of data is notoriously difficult to manage when it comes to change. For instance, a product line may be reassigned from one line of business to another. Inevitably, historical reports have to be produced from the perspective of the product line being the responsibility of either line of business. One example would be the need to see the performance of the recently assigned line of business as if it had been responsible for the product line for the past 5 years.
Operational transactions always have parties to them. These are the things that have to be present for a transaction to occur, and are represented in Figure 1 by the transaction structure data layer. The most common entities given as examples of this category of data are product and customer. It can be defined as follows:
Transaction structure data is data that represents the direct participants in a transaction, and which must be present before a transaction executes.
Thus, we have to know something about a product and a customer before we can actually sell the product to the customer.
Transaction structure data typically consists of entities with large numbers of attributes, which makes them very easy to spot in data models. This class of data inevitably has problems of identity management. It is easy to appreciate for customers, whose names may be incorrectly captured or change. Yet even products can change their identifiers as they pass through their life cycle or are rebranded. Standardization of identity is extremely difficult to achieve for this class of data, even though it is the subject of many initiatives in this regard.
Another characteristic of transaction structure data is the fact that it is usually implemented as single tables that contain hidden subtypes. Certain columns in a product table, for example, will only apply to certain kinds of products, or to products at a certain point in their life cycle, or to some kind of externally imposed grouping such as dangerous products. Sorting out what columns are relevant to a particular product record is a difficult and frequently neglected MDM challenge.
Transaction activity data, the fifth layer in Figure 1, is the normal “event” data that we see in operational transactions in an enterprise. It has been the focus of IT from the early days of automation. Transaction audit data, the final layer in Figure 1, tracks the state changes in transaction activity data. It is what is usually found in transaction logs, although this kind of table is also frequently seen in databases too.
At this point, a definition of master data can be provided. It is the aggregation of reference data, enterprise structure data, and transaction structure data. As has been shown, each of these is rather different in its properties, behaviors and management needs. However, they do form a group that is distinct from the other three layers in Figure 1.
Accepting that there are different kinds of data with different management needs is important. It means that “one-size-fits-all” approaches to MDM are likely to be unsatisfactory. It also means that the perspective that there is nothing special about master data, and that MDM is just the application of the same old data management techniques, is wrong. Both the “one-size-fits-all” and the “same-old, same-old” views still enjoy considerable acceptance. This is true even among MDM vendors and consultants, although, for obvious reasons, they tend to only express these views in private.
What the taxonomy in Figure 1 shows is that there really are different categories of data, and that it really does make sense to think of master data as different from other kinds of data and as having specific management requirements. The case for MDM is thus a genuine one.
Recent articles by Malcolm Chisholm