Originally published May 22, 2006
Recently I participated in a tutorial about master data management (MDM), at which my participation focused on the behavioral and management aspects of the data governance associated with MDM. As is evidenced by the sheer volume of collateral material and sales pitches surrounding the concept, MDM is a big deal these days, and it is fully expected to get bigger, with projections of spends in the billions of dollars within the next few years.
What Constitutes Master Data?
At the MDM tutorial, there were many good questions, although one that I attempted to navigate around was “What constitutes master data?” This is actually an extremely good question, and my hesitance in answering it is in part due to my belief that while there may be some object categories that seem to naturally fall into the realm of “master data,” it is not entirely clear that those categories are always relevant for any specific organization. My technique in skirting the question typically involves garbling the phrase “…things in the enterprise that we care about…,” but this is sort of a wimpy way of ignoring the harder issue of defining the characteristics of a master data set.
I have done some searching on the Web, looking for definitions, and have found that generally the approach is that the definition of master data is based on examples, such as “master data is a set of data elements along with their associated attributes, such as customer, product, employee, vendor, etc.” This may get the message across, but as a consultant in the areas of standards definition, both the inexactness of the various definitions and the variance in the definitions leave me a bit unsettled, especially because a large part of master data management involves collaboration for agreement on semantics.
Characteristics of Master Data
So let’s think a little about the features or characteristics of master data. First of all, the need for a master copy indicates that there may be copies of the same or similar data objects used in contexts where a lack of synchrony between copies leads to inconsistency across applications dependent on those copies. Second, the desire to subject the data to management (especially in the MDM sense of the word) indicates a willingness of the stakeholders to collaborate on centralized governance over the master copies. Third, master data objects are both the subject of transactions (as part of operational systems) and analysis (as part of analytical systems). Fourth, the concept of “master” implies that all application uses are subsidiary to a single core repository (although I am not yet ready to commit to a single physical copy, since a virtual approach might still satisfy the coordinated requirements). Fifth, a master object can be assigned a unique identifier within the enterprise.
Master Data Definition
Here is an adaptation of a definition I used in a recent white paper: Master data sets are synchronized copies of core business entities used in traditional or analytical applications across the organization, and subjected to enterprise governance policies, along with their associated metadata, attributes, definitions, roles, connections and taxonomies. This covers all the traditional master data sets: customers, products, employees, vendors, parts, policies and activities. It also extends the realm of possibility to incorporate data sets that might not fit the standard mold.
For example, we consider that master objects are ones that are subjects of transactions or analysis. However, transactions themselves are subject to transactions and analysis; transactions may be composed into workflows, which, in turn, are also subject to transactions and analysis. Therefore, transactions could be represented as master objects, allocated a unique system identifier, and then be subjected to various productivity and performance analyses.
Here are some other examples:
This discussion reminds me of some conversations I had about 10 years ago with a colleague, with whom I was working on what today would be called a customer data integration project (then we called it “accounts renovation”). While we were exploring alternate approaches to data representation, we came up with a bizarre idea: assign a unique integer number to every single object that was used in all the enterprise applications, and replace every reference to that object with its unique ID. Then, create a single thin reference table consisting of a simple mapping: unique identifier to description. The intention would be to reduce the storage requirements for every managed table, since each record would be transformed into a collection of integers. Adding a timestamp to each record would ensure that every record would be unique as well, which then could be added to that single thin reference table. Ultimately, all the data in the enterprise could be reduced to a small entry in that very long reference table, which would also take on some interesting mathematical properties based on set and category theory.
The problem with this approach? Two things: first, this change would have impacted all of the bank’s production application code, which would have been a logistical and political nightmare. Second, the flattening of the data would have diffused the object semantics – by eliminating the model, you lose the inheritance characteristics that naturally fall out of a relational design. However, the benefits of the approach are probably achievable today through a well-architected MDM implementation, without those two unbearable drawbacks.
Is the definition I provided complete? I doubt it, and I look forward to suggestions for improving it!
Recent articles by David Loshin