Using Data Quality Tools for Identification of Master Data

Originally published October 23, 2006

What makes data “master data”? In a recent article, I defined master as the “data objects that are core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies. Master data objects are those “things” that we care about – the things that are logged in our transaction systems, measured and reported on in our reporting systems, and analyzed in our analytical systems. Examples include customer, vendor, product, policy, etc.

While we seem to intuitively understand that master data objects exist across the enterprise, there are two challenges when actually considering the implementation of a master data management program. The first is discovery: what objects in our organizations qualify as master data? The second is assessment: how do we track down and standardize the model for and the content of a master data collection? These may seem relatively benign, but when each of your organization’s lines of business has its own set of tables and applications, the ability to distinguish master data starts to look like a daunting task.

In this article, we will look at discovery. One approach involves the use of data quality tools as the utilities by which master data is both discovered and standardized. For example, a combination of the techniques used for data profiling and manual assessment of the results will jump-start the isolation of master data sources.
Aside from the aforementioned definition, master data objects share certain characteristics:

  • They are referenced in both transaction and analytic system records.

  • They may require specialized application functions to create new instances.

  • They are likely to have models reflected across multiple applications.

  • They may be embedded in legacy data structure models.

That being said, one method is to completely document the metadata associated with all data objects across the enterprise and use that information to guide analysts seeking out master data. We can use data profiling as the tool of choice because of its ability to apply both statistical and analytical algorithms for characterizing data sets and driving the exposure of embedded data models and dependencies.

Our first step is to characterize the data set associated with each column in each table. At a conceptual level, we would assign one of these designations to each value set:

  1. Boolean or Flag – there are only two valid values, one representing true and one representing false.

  2. Time/Date Stamp – a value that represents a point in time.

  3. Magnitude – a numeric value on a continuous range, such as a quantity or an amount.

  4. Code Enumeration – a small set of values, either used directly (e.g., using the colors red and blue) or mapped as a numeric enumeration (e.g., 1 = red, 2 = blue).

  5. Handle – a character string with limited duplication across the set, may be used as part of an object description (e.g., name or address_line_1 fields contain handle information).

  6. Cross-reference – an identifier that either is uniquely assigned to the record or provides a reference to that identifier in another data set.

Next, we would construct a model of each table based on the characteristics of its contained columns and distinguish between tables that capture transaction information and those that capture reference information. For those tables that capture transaction information, we would then want to determine if reference data is embedded within the model. While the records in transaction tables are likely to contain date/timestamps, quantities and magnitudes along with the other classifications, the reference tables are more likely to look like catalog entries – descriptive information captured within handle attributes, with some attributes containing the other data types, but often accompanied by identifiers that are unique across the data set. By characterizing and accumulating the reference data sets, we can suggest to the analysts where to start reviewing how those data sets reflect the master data characteristics.

Also, consider the column names. In any environment, there is a strong likelihood that similar objects will have similar names. Think about it: there are always column names such as ACCOUNT, ACCT, ACCOUNT_NUM, or ACCT_NUM to represent account numbers. Even if the names are not exact, we can employ pattern matching, standardization and linkage tools to cluster attribute names together. In this case, they might all represent an account number. If we can identify columns with similar names, we can then review the values embedded within and use the cross-column analysis functionality of the data profiling tool to determine if the value sets have a large degree of overlap. If so, we can propose the two columns as representing the same master data attribute, and we can assign a standardized name to the attribute.

Another aspect to review is comparing the metadata between tables. Two tables that represent similar data will share many of the same attributes, and so this next step is to look for tables that share similar attributes. We can again apply our data quality tools by creating a signature for each table that consists of its name and an alphabetized list of its assigned attribute names. Our search and match tools can then be used to provide similarity scores between pairs of tables, providing some suggestions as to where there is content overlap between data sets. For example, a sales database with a table named CUSTOMER with columns named CUSTNUM, FIRSTNAME, LASTNAME, TELEPHONE, STREET, CITY, STATE, and ZIP will have a high degree of similarity to an accounts database with a table named ACCOUNT with columns named CUSTOMER_NUM, FIRST_NAME, LAST_NAME, TEL, ADDRESS1, CITY, STATE, and ZIP.

These are just a few techniques that can be applied to accumulate intelligence about potential master data sets across the enterprise. Even though the process of master data discovery is to be performed by analysts, we can more than adequately collect analytical hints by using data quality tools to collect the metadata that facilitates the discovery of master data objects.

Recent articles by David Loshin

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!