<-- Back to full color view
Data Governance and Data Architecture: Initial Thoughts
Originally published June 24, 2010
Over the past few years, we have been evolving a capability model for best practices in data management as a way to assess organizational preparedness for enterprise data initiatives such as ERP or master data management. Often motivated as a result of one or more failed attempts at data integration and consolidation, the desire for a more formal review of corporate maturity is driven by challenges that are effectively associated with two core concepts: structure and meaning.
Often these issues emerge as the by-product of the absence of historical standards and lack of oversight for the ways that different stakeholders model their core data concepts. As an example, consider these five situations:
- Structural differences at the data element level – This occurs when the same conceptual data element is represented using variant data element lengths and data types. An example is a conceptual data element for a postal code, with the following structural variations:
- One representation presumes only storing 5-digit US postal codes, using a length 5 character string;
- One representation presumes a US ZIP+4, allocating a length 9 character string;
- One representation presumes a US ZIP+4, instead allocating a length 10 character string that holds a hyphen to separate the first 5 digits from the last 4 digits;
- One representation allows US, Canadian, and UK postal codes, using a length-6 character string;
- One representation is not constrained, and uses a length 10 character string with no format constraints.
- Structural differences at the entity level – This happens when the models used for different entity representations require different data element attribution. An example is two applications that both represent a customer, but one model contains data elements for first name, last name, and telephone number, while the other contains data elements for first name, middle name, last name, street, city, state, postal code, and telephone number. There may be overlap between the two representations, but the number of data elements differs. In some situations, the data elements may map at a conceptual level yet disagree at the data element level, such as when the same data element concepts are represented using different sizes and types.
- Structural differences at the relationship level – In this situation, the same business relationships are modeled differently. As an example, in one customer data set there are tables for parties, addresses, and telephone numbers, linked together via foreign keys, while another customer data set models the same information compressed into a party table mapped via a foreign key to a contact mechanism table.
- Semantic differences at the data element level – Here the same names are used for data elements buts actually mean different things. As an example, in some government databases, state refers to the fifty states of the United States, while in others state refers to the fifty states as well as territories and administered areas such as Puerto Rico, Guam, US Virgin Islands, and American Samoa.
- Semantic differences at the entity level – In this situation, similar entity concepts actually have variant meanings. For an example of this, ask around your company for definition of “customer.”
When the data sets are used solely for the original purpose for which they were designed, these types of variances are largely irrelevant. Because the data sets were mostly developed to support specific transactional or operational needs, they are engineered to satisfy the immediate requirements without any consideration for longer-term downstream consumption. However, once the data sets are under consideration for centralization, even slight structural and semantic variances can inadvertently wreak havoc for downstream consumers, especially after a series of data transformations are applied to force data sets to merge into a target representation.
So today, with growing interest in enterprise-wide data reuse, siloed modeling and metadata management cannot be performed in a vacuum. Rather, some oversight at the organizational level must be imposed to establish standard practices for enterprise data design, sharing, and reuse. This suggests the need for specific policies for data governance associated with different aspects of data architecture, with the intention of establishing a high level of maturity and capability.
As just one example, consider establishing enterprise-wide metadata management. When the data management practitioners within the organization understand the ramifications of slight variations, they strive to attain a high level of “metadata maturity.” This means that a metadata management strategy is clearly defined and communicated to all developers and consumers, and there are centralized tools and techniques integrated as part of the enterprise development framework. A single metadata repository is used to document data element concepts, their instantiations, any structural variances, and where the conceptual data elements are touched across multiple business applications, providing a means for analyzing the impact of adjustments to any underlying or dependent data element definitions.
Similar considerations can be made for other areas of data architecture. This may include defining data standards, establishing protocols for enterprise data modeling, as well as instituting processes for data model review and acceptance by the members of a data governance board. Instituting organizational standards along with the data governance processes overseeing observance of these standards is the first step as resolving the challenges inherent in wholesale data consolidation.
SOURCE: Data Governance and Data Architecture: Initial Thoughts
Recent articles by David Loshin