Data Analysis: A Pivotal Activity for Operational Business Intelligence, Part 1

Originally published February 1, 2007

One could almost say, “Show me the types and quality of your data and I will show you how good your decision making is going to be.” But having only good types and quality of data is not going to satisfy the need for outstanding decision making. Without analyzing that data, decision making is not going to go anywhere. Data analysis which leads to data modeling will provide the useful insights into the data that will go a long way to support good decision making. 

Top-down logical data modeling discusses how to create project-specific logical data models and merge them into an integrated enterprise logical data model. It also elaborates on the data-specific business metadata components gathered during the logical data modeling process. Bottom-up source data analysis describes how to apply three sets of transformation rules to the source data: technical data conversion rules, business data domain rules and business data integrity rules. And in order to have good quality data, one needs to perform data cleansing; and it consists of the responsibilities for data archeology, data cleansing and data quality enforcement. It also explains the need to triage (prioritize) data cleansing activities. 

A number of questions that need to be answered before performing effective data analysis follow:

General questions regarding the source data:

  • Do we know where the source data resides? Which systems hold this data? Which files or databases have this data?
  • Is this data stored in multiple places? (The answer is usually an unequivocal “yes.” On average, the same data is stored in at least five places according to the general observations – and even that is a very kind number! )
  • Has the requested source data already been modeled?
  • One of the very political questions is going to be: Do we know who the data owners are? And it is not sufficient to know only that. You must also know the answer to the question: Who has the authority to update that data. And, the more loaded question is going to be: Who is responsible for it if it is stored incorrectly?
  • Do we already know what data is critical, what is important and what is insignificant (for data cleansing triage purposes)?
  • Is there any other type of documentation available for the requested source data? Is it up to date and complete?

General questions regarding the quality of data:

  • How clean is the data?
  • How clean does the data have to be? Who decides?
  • In order to work on the cleanliness of the data, one has to appreciate the business rules pertaining to the data. The question then is: Where do we get the business rules for the data? From the data owners? From the business representatives on the project?

General questions regarding data cleansing:

  • In order to not have to reinvent the wheel, it is a good idea to find out whether the data errors are already documented by other project teams. If the answer is “yes,” where is that documentation?
  • Who in the company would know what the known data errors are?
  • Are there codes inside the operational programs that are being translated? In which programs? Is there a code translation book for the encoded fields?  

A very typical problem of existing operational systems is as follows. Operational systems are developed as stovepipe automation solutions for individual business units and not for supporting the executive decision making process. Therefore, operational systems are not designed to integrate or reconcile with each other in order to provide a consistent cross-organizational view. Business intelligence (BI) applications, on the other hand, are designed to do just that – provide integrated and reconciled business data to the business community.

Business-Focused Data Analysis
For many organizations, the BI decision support initiative is the first attempt at bringing business data together from multiple sources in order to make it available across different parts of the organization. Organizations that use a traditional system development methodology on their BI projects usually run into severe source data problems when they try to implement their extract, transform and load (ETL) process. The reason is that traditional development methodologies do not have steps for analyzing data domains (content) early in the development process. They have – at best – a systems analysis phase for the application functions, but not a business-focused data analysis phase for the underlying data.  

Note:  The business-focused data analysis step is the most critical cross-organizational step in the Business Intelligence Roadmap 

The data analysis step in the Business Intelligence Roadmap is different from a systems analysis phase in a traditional methodology. The activities traditionally performed during systems analysis are geared toward producing a design decision for the system to be built. The activities performed during data analysis are geared toward understanding and correcting the existing discrepancies in the business data, irrespective of any system design or implementation method. Data analysis is therefore a business-focused activity, not a system-focused activity.  

Figure 1 points out that two complementary methods are required to perform rigorous data analysis:

  1. Top-down logical data modeling for integration and consistency, and
  2. Bottom-up source data analysis for standardization and quality.


Figure 1: Complementary Data Analysis Techniques

Top -Down Logical Data Modeling

Project-Specific Logical Data Model
The most effective technique for discovering and documenting the single cross-organizationally integrated and reconciled view of business data is entity-relationship (E-R) modeling, also known as logical data modeling. The normalization rules of E-R modeling are applied during top-down data modeling as well as during bottom-up source data analysis. Using normalization rules, along with other data administration principles, assures that each data element within the scope of the BI project is uniquely identified, correctly named and properly defined – and that its domain (content) is validated for all businesspeople who will be accessing this data. That means that the normalized project-specific logical data model yields a formal representation of the data exactly as it exists in the real world, without redundancy and without ambiguity.

This formal representation of data follows another normalization rule, which is process independence – the main reason being that the processes may change and they had better change to adjust to the real external world of changes! Therefore, by definition, a logical data model, which is based on normalization rules, is also process independent. That means that the structure and content of the logical data model are not influenced by any type of database, access path, design, program, tool or hardware. Figure 2 highlights the process independence of a logical data model by the nixed (X) process components.


Figure 2: Process Independence of Logical Data Models

Because of its process independence, a logical data model is a business view, not a database view and not an application view. Therefore, a unique piece of data, which exists only once in the real business world, also exists only once in a logical data model.

Enterprise Logical Data Model
It is the responsibility of an enterprise architecture (EA) group, or of data administration if the organization does not have an EA group, to merge the project-specific logical data models into an integrated and standardized enterprise logical data model, as illustrated in Figure 3.


Figure 3: Creating an Enterprise Logical Data Model

This enterprise logical data model, also known as enterprise information architecture, is very seldom constructed all at once, nor is it a prerequisite for BI projects to have a complete one. Instead, the enterprise logical data model evolves over time and may never be completed – and does not need to be completed because the objective of this process is not to produce a finished model, but to discover and resolve data discrepancies.

These data discrepancies exist en masse among stovepipe operational systems and are the root causes for the current inability to provide integrated and consistent cross-organizational information to the business community. The discovery of these discrepancies should be embraced and celebrated by the BI project team, and especially by the business community, because poor quality data is finally being addressed and resolved. Gaining control over the existing data chaos is, after all, one major function of any BI decision support initiative.

Logical Data Modeling Participants
Logical data modeling sessions are typically facilitated and led by a data administrator who has a solid business background. If the data administrator does not have a good understanding of the business, a subject-matter expert must assist him or her in this task.

The business representative and the subject matter expert assigned to the BI project are active participants during the modeling sessions. If the data is being extracted from several different operational systems, multiple data owners may have to participate on the BI project because each operational system may be under the governance of a different owner. Data owners are those business individuals who have authority to establish business rules and set business policies for those pieces of data that are originated by their department. When data discrepancies are discovered, it is the data owners' responsibility to sort out the various business views and to approve the legitimate usage of their data. This data reconciliation process is – and should be – a business function, not an IT function, although the data administrators – who usually work for IT – facilitate the discovery process.

Systems analysts, developers and DBAs should also be available to participate in some of the modeling sessions on an as-needed basis. These IT specialists maintain the organization’s applications and data structures, and they are often the most knowledgeable about the data – how and where it is stored, how it is processed, and ultimately how it is used by the business. In addition, they often have in-depth knowledge of the accuracy of the data, how it relates to other data, the history of its use, and how the content and meaning of the data has changed over time. It is important to obtain the commitment from these IT resources since they are often busy “fighting fires” and working on enhancements to the operational systems.

Standardized Business Metadata
A logical data model, representing a single cross-organizational business view of the data, is composed of an E-R diagram and supporting business metadata. Business metadata is comprised of information about business data objects, their data elements, and the relationships among them. Business metadata as well as technical metadata, which is added during the design and construction steps, ensure data consistency and enhance the understanding and interpretation of the data in the BI decision support environment. A common subset of business metadata components as they apply to data (as opposed to processes) is listed in Figure 4.


Figure 4: Data-Specific Business Metadata Components

Data name is an official formal label developed from a formal data naming taxonomy, and should be composed of a prime word, a class word and qualifiers. Each data name uniquely identifies one piece of data within the logical data model. No synonyms and no homonyms should exist.

Data definition is a brief one- or two-sentence comprehensive dictionary-like description of a data object or a data element. If a data object has many subtypes, each subtype should have its own unique data definition. A data definition explains the meaning of the data object or data element (similar to definitions in a language dictionary). It does not include who created the object, when it was last updated, what system originates it, what values it contains, etc. That information is stored in other metadata components (e.g., data owner, data content).

Data relationship is a business association among data occurrences in a business activity. Every data relationship is based on business rules and business policies for the associated data occurrences under each business activity.

Data identifier uniquely identifies an occurrence of a data object. A data identifier should be known to the business staff. It should be minimal (as succinct as possible) and non-intelligent (have no embedded logic such as account number 0765587654, where 0765 is an embedded branch number).

Data type describes the structure of a data element. It categorizes the type of values (character, number, decimal, date) allowed to be stored in a data element.

Data length specifies the size of a data element for its particular data type. For example, a decimal can be an amount field with two digits after the decimal point or a rate field with three digits after the decimal point.

Data content (domain) identifies the actual allowable values for a data element specific to its data type and data length. A domain may be expressed as a range of values, a list of allowable values, a generic business rule, or a dependency rule between two or more data elements.

Data rule is a constraint on a data object or a data element. A data constraint can also apply to a data relationship. A data constraint can be in the form of a business rule or a dependency rule between data objects or data elements. For example: the ceiling interest rate must be higher than the floor interest rate.

Data policy governs the content and behavior of a data object or a data element. It is usually expressed as an organizational policy or a government regulation. For example: patients on Medicare must be at least 65 years old.

Data ownership identifies the persons who have the authority to establish and approve the business metadata for the data objects and data elements under their control. Although logical data models are extremely stable, some of these business metadata components, such as data content, data rules, data policy and data ownership, occasionally change for legitimate reasons. It is important to track these changes in the metadata repository.      

We will cover the role of the bottom-up source data analysis in Part 2 of this article.

(Source: Moss, Larissa and Atre, Shaku. Business Intelligence Roadmap – The Complete Project Lifecycle for Decision-Support Applications, Addison Wesley Professional, 2003.)

Author’s Note: The Business Intelligence Roadmap includes a set of all major activities and tasks that are appropriate for BI projects. Not every BI project will have to perform every single activity in every step. To receive a complimentary copy of the Business Intelligence Navigator, designed to help chart the business intelligence journey, please visit http://www.atre.com/bi_navigator/navigator.html.

  • Shaku Atre

    Shaku is a world-renowned expert on business intelligence and on database technology. Shaku is the Founder and President of the Atre Group, Inc., and has written six, very successful books on database-related topics that are translated in many languages. She frequently speaks at conferences around the world, has been widely quoted in the trade press and has served as a columnist for many leading technology publications. Previously, Shaku was a partner with PricewaterhouseCoopers. She also worked at IBM for 14 years, where she taught at their prestigious Systems Research Institute. She also held various technology related and management positions at IBM in Europe and in the United States. Her award-winning book on database management systems, Data Base: Structured Techniques for Design, Performance and Management  has become a definitive reference on the subject. Most recently, Shaku co-authored Business Intelligence Roadmap—The Complete Project Lifecycle for Decision Support Applications.

Recent articles by Shaku Atre

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!