Part 1 of this series, which covers top-down logical data modeling, was published in the February 1, 2007 issue of the Bill Inmon Newsletter.
Bottom-Up Source Data Analysis
Data analysis cannot stop after top-down logical data modeling because the source data often does not follow the business rules and policies
captured during the data modeling sessions. If bottom-up source data analysis is not performed, the data problems and business rule violations would not be discovered until the extract, transform
and load (ETL) process. Some data quality problems would not be discovered at all until after implementation, and then only if somebody complained about them. As Figure 1 shows, source data mapping
must adhere not only to the usual technical data conversion rules but more so to the business data domain rules and to the business data integrity rules.
Figure 1: Source Data Mapping Rules
Technical Data Conversion Rules
The following technical rules should be observed for any type of data conversion:
Business Data Domain Rules
Business data domain rules are rules about the semantics (meaning and interpretation) of data contents. As a result, business data domain rules require much
more effort in preparing the data for conversion. The business community considers these rules to be more important than the technical data conversion rules. A source data element can meet all
technical data conversion rules but its contents can still be wrong. The business data domain rules are used to identify and correct data violations similar to those listed in Figure 2.
|
Data Domain Violations |
|
1. Missing data values (big issue on BI projects) |
|
2. Default values; for example 0, 999, FF, blank |
|
3. Intelligent "dummy" values, which are specific default (or dummy) values that actually have a meaning; for example, Social Security Number of 888-88-8888 being used to indicate that the person is a non-resident alien |
|
4. Logic embedded in a data value; for example: using lower-valued ZIP codes (postal codes) to indicate a state on the east coast, such as 07456 in New Jersey, and higher-valued ZIP codes to indicate a state on the west coast, such as 91024 in California |
|
5. Cryptic and overused data content; for example: the values "A, B, C, D" of a data element define type of customer, while the values "E, F, G, H" of the same data element define type of promotion, and the values "I, J, K, L" define type of location |
|
6. Multipurpose data elements – that is programmatically and purposely redefined data content; the most obvious example being the "redefines" clause in COBOL statements |
|
7. Multiple data elements embedded in or concatenated across, or wrapped around free-form text fields; for example: Address Lines 1 through 5 containing name and address data elements:
Address Line 1 Brokovicz, Meyers, and Co |
Figure 2: Data Domain Violations
Business Data Integrity Rules
Similar to business data domain rules, business data integrity rules are much more important to improving information quality than are the technical data
conversion rules. Some examples of violations to business data integrity rules are listed in Figure 3.
|
Data Integrity Violations |
|
1. Contradicting data content between two or more data elements; for example: Boston, CA (instead of MA) |
|
2. Business rule violation; for example: Date of Birth = 05/02/1985 and Date of Death for the same person = 11/09/1971 |
|
3. Reused primary key (same key value used for multiple object instances); for example: two employees with the same employee number |
|
4. No unique primary key (multiple key values for the same object instance); for example: one customer with multiple customer numbers |
|
5. Objects without their dependent parent object; for example: job assignment points to employee 3321, but there is no employee 3321 in the employee database |
|
6. A real-world relationship between two data objects that cannot be built in the database due to a gap in business knowledge among operational systems |
Figure 3: Data Integrity Violations
Every critical and important data element must be examined for these defects, and a decision must be made whether and how to correct them. The information consumers (businesspeople who will be using those data elements to make business decisions) and data owners should make that decision after discussing the impact of the cleansing effort with the business sponsor, the project manager and the core team.
Data Cleansing
One of the goals stated most frequently for business intelligence (BI) applications is to deliver clean, integrated and reconciled data to the business
community. Unless all three sets of data mapping rules, as discussed previously, are addressed, this goal cannot be achieved. Many organizations will find a much higher percentage of dirty data in
their source systems than they expected, and their challenge will be to decide how much of it to cleanse.
Data Quality Responsibility
Data archeology (the process of finding bad data), data cleansing (the process of correcting bad data), and data quality enforcement (the process
of preventing data defects from happening now and in future at the source) are all business responsibilities – not IT responsibilities. That means that business representatives –
information consumers as well as data owners – must be involved with the data analysis activities and familiar with the source to target data mapping rules.
Because data owners originate the data and establish business rules and policies over the data, they are directly responsible to the downstream information consumers (knowledge workers, business analysts, business management) who need to use that data. If downstream information consumers base their business decisions on poor quality data and suffer financial losses because of it, the data owners must be held accountable. In the past, this accountability has been absent from stovepipe systems. Data quality accountability is neither temporary nor BI-specific, and the business community must make the commitment to accept these responsibilities permanently.
The challenge for IT and for the business sponsor on a BI project is to enforce the inescapable tasks of data archaeology, data cleansing and data enforcement to meet the quality goals of the BI applications.
Note: The data analysis step may be time intensive since many “battles” may rage among the business representatives as to the valid meaning and domain of data.
Although data cleansing tools can assist in the data archeology process, developing data cleansing specifications is mainly a manual process. IT managers, business managers, and data owners, who have never been through a data quality assessment and data cleansing initiatives, are unfamiliar with the analysis process. They often underestimate the time and effort required of their staff by a factor of four or more.
Source Data Selection Process
Poor quality data is such an overwhelming problem that most organizations will not be able to correct all of the discrepancies. When selecting the data
for the BI application, consider the general steps shown in Figure 4.
Figure 4: Source Data Selection Process
Step 1:Identify data of interest and the significance of this data. Data cleansing is a collaborative effort between business representatives who are familiar with the semantics of the data, and data quality analysts who know the program-specific meanings of the data (e.g., use and meaning of a “flag” value, redefined record layouts).
Step 2:Analyze the data for content, meaning, and importance. Many organizations have accumulated massive amounts of data in files and databases. This data constitutes a prospective gold mine of valuable business knowledge and is potentially a good source for data mining. However, the quality of the data content must be assessed first, since mining dirty data is of little value.
Step 3:Determine which data to include in the BI application. Select only the data that will meet core business requirements. Even with automated tools, the cost of assuring data quality for an all-inclusive BI decision support environment becomes prohibitive for most organizations. Some questions to consider when selecting data are:
Step 4: Prepare the data cleansing specifications. The IT staff, working with business representatives, will get to know the necessary business rules, which are needed to write the data cleansing specifications. In essence, this is a source data reengineering process.
Step 5:Select ETL and cleansing tools. Determine whether it is appropriate and cost-effective to acquire an ETL tool, a cleansing tool, or both. Examine the suitability and effectiveness of those tools. Some data cleansing specifications can be very complicated. Be sure the tools are capable of executing them.
Note:Automated tools do not eliminate the manual labor of source data analysis. Human beings can use the inference drawing capability during the process of data analysis, which the automated tools don’t have.
Data Selection Key Points
When identifying and selecting the operational data to be used to populate the BI target databases, some key points should be considered. Applying
the source data selection criteria listed in Figure 5 will minimize the need and effort for data cleansing.
Figure 5: Source Data Selection Criteria
Data integrity – How internally consistent is the data? This is the most important criterion.
Data precision – This is the next important criterion.
Data accuracy – How correct is the data?
Data reliability – How old is it?
Data format – The closer the data is to the destination data format, the less the conversion requirements will be. From highest to lowest, the format priorities are:
Note: Source data quality will only be as good as the enforcement of quality processes in the operational systems. Mandatory quality processes should include data entry rules and edit checks in programs. If those processes are not enforced, or don’t exist, data usually gets corrupted, regardless of whether the data is in a relational database or in an old VSAM file.
To Cleanse or Not To Cleanse...
Many organizations struggle with this question. Data cleansing research indicates that some organizations downplay data cleansing to achieve short-term
goals. The consequences of not addressing poor quality data usually hit home when their business ventures fail or are adversely affected because of inaccurate data.
It is important to recognize that data cleansing is a labor-intensive, time-consuming and expensive process. Cleansing all of the data is usually not cost-justified – nor practical; but cleansing none of it is equally unacceptable. It is therefore important to analyze the source data carefully and to classify the data elements as being critical, important or insignificant to the business. Concentrate on cleansing all of the critical data elements, keeping in mind that not all data is equally critical to all businesspeople. Then, cleanse as many of the important data elements as time allows, and move the insignificant data elements as-is. In other words, you do not need to cleanse all of the data, and not all at once.
Cleansing Operational Systems
When the selected data is cleansed, standardized and moved into the BI target databases, a question to consider is whether the source files and source
databases should also be cleansed. Management may ask, “Why not spend a little extra money and time to cleanse the source files and databases so that the data is consistent in the source as
well as in the target?” This is a valid question, and this option should definitely be pursued if the corrective action on the source system is as simple as adding an edit check to the data
entry program.
If the corrective action requires changing the file structure, which means modifying (if not rewriting) most of the programs that access that file, the cost for such an invasive corrective action on the operational system is probably not justifiable – especially if the bad data is not interfering with the operational needs of that system. Let’s not forget that companies didn't even want to make such drastic changes for the now infamous Y2K problem – they only made those changes when it was clear that their survival was at stake. Certainly, a misused code field does not put an organization’s survival at stake. Hence, the chances that operational systems will be fixed are bleak.
Data Analysis Activities
Data analysis activities do not need to be performed linearly. Figure 6 indicates which activities can be performed concurrently.
Figure 6: Data Analysis Activities
Deliverables Resulting from These Activities
Normalized and fully attributed logical data model: This project-specific logical data model is a fully normalized entity-relationship diagram showing kernel entities, associative entities, characteristic entities, cardinality, optionality, unique identifiers and all attributes.
Business metadata: The business entities and attributes from the logical data model must be described with metadata. Data-specific business metadata components include data names, data definitions, data relationships, unique identifiers, data types, data lengths, domains, business rules, policies and data ownership. These are usually captured in the tool repository of the CASE tool.
Data cleansing specifications: This document describes the cleansing logic that must be applied to the source data in order to bring it into compliance with the technical data conversion rules, the business data domain rules and the business data integrity rules. This document will be used to create the transformation specifications on the source to target mapping document.
Expanded enterprise logical data model: This deliverable is produced behind the scenes by data administration or the EA group when they merge the project-specific logical data model into the enterprise logical data model. Any rejected entities or attributes and any discrepancies between the models will be presented to the BI project team for resolution.
(Source: Moss, Larissa and Atre, Shaku. Business Intelligence Roadmap – The Complete Project Lifecycle for Decision-Support Applications, Addison Wesley Professional, 2003.)
Author’s Note: TheBusiness Intelligence Roadmapincludes a set of all major activities and tasks that are appropriate for BI projects. Not every BI project will have to perform every single activity in every step. To receive a complimentary copy of the Business Intelligence Navigator, designed to help chart the business intelligence journey, please visithttp://www.atre.com/bi_navigator/navigator.html.
Recent articles by Shaku Atre
Shaku is a world-renowned expert on business intelligence. Shaku is the Founder and President of the Atre Group, and has written six books on database-related topics. She frequently speaks at conferences around the world, has been widely quoted in the trade press and has served as a columnist for many leading technology publications. Previously, Shaku was a partner with PricewaterhouseCoopers. She also worked at IBM for 14 years, where she taught at their prestigious Systems Research Institute. Her award-winning book on database management systems, Data Base: Structured Techniques for Design, Performance and Management has become a definitive reference on the subject. Most recently, Shaku co-authored Business Intelligence Roadmap—The Complete Project Lifecycle for Decision Support Applications.