We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Preparing Data for Business Analytics

Originally published October 27, 2011

Aside from ensuring that the quality of data in a business intelligence environment meets the collected expectations, one of the more vexing issues that has plagued those implementing reporting and analytics systems is the organization of the information to be subjected to scrutiny. This article looks at two specific aspects of information organization: structural organization and conceptual organization.

We can make three presumptions about the state of the data as it flows into a data warehouse or other analytical environment:

  1. Because the data originates from different sources, there will probably be little coordination regarding the structure of individual data elements.

  2. Because different modelers envisioned the data architectures for the different sources, the ways that the data elements are organized by entity or table are likely to vary.

  3. The ways that the collections of data tables are related to each other are likely to be different.

Here are some corresponding examples:

  1. At the data element level, postal codes are represented differently: In data set A, postal code data elements support 6-character alphanumeric codes (such as those used in Canada), while data set B maintains ZIP code data elements and supports 5-character codes limited to numeric digits.

  2. At the model level, data set A’s customer table incorporates a street address, city, province, postal code, and home telephone number, while perhaps data set B’s customer contact information is broken up into two tables, one for addresses (including a street address, city, state, and ZIP code) and one for telephone numbers.

  3. At the relational level, data set A’s customer contact information is embedded within a customer record, while data set B has one customer table that is linked by a foreign key to both the address table and the telephone table.

Even if all the data values are complete, accurate, consistent and current – in other words, meet all the data quality requirements – the business intelligence infrastructure has to be arranged to accommodate the source data in a cohesive way that maintains the link back to the original sources (“managed lineage”) and is “value-normalized” across the sources (“structurally consistent”). These tasks are generally absorbed as part of the data integration layer in which data is transformed from the original format into a standardized representation within the data warehouse environment.

That standardized representation is intended to address all three presumptions. A set of canonical data element representations provide the target for data element-level standardization, and this is managed within a metadata management system. The data model for analysis addresses more than just the canonicalization of the relationships across concepts (such as “customer,” “telephone number,” “location,” etc.). In contrast with the originating application data models, which are organized for transaction processing, the analytical data model is organized to support the types of interactions that analysts perform.

Here is the difference: Transaction processing systems focus on the information necessary to execute the transaction, so the arrangement of information must provide for rapid access to all the data when needed as part of the transaction execution flow. On the other hand, analytical systems tend to support collections of information: aggregations such as sums and averages, hierarchical designs for drilling down and looking at intermediate aggregates, and trending over large numbers of transactions. Analytical data models have to support rapid access to large subsets of collections of transactions. To address this, data warehouse models represent collections of transactions in a denormalized manner that can support a mixed workload of queries and analyses. The discussion of the model differences is beyond the scope of this article, but we’ll revisit that in a future article.

Meanwhile, we have yet to address our other issue: conceptual organization. As the data sets are broken down, then joined and denormalized for the analytical applications and algorithms to work, we start to see some pulling at the seams when it comes to standardizing the data values. For example, the thing we refer to as an “automobile” might not only be referred to using terms like “auto,” “car,” “van,” “minivan,” “SUV,” “truck,” etc., but also the items themselves have characteristics that allow them to be grouped and aligned in different ways. To continue the example, automobiles can be classified along manufacturing lines (make and model), but they can also be organized in relation to their attributes (such as paint color, type of transmission, brake type, etc.).

The consumers of information from the analytical system must help in setting the parameters for preparing data sets for analysis. This includes helping to define the concept hierarchies that are relevant for slicing and dicing, and this might also mean additional transformations or data reductions.

For example, data attributes with values drawn from a continuous value range (such as a customer’s age) might be mapped into a smaller number of “age buckets” such as {under 18, 18-34, 35-50, 51-65, over 65}. Supporting the analysis of customers by age range requires multiple transformation steps: first transforming a birth date into an age, then mapping the calculated age into the “quantized” bucket, and then assigning the bucket type to the customer’s record. And another thing to think about: While that would support the analysis by age buckets, the calculated bucket cannot be left as a statically assigned value because people aging in real life affects their age bucket assignment in real time!

These are a few examples of thought processes that you must consider when preparing data for analysis. But having an initial awareness of the potential issues enables the business intelligence professional to incorporate these tasks into the design, instead of being alerted to the issues after a system has been put into production.

Recent articles by David Loshin

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!