We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Data Models for Analytics

Originally published April 26, 2012

Practically all modern business applications that employ a data subsystem represent their data sets using data models. A data model is a discrete structured data representation of a real-world set of entities related to one another. Each entity (most often represented using a table) carries a set of characteristic attributes (described as data elements). Yet over time, our understanding of the ways that real-world objects and events are captured within a structured representation must adapt to the context in which the data sets are used and the ways the information satisfies the needs of the business processes.

Transaction processing systems need to have visibility into all aspects of a limited number of data instances at a time (such as a single customer buying a particular product). Analytical systems may need to support rapid aggregation to quickly respond to ad hoc queries relating to many data instances (such as “How many customers bought this specific product?”). In other words, there is bound to be a significant difference between how we use data in an operational/tactical manner (i.e., to “run the business”) and the ways we use data in a strategic manner (i.e., to “improve the business”).

The traditional modeling technique for operational systems revolves around the entity-relationship (E/R) model. Unfortunately, reporting and analytical applications are generally less well-suited to utilizing data structured in the entity-relational form (although there are emerging environments that make this less true). The alternative is to restructure transaction or event data using what is called “dimensional modeling” that is better organized to provide rapid responses to different types of queries and analyses.

The challenge in using the standard entity-relationship model for reporting and analysis lies in the interconnectedness of the entities and the corresponding complexity in accumulating the information necessary for hierarchical aggregations. The alternative dimensional modeling technique captures the basic unit of representation as a single multi-keyed entry in a slender fact table, with each key exploiting the relational model to refer to the different dimensions associated with those facts.

A maintained table of facts, each of which is related to a set of dimensions, is a much more efficient representation for data in a data warehouse and allows for information to be represented in a way that is more suitable to high-performance access. This is due to the ability to efficiently create aggregations and extractions of data specific to particular dimensional constraints quickly while being able to aggregate information.

Just the Facts, Ma’am

The representation of a dimensional model is straightforward in that each row in the fact table represents a unique observable transaction or event within a specific business context. For example, a sales transaction would be represented as a specific fact record capturing all details of the event:

 Time  Customer  Quantity  Item  Clerk  Unit Price
 Total  Promotion  Location
                 

Table 1:
Sales Transaction Fact Record

This representation captures both entity data and quantifiable data. The entity data items, such as customer or location are not the actual values but instead are references (or foreign keys) to the dimension tables. The quantifiable items (such as quantity or unit price) are specific pieces of information relevant to the fact and are captured in the fact record. This data is typically numeric so that it is amenable to aggregate functions (sum, max, min, etc.). Each fact represents the total quantity of a product sold to a specific customer at a particular point-of-sales location at a particular point in time. A rough visualization of this sales fact table model is shown in Figure 1:



Figure 1: Sales Fact Table Model


This model contains six dimensions: Time, Customer, Item, Promotion, Location, and Clerk. When you look at the picture of the model in the figure, you can see that the relationships between the fact table and the dimensions resemble a star, which is why this model layout is referred to as a star schema.

This model easily adapts to support aggregation by essentially sorting and accumulating within the set of dimensions under investigation. So for example, to derive information about the sales of any particular item by sales location, you would sort the fact table records by sales location and then by product. At that point you could group the total counts and amounts by product within each sales location. With the right set of indexes introduced, reporting these aggregates is simply done using SQL queries.

The fact table is related to dimensions in a star schema. Each entry in a dimension represents a description of the individual entities within that dimension. In our sales example, the item dimension contains information associated with each item that could be sold, with descriptions of the attributes and characteristics of the item, such as item category, description, manufacturer, SKU number, size, unit of measure, package type, and package size, among a myriad of other relevant pieces of reference information.

Using the Dimensional Model for Business Intelligence

This dimensional model has become the de facto standard for representing and managing data in a data warehouse for a number of reasons, such as:

  • Simplicity: There is a certain elegance in the predictability of the model, since it simplifies the process of satisfying requests for reporting via the variety of knowledge delivery and reporting tools. In fact, there is a generic process for extracting information that relies on the star schema: Create a join between the fact table and the desired dimensions and then group by dimension. By virtue of the key relationship between the fact table and the dimensions, this join is basically a single pass through the fact table.

  • Lack of Bias: Even with multiple dimensions, there is no inherent bias lent to any individual dimension. This means that as data consumers change their activity or behavior associated with the kinds of analyses or reports they desire, no specific action needs be taken to rebalance the data to improve performance. In other words, the performance characteristics are not related to the data layout.

  • Extensibility: Because the dimensional model is easily modified, changes to the model can be handled gracefully without disrupting the operation of the data warehouse. For example, adding new values to a dimension (such as adding a new customer or a new item) involves simply adding new rows to that dimension. This can be aligned with a master data registry or repository; each enables greater consistency across systems. Adding a new dimension is done by creating the new dimension table and modifying the key values in the fact table to incorporate the references back to the new dimension. Adding new attributes to dimension values is done by altering the tables and adding the new attribute values.

Dimensional modeling is not limited to customer and sales data. The same approach can be taken to model any collection of transactions or events, such as transportation and logistics, telecommunications transactions (call detail data), insurance claims, Web log transactions, social media postings, non-profit donations, medical encounter data, etc.

There are variations on the star schema that involve breaking out additional dimension information associated with a preexisting dimension, and this allows for additional hierarchical grouping. An example would be customer categories that are applied at the highest level and then having the different customers enumerated within dimensions of that dimension. In either event, the general star schema is a powerful representational abstraction that is ubiquitous in building models for capturing data to supplement reporting and analytics.

Recent articles by David Loshin

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!