Data Models for Analytics
by David Loshin
Originally published April 26, 2012
Practically all modern business applications that employ a data subsystem represent their data sets using data models. A data model is a discrete structured data representation of a real-world set of entities related to one another. Each entity (most often represented using a table) carries a set of characteristic attributes (described as data elements). Yet over time, our understanding of the ways that real-world objects and events are captured within a structured representation must adapt to the context in which the data sets are used and the ways the information satisfies the needs of the business processes.
Just the Facts, Ma’amThe representation of a dimensional model is straightforward in that each row in the fact table represents a unique observable transaction or event within a specific business context. For example, a sales transaction would be represented as a specific fact record capturing all details of the event:
Table 1: Sales Transaction Fact Record
This representation captures both entity data and quantifiable data. The entity data items, such as customer or location are not the actual values but instead are references (or foreign keys) to the dimension tables. The quantifiable items (such as quantity or unit price) are specific pieces of information relevant to the fact and are captured in the fact record. This data is typically numeric so that it is amenable to aggregate functions (sum, max, min, etc.). Each fact represents the total quantity of a product sold to a specific customer at a particular point-of-sales location at a particular point in time. A rough visualization of this sales fact table model is shown in Figure 1:
Figure 1: Sales Fact Table Model
This model contains six dimensions: Time, Customer, Item, Promotion, Location, and Clerk. When you look at the picture of the model in the figure, you can see that the relationships between the fact table and the dimensions resemble a star, which is why this model layout is referred to as a star schema.
This model easily adapts to support aggregation by essentially sorting and accumulating within the set of dimensions under investigation. So for example, to derive information about the sales of any particular item by sales location, you would sort the fact table records by sales location and then by product. At that point you could group the total counts and amounts by product within each sales location. With the right set of indexes introduced, reporting these aggregates is simply done using SQL queries.
The fact table is related to dimensions in a star schema. Each entry in a dimension represents a description of the individual entities within that dimension. In our sales example, the item dimension contains information associated with each item that could be sold, with descriptions of the attributes and characteristics of the item, such as item category, description, manufacturer, SKU number, size, unit of measure, package type, and package size, among a myriad of other relevant pieces of reference information.
Using the Dimensional Model for Business IntelligenceThis dimensional model has become the de facto standard for representing and managing data in a data warehouse for a number of reasons, such as:
Dimensional modeling is not limited to customer and sales data. The same approach can be taken to model any collection of transactions or events, such as transportation and logistics, telecommunications transactions (call detail data), insurance claims, Web log transactions, social media postings, non-profit donations, medical encounter data, etc.
There are variations on the star schema that involve breaking out additional dimension information associated with a preexisting dimension, and this allows for additional hierarchical grouping. An example would be customer categories that are applied at the highest level and then having the different customers enumerated within dimensions of that dimension. In either event, the general star schema is a powerful representational abstraction that is ubiquitous in building models for capturing data to supplement reporting and analytics.
SOURCE: Data Models for Analytics
Recent articles by David Loshin
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC