We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


A Practical Approach to Managing Slowly Changing Dimensions Using the tSQL MERGE Statement in SQL Server 2008

Originally published February 17, 2011

This article will demonstrate one approach to managing a slowly changing dimension in a SQL Server 2008 environment. I will explain what a dimension is and how it is used, what slowly changing dimensions are and how they are used, and finally we will go through a technology approach to managing them using a new functionality in SQL Server 2008—tSQL MERGE statement. We will go through a coding example and decompose the various tables and elements, as well as the MERGE statement. Advanced technologists may find this article a little plain; however, if you are new to SQL Server 2008, data warehousing or data marts, you might find great value in this article. There are various ways to manage a slowly changing dimension through various coding methods, and this article will focus on managing a type 2 slowly changing dimension with SQL Server 2008 transact SQL technologies.

Describing a Dimension in Data Warehousing/Data Marts

Dimension tables are made up of fields and attributes that people use to query a database. For example, the CARRIER table contains carrier names and addresses; the PATIENT table contains patients’ names, addresses, phone numbers, and other contact information; and the TIME table contains day, month, quarter, and year entries. Dimension tables typically do not contain transaction data or measurable data such as costs, revenue, or other data that would belong in a FACT table. A FACT table is typically made up of transaction data that also contains related surrogate keys from the defined dimension tables. Having dimensions defined separately from your FACT tables will enable your users to slice and dice transaction data. For example, all health claims relating to patients living in Florida, during Q1 of 2009.

All dimension tables should contain a primary key that consists of one-to-many columns. The records in a dimension table are uniquely identified by its primary key value. There are many types of primary keys and in our example we will use a surrogate key. What is a surrogate key?  A surrogate key is usually a system -generated number, GUID, or other unique value that has absolutely no business meaning in relationship to the dimension attributes it identifies. In SQL Server 2008 a surrogate key can be implemented on a numeric column, usually an integer, that auto-generates a sequential value. (There are many ways to implement a surrogate key, and in SQL Server 2008 you can define a field as an IDENTITY column, but that is out of the scope of this article.) However, try to use a surrogate key that is built on an interval of more than 1. For example, use a seed value of 101 that intervals by 2, resulting in 101, 103, 105, 107…so forth. Also, try to use an integer that can be both positive and negative. You will thank me one day if your system scales to a very large database (VLDB) or a replication between different systems are involved. But again, this is another topic outside of the scope of this document.

Dimensions usually get their data from source business transaction systems. Data in those source systems will change over time and thus impact the records that are in the dimension. Extraction, transformation, and load jobs will transfer data from source systems and into the dimension tables. For example, Jane Smith has a record entered in the PATIENT dimension table. There will be certain things about Jane Smith that probably won’t change: social security number, birthday, and other personal details. However, Jane Smith could get married and change her last name, she could move to a different city or state, or her other contact information could change. The business could have specific reasons to maintain a history of changes to Jane Smith’s records. For example,  Jane Smith may have a worker’s compensation claim transaction dated 1/12/2007, while she was working for an employer in Ohio. She then  moves to Michigan on 10/1/2008, suffers another injury and files a worker’s compensation claim on 3/2/2009. If the dimension does not store historic changes to Jane Smith’s record, the only record that will be related to both worker’s compensation claims will be Jane Smith and her Michigan address. What is described here is a slowly changing dimension.

Describing a Slowly Changing Dimension (SCD)

The example in the previous paragraph explains what a slowly changing dimension is and how it can impact a data mart or data warehouse. Simply put, a slowly changing dimension is when certain attributes for a record are altered over time. The example above describes a type 1 slowly changing dimension. A type 1 slowly changing dimension overwrites the data in the table and does not maintain a history of those changes. This section of the article will describe the different types of slowly changing dimensions and explain an approach using the SQL Server 2008 t-SQL MERGE statement and a coding example.

The most common slowly changing dimensions are types 0, 1, and 2. A type 0 is where the dimension is maintained only at the primary load. An example of a type 0 would be a TIME dimension. A type 1 SCD is where the records of a dimension are overwritten and a type 2 is where the history of records is maintained via a versioning process. In our Jane Smith example mentioned above, there would be two records in the dimension table: one for Jane Smith that lived in Ohio—version 1—and a second record when Jane Smith moved to Michigan—version 2. Instead of a version, an alternative and popular way of managing a type 2 dimension would be to have two date fields on the PATIENT dimension table: START_DATE and END_DATE. Jane Smith would again have two separate records that would like the following table:

 KEY  NAME  STATE  START_DATE  END_DATE
 1  Jane Smith
 OH  1/1/2006  10/1/2008
 2  Jane Smith
 MI  10/2/2008  1/1/2599

Notice that the Jane Smith record with KEY=2 has an END_DATE of 1/1/2599. This value denotes that this record is the most current record for any transaction dated on or after 10/2/2008. Our coding example in this document will provide a technical example of how to manage this using one SQL statement. This article will not explain all the various forms of slowly changing dimensions as there is a lot of information available on the web and in books. Let’s take a look at how to manage a slowly changing dimension using the t-SQL MERGE statement.

SQL Server 2008 t-SQL MERGE Statement

The t-SQL MERGE statement is added functionality that was built into Microsoft SQL Server 2008. From a minimalist point of view it allows the programmer to write one statement that allows you to INSERT new records, UPDATE existing records, or DELETE existing records in a target table based on a relationship with a source table. Simply stated, it allows the programmer to write one statement rather than many statements to perform maintenance on a SQL Server table in a database. It can be used with any table regardless if it is an online analytical processing (OLAP)database or an online transaction processing (OLTP) database.

I have provided some sample code, and you can access the .zip file containing this code here. In the example provided in the source code, we will insert records that do not exist and overwrite an existing record. If you open the compressed folder, you will see the following files:

  • tSQLMerge.SQL—File contains the MERGE statement relating to the article.

  • Stage_Carrier.SQL—File contains the table definition of the source table.

  • Carrier_SCD.SQL—File contains the table definition of the target table.

The tSQL MERGE statement tries to match rows from the source (STAGE_CARRIER_MASTER) and target (RDS_DIMENSION_CARRIER_SCD) tables using the fields defined as the matching key sets, and performs one of the following: (Open up the file: tSQLMerge.sql)

  • If the source list has a record that matches a record in the target table, it will update that record accordingly. (SECTION A)

SECTION A: WHEN NOT MATCHED THEN INSERT

 

  • If the source list has a record that doesn't match any record in the target table, then we want to insert a new record to the table. We also want to check for the END_DATE = 12/31/2199 (Section B). This field tells us that this is the current record for the dimension. We will want to update that field to the date it was changed by the merge statement (Section C)

 SECTION B: WHEN MATCHED  AND DCS.END_DATE = '12/31/2199' AND

 DCS.SOURCE_UPDATE_DATE < > CAST(SCM.ROW_UPDATE-TS AS DATETIME)

 THEN UPDATE


SECTION C: SET DCS.END_DATE = GETDATE()-2



SECTION C of the code enables you to maintain the record history and surrogate key in the dimension that can be accurately matched to related transaction fact tables.

There is obviously more to the tSQL Merge statement, and this article does not provide a full description of its capabilities. However, I use very similar code in many real world implementations today.

  • John ThumaJohn Thuma

    John Thuma has 20 years of experience in systems integration, large scale project management, and business intelligence. He has worked with various vertical industries, including pharmaceuticals, retail, manufacturing, healthcare, and banking/financial services.

Recent articles by John Thuma

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!