We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Technical Planning for Data Warehouse Migration

Originally published May 6, 2010

An evolving technology road map, end of license/support contract and existing operational challenges are some of the factors that create the need for data warehouse (DW) enhancements. Such enhancements are the drivers for data warehouse migration. Generally, DW migrations are driven by the technology group of the enterprise. As the business is not directly involved, it demands additional planning to make the migration transparent (or near transparent) to business activities.
 
This article highlights and discusses some of the important technical aspects of data warehouse migration. It assumes a state where the decision for migration has already been made and the solutions identified. The article focuses on dimensionally modeled data warehouse solutions hosted on enterprise infrastructure and software.
 
Data warehouse migration consists of migrating the data integration solution and/or the database solution. In the rest of this article, we will discuss these scenarios and some important considerations related to them.

Migration of the Data Integration Solution

With respect to the existing ETL tool, this change could either be a simple version upgrade or a total replacement with a new solution/tool. We will discuss these one by one.
 
ETL Version Upgrade
 
ETL version upgrades are generally in-place upgrades (in the existing environment), and tool vendors typically provide the upgrade process and support. However, the project should include plans for the following:
  1. ETL Rollback – In case the new version installation fails, it is important to have a quick restoration of the AS-IS environment. Determine if the vendor provides a rollback mechanism to the older (AS-IS) version. If not, such a mechanism needs to be devised in house before the actual migration.
  2. Database Rollback – In the case of implementation failure, the data warehouse data may have already been altered in some way. Using the database and/or ETL features, a rollback path should be defined in house.
  3. Schedule Changes – In terms of performance, the to-be solution will most likely be better than the existing solution. Hence, the existing job schedules should be reviewed for suitable changes.
ETL Replacement
 
ETL replacement is generally not done in place, but rather in a different production environment. Such a migration generally requires a mix of migration and development activities. In addition to the normal migration and development planning, a parallel run phase should be defined. Parallel run is discussed in detail in a later section of this article.

Migration of the Database Solution

Database Version Upgrade
 
Database version upgrade can be done in place or in a new environment.
 
Database vendors generally provide and support the upgrade path for in-place migration. Also, such a migration saves the actual data migration from the as-is to the to-be environment. However, aspects like downtime, backup and failover need to be thoroughly planned.
 
If the challenges of in-place migration overwhelm its benefits, the upgrade should be planned as a fresh database installation followed by complete data migration. Data migration aspects are discussed in detail in a later section of this article.
 
Database Replacement
 
Database replacement will always be implemented as a separate installation followed by complete data migration. Here, we will not be covering the generic aspects of database migration. Aspects specific to data warehouse data migration are discussed in detail in the section below. 

Actual Data Migration

Enterprise data warehouses and data marts contain years of detailed and summarized data. Given the sheer amount of data, the migration could be a time-consuming activity. Some data warehouses have a daily ETL load that runs from the close of the business day all the way until the start of next business day. This makes it difficult to accomplish the data migration without impacting the normal data warehouse processing. In such a scenario, it is worth exploring whether the data can be migrated in a phased manner and hence in a smaller time window.
 
Phased Data Migration Approach
 
The following should be explored while migrating the data in such scenarios:
  1. For daily fact tables, check for date-based partitions. If daily partitions exist, it may suffice to migrate only a few months’ worth of data in the first shot.
  2. For aggregated fact tables, the data volumes should be reasonable and hence can be migrated in one shot. However, if the underlying daily facts are not completely migrated, this may break the aggregate-to-detail drill down reports in the BI layer. Hence, the partitions of aggregate facts that are migrated should be synchronized with the corresponding daily facts.
  3. For migrating dimensions, the following should be considered:

    1. If the dimension is of slowly changing dimension Type 1 (SCD-1) or SCD-3, migrate it completely in one shot. These tables are generally not huge.
    2. If the dimension is implemented as SCD-2 and is really a slowly changing dimension, then migrate it completely in one shot. These tables are not big either.
    3. If the dimension is implemented as SCD-2 but over time has become a fast changing dimension, then the tables can be huge. Most probably these dimension tables will be partitioned. In such a case, in the first shot, pick the partitions corresponding to the date range of the facts to which the dimension is linked. Even if the dimension table is not partitioned, the relevant dimension records can be obtained based on the effective start date of the records.
    4. All other tables manifested as helper tables, bridge tables, etc., should be migrated corresponding to the date range dimensions/facts that they are linking or referring to. 
A lot depends on the calendar when the migration is planned. For example, during financial year end, decision support groups may need more history, thereby limiting the scope of phased data migration. Thus, thoughtfully planning the migration date could be of great help.
 
Handling Surrogate Keys

Surrogate keys1 are of immense use in data warehousing. But they can complicate a migration if not handled carefully. During the data migration, one time ETL jobs may be required for the copying of the dimensions and facts and for reconciling the surrogate keys.

Approach for Parallel Run

The parallel run is a hyper-care state post deployment and is generally done for critical and complex projects. The following points are worth considering:
  1. Duration of the parallel run – As a guiding principle, the parallel run should cover at least one cycle of the daily, weekly and monthly ETL run. So a parallel run can span from a month to a quarter of a year. Stretching the parallel run in excess of a quarter could be prohibitively expensive and hence is discouraged.
  2. A few execution and reconciliation questions should be carefully thought while designing the parallel run:

    1. Should the dimensions be built separately or shared between the two environments?
    2. In case of data discrepancy, how will the dimensions and facts across the environments be reconciled?
    3. How much offset should be planned between the ETL load windows?
    4. Which environment will the reporting layer point to, and how will it failover in case of faults?

  3. Change management – Business critical change requests generally can’t afford a moratorium for migration of an existing data warehouse. These change requests in most cases will require updating existing dimensions and/or facts. Managing such changes during the parallel run could be very tricky. A sound communication and change management strategy for such scenarios should be worked up in advance.

Archival and Decommissioning

Finally, once the data warehouse has been successfully migrated, the old infrastructure, old ETL solution and few redundant tables are left behind. The infrastructure should be planned for decommissioning and software licenses should be freed and made available for re-use. The redundant tables can either be archived in a low cost storage medium or purged with due approvals.

Summary

In many ways, a data warehouse migration is different for a normal application or data migration. A pragmatic approach to identify and design for the data warehouse specific challenges can make the migration successful.

Reference:
  1. Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), John Wiley & Sons, ISBN 0-471-20024-7.
  • Binayak DuttaBinayak Dutta
    Binayak Dutta (CBIP, CDMP) is a Technology Architect (DW/BI) with Infosys Technologies Limited. He is a Computer Science engineering graduate with 10 years of work experience in information technology. For last 6 years, he has worked extensively in DW/BI space. His areas of specialization are dimensional modeling, data warehousing and business intelligence. Additionally, he takes interest in exploring the changes in DW/BI landscape forced by evolving technology, open source and vendor consolidation.  Binayak can be reached at binayak_dutta@infosys.com.
 

Comments

Want to post a comment? Login or become a member today!

Posted May 6, 2010 by Kurma Deepak

Hi Binayak,

The Article published by you is really very informative. Thanks man

Is this comment inappropriate? Click here to flag this comment.