Originally published October 12, 2004
The staging area is a component of the Corporate Information Factory that’s increasingly overlooked, ignored, or taken for granted. Though out of the spotlight, the staging area has evolved into a significant support function for overall ETL processing and the ultimate quality of the data warehouse. It’s worth a closer look.
According to my dictionary, a stage is “a step, level, or period in the development or progress of something”. Originally a “staging area” was a military term for a place where soldiers and military equipment were gathered for final organization and training before deployment. These definitions give you the impression that the stage or staging area was temporary in nature, torn down or removed once the “something” was created or the military operation was underway.
Because the construction of a BI environment is not a one time project but rather consists of many coordinated and related projects, the staging area becomes “persistent”. In other words, certain parts of it (e.g., reference tables, error conditions, conversion tables) endure from project to project in a state of permanence.
The staging area serves some very important pre-processing functions such as:
Most BI implementation teams create the staging area out of necessity, or serendipitously, rather than as part of planned and managed strategy from the beginning. These teams discovered that it was necessary to pre-process data before the ETL processing began. Unfortunately, these humble beginnings lead to staging areas that grow in a haphazard fashions leading to inefficiency or ineffective processing.
It’s best if the staging area is designed and that the design result from the ETL tool you have, the state of your operational systems, and the specific pre-processing requirements needed before ETL processing. You have two choices of modeling the staging area – 1) to optimize the extraction and beginning integrations steps between the operational systems and the staging area or 2) to optimize data flow from the staging area to the data warehouse. If you are lucky, your operational systems are already in relational technologies and are similar in the design to your warehouse. Then the design of the staging area supports both halves getting data into it from operations and getting data from it for the data warehouse.
If your operational systems are a combination of technologies and database designs, you will have to make a choice of which half to support. If you find that it is more important to support capture of data from the operational systems, then your staging area design will be somewhat different from what is in the operational environment. It will not be quite what you find in your data warehouse. The goal is a design that supports “getting data into” the staging area faster, easier and more efficiently. It is more important to capture and hold the data for further processing than it is to get started on the integration and transformation. This puts a heavy burden on both the data warehouse and ETL processing to ensure that the data is properly integrated and transformed before it is loaded into the data warehouse.
Alternatively, it may be that it is more important to lighten the load of the ETL processing by doing data cleansing and light integration/transformation before turning the data over to the ETL processing for the rest of the heavy-duty integration and transformation. In this case, it is necessary to keep the detailed operational data for longer periods of time thus requiring audit trails and significant storage capabilities. Much data cleansing, key matching, and beginning data integration can occur. As data from other systems flows in, the cleansing and integration continues until almost a complete record is formed.
As a final guideline for designing your staging area, remember that the staging area is the programmer’s work area. It is not accessed by any business users EVER. Therefore, the design of this area should support a flexible ETL process with the ultimate goal of the loading of data into your warehouse. For these reasons and others, the model for the staging area should strive to be as similar to the data warehouse’s relational (normalized) model as possible, thus forcing a fair bit of integration to occur in either case.
Getting started with a staging area is also similar to the construction of a data warehouse. You will build the staging area one piece at a time. The incremental build follows the business requirements for data for the ultimate end point, the data mart or analytical applications requested.
This is true with one significant exception. Many people believe that the staging area should also be the storage area for any future data requirements. This means that if you touch an operational system to capture some needed data, you take all of its data and store it in the staging area until there is a requirement to bring it into your BI environment. The argument is that the staging area is designed for the future BI environment not just for today. All well and good but maybe excessively expensive or difficult if we are talking about enormous amounts of data.
My advice is to start conservatively, store what you need and only what you need until you really understand the staging area. If you begin to bring in data on a “just because” basis, then you should also look into near-line or offline data storage mechanisms (tape drive, optical storage, etc.). You may also want to examine some of the data appliance technologies like Netezza, Calpont, and Datallegro to see if these can help with staging area storage and processing needs.
A well-thought out staging area can greatly enhance your overall BI environment. As your data exits the stage and moves into the real spotlight of analysis and reporting, it will be properly prepped and ready for action.
Recent articles by Claudia Imhoff