All the (BI) World’s a Stage…

Originally published October 12, 2004

The staging area is a component of the Corporate Information Factory that’s increasingly overlooked, ignored, or taken for granted.  Though out of the spotlight, the staging area has evolved into a significant support function for overall ETL processing and the ultimate quality of the data warehouse.  It’s worth a closer look.

According to my dictionary, a stage is “a step, level, or period in the development or progress of something”. Originally a “staging area” was a military term for a place where soldiers and military equipment were gathered for final organization and training before deployment. These definitions give you the impression that the stage or staging area was temporary in nature, torn down or removed once the “something” was created or the military operation was underway.

Because the construction of a BI environment is not a one time project but rather consists of many coordinated and related projects, the staging area becomes “persistent”. In other words, certain parts of it (e.g., reference tables, error conditions, conversion tables) endure from project to project in a state of permanence.

The staging area serves some very important pre-processing functions such as:

  • Providing a collection point for common data arriving at different times from different sources. This ensures that extraction can be performed based on the source system’s operating schedule rather than the data warehouse’s loading schedule. Disconnecting these two schedules ensures minimal impact on the operational systems while guaranteeing that ETL processing will occur in a timely fashion.
  • Supplying a storage place for persistent key and code conversion tables. A large part of staging is simply getting the operational data to line-up with the proper set of keys or identifiers. If nothing else is done to the data but matching up the operational records to the proper data warehouse key, you have still greatly improved the overall ETL processing.
  • Incorporating data quality verification and data cleansing preprocessing before ultimately loading the data into the data warehouse or operational data store. This goes along with the key matching. Any type of data clean-up done before ETL processing is a boon to the overall data acquisition process. A staging area can be very useful in setting up  the data hygiene routines.
  • Providing a holding area to facilitate loading the data into the warehouse or operational data store. Sorting the data before invoking the load utilities is another way to increase the efficiency and performance of your overall data acquisition processing.
  • Storing invalid records that could reside in a suspense file. The staging area is a triage center as well for data that simply can’t be fixed through normal channels. Once fixed, the data can then rejoin the data acquisition processing.

Most BI implementation teams create the staging area out of necessity, or serendipitously, rather than as part of planned and managed strategy from the beginning. These teams discovered that it was necessary to pre-process data before the ETL processing began. Unfortunately, these humble beginnings lead to staging areas that grow in a haphazard fashions leading to inefficiency or ineffective processing.

It’s best if the staging area is designed and that the design result from the ETL tool you have, the state of your operational systems, and the specific pre-processing requirements needed before ETL processing. You have two choices of modeling the staging area – 1) to optimize the extraction and beginning integrations steps between the operational systems and the staging area or 2) to optimize data flow from the staging area to the data warehouse. If you are lucky, your operational systems are already in relational technologies and are similar in the design to your warehouse. Then the design of the staging area supports both halves getting data into it from operations and getting data from it for the data warehouse.

If your operational systems are a combination of technologies and database designs, you will have to make a choice of which half to support. If you find that it is more important to support capture of data from the operational systems, then your staging area design will be somewhat different from what is in the operational environment. It will not be quite what you find in your data warehouse. The goal is a design that supports “getting data into” the staging area faster, easier and more efficiently. It is more important to capture and hold the data for further processing than it is to get started on the integration and transformation. This puts a heavy burden on both the data warehouse and ETL processing to ensure that the data is properly integrated and transformed before it is loaded into the data warehouse.

Alternatively, it may be that it is more important to lighten the load of the ETL processing by doing data cleansing and light integration/transformation before turning the data over to the ETL processing for the rest of the heavy-duty integration and transformation. In this case, it is necessary to keep the detailed operational data for longer periods of time thus requiring audit trails and significant storage capabilities. Much data cleansing, key matching, and beginning data integration can occur. As data from other systems flows in, the cleansing and integration continues until almost a complete record is formed.

As a final guideline for designing your staging area, remember that the staging area is the programmer’s work area. It is not accessed by any business users EVER. Therefore, the design of this area should support a flexible ETL process with the ultimate goal of the loading of data into your warehouse. For these reasons and others, the model for the staging area should strive to be as similar to the data warehouse’s relational (normalized) model as possible, thus forcing a fair bit of integration to occur in either case.

Getting started with a staging area is also similar to the construction of a data warehouse. You will build the staging area one piece at a time. The incremental build follows the business requirements for data for the ultimate end point, the data mart or analytical applications requested.

This is true with one significant exception. Many people believe that the staging area should also be the storage area for any future data requirements. This means that if you touch an operational system to capture some needed data, you take all of its data and store it in the staging area until there is a requirement to bring it into your BI environment. The argument is that the staging area is designed for the future BI environment not just for today. All well and good but maybe excessively expensive or difficult if we are talking about enormous amounts of data.

My advice is to start conservatively, store what you need and only what you need until you really understand the staging area. If you begin to bring in data on a “just because” basis, then you should also look into near-line or offline data storage mechanisms (tape drive, optical storage, etc.). You may also want to examine some of the data appliance technologies like Netezza, Calpont, and Datallegro to see if these can help with staging area storage and processing needs.

A well-thought out staging area can greatly enhance your overall BI environment. As your data exits the stage and moves into the real spotlight of analysis and reporting, it will be properly prepped and ready for action.

  • Claudia ImhoffClaudia Imhoff
    A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines.

    She is also the Founder of the Boulder BI Brain Trust, a consortium of independent analysts and consultants (www.BBBT.us). You can follow them on Twitter at #BBBT

    Editor's Note:
    More articles and resources are available in Claudia's BeyeNETWORK Expert Channel. Be sure to visit today!

     

Recent articles by Claudia Imhoff

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!