Business Intelligence Network business intelligence resources

Blog: William McKnight

« UK business intelligence projects hitting target | Main | Catching up with alpacas and yoga »

Triaging Source Data and a Persistent Staging Area

One of the most difficult things to do in data warehousing is to engage a new source system. Learning about the fields the system has to offer the data warehouse, when they are populated, how “clean” the fields are and when you can get after them with your extract job can be daunting. Then, after going through the process of attaching the extract jobs, scheduling and beginning the cycles, you would want to be set for a while.

Not so fast. Usually 1 day to 2 weeks after putting a data warehouse – any iteration – into production (or prototype), users who previously communicated requirements in abstract terms are now seeing the results and requiring changes. New fields and new transformations are not unheard of at this point.

Although data warehousing is very dynamic, it is possible for a practitioner to think beyond initial, spoken requirements and “prime the pump” by bringing additional fields into the ETL process. This concept, known as “triage” works very well if you have a staging area where initial loading from source is “dropped” prior to the majority of the transformations.

With triage and a staging area, the staging area can contain many more fields than are moved forward to the actual data warehouse. Then, if a new field is needed in the warehouse, there is no effect on the source extracts (and no accompanying disruption of source operation and negotiation with the source system team).

But wait, you say. "What about the historical data that usually accompanies such new data sourcing?"

The concept of the persistent staging area is to keep all data, both from a “triaged” (see yesterday’s tip) and a historical perspective in the staging area. That way, when requirements change post-production (again, see yesterday’s tip), you not only have the ETL “primed”, you also have the historical data primed and ready to be moved forward to the warehouse – in the persistent staging area.

Persistent staging areas almost always require a separate DBMS instance from the data warehouse DBMS due to the volume that will accumulate in them.

Since historical data is also kept in the warehouse, the distinctness for the persistent staging area lies in its capturing of triaged data, ready for historical loading of required data post-implementation. It will be bigger than the warehouse itself.

Although I usually do not use this technique in my data warehouses, if there was a high likelihood that requirements would be very dynamic after production and disk cost were not an issue, it would be very applicable.

  Posted by William McKnight on July 4, 2007 10:49 AM |

Post a comment