In the article “The State of Healthcare BI,” we outlined the top ten things healthcare organizations need to do in order to invest in business intelligence (BI). One of those things was extract, transform & load (ETL). Very few functions of business intelligence are as important as ETL. It will, without question, make or break your BI program. When people refer to "garbage in, garbage out," they are really referring to ETL, although they likely don’t realize it. ETL is the mechanism most BI efforts use to get data out of the source systems and into an integrated data warehouse. More complex than a simple “dump and load,” which many of us do, good ETL offers us the opportunity to not only apply business rules (the "T" in ETL), but also introduce quality checks into the data.
In this article, we will review the current state of ETL; and just as in the previous article, we will outline the top ten things you need to consider as you create your ETL function. The reason that we called out ETL as its own article is simple: In almost any data warehouse project the ETL portion of the project consists of 80% of the effort, and usually introduces the most risk.
The Current State of ETL in Healthcare
There is a fair amount of variation in the degree of ETL that healthcare organizations do. Similar to the current state of healthcare business intelligence, healthcare payers are ahead of the game. That is because the business rules for claims data are a bit easier to tackle. For their part, providers (hospitals as well as clinics and physician networks) do some nominal ETL. The primary barrier for these organizations is that they don’t invest in the IT resources required to do this level of work. When they do, their goal is to get the data out of systems, with little thought to actual integration. Falling under the category of "dump & load," organizations’ databases, views, tables, etc. become a kluge of large piles of data with few relationships. One hospital we worked with last year had struggled under the weight of their data structures. Each time IT got a request for a report, they created another view of the data. When we got there, they had more than 2,000 views, with an incredible amount of replication and very few relationships. It’s easy to understand how this happens. IT in hospitals, if it exists, exists to support the business-critical systems. Business-critical takes on a whole new meaning for hospitals. Getting data out of source systems seems a relatively low priority versus ensuring that the electronic health record (EHR) system is up and running, along with all the other technology devices that exist in every hospital room today.
Larger health systems and hospitals have been doing business intelligence and data integration for a while. A health system in Minnesota has spent the better part of three years integrating their financial and clinical data – no small feat. But for the rest of us who have pressures that have nothing to do with business intelligence, how do we tackle this work? Now, in the face of Meaningful Use, with 2012 measures requiring data repositories to track outcomes, the motivations have to change.
For the rest of this article, we will outline the top ten things specific to ETL that a healthcare organization should consider as they begin their BI journey. Not all of these will apply to everyone, but they will help guide your efforts.
- ETL Best Practices
Just like any other best practices, we need to follow software development lifecycle best practices for ETL as well for code reusability, robustness, flexibility and manageability. Standards that should be created and followed for any successful ETL project include: naming conventions, error handling and notification, reusability, metadata management and failure and recovery processes.
- Enterprise Level Data Integration View
The ETL project team should have exposure to the enterprise-wide data and integration initiatives so that they can design conforming dimensions and conforming facts to leverage the knowledge assets created. Without this enterprise-wide view, your design may miss an integral piece of the puzzle for future development and therefore require rework. In order to ensure that this doesn’t happen, identify a resource on the ETL team that can interact with the project team to ensure all design requirements are heard firsthand (a good practice from Agile methods).
- Data Profiling and Data Quality
A clean data source will require minimal transformations, while a “noisy” data source will require extensive data transformations. “Noisy” data is data that doesn’t abide by the data quality standards set by the organization. For example, in a field that has the standard of four-character numeric value with a decimal point in the third place, you find a value of 9.999999999. When this occurs often, it is considered “noisy” data.
It is very important to determine what level of data quality is expected from the source system. By applying data profiling, the ETL team may reveal noisy data sources that cannot be supported by an ETL process and need some additional ETL processes to handle the requirements. By performing data profiling, the ETL team may determine some additional data quality ETL processes to overcome the noisy data sources to fulfill customers' requirements.
- Utilize a Data Staging Area
Store data temporarily in a staging area so that it mimics the sources system for data controls, quick/frequent look-ups and insulates the data warehouse repository from source system changes. This staging area also helps with audit controls and recovery times if there is a job failure.
- Encourage Reusable Transformations
A transformation is a key building block of an ETL process, so defining adequate standards and encouraging the ETL developers to design reusable transformations will save time, money, and avoid errors.
- Data Latency
Here, data latency refers to how quickly the source data must be delivered for end-user consumption. Consequently, it is a good idea to determine this well in advance to avoid latency issues that ultimately may lead to poor end-user adoption of the data warehouse.
There has long been debate among BI practitioners in healthcare about data latency. Many question the value of real-time data in healthcare since so much data transformation has to take place in order for the data to be useful to the average person. “Zero latency” is a popular concept. It allows data to be appropriately transformed but gets it to the data warehouse in an expeditious manner, rather than waiting for the next overnight batch job.
- Gracefully Handling ETL Errors
There is no question that errors will happen. When they do, it is critical for the ETL architects to have created a design that handles ETL errors upon initiation of an ETL job and allows for a restart automatically when errors occur. A process for addressing these errors is necessary. The records that fail the business rules should be written to a file for the business owners or data stewards to investigate. This should prompt a change of business rules to account for the new information and for future reprocessing.
- ETL Testing
We know how critical the data in a data warehouse is when it integrates data from different sources using ETL. Thus, it is very important to involve the testing team along with the development team as early as during the requirements gathering stage so that they are in a position to define the high-level test approach, test estimation, test plan creation, test case creation and test case execution.
- Knowledge of ETL Tools
Using an ETL tool without understanding its capabilities and ETL processes is a major issue we have seen in many data warehousing and business intelligence projects. To be successful, it is very important to have experienced and knowledgeable ETL architects, designers and developers on the team. Constantly review the ETL team's training needs and provide the ETL tool knowledge appropriately.
- Compliance Requirements
Healthcare has some of the most stringent regulations of any industry. The proposed ETL solution must plan for the compliance requirements placed on it by state and federal regulations, industry best practices, standards and guidelines. Ensure that you have a method for audit controls, which require the ability to track the data to the original source and the original state of the data.
ETL is critical to the success of any data warehouse project, particularly for healthcare, because so many transformations are needed to ensure that the data is usable for end users. If you take the time to follow these ten best practices for ETL for healthcare, you will be set for success.
Recent articles by Laura Madsen, Rao Nemani, Ph.D.