Oops! The input is malformed! The Trial-and-Error Method for Data Integration by Rick Sherman - BeyeNETWORK
We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Channel: Data Integration - Rick Sherman RSS Feed for Data Integration - Rick Sherman


The Trial-and-Error Method for Data Integration

Originally published October 6, 2009

Implementing a data integration program is not a task to be taken lightly. Let’s take a quick look at some areas where companies tend to go wrong in data integration or extract, transform and load (ETL) processing.

Not Developing an Overall Architecture and Workflow

The usual development approach for data integration is to gather the data requirements, determine what data is needed from source systems, create the target databases such as a data warehouse, and then code. This is an incomplete, bottom-up approach. It needs to be coupled with a top-down approach that emphasizes an overall data integration architecture and workflow.

Some of the design considerations often lost with a bottom-up approach include:

  • How and where do you implement a refresh (replacing all your data) rather than a change data capture (CDC) approach?

  • How do you process dimensional data, such as products and customers, in relation to facts, such as business transactions?

  • When do you filter and aggregate data?

  • When do you use staging tables, and are they persistent or transient?

  • How do you handle data quality?

  • Do you reject, suspend or flag data with perceived data quality problems?

  • How do you handle the changes and gaps in historical data?

Failing to address these considerations in the beginning can delay your project, increase costs, reduce perceived data quality and cause business users to question the value of your business intelligence (BI)/data warehousing (DW) efforts.

Thinking that Data Quality is a Product Rather than a Process

People often assume that data quality problems are simply data errors or inconsistencies in the transactional systems that can be fixed with data quality products. They overlook and, therefore, don't try to prevent the fact that problems arise when you integrate data from disparate source systems into a data warehouse.

They're not seeing that many data quality problems are really data consistency and integrity issues that arise when you integrate data from multiple transaction systems. The differences in dimensional data, such as product (part IDs, code and hierarchy), customers (business and/or people), suppliers, partners and employees, become an issue on the enterprise level, which is precisely the target for DW and performance management reporting and analysis.

Even when data quality problems are not your fault, you still need to take responsibility to proactively measure, monitor and report on data quality metrics as you load your DW and data marts. You might not own fixing the problems, but you certainly own measuring them. The business should not make decisions using enterprise data of unknown quality.

Measure data quality by obtaining data quality requirements in your initial business requirements phase, incorporating data quality metrics into your architecture, monitoring those metrics in all your data integration processes and reporting on data quality so the business users understand the data on which they are basing their decisions.

Assuming Custom Coding is Faster than ETL Development

While most large enterprises have embraced ETL development as a best practice, the reality is that custom coding is still prevalent, especially in the following areas:

  • The small-to-medium business market is not using ETL tools as extensively as their larger brethren.

  • Companies that have used database vendors' ETL tools (particularly older versions of these tools, such as Microsoft DTS and Oracle Warehouse Builder) may be using them just to run SQL scripts or store procedures. Although they are technically using ETL tools, in reality they are writing custom code and just using the ETL tool to run their code.

  • Many BI applications need to build summary or aggregation tables (these should be data marts, but many times they are not). These reporting tables are often built using SQL scripts or stored procedures. The BI developers know SQL, feel comfortable in creating tables with it and do not feel the need to use an ETL tool.

  • Business groups are using custom coding to create countless data shadow systems (also known as spreadmarts) in large and small enterprises. They may have asked for a flat file extract from IT, or they used a BI tool to create their own flat file extract. They then use a combination of Microsoft Access and Microsoft Excel to perform their ETL work.

In all the cases just mentioned, I get the usual feedback. "Rick, how can it be faster to code with the ETL tool, and how can the ETL code perform as well as the code I create?" Without debating custom coding versus the merits of ETL development, let's just say that when I see a large block of custom code, I see a great opportunity for change and improvement. But custom coding is often below the radar because most just assume that everyone is using ETL tools.

In general, people tend to take an oversimplified view of data integration. By not seeing it as a process to be managed from a top-down perspective, they tend to run into problems and learn the hard way - through trial and error.

SOURCE: The Trial-and-Error Method for Data Integration

  • Rick ShermanRick Sherman

    Rick has more than 20 years of business intelligence (BI), data warehousing (DW) and data integration experience. He is the founder of Athena IT Solutions, a Boston-based consulting firm that provides DW/BI consulting, training and vendor services; prior to that he was a director/practice leader at PricewaterhouseCoopers.  Sherman is a published author of more than 50 articles, an industry speaker and has been quoted in CFO and Business Week. He also teaches data warehousing at Northeastern University's graduate school of engineering. You can reach him at rsherman@athena-solutions.com and follow him on Twitter at https://twitter.com/rpsherman.

    Editor's Note: More articles and resources are available on Rick's BeyeNETWORK Expert Channel. Be sure to visit today!



Want to post a comment? Login or become a member today!

Posted October 11, 2009 by Ed Gillespie

Excellent piece, Rick.   Recently read a PBBI article that cited the cost of poor data quality at $8.2 million a year for the average company.   Definitely worth looking at this as a process.  If interested, can read the full piece at http://ebs.pbbiblogs.com/2009/09/08/the-business-case-for-data-quality/

Is this comment inappropriate? Click here to flag this comment.

Posted October 6, 2009 by Marc Paradis

An excellent article Rick - I couldn't agree more that integration efforts begin and end at the center of gravity between bottom-up and top-down considerations.  Integration efforts, especially enterprise integration efforts, should be undertaken with the rigor and structure of good Enterprise Architecture.  The challenge, of course, with any enterprise implementation is to maintain agility - especially in today's economic climate, there must be a near-term positive ROI/TCO/TEI, and preferably one that also satisfies an end-user need/pain point/job-to-be-done.

Is this comment inappropriate? Click here to flag this comment.