Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I've been working recently with the new release of a popular ETL tool that now handles ELT. It is an interesting approach but one that bears discussion. The systems we are working with are very large in terms of data sets. I'm not talking about large historical data (although that's one issue) I'm talking about large daily data feeds. This customer has Terabytes of information to deal with on a daily basis. So if you're like me, and you've used ETL for years, and want to know more about ELT, read on. This is a technical entry in my blog. I'll get back to MDM very shortly.

ETL has it's place in the data migration and integration world. ETL = extract, transform (in stream), then load to the target database. However, ETL is proving to have issues with super-huge volume sets (like 800 million rows in a single table) once the data has reached the target database, and still needs transformation - what do you do? Simple math says that if you can run at a maximum of 50,000 or 100,000 rows per second it will still take: 4.4 hours and 2.2 hours to process (respectively), and this is just ONE of the 50 or so tables to be processed.

When we talk about processing volumes like this, time is critical. It's not to say that ETL can't handle the load. On a large enough machines, with enough processing power - it certainly can be done. But extracting all that data (that you just loaded) then transforming it, then re-inserting it back into the database can be a bit unnerving.

Enter ELT - the ability to perform extremely strong, parallel SQL and transformation levels within the RDBMS. There are only certain RDBMS engines, and certain configurations (mostly in the MPP space or extremely large SMP space) that can handle this kind of load in single SQL statements. In an ETL format we try to push the data through an optimized parallel set of processes, and break the data stream up such that much of the decode, aggregation, filtering and expressions are held in RAM. In ELT the goal is different, simplify the SQL statements as much as possible, formulate intermediate tables along the way (could be many of them), use the power of the join operations within the database to resolve all the calculations.

If we are to use tools to gain functionality of ELT, our goal is very simple, extremely small processes - but lots of them. Each step along the way takes care of one small set of the transformation logic. Multiple passing of the data is the secret to making ELT a huge success.

This denotes an architectural change in the way we code our "transformation" logic. ETL focused on complex mapping designs split into parallel processes, ELT focuses on small incremental steps - very simple with extremely fast results. Trying to do to much in ELT will minimize performance, and will move too much data over the MPP nodes within the environment. Complexity is the enemy of a single step with the ELT environment.

For instance, in ETL it may take 5 processes to completely cleanse and transform customer data. in ELT that same logic may take 32 processes (steps), but each one operates at a completely different performance level. In ETL those 5 steps may operate at 30,000 rows per second, in ELT, they operate at 100,000 rows per second or better depending on the hardware and the dispersion of the data sets across the MPP environment.

The point is: if you're beginning to use ELT instead of ETL, you have to think differently - there are different standards, a different architecture, and a different approach to take.

If you've been in the ETL and ELT worlds, I'd love to hear your comments about what you've had to change to make it work.

Thanks,
Dan Linstedt
CTO, Myers-Holum, Inc


Posted September 29, 2006 5:24 AM
Permalink | 4 Comments |

4 Comments

Dan,

Co-incidentally, I am currently working on a project that was in QA cycle, and we pulled it back to Design stage and adoped this approach of splitting the process into many subprocesses. Each sub-process does a small piece of work, but very optimized to handle that logic with the large data volume. It also becomes easy to monitor and support this in production, because the several pieces of complex business rules dont overlap each other. Each subprocess only uses the data necessary for it, rather than churning the whole database to try and complete the process in ONE PASS.

Dan,

I recently implemented an Enterprise data warehouse solution using Netezza (MPP) for a UK Telco, and to me, with the obvious huge amounts of data around, ELT made much more sense than ETL.

Not only did it speed things up tremendously, but it gave several other large benefits:
- no need to invest in middle tier servers with lots of memory, or expensive ETL tools
- made support much easier as as it removes the 'black box' of ETL
- provides a middle layer of data we called the ODS (operational data store) which was raw, untransformed data, copied from source, all in the same place, that could be traced back to source and checked/audited
- allows use of traditional SQL transform skills to be applied, rather than relying on more expensive specialist ETL contract resources

I think this shift in thinking will continue for those that are involved in large volume data processing...

Dan,




I have been working on a DWH implementation for the last couple of years. I don't see a DWH as a project that finishes but I see a DWH or better an EDW as an ongoing integral part of an organisation that has a continuous need for information.




My experience in this area and more specific on the ETL/ELT side of things is that it is very important what infrastructure you use for your ETL work next to what RDBMS is used. Most often these investments are done over a longer period of time so they must be used most efficiently.
I'm also convinced that purely looking at the tool side of things and the infrastructure is not enough to determine what is best for your specific organisation. Looking at the available skillset, looking at ease of maintenance on the tools used is also important to evaluate the total solution used for keeping an EDW up and running and extending.



But comming back to my findings and personal ideas on what to use ETL or ELT here's my opinion:

  • I'd recommend an ETL tool that can both handle batch and real-time processing as most DWH often start in a batch only mode but evolve to combining batch with real-time processing jobs.
  • I'd also recommend a scalable ETL tool for all processing needs (Extract, profiling, cleansing, transformation and db-load) from source systems to into the RDBMS.
  • Once data is loaded into the RDBMS system (in a 3NF to prevent loss of detailed data)it should either be used by end users by means of views on top of the data, or (when needed for reasons of hiding too much complexity away from EUS or maybe in some cases for performance reasons) be transformed again into its next physical data structure on top of your 3NF. This last in database transformation is most optimal done as said within the database itself. Especially when you have an MPP that can handle bulks of transformations extremely fast due to its parallel architecture.


    The above use of two ETL tools (one for real ETL and one for ELT within the database being the DB itself) does bring on the subject of needing two different skillsets to develop both the ETL and the ELT environment.
  • This last statement brings me back to where I started of. You need to look at the overall picture to decide what's the best approach in deciding to ETL or ELT. And should there be any doubt I'd recommend a combination, taking the best of both worlds.

    Happy to discuss this in more detail.

    High volume low complexity transformations perform well in ELT, but high complexity transformations perform better in ETL. One must also consider the location of the source and target. If they are both on the same database server then you have an advantage using ELT. One disadvantage of ELT is that it uses up processing power that is meant for users; OLAP computation, ad hoc user queries, and ELT are competing for the same resources. I think that in order to meet today's demanding data integration requirements you need both ETL and ELT working together. Building ETL functionality in to the DBMS is bridging that gap.

    Leave a comment

        
    Search this blog
    Categories ›
    Archives ›
    Recent Entries ›