I've been working recently with the new release of a popular ETL tool that now handles ELT. It is an interesting approach but one that bears discussion. The systems we are working with are very large in terms of data sets. I'm not talking about large historical data (although that's one issue) I'm talking about large daily data feeds. This customer has Terabytes of information to deal with on a daily basis. So if you're like me, and you've used ETL for years, and want to know more about ELT, read on. This is a technical entry in my blog. I'll get back to MDM very shortly.
ETL has it's place in the data migration and integration world. ETL = extract, transform (in stream), then load to the target database. However, ETL is proving to have issues with super-huge volume sets (like 800 million rows in a single table) once the data has reached the target database, and still needs transformation - what do you do? Simple math says that if you can run at a maximum of 50,000 or 100,000 rows per second it will still take: 4.4 hours and 2.2 hours to process (respectively), and this is just ONE of the 50 or so tables to be processed.
When we talk about processing volumes like this, time is critical. It's not to say that ETL can't handle the load. On a large enough machines, with enough processing power - it certainly can be done. But extracting all that data (that you just loaded) then transforming it, then re-inserting it back into the database can be a bit unnerving.
Enter ELT - the ability to perform extremely strong, parallel SQL and transformation levels within the RDBMS. There are only certain RDBMS engines, and certain configurations (mostly in the MPP space or extremely large SMP space) that can handle this kind of load in single SQL statements. In an ETL format we try to push the data through an optimized parallel set of processes, and break the data stream up such that much of the decode, aggregation, filtering and expressions are held in RAM. In ELT the goal is different, simplify the SQL statements as much as possible, formulate intermediate tables along the way (could be many of them), use the power of the join operations within the database to resolve all the calculations.
If we are to use tools to gain functionality of ELT, our goal is very simple, extremely small processes - but lots of them. Each step along the way takes care of one small set of the transformation logic. Multiple passing of the data is the secret to making ELT a huge success.
This denotes an architectural change in the way we code our "transformation" logic. ETL focused on complex mapping designs split into parallel processes, ELT focuses on small incremental steps - very simple with extremely fast results. Trying to do to much in ELT will minimize performance, and will move too much data over the MPP nodes within the environment. Complexity is the enemy of a single step with the ELT environment.
For instance, in ETL it may take 5 processes to completely cleanse and transform customer data. in ELT that same logic may take 32 processes (steps), but each one operates at a completely different performance level. In ETL those 5 steps may operate at 30,000 rows per second, in ELT, they operate at 100,000 rows per second or better depending on the hardware and the dispersion of the data sets across the MPP environment.
The point is: if you're beginning to use ELT instead of ETL, you have to think differently - there are different standards, a different architecture, and a different approach to take.
If you've been in the ETL and ELT worlds, I'd love to hear your comments about what you've had to change to make it work.
CTO, Myers-Holum, Inc
Posted September 29, 2006 5:24 AM
Permalink | 4 Comments |