ETL (extract transform & load), and ELT (extract load and transform) have been getting interesting rubs lately from different people in the market space. The issue is that most of what I read / see doesn't include dealing with super-large volumes of data. For example: every DFD (data flow diagram) I build is architected to deal with a minimum of 80Million rows per source, and is considered to be a small load. The medium sized loads deal with 150 Million rows per source, and the large loads deal with 300 million and up (to 1 to 2 billion rows per load). In this entry we explore the nature of both ETL and ELT as they relate to this size of data set. We'll cover problems, issues, and architectural changes that need to happen.
We'll also attempt to address the changing nature of ETL, ELT and what really needs to be achieved in the market space (something called ETLT where CHOICE is dependent on the task and size/performance of available hardware resources). With that, let's dive in.
These pros and cons are limited to items which are not necessarily available in both paradigms at the same time. Things like Security, Version Control, Code Optimization, and dependency analysis are available in both paradigms, and thus will not be listed.
For further reading, see my previous entry: http://www.b-eye-network.com/blogs/linstedt/archives/2006/12/etlt_or_elt_eit.php
* Can handle transformation midstream, outside the database walls
* Is tied to the scale and size of the mid-stream transformation hardware it sits on
* Can be spread across a computing grid without the high cost of "DBMS grid technology"
* Is usually database agnostic (as far as the transformations are concerned)
* Usually can manage metadata (Technical / Process Metadata)
* Generally is GUI driven
* Can interface with EAI, and EII along with other component layers like web-services and reporting tools
* Most of the time doesn't require a heavy knowledge of SQL (however, in VLDW situations, it is highly preferable to have deep SQL knowledge).
* Is limited by the platform it sits on
* Is limited by the source and target throughput speeds (including network and dbms systems)
* Must bring the data off the RDBMS engine into the transformation engine in order to tie it into the transform stream.
* Can lead inexperienced developers to setup highly complex mappings (which drive performance down, down, down)
* Cannot take advantage of hardware or DBMS scalability within the database engine platform
* Can require a proprietary "transformation" engine to be run in a middle tier
* Usually houses it's metadata in a non-standard architecture (another database schema/model as opposed to directly in the RDBMS header or system tables).
* Has startup and shutdown time
* Can utilize the partitioning and performance strategies set forth by the RDBMS and table structures
* Can utilize RDBMS specific feature sets
* Keeps up to date with the RDBMS releases
* Can process all data within the RDBMS caches and buffers
* Can process near-real time data with ease
* Is not dependent on network speeds, nor client connectivity (cache sizing)
* Requires 98% SQL Select statements (raw code)
* If it accesses stored procs / functions / triggers, then it tends to be in danger of not executing in parallel at all.
* Can lead to highly complex, difficult to maintain SQL statements.
* Leads to scripts to manage "strings of serially executing SQL statements"
* Makes use of many "temporary tables" in the database, requiring a lot more storage and a lot more horsepower within the DB platform.
* Can "lock up" the RDBMS engine from doing any alternative work
* Usually does not have a lot of accompanying metadata (like process metadata)
ETL's performance is reliant on many different moving parts, while ELT's performance is reliant on just the RDBMS and it's ability to scale. The larger issue is as follows: usually when I tune ETL platforms, there is a lot of performance we can get from the system before any hardware needs to be upgraded to handle additional load. When I tune ELT, usually we hit the need (very quickly) for increasing the RAM caches, and making use of bigger hardware.
In other words, when I tune an 80 million row load map on ETL, we can optimize the ETL engine, network, disk speeds, and database buffers to get anywhere from 80,000 to 120,000 rows per second throughput really quickly.
When I tune the same functionality of ELT (SQL Scripts), we get much higher performance - but require twice the amount of RAM, or CPU, or DISK or all three in order to make it work. If the hardware is scaled properly, ELT can run in the ranges of 120,000 to 250,000 rows per second. The statement? Running batch through ELT in a database can require significant amounts of hardware to make it go, most clients have under-powered database platforms, this can cost a LOT of money to scale up, where-as throwing in another Windows / Linux box for ETL engines might be half or less the comparable cost.
Where's the market today?
Today, most engines are getting smart and offering ETLT (designerâ€™s choice). This is the best of both worlds, as it incorporates the GUI building, metadata management, integration features - as well as an ability to tune performance where necessary. In other words, loads that are acceptable (via user requirements) can take place in the ETL engine, where loads that require an extra performance boost, or particular tuning can take advantage of the ELT components.
What's the recommendation for when to use what?
We suggest that ELT only be used for a) small near-real time transactions or b) very large batches needing CDC in the target - but for this, you MUST be able to scale the hardware of your database engine.
We suggest that ETL be used for everything else, from process flow, to metadata management, to GUI development, re-usability and so on. Apply where appropriate. I've always found that by achieving a balance across both architectures, I can achieve an optimal performance setup - of course I end up having to tune the platforms as well.
What about JAVA ETL / ELT engines can they work for us?
Most open-source engines are written in Java, and yes - as of Java 1.5.x they are fast, efficient, and can be run to solve problems. Are they capable of handling the scale of throughput that we suggest at the top? Not sure, I've never run that kind of load through any of these systems (Pentaho, Talend, etc..) If you've got experience, we welcome the information.
These Java engines are actually quite advanced, and work very well.
Is there anything new on the horizon?
Yes, there is always something new.. But in this case, the nature of Transformation and the fact that it seems to "need" to be done, will stick around for a long long time. For as long as the computer has been around, people have needed it to "manage and transform" data. The new components (as discussed previously) should show up in the form of hardware cards, plug & play pieces that sit right inside the RDBMS platform, and offer specialized transformation - possibly even "plugging" directly into the RDBMS platform as an extension to the database system.
These cards will house their own CPU's, their own memory, their own high-speed busses - and will make use of connecting directly to disk as well as the database engines.
It will be something to watch, the expected performance gain of this kind of solution should be anywhere from 10x to 200x faster than both ETL and ELT platforms that exist today.
Do you want a Masters of Science in Business Intelligence?
Get one from Denver University: http://www.COBICC.org
Posted July 19, 2007 7:17 AM
Permalink | No Comments |