Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« High Volume, Low Performance - Can CDC Help? | Main | Data Integration - Performance Numbers »

ETLT or ELT - Either way, pull back the sheets.

Ok, I've said it before in previous entries, I've discussed ELT and ETL and loss of metadata here on the blog before. I've worked in both situations, I've worked in VLDW for 8 years, I've worked in ADW (Active or real-time data warehousing) for the past 3 to 4 years, I've been involved with non-data warehousing data integration projects using ETL and massive volumes. Now I'll say it again - there's a difference, a time a process, or set of processes, and singular points of architecture where everything converges. If you've seen me present VLDW, you've seen my Pyramid diagram that shows the impact of volume and latency on the number of ways to "execute". This entry redresses ELT or a term I wrote about in Teradata Magazine over 5 years ago called ETLT.

Warning: read on if you like a good "one-sided" rant... I lay it on the table here, after all my years of experience, I just need to share. As always, I'm open to other sides of the argument - and invite anyone with alternative views to comment on this entry. If you agree, I'd love to hear from you as well...

I've had the pleasure today of sitting through a two hour architecture meeting. The client we are working for at the moment has to move (consistently) 1 Billion rows during a single load in a single process, the maximum growth rate for this single target is potentially 1 Billion rows a month. Of course, this is the maximum rates. 80% of our data sits on average around 100 Million a day; we have maybe 10 to 50 source tables to work with at these volumes.

What would you do in this situation? We are working with a particular ETL tool which has ventured into ELT land recently. We are also working with the worlds leading VLDW database engine (ranked according to Gartner Magic Quadrant). By the way, if you think this is large - you should try dealing with Telco Switch Data - same problem, same size of data, different data, and different requirements - same issues.

Our requirements are to take "micro-batches", and real-time data feeds in the future, along side the standard volume batch updates. There's no question that the architecture must be accommodating to this volume, that the data model must be accommodating to the overall architecture and data arrival timings.

What am I saying when I state: the larger volume, and/or the lower the latency, the less architecture choices you have to solve the problem?

Well, here's an example: if I ask you to load 500 rows of data and tell you, you have 1 hour to get it done. You can probably tell me, 1000 different ways (architectures / designs) that will work - from the data model to the coding solution, to the architecture of the data flow (ETL or ELT - it won't matter). And I'll say great: make it happen. Now, if a year goes by, all's well, and I tell you, wait: you now have 5 minutes to load 500 rows - you might say... Hmmmm our current process runs in 8 minutes - we need to re-architect. Or maybe you'd say: we need a larger box / faster box to get it done. I might say: ok - make it happen.

Now if I tell you instead: you now have to load 100 million rows, oh - and you still only have 1 hour to get it done. You might say.... "You’re crazy" or "we have to re-architect", there might only be 5 ways instead of 1000 ways to get it done. Now if I change the scope and say (6 months later): 100 M rows must be loaded in 15 minutes... You might tell me, we have to change the data model, change the base-architecture, change the loading paradigm, and change the hardware and the tool sets we are using to get it done.

This is at the crux of what I say when I talk about the fact that "real-time" is really "right-time", and that the cost of VLDW (to implement) increases exponentially when your time factor is reduced below a 8 to 10 minute refresh rate, and your data set stays consistently large. But back to the point of this blog - how do you solve this problem?

Well, everything, and I mean everything starts and ends with architecture - first and foremost data model / data architecture, second in line is hardware _architecture_ i.e.: MPP, SMP, Clusters, Grid, etc... Third, is data processing architecture - everything converges on a single point: performance, performance, performance. All roads lead to a single architectural solution: ELT, with MPP, with a SINGLE SPECIFIC NORMALIZED DATA MODEL - no R.I.!! Not at the Billion+ row mark... no way.

Surprised? Maybe, maybe not.

My point: ETL is changing folks, to ETLT (ETL & ELT) mixed workload. Why? Because there's a COST involved in going 100% ELT, there's a loss of metadata as transformation pieces are broken up into smaller execution steps, and run as ELT (disappears into RDBMS complex SQL logic). The kicker? There's still a place for ETL, to take advantage of development and maintenance speed. There's still a place for where ETL is good enough, where Metadata in-stream is good enough, and massive volumes simply don't exist.

Now, let's look at this from a STANDARDS side of the house:
In order to maintain ETL and ELT separately, you quickly realize that this breeds dual-standards, and a divergence of execution states. Pretty soon it becomes clear that maintaining BOTH (regardless of volume) simply costs too much. But wait a minute; the opposite is the COST of executing one or the other methodology. The cost of maintaining ETL at this level is: * Performance (high cost) - the cost of maintaining ELT at this level is usually: * DISK SPACE, and HARDWARE to run the MPP, loss of Metadata - in some cases.

So you see, achieving a balance across the standards (due to cost) is vital in this time of "changing lines" and shifting sands. In the future ELT will be better supported by the remainng vendors, and it will become easier to "flip a switch" and let the traditional ETL tool make the decisions as to what to execute.

Back to standards: what do we really want with an architecture?
* Repeatability
* Consistency
* Reliability (up-time)
* Ease of maintenance
* Ease of build-out/development
* Performance (optimized - according to SEI CMMI level 5)

Ok - these are conflicting goals today. ELT provides most of these at a cost, ETL on the other hand also provides most of these (again at a cost), but they both lead to different architectural implementations. So why can't they co-exist? Because ETL runs on data that's cached within the ETL server - breaking coherency with "real-time" feeds that exist within a true active data warehouse (the cache is extremely difficult - if not impossible) to synchronize to the real-time changes occurring within the RDBMS. That's right - the closer to right-time the warehouse gets, the more it becomes operational in nature, the more operational, the more it requires transaction consistency.

ELT on the other hand requires tons of staged/cached data steps along the way within the RDBMS - particularly to execute against this level of volume.

Both can be created to be repeatable, and restartable, and consistent, and optimized. However what we learn (quickly) by mixing the two is that ETL has a different set of technical issues - mixing a cache with a right-time data warehouse is just one of the problems we see. The other is restartability begins to erode at a certain level. Jobs that were once restartable (in one form or the other) are now reliant on major checkpoints to re-establish caches, or to reset "staging tables." It's a bad mix...

So what should I do?
You need to weigh the costs and benefits of each approach against the volumes and latency time frames that your business can / will sustain within the next year - then decidedly document and issue decrees about how the architecture will work going forward. At some point volume and low latency push us all over the edge (today) to an ELT solution. Maybe in the future an "appliance" of sorts will address these issues and it will all be hidden in the background.

What are you're thoughts / Comments?

Thanks,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com

  Posted by Dan Linstedt on December 6, 2006 3:04 PM |

Comments

Dan,

I agree with you about the relevance of ELT when dealing with large data volumes and high data
freshness. However I do have some reservations regarding your comments on the "loss' of meta data quality in ELT.
Maybe you can elaborate more on this point later in your blog.

"ELT" is something a lot of people (including myself) refer as the new 3-G ETL architecture, to differentiate it from
the popular 2-G ETL tools represented by those from Informatica and Ascential that dominate the market today. The first generation
ETL tools are the legacy code generators like ETI. The second generation ETL tools require the deployment of
a proprietary ETL engine, implemented on a separate physical ETL server. As data volumes become large, and we know the will,
that physical ETL server often becomes a major bottleneck, severely impacting the overall performance of
an ETL process. Based on my personal experience when working on one of the largest EDW projects in the world, a
popular 2-G ETL tool we used then simply breaks down when the incoming batch of new data needs to interact with the
huge EDW history data to perform some functions such as history duplicate checking and "fact-to-fact" orphan processing.

My personal view is that 3-G "ELT" tools that can generate highly optimized standard SQL codes which can fully
take advantage of the back-end RDBMS (Teradata, Oracle and DB2) built-in parallel processing capabilities
will become mainstream in the next five years, replacing the currently popular 2-G ETL tools on the market today.

Regards,
Po Hong
Appteq International

Post a comment