Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I had some good questions come in recently, thank-you. In this entry I'll share my experiences with ETL and ELT with regards to metadata; I'll also try to elaborate on when it is right to use which type of technology. This also goes back to my original articles in Teradata Magazine on ETLT.

When to use which technology.
As it turns out (and this is a good thing), not all systems have the super-huge volumes that I've been discussing. Bill recently documented a specific case where too much granularity is a bad thing - which I agree with. There's a time and place for asking the question: what's the value of this data set? Is the value increased or decreased by going to finer grain of data? But that's for another blog.

Bottom line: ETL still has it's uses, and will continue to be useful moving forward. Here are some typical cases and reasons why you might still want to use ETL:
* When volumes are medium to low sized, in my estimation, a single feed (single file) to be processed, that is under 180M rows can easily be processed by ETL when architected properly.
* When the time-line for loading the data is not super critical, in other words even if I have 500M rows, but I have 2 or 3 days to load, because it's a monthly feed, then maybe this is Ok too.
* When the source systems provide data slower than 60,000 rows per second - or the networks are restricting data flow. In these cases, you really have no choice. You now have the time (due to slow sources) to do transformation in-stream, so why not take advantage of this?
* When the target system won't perform at more than 50,000 or 60,000 rows per second, even if it's been tuned. (see my last bullet)
* When code and reference tables are needed to populate default values or provide simple data replacement through key matching.
* When small micro-bursts of data are arriving in real-time, and we need to perform transformation in-line between systems, or between queue's.

In other circumstances we will use the ETL engine to provide EL in parallel and partitioned jobs, providing maximum throughput with minimum development time.

So what does ETL provide in the metadata space:
Generally ETL's metadata repositories have been around for a little bit longer than the ELT tool vendors' repositories, and those ETL tool sets now dabbling in ELT are just growing into this area.

If we look at ELT from a scripting point of view, and not using a tool to execute, we lose all kinds of metadata from process flow, to business logic, to full traceability. Particularly this is seen when scripting Teradata BTEQ, Oracle SQL Loader and Stored Procedures, SQLServer command line, and DB2 UDB Functions, and so on... Any time we return to CODE base for ELT, we're back to deciphering scripts to figure out metadata, and much of it is hidden. Unfortunately (today) most of the ELT that I run into is still hand-coded, although this is set to change in the near future.

This is where people stop me and say: well, what about Sunopsis? They were 100% ELT... True - but their metadata repository left a LOT to be desired, they were young in the field, and hadn't quite grown up yet.

Ok, so what about Informatica, Ascential, Ab-Initio, Business Objects Data Mover, and so on...
They grew up in the ETL space and have strong metadata components, however, they are just now completing their transition to ELT - or in some cases, they've completed it, but it's not 100% compliant with the all the database features available that I can see.

So then this argument kicks in: Well, ok - but what about when I override the SQL in these tools? That get's me 100% compliance with ELT and database feature sets... Yes, but then some tools simply don't have the SQL de-composition routines built into their engine yet. They can't pick these apart for the metadata lineage and achieve end-to-end lineage.

For those tools that offer end-to-end lineage with SQL, the next thing that I find is:
Some database engines do not record the number of "rows" which were operated on within these SQL statements that executed within the database. They record them, but the tool sets don't pick them up. Other database engines record this information (for every SQL statement) in metadata tables in the RDBMS - as it should be!

Finally, when it comes down to it - ELT requires much simpler processes, resulting in many more steps, resulting in a lot more "temporary" tables during execution cycles, metadata is bound to be lost through this process (today) - this is also due to change in the future, and get better as the tools find the gaps and fix them.

Now what are some of the pros and cons of ELT?
Pros:
* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.

Cons:
* ELT relies on proper database tuning, proper data model architecture, normalized data model architecture
* ELT relies on MPP hardware
* ELT can easily eat 100% of the hardware resources available for complex and huge operations
* ELT can't balance the workload
* ELT can't reach out to alternate systems (all data must exist in the RDBMS BEFORE ELT operations take place)
* ELT easily doubles, triples, and quadruples disk storage requirements (more processes, each simpler, each requiring intermediate temporary tables).
* ELT (sometimes) is not 100% metadata lineage traceable.
* ELT can take longer to design and implement, more steps, less complicated per step, but usually results in more custom SQL code (sometimes this is where metadata is lost).

Pros and Cons of ETL
Pros:
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware

Cons:
* ETL requires separate and equally powerful hardware in order to scale.
* ETL can "bounce" data to and from the target database, requires separate caching mechanisms which sometimes don't scale to the magnitude of the data set - this can result in scalability and performance issues.
* ETL cannot perform as fast as ELT without twice the size of hardware (usually for RAM and CPU resources).

So, you see - it all depends on how you look at the picture and what you want. Ultimately the nirvana is that no metadata will be lost, regardless of database specific SQL, or ETL "functions / transformations" utilized. The other nirvana is that you will ultimately design an "ETL" stream, and when it no longer performs, flip a switch on a job that says: turn this whole thing into ELT for me without me re-designing; that is the day when ELT will have grown up on the job, and be our best friend.

Of course, it all depends on where you can afford to put your system load, and where you can spend your money to achieve scalability....

Hope this helps,
Dan Linstedt
CTO, http://www.MyersHolum.com


Posted December 15, 2006 10:53 AM
Permalink | 3 Comments |

3 Comments

Hi Dan,


Is it just my perception, or is the above list of PROs for ETL longer, and that of CONS shorther ?


Is there a strong restriction in the ELT approach to only set oriented operations (due to the nature of SQL)?


You have not mentioned complex derivation algorithms that don't work well on sets, but might require multiple cursors in the ELT - severely limiting your performance.


Also, repartitioning in the Transformation process is a piece of cake in the ETL tools, whereas in a database (and DB2 is the only one that I know that supports true data partitioning - not sure about Teradata) it involves creating lots of temp tables with various partitions to enhance the transformation speed. Do the new tools/enhancements automate the creation and management of these temp tables?


Raul

Hello Dan,
Few comments: Probably more precisely would be to refer the extract speed in gigabytes or megabytes per second and not in number of rows per second. This is because rows can have a different size.
Another one: extract method from database can differ. If the tool performs direct path extract like wisdomforce fastreader then the extract speed could be same as a speed of I/O and can easily reach 15 Gigabytes per minute. Then does not make sense to use ELT. What do you say?

Marco Pellicciotta:

ELT tools are growing fast and in few years they will take the market, I think so.

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›