I had some good questions come in recently, thank-you. In this entry I'll share my experiences with ETL and ELT with regards to metadata; I'll also try to elaborate on when it is right to use which type of technology. This also goes back to my original articles in Teradata Magazine on ETLT.
When to use which technology.
As it turns out (and this is a good thing), not all systems have the super-huge volumes that I've been discussing. Bill recently documented a specific case where too much granularity is a bad thing - which I agree with. There's a time and place for asking the question: what's the value of this data set? Is the value increased or decreased by going to finer grain of data? But that's for another blog.
Bottom line: ETL still has it's uses, and will continue to be useful moving forward. Here are some typical cases and reasons why you might still want to use ETL:
* When volumes are medium to low sized, in my estimation, a single feed (single file) to be processed, that is under 180M rows can easily be processed by ETL when architected properly.
* When the time-line for loading the data is not super critical, in other words even if I have 500M rows, but I have 2 or 3 days to load, because it's a monthly feed, then maybe this is Ok too.
* When the source systems provide data slower than 60,000 rows per second - or the networks are restricting data flow. In these cases, you really have no choice. You now have the time (due to slow sources) to do transformation in-stream, so why not take advantage of this?
* When the target system won't perform at more than 50,000 or 60,000 rows per second, even if it's been tuned. (see my last bullet)
* When code and reference tables are needed to populate default values or provide simple data replacement through key matching.
* When small micro-bursts of data are arriving in real-time, and we need to perform transformation in-line between systems, or between queue's.
In other circumstances we will use the ETL engine to provide EL in parallel and partitioned jobs, providing maximum throughput with minimum development time.
So what does ETL provide in the metadata space:
Generally ETL's metadata repositories have been around for a little bit longer than the ELT tool vendors' repositories, and those ETL tool sets now dabbling in ELT are just growing into this area.
If we look at ELT from a scripting point of view, and not using a tool to execute, we lose all kinds of metadata from process flow, to business logic, to full traceability. Particularly this is seen when scripting Teradata BTEQ, Oracle SQL Loader and Stored Procedures, SQLServer command line, and DB2 UDB Functions, and so on... Any time we return to CODE base for ELT, we're back to deciphering scripts to figure out metadata, and much of it is hidden. Unfortunately (today) most of the ELT that I run into is still hand-coded, although this is set to change in the near future.
This is where people stop me and say: well, what about Sunopsis? They were 100% ELT... True - but their metadata repository left a LOT to be desired, they were young in the field, and hadn't quite grown up yet.
Ok, so what about Informatica, Ascential, Ab-Initio, Business Objects Data Mover, and so on...
They grew up in the ETL space and have strong metadata components, however, they are just now completing their transition to ELT - or in some cases, they've completed it, but it's not 100% compliant with the all the database features available that I can see.
So then this argument kicks in: Well, ok - but what about when I override the SQL in these tools? That get's me 100% compliance with ELT and database feature sets... Yes, but then some tools simply don't have the SQL de-composition routines built into their engine yet. They can't pick these apart for the metadata lineage and achieve end-to-end lineage.
For those tools that offer end-to-end lineage with SQL, the next thing that I find is:
Some database engines do not record the number of "rows" which were operated on within these SQL statements that executed within the database. They record them, but the tool sets don't pick them up. Other database engines record this information (for every SQL statement) in metadata tables in the RDBMS - as it should be!
Finally, when it comes down to it - ELT requires much simpler processes, resulting in many more steps, resulting in a lot more "temporary" tables during execution cycles, metadata is bound to be lost through this process (today) - this is also due to change in the future, and get better as the tools find the gaps and fix them.
Now what are some of the pros and cons of ELT?
* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.
* ELT relies on proper database tuning, proper data model architecture, normalized data model architecture
* ELT relies on MPP hardware
* ELT can easily eat 100% of the hardware resources available for complex and huge operations
* ELT can't balance the workload
* ELT can't reach out to alternate systems (all data must exist in the RDBMS BEFORE ELT operations take place)
* ELT easily doubles, triples, and quadruples disk storage requirements (more processes, each simpler, each requiring intermediate temporary tables).
* ELT (sometimes) is not 100% metadata lineage traceable.
* ELT can take longer to design and implement, more steps, less complicated per step, but usually results in more custom SQL code (sometimes this is where metadata is lost).
Pros and Cons of ETL
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware
* ETL requires separate and equally powerful hardware in order to scale.
* ETL can "bounce" data to and from the target database, requires separate caching mechanisms which sometimes don't scale to the magnitude of the data set - this can result in scalability and performance issues.
* ETL cannot perform as fast as ELT without twice the size of hardware (usually for RAM and CPU resources).
So, you see - it all depends on how you look at the picture and what you want. Ultimately the nirvana is that no metadata will be lost, regardless of database specific SQL, or ETL "functions / transformations" utilized. The other nirvana is that you will ultimately design an "ETL" stream, and when it no longer performs, flip a switch on a job that says: turn this whole thing into ELT for me without me re-designing; that is the day when ELT will have grown up on the job, and be our best friend.
Of course, it all depends on where you can afford to put your system load, and where you can spend your money to achieve scalability....
Hope this helps,
Posted December 15, 2006 10:53 AM
Permalink | 4 Comments |