Blog: Dan E. Linstedt« My Holiday Wish List for BI of Tomorrow | Main | RFID tracking for Individuals needs to go away » Performance of ETL From an Architecture PerspectiveIt seems these days that many people have similar problems with performance and tuning of their ETL routines (in another blog entry I'll discuss performance and tuning of ELT). ETL may be the "old-horse" in the stables, but it will exist for a very long time to come, as it serves many different purposes (such as sharing or balancing the workload) between the Transformation Engine and the Database Engine. Particularly where ELT is 100% database engine based, and puts some serious strain on the RDBMS (especially in huge volumes). So where does that leave ETL? What are some of the top suggestions for getting ETL to perform? I've been teaching performance and tuning for the past 7 years, and working on systems analysis, design, performance and tuning architectures for over 10 years. I started life as an 8080 CPM assembly level programmer where I re-wrote the Digital BIOS to read MS-DOS disks, and then proceeded to re-write the compiler and linker because I only had 64kb of RAM on the machine, and the compiler wouldn't compile the BIOS, and then the linker couldn't link it (too many modules, not enough RAM). So if there's one thing I understand, its speed of a machine and execution cycles. I frequent clients where their performance of their ETL routines (Data Stage and Informatica, and Java ETL) starts at 800 to 12,500 rows per second - with an average row size of 1500 bytes per row, do the math: (800 rps)(1500 bytes) = 1,200,000 bytes per second = 1.2MB per second. Usually the IT staff considers this "fast." This couldn't be further from the truth. In this blog I will disclose some of the things you need to look at to get higher performance, but if you want to know how to accomplish those tasks - well that involves consulting, and you'll have to contact me. Typically my customers see anywhere from 400% to 4000% performance improvements by implementing my recommendations. If that's not "fast" then what is "fast"? Consider this: on my HP Pavilion, AMD 64 bit CPU, 2 GB RAM, single internal 80GB disk @ 7200 rpm, with ETL engine and Database co-located. I'm reading from a flat file of 2M rows, and inserting to the database (non-empty table) with a single primary key index, and receiving between 40,000 rps and 60,000 rps (best case: 60,000 rps x 1500 bytes per row = 90MB per second). For updates I receive 12,500 to 20,000 rps x 1500 bytes per row = 30Mb per second), for Deletes it varies by key selection (range or singular). Hint 1: Hint 2: Hint 3: Hint 4: Keeping the data flowing through the transformation objects rather than branching around them is always preferable for performance. But watch out!! The more you tune, the more standards and best practices you break, the more metadata is lost (often times). ONLY TUNE WHAT IS TRULY BROKEN AND SIGNED OFF AS SUCH WITH AN SLA AND THE BUSINESS USERS. Hint 5: Beware of overloading; don't believe the hype that always adding parallelism will give you performance boosts. Hope this helps, Feel free to contact me directly with your performance issues. |
Comments
Its interesting reading this article today and comparing it to your article of 2005 where you declared "ETL is dead"! What changed?
Posted by: M Routh | March 11, 2007 6:41 PM
Hi M. Routh,
Yes, it is interesting... What's changed is that ETL apparently still has room for growth, but beyond just the data warehousing market. What else has changed is the notion of ETL, and what it's used for. It's always been about integration, but the database engines haven't come as far in 2007 as I thought they would (in 2005 - I thought they'd do a much better job of EL-T than they have done). I thought that a GUI on top of RDBMS with integrated metadata, and the "transformations" being buried in the database would take-over... that just hasn't happened.
Thanks for the comment,
Dan L
Posted by: Dan Linstedt | March 12, 2007 4:57 PM
This is very interesting and while developing an ETL tool for Sun, I had to explain to people many times, that its not the tool only which can improve the performance. Designing it correctly and making good use of of the tool can improve performance dramatically. But ETL developer tend to believe the tool should do that optimization; so we did came up with a strategy based optimization to cater this, see strategy based execution that Sun ETL Integrator support in our wiki page:
http://www.glassfishwiki.org/jbiwiki/Wiki.jsp?page=ETLExecutionStrategies
Posted by: Ahi | April 18, 2007 1:23 AM