Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

It seems these days that many people have similar problems with performance and tuning of their ETL routines (in another blog entry I'll discuss performance and tuning of ELT). ETL may be the "old-horse" in the stables, but it will exist for a very long time to come, as it serves many different purposes (such as sharing or balancing the workload) between the Transformation Engine and the Database Engine. Particularly where ELT is 100% database engine based, and puts some serious strain on the RDBMS (especially in huge volumes). So where does that leave ETL? What are some of the top suggestions for getting ETL to perform?

I've been teaching performance and tuning for the past 7 years, and working on systems analysis, design, performance and tuning architectures for over 10 years. I started life as an 8080 CPM assembly level programmer where I re-wrote the Digital BIOS to read MS-DOS disks, and then proceeded to re-write the compiler and linker because I only had 64kb of RAM on the machine, and the compiler wouldn't compile the BIOS, and then the linker couldn't link it (too many modules, not enough RAM). So if there's one thing I understand, its speed of a machine and execution cycles.

I frequent clients where their performance of their ETL routines (Data Stage and Informatica, and Java ETL) starts at 800 to 12,500 rows per second - with an average row size of 1500 bytes per row, do the math: (800 rps)(1500 bytes) = 1,200,000 bytes per second = 1.2MB per second.

Usually the IT staff considers this "fast." This couldn't be further from the truth. In this blog I will disclose some of the things you need to look at to get higher performance, but if you want to know how to accomplish those tasks - well that involves consulting, and you'll have to contact me. Typically my customers see anywhere from 400% to 4000% performance improvements by implementing my recommendations.

If that's not "fast" then what is "fast"? Consider this: on my HP Pavilion, AMD 64 bit CPU, 2 GB RAM, single internal 80GB disk @ 7200 rpm, with ETL engine and Database co-located. I'm reading from a flat file of 2M rows, and inserting to the database (non-empty table) with a single primary key index, and receiving between 40,000 rps and 60,000 rps (best case: 60,000 rps x 1500 bytes per row = 90MB per second). For updates I receive 12,500 to 20,000 rps x 1500 bytes per row = 30Mb per second), for Deletes it varies by key selection (range or singular).

Hint 1:
These are the numbers you should be shooting for without using parallel objects, and without partitioning the data set. This way, when parallelism and partitioning are applied you gain a multiplier of these numbers.

Hint 2:
If you're running too many instances of an RDBMS engine on a single machine, you can easily over-run your available hardware. CONSOLIDATE ALL DIFFERENT INSTANCES to a SINGLE instance of the engine, tune that instance, and you'll see better performance almost guaranteed. For instance, an 8 cpu engine with 8 GB RAM can handle at most, 2 instances of a DBMS engine, IF each one is tuned and limited to use only 4 cpu's and 4 GB of ram MAX.

Hint 3:
Rule of thumb with ETL: Always always always, separate your inserts from your updates from your deletes. Running mixed-mode (inserts and updates) within the same stream causes performance slowdown by orders of magnitude. Trust me on this one. A data flow (mapping) that contains inserts and updates may run at 12,500 rows per second (1500 byte rows), where when split apart, sees the performance gain mentioned above.

Hint 4:
What is the shortest distance between two points? A straight line right? Well, the same goes for ETL data flows - the more splits across transformation objects, the less performance is usually seen. * Note this is NOT true for Ab-Initio, because Ab-Initio runs optimization algorithms (highly sophisticated mathematics) to remove and eliminate bottlenecks in the mapping/graph. What you design in Ab-Initio is not always what is run under the covers.

Keeping the data flowing through the transformation objects rather than branching around them is always preferable for performance. But watch out!! The more you tune, the more standards and best practices you break, the more metadata is lost (often times). ONLY TUNE WHAT IS TRULY BROKEN AND SIGNED OFF AS SUCH WITH AN SLA AND THE BUSINESS USERS.

Hint 5:
I run into too many underpowered hardware engines - or the converse: too much parallelism. People try to do too much all at the same time. Balancing the load cycle is much better than overloading the hardware, and will almost always yield faster performance across the board. For example: I know of places that run 400 ETL jobs in parallel for example, the longest one in that parallel group runs at 800 rps - and runs for about 2 hours (5.76M rows); on average, they all run slowly. When we split it in to two parallel groups of 200 each group run sequentially, we saw the jobs running speed increase to 20,000 rps, the running time of the longest dropped to 4.8 minutes!! Both groups end to end ran in under 15 minutes. That's a two hour run time reduced in total to 15 minutes!!

Beware of overloading; don't believe the hype that always adding parallelism will give you performance boosts.

Hope this helps, Feel free to contact me directly with your performance issues.
Dan L
http://www.COBICC.org


Posted January 12, 2007 5:52 AM
Permalink | 3 Comments |

3 Comments

Its interesting reading this article today and comparing it to your article of 2005 where you declared "ETL is dead"! What changed?

Hi M. Routh,

Yes, it is interesting... What's changed is that ETL apparently still has room for growth, but beyond just the data warehousing market. What else has changed is the notion of ETL, and what it's used for. It's always been about integration, but the database engines haven't come as far in 2007 as I thought they would (in 2005 - I thought they'd do a much better job of EL-T than they have done). I thought that a GUI on top of RDBMS with integrated metadata, and the "transformations" being buried in the database would take-over... that just hasn't happened.

Thanks for the comment,
Dan L

This is very interesting and while developing an ETL tool for Sun, I had to explain to people many times, that its not the tool only which can improve the performance. Designing it correctly and making good use of of the tool can improve performance dramatically. But ETL developer tend to believe the tool should do that optimization; so we did came up with a strategy based optimization to cater this, see strategy based execution that Sun ETL Integrator support in our wiki page:

http://www.glassfishwiki.org/jbiwiki/Wiki.jsp?page=ETLExecutionStrategies

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›