It seems these days that many people have similar problems with performance and tuning of their ETL routines (in another blog entry I'll discuss performance and tuning of ELT). ETL may be the "old-horse" in the stables, but it will exist for a very long time to come, as it serves many different purposes (such as sharing or balancing the workload) between the Transformation Engine and the Database Engine. Particularly where ELT is 100% database engine based, and puts some serious strain on the RDBMS (especially in huge volumes). So where does that leave ETL? What are some of the top suggestions for getting ETL to perform?
I've been teaching performance and tuning for the past 7 years, and working on systems analysis, design, performance and tuning architectures for over 10 years. I started life as an 8080 CPM assembly level programmer where I re-wrote the Digital BIOS to read MS-DOS disks, and then proceeded to re-write the compiler and linker because I only had 64kb of RAM on the machine, and the compiler wouldn't compile the BIOS, and then the linker couldn't link it (too many modules, not enough RAM). So if there's one thing I understand, its speed of a machine and execution cycles.
I frequent clients where their performance of their ETL routines (Data Stage and Informatica, and Java ETL) starts at 800 to 12,500 rows per second - with an average row size of 1500 bytes per row, do the math: (800 rps)(1500 bytes) = 1,200,000 bytes per second = 1.2MB per second.
Usually the IT staff considers this "fast." This couldn't be further from the truth. In this blog I will disclose some of the things you need to look at to get higher performance, but if you want to know how to accomplish those tasks - well that involves consulting, and you'll have to contact me. Typically my customers see anywhere from 400% to 4000% performance improvements by implementing my recommendations.
If that's not "fast" then what is "fast"? Consider this: on my HP Pavilion, AMD 64 bit CPU, 2 GB RAM, single internal 80GB disk @ 7200 rpm, with ETL engine and Database co-located. I'm reading from a flat file of 2M rows, and inserting to the database (non-empty table) with a single primary key index, and receiving between 40,000 rps and 60,000 rps (best case: 60,000 rps x 1500 bytes per row = 90MB per second). For updates I receive 12,500 to 20,000 rps x 1500 bytes per row = 30Mb per second), for Deletes it varies by key selection (range or singular).
These are the numbers you should be shooting for without using parallel objects, and without partitioning the data set. This way, when parallelism and partitioning are applied you gain a multiplier of these numbers.
If you're running too many instances of an RDBMS engine on a single machine, you can easily over-run your available hardware. CONSOLIDATE ALL DIFFERENT INSTANCES to a SINGLE instance of the engine, tune that instance, and you'll see better performance almost guaranteed. For instance, an 8 cpu engine with 8 GB RAM can handle at most, 2 instances of a DBMS engine, IF each one is tuned and limited to use only 4 cpu's and 4 GB of ram MAX.
Rule of thumb with ETL: Always always always, separate your inserts from your updates from your deletes. Running mixed-mode (inserts and updates) within the same stream causes performance slowdown by orders of magnitude. Trust me on this one. A data flow (mapping) that contains inserts and updates may run at 12,500 rows per second (1500 byte rows), where when split apart, sees the performance gain mentioned above.
What is the shortest distance between two points? A straight line right? Well, the same goes for ETL data flows - the more splits across transformation objects, the less performance is usually seen. * Note this is NOT true for Ab-Initio, because Ab-Initio runs optimization algorithms (highly sophisticated mathematics) to remove and eliminate bottlenecks in the mapping/graph. What you design in Ab-Initio is not always what is run under the covers.
Keeping the data flowing through the transformation objects rather than branching around them is always preferable for performance. But watch out!! The more you tune, the more standards and best practices you break, the more metadata is lost (often times). ONLY TUNE WHAT IS TRULY BROKEN AND SIGNED OFF AS SUCH WITH AN SLA AND THE BUSINESS USERS.
I run into too many underpowered hardware engines - or the converse: too much parallelism. People try to do too much all at the same time. Balancing the load cycle is much better than overloading the hardware, and will almost always yield faster performance across the board. For example: I know of places that run 400 ETL jobs in parallel for example, the longest one in that parallel group runs at 800 rps - and runs for about 2 hours (5.76M rows); on average, they all run slowly. When we split it in to two parallel groups of 200 each group run sequentially, we saw the jobs running speed increase to 20,000 rps, the running time of the longest dropped to 4.8 minutes!! Both groups end to end ran in under 15 minutes. That's a two hour run time reduced in total to 15 minutes!!
Beware of overloading; don't believe the hype that always adding parallelism will give you performance boosts.
Hope this helps, Feel free to contact me directly with your performance issues.
Posted January 12, 2007 5:52 AM
Permalink | 3 Comments |