Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« ETLT or ELT - Either way, pull back the sheets. | Main | IBM - DB2 UDB 9.x - hot new technology »

Data Integration - Performance Numbers

I've been teaching and consulting on performance and tuning of systems architectures for 10+ years. I've seen the increases in performance across the board from many different vendors - hardware, software, network to disk to RAM to CPU and so on. This entry does not mention particular vendor names, but rather discusses the _nature_ of performance and tuning at the core of Very Large Systems - whether you're doing Business Intelligence / Data Warehousing, or simply data movement / data integration - these numbers (hopefully) will make sense to you. When I say very large environments, I'm talking about 1 billion rows+ _per file_ - handled within a single batch run, 1B rows per year of history loaded, with 5 years of history to load, that means the second through 5 years of history must manage "update detection" against an increasing set of rows on the target, on the order of billions. So what kind of performance do you want from your systems?

With DW2.0 around the corner, and unstructured data creeping in - and near-line, active, and historical storage coming on-board, it's more important than ever before to get your systems in top shape to handle massive volumes.

Here's a run-down of what I see in the market place - again, no vendor names will be mentioned. These are numbers that I see before tuning architectures:

We're talking an average row size of 1250 bytes, same data in Unicode: 2500 bytes.

Usually without parallelism, without partitioning, without tuning:
RDBMS numbers (before tuning)
1. RDBMS A: 8,000 to 10,000 rows per second handled
2. RDBMS B: 12,000 to 15,000 rows per second
3. RDBMS C: 6,500 to 8,500 rows per second.
4. RDBMS D & E & F cannot be stated, they always run partitioning and parallelism by default.

Major data integration engines on the market
1. ETL / ELT Engine #1: 6,000 to 8,000 rows per second
2. ETL / ELT Engine #2: 1250 to 4,000 rows per second
3. ETL / ELT Engine #3: 18,000 to 25,000 rows per second out of the box.

* PLEASE NOTE: I CANNOT AND WILL NOT DISCLOSE THE VENDOR NAMES, IN A PUBLIC VENUE, IT IS A LEGAL ISSUE. My point is solely to give you (the reader) ideas as to what the best practice is, what performance numbers you should be shooting for regardless if vendor.

Where are the common problems hidden within the architecture?
Here are the top 10 issues I run into ALL the time, I'd be happy to consult you and your customer virtually to dig into each area, and help you gain massive performance from your existing systems by disclosing HOW to implement these issues, please contact me directly. Please note: there are over 50 such criteria I use that pinpoint all the bottlenecks in the systems and data architectures.

1. Common problem #1: Updates and Deletes and Inserts are commonly mixed into a single load stream - this can slow ETL / ELT loading processes by a factor of 4x to 12x. Using SET LOGIC and statistics alone, separate the data upstream as soon as possible (another reason why CDC has such a HUGE POSITIVE IMPACT on performance). In 95% of the customers I visit, out of billions of rows, usually only 10% to 20% of the incoming data sets are updates. Only 5% to 15% are deletes. The most costly and most complex operation in any database system and ETL / ELT is an UPDATE, by separating the logic and separating the data set, one can stream-line the updates and inserts and tune appropriately.
2. Common Problem #2: Database Engines are tuned for OLTP, not for ADW / EDW.
3. Common Problem #3: Disk I/O is a huge source of pain (never fast enough)
4. Common Problem #4: Network is shared, is not VPN between the ETL server and the DBMS server.
5. Common Problem #5: Data Modeling Architecture in the ADW / EDW is flawed.
6. Common Problem #6: Too many serial processes / dependencies inside the ETL / ELT stream.
7. Common Problem #7: Too many instances of a Database Engine on a single machine
8. Common Problem #8: Not enough hardware to support the RDBMS ADW / EDW or ETL engines.
9. Common Problem #9: Block sizing is wrong in the RDBMS
10. Common Problem #10: Too complex ETL / ELT processes, complexity drives dependencies, mixed updates and inserts and deletes in a single stream drives complexity. Mathematics and set logic always provide some of the largest instant gains in this situation.

So what are the magic numbers? What do we want to shoot for? What kind of performance are we after?
Well, here's the thing - what kind of performance is "livable" for your organization, and the next question is: for how long will that number be acceptable? Is your SLA to end processing at 6:30am every morning, and you can't start until 1am that morning? Or is your SLA that you have a 12 hour load window, but you need it to load in 5 hours? Or are you trickle feeding your warehouse in conjunction with batch processing? It's really about RIGHT-TIME for the business.

Keep in mind that EVERY TIME we tune something, we begin to specialize the system further. Which means (in general terms) For every further tuning effort, we are customizing. When we customize we break standards, usually when we break standards we end up with "exception cases". When we have exception cases, we are increasing overal maintenance complexity, and our TCO increases - it's bound to happen. Another angle is: the cost of faster hardware and more bandwidth get's exponentially steeper, so picking the right time, balancing standards with the 80/20 rule (when good is good enough / fast is fast enough) is a KEY to our continuing success. Only tune what the end-users say is "broken" - in other words, notify them of the COST of tuning, and if they sign the SLA and justify the cost, then you're good to go.

What do I look for?
I look first, to a non-partitioned, non-parallel stream for performance number gains. This way, once I've tuned the non-partitioned, non parallel stream - I can get additional multiplied performance gains by using the parallelism and partitioning up to the hardware limitations. Why? because the hardware limitations are a FIXED resource.

1. ETL / ELT systems: Well, in systems (standard batch loads) of 100M per process or less, I look for and usually can live with 40,000 rows per second to 80,000 rows per second for inserts. I look for 20,000 to 30,000 rows per second for updates. For Deletes, I look for 10,000 to 100,000 rows per second (depending on the RDBMS). What's the math look like? 80% inserts of 100M rows = 80M rows of inserts, running at 80,000 rows per second = total run time of: 16.67 minutes. A livable piece, considering I would be running 10 to 20 of these types of loads concurrently, and considering I haven't even begun to partition yet. 18% updates: 100M Rows = 18M rows of updates, running at (slowly): 15,000 rows per second = 20 minutes. Deletes: 2% 2M rows / 10,000 rows per second = 3.3 minutes. If I ran all three (Inserts, updates, then deletes) in SERIAL: my total run time is: 16.67 minutes + 20 minutes + 3.3 minutes = 39.97 minutes (just over a half hour).

If the ORIGINAL MIXED WORKLOAD data flow ran at: 8,000 rows per second (contains updates and inserts and deletes), then the total time would be: 100M rows / 8,000 rows per second = 208.33 minutes or 3.4 hours

Now, 3.4 hours reduced to 39.97 minutes is almost an 80% performance increase.

2. With RDBMS Systems, I look for the same numbers as above - the RDBMS system must perform at the same rates as the ETL / ELT systems - or none of this works.

What do I look for from DBMS systems when I employ ELT?
I shoot for performance rates of 100,000 rows per second to 400,000 rows per second - an order of magnitude difference, however most RDBMS engines can't come close to this, there are only a few that can. These kinds of rates make it possible to deal with 1B rows+ of incoming data on a single stream.

For example: Suppose this: 5 sequential process in ETL move 1B rows to an RDBMS at: 50,000 rows per second. That’s: 3B rows / 50,000 rows per second = 16.67 hours total run time.

Now suppose I have 42 steps of ELT for the same operational logic, that's 42B rows / 336,000 rows per second = 34.72 hours to process. Of course it's not always this way, because each "step" of the 42 steps reduces the data set, and not all of the 42 steps process all 42B rows, this is a worst case scenario. Oh and by the way, this is just ONE of the 10 batch processes that have to run concurrently. By the way, the RDBMS supporting this is not as large as you would think - it's about 8 dual core CPU's, and 12GB of RAM - that's all I can say about that...

These are real numbers from real customer sites that I deal with. If you want to play in the big-data leagues, you've got to be able to produce in a mixed work load environment. Please let us know if we can help you with your performance and tuning, we'd be glad to take a look at your systems.

Vendors take heed: these are not "numbers to shoot for", these are numbers that must be met within the next year or two - Especially with the unstructured data sets on the rise, and massive amounts of XML data being passed around and integrated. Vendors: I can work with you as well, as I sit on a number of technical advisory boards.

Come see me the next time I teach VLDW at TDWI, I'd be happy to talk to you, or contact me at my company to schedule a performance review of your systems.

What kinds of volumes are you dealing with today? Please let me know.

Thank-you,
Dan Linstedt
CTO, Myers-Holum
http://www.MyersHolum.com

  Posted by Dan Linstedt on December 8, 2006 5:46 AM |

Post a comment