Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« ETL Engines: VLDW & Loading / Transforming | Main | ETL/ELT: VLDW and Multi-passing data »

ETL/ELT: VLDW & Data Integration Details

I've been blogging on this topic now for at least the past week. I've gotten some good feedback from a few of you out there in blog-land (thank-you). I've seen a couple questions on "how" to reduce the data set, and what I mean by "multi-passing" the data, and so on. Keep in mind that it's much easier to teach these subjects in classes, than it is to blog on them and get all the needed details put in place for everyone to read. In this blog I provide additional information on how to make performance work for large data sets and data integration.

How do I reduce the data set?
In my last entry I made mention of "rows being too wide." The comments I see from good people are: If you need the data, you need the data, how can you "reduce" the data set?

Mathematics provide the answer.
As I stated early on in performance and Tuning, and in every class I teach, I discuss this: there are only two / maybe three ways to improve performance: 1) reduce the data set, 2) increase parallelism, 3) combination of the two - really there's nothing else that can be done. Ok, improving the architecture really falls under #2 (providing the means within the architecture to improve parallelism), most architectures and data flow diagram designs are too restrictive and too complex too allow improved parallelism.

So, that said: Suppose you have 150 columns, or even 360 columns (SAP / R3 feed for example).... Worse yet, suppose the row is 5 columns... What matters? The byte length of the row - that's what matters, not the number of columns. The first incorrect conclusion that experts tend to make, is simply to look at the number of columns. However, as a rule of thumb, USUALLY the higher the number of columns, the larger the row size.

Ok, that said: suppose I have 150 columns that amount to a 4k row size. This is a WIDE ROW, and will cause performance issues to no end. Problems in the network, CPU, RAM, Disk, I/O, Swapping, Database, Application - you name it, it happens...

For simplicity sake, if we focus on the database for a minute... Most database vendors install their database engines at a default 8k block size... Simple math says: I can fit 2 rows per block, that means: 2 rows per I/O on disk (simple-mans math, because I'm a simple man)... Now, suppose that row size is 4500 bytes... What happens to the rest of row 2 when inserting to the database?? It carries over to another block... Now, 2 rows = 2 I/O's (at a minimum). In some databases this is referred to as "block chaining" or "row chaining". This is a problem... But back to the focus...

How large are most TCP/IP packets? In fact, what's the maximum size of a raw IP packet? Anyone? You might be surprised to know it's close to 1024 bytes, that's right: 1k. (Hint: this is one of the reasons that Network drives are slower than DASD, or Fibre Channel attached drives)... How many I/O's are needed to get 2 rows of 4500 bytes of data across to the disk? Again, in simple math: 9000 bytes = 9x1k packets = 9 I/O's over the network. Ouch.

I/O's cost time, cost money, and can quickly and easily overflow the hardware... but wait - there's another monkey wrench in the works... Lack of parallelism (serialized operation).

If your application (ETL tool?) sends the "block" of two rows over the network, most likely it does it in serial... HEY NETWORK: I HAVE 9k WORTH OF DATA, SEND IT TO THIS DISK AT THIS ADDRESS... Network: OK, 9 I/O's in serial... Processed. HEY DATABASE: COMMIT THESE ROWS TO THE DISK... Database: Ok... 2 I/O's on disk, that'll be 2 blocks please...

And so on.

Now, what can we do? We still need the 9k of data right? Right... But we have one piece left at our disposal: increase parallelism...

How do we do that?
1) Split the data flow in to two components, or three, or four... Whatever it takes to NORMALIZE the structures, INCLUDING IN THE DATABASE.
2) Run the Primary Key with Flow 2, Flow 3, Flow 4... (let's say we split 4500 bytes into an average of 1k each with 4 data flows...)
3) Run all 4 data flows from the SAME SOURCE to the NORMALIZED TARGETS in parallel.

Voila, we just increased the parallelism - wait a minute!! We ALSO reduced the data set! We receive a huge multiplier in performance _every time_ I guarantee it, as long as the hardware can handle the parallelism...

Now wait a minute! Won't this run "slower" because now it has to read the same data 4 times? What about the amount of work over TCP/IP - isn't it the same number of packets? What about the target disk and database? Don't they have the same amount of work to do?

Ok, here's the low-down:
1) Systems (including databases) have been optimized over the years to "Cache" data from disk into RAM the first time it's requested, more often than not - running all 4 data flows at the same time, only 1 of them at any given point will pay the Disk I/O price, leaving the rest to "grab data from RAM"
2) Systems have also been optimized to handle small fast transactions in parallel, it's what they DO!!! So normalizing the data allows the hardware to be utilized the way it was originally intended. If you want DENORMALIZED DATA, then buy an appliance - that's what they specialize in: big blocks, huge wide rows, etc... (more on this later).
3) Applications "usually" by default (like ETL engines) open single connections to the database per TARGET. Databases (at the end of the day) STILL use round-robin style processing against ALL their connections, and give more weight to system processes. Suppose your connections given 3 milliseconds to handle 4500 byte rows of data - no matter how you slice the data you still have 1 connection = 3 milliseconds of execution time. Now, if you "normalize" the data, you now have 4 connections simultaneously attached, now the round-robin scheduler in the database gives you 12 milliseconds to process the same amount of data IN PARALLEL. More Time + less Data = Faster Execution, guaranteed.

What else am I saying?
DENORMALIZATION OF DATA FOR PERFORMANCE IS A MYTH! Trust me, it works great to eliminate "joins" in the database, but once you've blown the magic point at which the ratio of: Number Of Rows per block to I/O's it takes to read that number of rows, you've doubled, tripled, quadrupled your Disk I/O activity alone - then add in all the serialization to deal with those rows across the network, CPU, RAM, and you've got yourself a resource hog.

I've seen it time and time again.... Database engines these days are TUNED to take advantage of parallelism, the more we Denormalize, the LESS parallelism the engines can engage in. If Joins are a "problem" for performance, then either: a) the database isn't tuned properly or b) the hardware isn't scaled properly or c) both - to take advantage of parallelism.

I've hinted at performance numbers in my blog entry that I'm receiving on a 1TB (raw data) SQLServer2005 EDW across a 4 / 5 table join in a 32 bit Windows environment. I hope this helps.

The problem again, is look at the mathematics of wide rows, they are in direct correlation to slow performance (because they exponentially increase the number of I/O's required to handle the data set). Do not believe the hype that denormalization will solve the performance problems, it doesn't. It is a stop-gap measure for another deeper issue: usually architecturally driven or incorrect tuning.

Love to have your thoughts/comments...

Thanks,
Dan Linstedt
DanL@DanLinstedt.com

  Posted by Dan Linstedt on September 24, 2007 10:16 AM |

Post a comment