I've blogged a little bit about this before. In this entry we'll explore the mathematics side of normalization and denormalization in the data model. Of course, this entry is also specifically targeted at very large data sets. On small systems that can't handle the mathematics, performance degrades to a certain degree. There's a rumor out there, an un-truth, a fallacy that states: if you're having performance problems in your database, simply denormalize (flatten the tables, remove the joins). This works, but only to a point.
There is a physical upper limit to denormalization where mathematically it simply stops working, and performance degrades on an exponential scale. That magical point is where the width of the row increases so much, that it forces multiple blocks (chained together) to meet the needs of queries.
That said: VLDW databases should ALWAYS have 64k block sizes (bigger if the hardware can handle it). Especially if the rows are 4k each (in width). If you're dealing with uni-code data, then you have wide rows (2 bytes per character) which make it difficult to survive without 64k block sizes.
Let's take a look at the mathematics:
64k block sizes
4k per row
16 rows per block
This is a nice even number, but none-the-less, if you are dealing with 60 million, or maybe 500 million rows, that's a lot of I/O's. Let's explore the I/O counts (simple mans math) just based on disk retrieval (ignoring network, RAM, caching, etc...) for a specific query.
This query wants 10 million rows out of 500 million, there is no parallelism, and no partitioning on the table (the only DB that this doesn't work for is Teradata - it's 100% parallel all the time).
Here's the math on I/O's:
16 rows per block
625,000 blocks (assume 1 I/O per block) = 625,000 disk reads.
Now, just for simplicity sake, suppose we really needed only 800 bytes of each row, would it make a difference in the I/O's in a denormalized solution? NO. The 800 bytes are completely buried within each 4k row, so even with the 10M row return, the I/O's will hit the disk 625k times.
Now, suppose you normalize the table, and you break it in to three groups of data: 800 bytes, 955 bytes, and 2245 bytes. Suppose you need only the first table now to answer the query... Here's the math:
64k block size / 800 bytes = 80 rows per block
10M rows / 80 rows per block = 125000 blocks or 125,000 disk reads.
I don't know about you, but in my book: 125,000 is a lot less than 625,000 to get the same data... It's a multiplier to performance, magnitudes faster.
Now suppose you needed ALL 4k to put the row back together, how does this work?
Well, without parallelism there would be only speed / performance loss - that means if the database reads ONE TABLE at a time (800 bytes, then 955 bytes, then 2245 bytes) to put the rows together, then the I/O's are (logically) 3x higher than a single table scan, why? Because the database now has to read 30M rows (10M from each separate table) and then join them to answer the query. The join operation introduces the extra I/O's because it's not done in parallel either.
Now, suppose I introduce a) parallel query, b) parallel join. How does the timing work?
800 bytes per row reads, at the same time 955 bytes per row reads at the same time as 2245 bytes per row reads, which all happen at the same time as the join operation (as the rows are fed in to RAM, they are joined together).
Parallelism doesn't necessarily reduce the I/O count, it does two different things: 1) it runs the I/O's at the same time, and 2) it reduces the amount of TIME to execute.
So what happens when I denormalize? Why is this wrong?
1. Most database block sizes default (on install) to 4k or 8k, causing I/O's to go through the roof.
2. Most EDW row sizes are between 2k and 4k themselves (just look at your average conformed dimension for this)... Causing 1 or 2 rows per block, or worse yet, causing "chained blocks" where a single row has half it's data in one block, and half it's data in another, THIS is where denormalization BREAKS THE PERFORMANCE BANK.
When chained rows are introduced, the data is denormalized beyond the boundaries of I/O execution time. This causes double, then triple, then quadruple I/O's compared to a normalized solution with parallel joins. Now, denormalization has broken the performance curve and the only way to get it back is to normalize the data AND introduce parallelism.
Without parallelism, yes - performance of normalized data sets IS usually slower than denormalized data sets, but with parallelism - it will blow away any denormalized set at nearly any time - especially with a large set of data.
Keep in mind, these mathematics only become a problem with larger sets of data. Usually anything under 10 million rows per table is not an issue (as a guideline, this varies from DB to DB, engine to engine). That is IF the block size is 8k, or even 16k.
In LARGE systems, there are only two ways to improve performance: a) reduce the data set (in this case, normalize), b) increase parallelism (done through tuning and hardware capability). By decreasing the amount of data that each parallel thread has to deal with, we improve performance on a factored scale.
Also be aware, that indexes are smaller and more nimble in normalized tables, and that indexes as well play by the same mathematics as the data set itself.
NOTES OF INTEREST:
In terms of appliances (so to speak). There are some appliances in the market place that show marked performance gains on DENORMALIZED data sets only, but they too have their upper limits in terms of the sheer volume of data they can handle effectively.
There are other "appliances" and software that is entering the market place today that is based on Column normalization, or column storage. This is a technique that was pushed to the public market by Sybase IQ many years ago, but IQ had other limitations.
Their secret sauce is in the hardware optimizations, and the disk layouts. The one problem (again) with denormalization, is mathematically there is _always_ an upper limit (to the amount of data) before performance degrades exponentially.
Now, if we MODEL our data sets incorrectly in a normalized data warehouse, performance will also be inversely affected. There are some systems that can handle huge volumes extremely well, (MPP Systems that is), but the data model that is put in place severely limits the parallelism and also increases the width of the rows unnecessarily. Data Models are the next important key to performance success at HUGE volumes. I urge you to check out the Data Vault (Common Foundational Integrated Data Modeling Architecture) - which is built for Petabytes. It's an architectural solution that is FREE (just like 3rd normal form or star-schema) that solves performance and flexibility problems in data warehousing. You can find it at: http://www.DanLinstedt.com
Posted September 29, 2007 6:54 AM
Permalink | 2 Comments |