Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I've blogged a little bit about this before. In this entry we'll explore the mathematics side of normalization and denormalization in the data model. Of course, this entry is also specifically targeted at very large data sets. On small systems that can't handle the mathematics, performance degrades to a certain degree. There's a rumor out there, an un-truth, a fallacy that states: if you're having performance problems in your database, simply denormalize (flatten the tables, remove the joins). This works, but only to a point.

There is a physical upper limit to denormalization where mathematically it simply stops working, and performance degrades on an exponential scale. That magical point is where the width of the row increases so much, that it forces multiple blocks (chained together) to meet the needs of queries.

That said: VLDW databases should ALWAYS have 64k block sizes (bigger if the hardware can handle it). Especially if the rows are 4k each (in width). If you're dealing with uni-code data, then you have wide rows (2 bytes per character) which make it difficult to survive without 64k block sizes.

Let's take a look at the mathematics:
64k block sizes
4k per row
---------------
16 rows per block

This is a nice even number, but none-the-less, if you are dealing with 60 million, or maybe 500 million rows, that's a lot of I/O's. Let's explore the I/O counts (simple mans math) just based on disk retrieval (ignoring network, RAM, caching, etc...) for a specific query.

This query wants 10 million rows out of 500 million, there is no parallelism, and no partitioning on the table (the only DB that this doesn't work for is Teradata - it's 100% parallel all the time).

Here's the math on I/O's:
10 million
16 rows per block
-------------------------
625,000 blocks (assume 1 I/O per block) = 625,000 disk reads.

Now, just for simplicity sake, suppose we really needed only 800 bytes of each row, would it make a difference in the I/O's in a denormalized solution? NO. The 800 bytes are completely buried within each 4k row, so even with the 10M row return, the I/O's will hit the disk 625k times.

Now, suppose you normalize the table, and you break it in to three groups of data: 800 bytes, 955 bytes, and 2245 bytes. Suppose you need only the first table now to answer the query... Here's the math:
64k block size / 800 bytes = 80 rows per block
10M rows / 80 rows per block = 125000 blocks or 125,000 disk reads.

I don't know about you, but in my book: 125,000 is a lot less than 625,000 to get the same data... It's a multiplier to performance, magnitudes faster.

Now suppose you needed ALL 4k to put the row back together, how does this work?
Well, without parallelism there would be only speed / performance loss - that means if the database reads ONE TABLE at a time (800 bytes, then 955 bytes, then 2245 bytes) to put the rows together, then the I/O's are (logically) 3x higher than a single table scan, why? Because the database now has to read 30M rows (10M from each separate table) and then join them to answer the query. The join operation introduces the extra I/O's because it's not done in parallel either.

Now, suppose I introduce a) parallel query, b) parallel join. How does the timing work?
800 bytes per row reads, at the same time 955 bytes per row reads at the same time as 2245 bytes per row reads, which all happen at the same time as the join operation (as the rows are fed in to RAM, they are joined together).

Parallelism doesn't necessarily reduce the I/O count, it does two different things: 1) it runs the I/O's at the same time, and 2) it reduces the amount of TIME to execute.

So what happens when I denormalize? Why is this wrong?
1. Most database block sizes default (on install) to 4k or 8k, causing I/O's to go through the roof.
2. Most EDW row sizes are between 2k and 4k themselves (just look at your average conformed dimension for this)... Causing 1 or 2 rows per block, or worse yet, causing "chained blocks" where a single row has half it's data in one block, and half it's data in another, THIS is where denormalization BREAKS THE PERFORMANCE BANK.

When chained rows are introduced, the data is denormalized beyond the boundaries of I/O execution time. This causes double, then triple, then quadruple I/O's compared to a normalized solution with parallel joins. Now, denormalization has broken the performance curve and the only way to get it back is to normalize the data AND introduce parallelism.

Without parallelism, yes - performance of normalized data sets IS usually slower than denormalized data sets, but with parallelism - it will blow away any denormalized set at nearly any time - especially with a large set of data.

Keep in mind, these mathematics only become a problem with larger sets of data. Usually anything under 10 million rows per table is not an issue (as a guideline, this varies from DB to DB, engine to engine). That is IF the block size is 8k, or even 16k.

In LARGE systems, there are only two ways to improve performance: a) reduce the data set (in this case, normalize), b) increase parallelism (done through tuning and hardware capability). By decreasing the amount of data that each parallel thread has to deal with, we improve performance on a factored scale.

Also be aware, that indexes are smaller and more nimble in normalized tables, and that indexes as well play by the same mathematics as the data set itself.

NOTES OF INTEREST:
In terms of appliances (so to speak). There are some appliances in the market place that show marked performance gains on DENORMALIZED data sets only, but they too have their upper limits in terms of the sheer volume of data they can handle effectively.
There are other "appliances" and software that is entering the market place today that is based on Column normalization, or column storage. This is a technique that was pushed to the public market by Sybase IQ many years ago, but IQ had other limitations.

Their secret sauce is in the hardware optimizations, and the disk layouts. The one problem (again) with denormalization, is mathematically there is _always_ an upper limit (to the amount of data) before performance degrades exponentially.

Now, if we MODEL our data sets incorrectly in a normalized data warehouse, performance will also be inversely affected. There are some systems that can handle huge volumes extremely well, (MPP Systems that is), but the data model that is put in place severely limits the parallelism and also increases the width of the rows unnecessarily. Data Models are the next important key to performance success at HUGE volumes. I urge you to check out the Data Vault (Common Foundational Integrated Data Modeling Architecture) - which is built for Petabytes. It's an architectural solution that is FREE (just like 3rd normal form or star-schema) that solves performance and flexibility problems in data warehousing. You can find it at: http://www.DanLinstedt.com

Cheers,
Dan Linstedt


Posted September 29, 2007 6:54 AM
Permalink | 2 Comments |

2 Comments

There's no problem if there are less joins when Engineers want only specific data.

However, my engineers, on a daily basis, wants the approx 80% of the entire columns available in the DB.

That being said, w/o parallelism, (which I don't have) I have to to multiple joins of 70million rows.

(Did I mention that this is through 2 RAID1 500GB 7200rpm IDE drives?)

I'm sorry you see it this way.

1) your disk is too slow, you need at least 10,000 or 15,000 RPM
2) you have two disks, how many I/O channel cards? Are these SCSI disks? Is this a NAS, or a SAN, or DASD? Is it internal or external?
3) I can join two tables with 500 million rows each in a SQLServer2005 environment on a 2 CPU dual core machine (also blogged here) and return 10,000 rows in under 5 seconds.

The system you are using requires more tuning.

Thanks,
Dan Linstedt

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›