Blog: Dan E. Linstedt« Mistaken Nano-Identities (just for fun) | Main | Context and Perspective » Over-Normalization: VLDW and performance of queriesJust like there is a danger in over-denormalization (overrunning the block sizes, causing chained rows, and a multiplier to reading the data), there is a danger in over-normalizing... Or is there? Lately there has been renewed discussion about column-based-solutions coming in to play (but that's for another blog). In this blog entry I discuss the dangers of over-normalizing data on a traditional row based database system, especially as it relates to VLDW and MPP. The math that works in FAVOR of Normalization, also works AGAINST normalization if we over-normalize, and re-introduce too many new joins. For instance going from 3rd normal form to 4th normal or even 5th normal form in our architectures can cause significant I/O traffic, even in a parallel environment. What we want (like always with Performance and Tuning) is a balance. In Oracle 10g on a BIG-IRON SMP machine (32 CPU's, 48GB RAM) you can usually achieve between 5 and 15 joins, depending on the disk and I/O configuration, and sizes of tables. Now keep in mind that these are "averaged" numbers, on an "average" VLDW system, and are executed when there is "average load" on those systems to begin with, unfortunately I cannot publicly disclose actual numbers, nor performance of these types of queries, except to say: when properly tuned for parallelism, they run FAST, and except to say that the average table size contains 150 million rows at about 1k each... We are pulling punches here, because some of the tuning that has been done includes things like Join Indexes, or materialized views, or IOT (Index Only Tables) in the cases where row size is 12 to 20 bytes long... We've also used partitioning, database compression, and turned up the parallelism of the indexes available to the optimizers. Keep in mind that these large and very large systems have to be tuned in one form or another in order to get these joins working well. But I wander.... Back to the point. If we overly normalize (reduce down to 1 or 2 columns per table), and then we "double" the I/O's needed to get the data back, not to mention the work that has to be done in Memory... But we DON'T increase the hardware to handle it, then we may very well end up with "too much parallelism" that overwhelms the existing hardware, causing multiple threads to "wait" on the I/Os of one or more disk, or wait on the availability of computing resources. So is it the over-normalization that's the problem? No, not necessarily, in this case its the inability to process everything in parallel all the time, this is why we are seeing a resurgance in column based appliances - they overcome some of these problems at the firm-ware level, and at the data sharing level, and at the processing level. In effect they are "joining" every single column to every other column, in parallel, to reconstruct the rows. Would I recommend this in your standard RDBMS today? The sweet spot will change based on the BLOCK SIZE selected for the database. As a rule of thumb, for "wider" rows in a normalized table, I shoot for a minimum of 50 to 100 rows (block size of 64k). In "smaller" rows in a normalized table, I shoot for a minimum of 400 to 500 rows per block (more if I can get it). In databases that can only handle 8k block sizes, I focus on the parallelism of the query optimizer. Here's something I can share about SQLServer2005: Upgrade from 2000 (even in the 32 bit) to get the performance gains in the optimizer. But, Foreign keys actually perform faster than separate indexes on both tables... It seems the optimization has been greately improved, especially if the foreign key includes a CLUSTERED column (usually a surrogate sequence number). Another note about SQLServer2005: if you have an Index (non-unique), and you have a clustered primary sequence number, then add the primary sequence number to the non-unique index (at the end), make the index unique, and then join.... It will also be faster in joins. By the way, I normally don't use clustered data sets, except when it comes to surrogate ever increasing sequences. Hope this helps, |