Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I've read a recent comment in another blog about my original posting on SQLServer2005 and SSIS. I've also been talking with Microsoft on their product line, and improvements that are on the way. While I can't disclose what Microsoft is working on, I can talk about the type of experiment that I'm writing about, and why I staged it the way I did.

Before I launch into the explanations, let me just say that I've been working with Performance and Tuning of overall systems for over 15 years. The sizes of systems that I work on these days are 50 Terabytes to multi-petabytes of information - usually these systems are massive numbers of CPU's, massive numbers of I/O channels (High speed), MPP shared nothing architectures with SMP units underneath, and so on.

The blog I'm replying to is here: http://bencraigo.wordpress.com/2007/04/01/trench-report-ssis-and-sql-server-2005/

What I'm doing is intentional as far as table scans go. I am familiar with the hints that he has suggested, however I have a few comments about those hints.

First, when I deal with transactional data - I like clustered indexes, however when I deal with time-based data warehouse information, clustered indexing can "kill" what otherwise would be a good query plan; regardless of the column choice for clustering.

Second, most of the queries I'm using require 80% of the data or more in the table, it's been proven time and time again that table scans are most efficient when 80% (or more) of the data set is necessary in order to answer the query.

Third, usually the type of data sets I deal with (in volume), SQLServer can't even begin to touch (at least on a laptop). I typically work with 50 Terabytes to 1.5 Petabytes or more, obviously this type of information is on a highly scalable platform with 48 to 64 CPU's, and 64 GB to 200GB of RAM, and multiple disk conections (independantly switched fibre channels).

Fourth in my discussions with Microsoft directly, according to them, they "don't want to play in the super-big (VLDB) arena", however if someone "wants" to scale their SQLServer, they will do everything in their power to help out. Their objective is to be on everyone's desktop at a reasonable price. What I'm suggesting is that if this is their goal, then they need to adjust the performance for "everyone's laptop" - so that a single disk channel, dual RAM bank, and single CPU work in congruance with some level of standard partitioning and performance metrics.

There are additional problems with running SQLServer on a laptop, I'm also running Win32 in which the PageFile.sys is a single-strung read/write (I/O) mechanism - unfortunately this greatly hinders all threaded performance at the I/O level (which is a true bottleneck). I realize that 64 bit Windows, with 64 bit SQLServer may solve several of these issues, however unless I have the horsepower to allow it to work effectively, this solution will "slow me down". In other words, it's not the average laptop that has 64 bit OS, and 64 bit SQLServer, and even if it would, it's not every laptop that would have 6 to 8 GB RAM, with dual quad-core CPU's to run it effectively. Even then, there still remains the question: is PageFile.sys still single strung when the I/O calls to the file are made?

Regarding Index and FillFactor, I use this technique extensively to reduce the I/O reads and writes in the staging areas of a warehouse, however - again the goal was to replicate "average usage" by "average user" on an "average" or default configuration. If the real help is FillFactor of 100% for a "truncate and re-load" table, then the database wizard should help identify this when setting up a particular table structure.

About the Column values and sizing: the problem with not using an INT or a decimal(12,0) is as follows: there are proven techniques for data warehousing performance, and joining across millions of rows on varchar or nvarchar of 12 characters or more kills performance worse than joining on an INT in the table. Furthermore the joins must be exact in the data warehousing world, we cannot risk a "full coverage" field value where the repeatability is such that it returns 20% of the data. Now in regards to a full table scan, we needed 80% of the data out of the table anyhow, and there have been numerous tests that show that table scans to return this amount of data are much faster than any indexing system.

There is an excellent book on the market which I thorougly enjoy. It covers a variety of SQL Environments, and the performance and tuning of SQL. It's ISBN is: 0-201-79169-2, title: "SQL Performance Tuning" by Peter Gulutzan and Trudy Pelzer. Now this isn't to say you can believe everything you read, but I will say that I had over 10+ years of performance and tuning experience before picking up this book, and it validated nearly every one of my experiences in the field.

Back to the focus: If Microsoft wants to "be on everyone's desktop" then they need to have tuning features "for the average desktop" out there. I was trying to mimic the standard desktop that one might have running SQLServer2005 and SSIS. Microsoft has also been kind enough to let me know there are lots of tuning knobs and tweaks when it comes to SSIS that help with performance, but addimmited that PageFile.sys is still a performance isue, and that 32 bit OS will always (regardless of the application) put 1/2 of the requested memory in to PageFile.sys and 1/2 the requested memory in to real RAM (if available). This goes for SQLServer as well, which forces all queries to "cache" from disk back to disk.

They told me by upgrading to 64 bit OS and 64 bit SQLServer that we could allocate "all" real-RAM, which would certainly help performance, but again - I will see a slow down without the proper hardware to handle the extra overhead of code, and RAM caching.

About Visio: I typically use CA AllFusion (ER-Win), or Embarcadero ER-Studio for my data modeling work, however, what I was suggesting (again) was that there are features currently in Visio that work better than the "Data Diagramming" offered within MS SQL Server. Ultimately what Microsoft needs to do is build a real Data Modeling tool that is plugged in to the database, the design environment, and SSIS, and the metadata. Something that goes way beyond Visio's capabilities. I was looking at the short-term answer of how to improve the product immediately.

Thank-you for the wonderful comments and blog post, I hope you find this interesting.
Dan Linstedt
See our Academic Advisory Board Members at: http://www.COBICC.org


Posted April 8, 2007 8:14 AM
Permalink | 3 Comments |

3 Comments

Great Post Dan...and thanks for expanding on the details on your original post. One thing that is clear, and what I alluded to in my comments, are that different solutions have different requirements and challenges. I knew that I didn't know enough about what you were working on, or what your outcomes were, to really comment on it. Apparently that didn't stop me.

Most of my work has been with data warehouses/marts much smaller than what you've worked with (right around a terabyte) with querying requirements that focus on slivers of that data. So from my perspective tables scans in an execution plan always raised the flag that the indexes weren't working. It would be awesome to be able to work with multi-PB systems so I am completely jealous.

On the column values I hope you didn't think I was thinking varchar or nvarchar. I'm a huge fan of surrogate keys and keeping them as tinyint, smallint and int depending on the size of your dimensions. So I'm with you there.

To really get the most out of performance I found that the best results come from maxing out the size in the indexes, partitioning the data, hardware and storage configuration and digging in deep with the execution plans of the queries to make sure they're as optimized as possible. Also on understanding the information will be queries so it can be designed to take advantage of how users will hit the system.

My post was a bit superficial so I really appreciate all your added context around your original post and to my comments. Great blog and thanks again for your comments.

Hi Dan,

I was lucky to find your posts about SQLServer performance, as my company is restructuring its BI. This company is at the end of a merge process, and now is deciding the architecture for its "new BI", using the experience from both teams. Well, Company A has the following BI architecture:
- DataBase: Oracle 9i
- ETL tool - PowerCenter
- OLAP tool - Business Objects
- Volume - less than 1 Terabyte

Company B has:
- DataBase: SQLServer 2000
- ETL tool - none; they use SQLServer procedures
- OLAP tool - Excel (?), Reporting Service
- Volume - less than 0.5 Terabyte

The "new BI" size will be close to 1 Terabyte, so should we continue with Company A's architecture or should we look for a low cost option as Company B? How do we persuade our boss to pay a little more for (what you think is) a better solution?

PS: Sorry for my poor English

I cannot recommend (publicly) a solution for you. We have to take this off-line in order to discuss this.

What I can say is this: SQLServer 2005 and SSIS works wonderfully well past 20 TB, in particular if it is tuned properly.

SSIS also has a learning curve, but that can be surmounted by taking training - advanced training.

Feel free to contact me directly.

Thanks,
Dan L
DanL@RapidACE.com

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›