I've read a recent comment in another blog about my original posting on SQLServer2005 and SSIS. I've also been talking with Microsoft on their product line, and improvements that are on the way. While I can't disclose what Microsoft is working on, I can talk about the type of experiment that I'm writing about, and why I staged it the way I did.
Before I launch into the explanations, let me just say that I've been working with Performance and Tuning of overall systems for over 15 years. The sizes of systems that I work on these days are 50 Terabytes to multi-petabytes of information - usually these systems are massive numbers of CPU's, massive numbers of I/O channels (High speed), MPP shared nothing architectures with SMP units underneath, and so on.
The blog I'm replying to is here: http://bencraigo.wordpress.com/2007/04/01/trench-report-ssis-and-sql-server-2005/
What I'm doing is intentional as far as table scans go. I am familiar with the hints that he has suggested, however I have a few comments about those hints.
First, when I deal with transactional data - I like clustered indexes, however when I deal with time-based data warehouse information, clustered indexing can "kill" what otherwise would be a good query plan; regardless of the column choice for clustering.
Second, most of the queries I'm using require 80% of the data or more in the table, it's been proven time and time again that table scans are most efficient when 80% (or more) of the data set is necessary in order to answer the query.
Third, usually the type of data sets I deal with (in volume), SQLServer can't even begin to touch (at least on a laptop). I typically work with 50 Terabytes to 1.5 Petabytes or more, obviously this type of information is on a highly scalable platform with 48 to 64 CPU's, and 64 GB to 200GB of RAM, and multiple disk conections (independantly switched fibre channels).
Fourth in my discussions with Microsoft directly, according to them, they "don't want to play in the super-big (VLDB) arena", however if someone "wants" to scale their SQLServer, they will do everything in their power to help out. Their objective is to be on everyone's desktop at a reasonable price. What I'm suggesting is that if this is their goal, then they need to adjust the performance for "everyone's laptop" - so that a single disk channel, dual RAM bank, and single CPU work in congruance with some level of standard partitioning and performance metrics.
There are additional problems with running SQLServer on a laptop, I'm also running Win32 in which the PageFile.sys is a single-strung read/write (I/O) mechanism - unfortunately this greatly hinders all threaded performance at the I/O level (which is a true bottleneck). I realize that 64 bit Windows, with 64 bit SQLServer may solve several of these issues, however unless I have the horsepower to allow it to work effectively, this solution will "slow me down". In other words, it's not the average laptop that has 64 bit OS, and 64 bit SQLServer, and even if it would, it's not every laptop that would have 6 to 8 GB RAM, with dual quad-core CPU's to run it effectively. Even then, there still remains the question: is PageFile.sys still single strung when the I/O calls to the file are made?
Regarding Index and FillFactor, I use this technique extensively to reduce the I/O reads and writes in the staging areas of a warehouse, however - again the goal was to replicate "average usage" by "average user" on an "average" or default configuration. If the real help is FillFactor of 100% for a "truncate and re-load" table, then the database wizard should help identify this when setting up a particular table structure.
About the Column values and sizing: the problem with not using an INT or a decimal(12,0) is as follows: there are proven techniques for data warehousing performance, and joining across millions of rows on varchar or nvarchar of 12 characters or more kills performance worse than joining on an INT in the table. Furthermore the joins must be exact in the data warehousing world, we cannot risk a "full coverage" field value where the repeatability is such that it returns 20% of the data. Now in regards to a full table scan, we needed 80% of the data out of the table anyhow, and there have been numerous tests that show that table scans to return this amount of data are much faster than any indexing system.
There is an excellent book on the market which I thorougly enjoy. It covers a variety of SQL Environments, and the performance and tuning of SQL. It's ISBN is: 0-201-79169-2, title: "SQL Performance Tuning" by Peter Gulutzan and Trudy Pelzer. Now this isn't to say you can believe everything you read, but I will say that I had over 10+ years of performance and tuning experience before picking up this book, and it validated nearly every one of my experiences in the field.
Back to the focus: If Microsoft wants to "be on everyone's desktop" then they need to have tuning features "for the average desktop" out there. I was trying to mimic the standard desktop that one might have running SQLServer2005 and SSIS. Microsoft has also been kind enough to let me know there are lots of tuning knobs and tweaks when it comes to SSIS that help with performance, but addimmited that PageFile.sys is still a performance isue, and that 32 bit OS will always (regardless of the application) put 1/2 of the requested memory in to PageFile.sys and 1/2 the requested memory in to real RAM (if available). This goes for SQLServer as well, which forces all queries to "cache" from disk back to disk.
They told me by upgrading to 64 bit OS and 64 bit SQLServer that we could allocate "all" real-RAM, which would certainly help performance, but again - I will see a slow down without the proper hardware to handle the extra overhead of code, and RAM caching.
About Visio: I typically use CA AllFusion (ER-Win), or Embarcadero ER-Studio for my data modeling work, however, what I was suggesting (again) was that there are features currently in Visio that work better than the "Data Diagramming" offered within MS SQL Server. Ultimately what Microsoft needs to do is build a real Data Modeling tool that is plugged in to the database, the design environment, and SSIS, and the metadata. Something that goes way beyond Visio's capabilities. I was looking at the short-term answer of how to improve the product immediately.
Thank-you for the wonderful comments and blog post, I hope you find this interesting.
See our Academic Advisory Board Members at: http://www.COBICC.org
Posted April 8, 2007 8:14 AM
Permalink | 3 Comments |