An interesting thing happened on the way to the data bank.... Well, I just couldn't help myself. This is a discussion on performance, notions of derivations, product improvements and directions in SQLServer2005, and some annotations on SSIS and where it has gone, where it needs to go, and so forth. Please remember, this is for 35 Million row tables on a 1 CPU-2GB RAM-1 Disk I/O Channel laptop, so it's the bare-minimum. I'd hope to see exponential (or at least linear) performance gains by going to a larger machine (but this doesn't happen either).
I've recently been working with Daniels College of Business, Denver University on a State Tax Data Initiative, laying out and integrating information by year, by county, and so on. We have a very small amount of information, some 35 million assessed properties over 8 years of effort. The point of which is to say that I'm working with this information in SQLServer2005, and SSIS data integration components.
When running this information on my Single CPU AMD Athlon, 64 bit processor it runs fairly slowly. When I run a query on all 35 million rows (table scan based), no aggregation, it runs for about 10 to 20 minutes, depending on how many rows it returns (between 100,000 rows and 535,000 rows). 2 GB RAM. But when I run the same query on a brand new laptop with 2GB RAM, and Intel Centrino Duo (dual-core) it runs between 1 minute and 3 minutes to return data.
A huge difference in performance or I/O to disk. SQLServer2005 with Table Scans and no joins works extremely well, and I'm very happy with that aspect of the performance on a laptop. However, entering queries which require joins (after adding the appropriate indexes) slow the query down exponentially, due to the significant increase in DISK I/O. Especially when I set it up to insert into a secondary target table.
Now here's an interesting thought / difference I noticed. When I run the query as a view through "import / export" data wizard, it runs very quickly until it gets to the final commit. The final commit takes forever, especially if the target table has foreign keys, primary keys, and is set to auto-shrink, auto-statistics update, and has traditional logging and so on. Now, if I turn off auto-shrink, auto-stats, and set db_options for select into/bulk-copy, and trunc. log on chkpt, then drop the foreign keys leaving only the default constraints and a sequence identifier, the import/export wizard screams. I can move 35 Million rows from one table to the next (on the new laptop) inside of 5 minutes. Not bad for a single disk, single table copy.
However, I noticed something else: if I use a view (join across not more than two tables), and then load data using SSIS, all of the sudden performance jumps. Trying to execute a join/view inside of Import/Export wizard gives me a different result (slower).
Anyhow, SSIS is proving to be extremely powerful when you have access databases to load, OLE/DB sources, ODBC sources, or are all SQLServer based. I'm impressed with the technology that Microsoft has put together. SSIS is NOT the answer (in my opinion) for non-microsoft shops, as it has a hard time connecting to non-microsoft operating systems, and non-microsoft systems (like SAP, Oracle Financials, PeopleSoft, etc..)
The transformations within SSIS are powerful, and include fuzzy logic grouping, fuzzy logic matching, and neural net learning capacities (very helpful indeed). Finally, A.I. has made it to the transformation main-stream. Yes, I'm aware that certain RDBMS engines have had this functionality for a couple years now, buried in the SQL layers, but to see it inside an ETL type engine is great, particularly now a part of drag-and-drop GUI.
Where does SSIS and SQLServer need to go?
These are my thoughts, I would like to see the following focus areas improved upon in future products, also - in order to compete, it may be wise for Microsoft to implement some of these ideas going forward:
1. Expandable Data Flow Views (so that columns can be seen in a standard view format, and so that I don't have to double click to see the column views exploded).
2. Additional connectors / adapters for other sources of data, maybe even unstructured data, word docs, and the like.
3. Separated notion of "connection" from "table definition", today I have to import the table definition every time I setup a new data flow, upon selecting an existing connector to use.
4. Conditional aggregation functionality (conditional aggregate functions for each column)
5. performance numbers (rows per second throughput)
6. Row width information
7. CPU/DISK and I/O performance information (easily accessible from SSIS)
8. Better "stop job" capabilities, I recently noticed SSIS outputting messages (warnings about low virtual memory). I had to watch for these, then physically kill the job, change the virtual mem, reboot, then restart the job. This is fine, except I'd rather have a job setting that says: STOP ANY JOB when virtual or physical RAM is running low.
1. Easier partitioning, while partitioning is there, it can be difficult to setup and maintain.
2. Improved Query Optimizer, the query optimizer begins to execute some things in parallel (whcih is nice), but it has a hard time optimizing more than 2 or 3 table joins even with the proper indexes in place - particularly when the data set is 35 million rows or more in each table that is being joined. Keep in mind, we have 1 disk, 1 CPU, and 1 I/O channel - so this is a huge limitation, but none-the-less, the query optimizer should be able to make more efficient use of block reads and data organization.
3. Better data modeling capabilities. I find the "data model diagramming" engine to be severly lagging behind the times. Let's get some serious data modeling capabilities built in to SQLServer, now that Microsoft owns Visio, I'm surprised they haven't built the Visio data modeling engine into SQLServer (it's a little better in some areas, but worse in others). We need physical and logical data modeling capabilities, metadata definition capabilities, and so on.
What-ever you do, don't cluster the SQLServers together (unless you're willing to pay the maintenance and upgrade prices). Maintenance of clustered SQLServers is tough at best, but you can see some significant performance improvements. Me? I prefer the big-iron boxes to clustering. In other words: take a mainframe box (p-Series/z-Series, etc..) run a logical partition / virtual partition with Windows OS, and SQLServer2005 - this should scream performance....
Do you have Suggestions for SQLServer? Are you running SQLServer2005? Questions? Comments?
Want to recruit students with an MS-BI? Talk to us: http://www.COBICC.com
Posted March 21, 2007 5:14 AM
Permalink | 6 Comments |