Blog: Dan E. Linstedt« Defining Unstructured Data & DW2.0 | Main | I.T. Profitability - a follow on » Microsoft SSIS and SQLServer2005An interesting thing happened on the way to the data bank.... Well, I just couldn't help myself. This is a discussion on performance, notions of derivations, product improvements and directions in SQLServer2005, and some annotations on SSIS and where it has gone, where it needs to go, and so forth. Please remember, this is for 35 Million row tables on a 1 CPU-2GB RAM-1 Disk I/O Channel laptop, so it's the bare-minimum. I'd hope to see exponential (or at least linear) performance gains by going to a larger machine (but this doesn't happen either). I've recently been working with Daniels College of Business, Denver University on a State Tax Data Initiative, laying out and integrating information by year, by county, and so on. We have a very small amount of information, some 35 million assessed properties over 8 years of effort. The point of which is to say that I'm working with this information in SQLServer2005, and SSIS data integration components. When running this information on my Single CPU AMD Athlon, 64 bit processor it runs fairly slowly. When I run a query on all 35 million rows (table scan based), no aggregation, it runs for about 10 to 20 minutes, depending on how many rows it returns (between 100,000 rows and 535,000 rows). 2 GB RAM. But when I run the same query on a brand new laptop with 2GB RAM, and Intel Centrino Duo (dual-core) it runs between 1 minute and 3 minutes to return data. A huge difference in performance or I/O to disk. SQLServer2005 with Table Scans and no joins works extremely well, and I'm very happy with that aspect of the performance on a laptop. However, entering queries which require joins (after adding the appropriate indexes) slow the query down exponentially, due to the significant increase in DISK I/O. Especially when I set it up to insert into a secondary target table. Now here's an interesting thought / difference I noticed. When I run the query as a view through "import / export" data wizard, it runs very quickly until it gets to the final commit. The final commit takes forever, especially if the target table has foreign keys, primary keys, and is set to auto-shrink, auto-statistics update, and has traditional logging and so on. Now, if I turn off auto-shrink, auto-stats, and set db_options for select into/bulk-copy, and trunc. log on chkpt, then drop the foreign keys leaving only the default constraints and a sequence identifier, the import/export wizard screams. I can move 35 Million rows from one table to the next (on the new laptop) inside of 5 minutes. Not bad for a single disk, single table copy. However, I noticed something else: if I use a view (join across not more than two tables), and then load data using SSIS, all of the sudden performance jumps. Trying to execute a join/view inside of Import/Export wizard gives me a different result (slower). Anyhow, SSIS is proving to be extremely powerful when you have access databases to load, OLE/DB sources, ODBC sources, or are all SQLServer based. I'm impressed with the technology that Microsoft has put together. SSIS is NOT the answer (in my opinion) for non-microsoft shops, as it has a hard time connecting to non-microsoft operating systems, and non-microsoft systems (like SAP, Oracle Financials, PeopleSoft, etc..) The transformations within SSIS are powerful, and include fuzzy logic grouping, fuzzy logic matching, and neural net learning capacities (very helpful indeed). Finally, A.I. has made it to the transformation main-stream. Yes, I'm aware that certain RDBMS engines have had this functionality for a couple years now, buried in the SQL layers, but to see it inside an ETL type engine is great, particularly now a part of drag-and-drop GUI. Where does SSIS and SQLServer need to go? SSIS: SQLServer2005: What-ever you do, don't cluster the SQLServers together (unless you're willing to pay the maintenance and upgrade prices). Maintenance of clustered SQLServers is tough at best, but you can see some significant performance improvements. Me? I prefer the big-iron boxes to clustering. In other words: take a mainframe box (p-Series/z-Series, etc..) run a logical partition / virtual partition with Windows OS, and SQLServer2005 - this should scream performance.... Do you have Suggestions for SQLServer? Are you running SQLServer2005? Questions? Comments? Thanks, |
Comments
Dan,
ETI is a strategic partner of Microsoft for SQL Server 2005, our value proposition is enabling SSIS to extend beyond the MSFT based environments. ETI provides pre-built high performace connectors for shuttling data in and out of SSIS to Teradata, Oracle and DB2 achieving 8-20 times the performance of the typical OLeDB and ODBC providers. We also generate Built-to-Order Connectivty for SSIS to legacy systems on the mainframe, Unix or other proprietary systems that are specific to a customers needs.
Laurie Crossett
Posted by: Laurie Crossett | March 22, 2007 9:19 AM
Hi Dan,
I’m a program manager in SQL Server Integration Services team, and just read your article
You have some really good feedback at the end, and I wanted to say that we are really working on fixing many of those issues.
Two things I wanted to mention:
First one is around performance and adjusting the engine parameters to better take advantage of the computing environment. SSIS has knobs around threads and buffer/row sizes, which enables the engine to specifically fit into the environment better. i.e. if the CPU(s) can deal with multiple threads, you can increase the # threads per engine…or if the bottleneck is the memory, you can right-size the buffer, so that SSIS engine adjusts the # of rows being sourced to avoid spooling, which costs a lot of I/O. For instance, a study from Unisys actually proved that SSIS can source ~0.5TB/hour on a single box with a super high-end (ES7000) Unisys machine.
Second one is around connectivity. We do support un/semi-structured data sources out of the box, such as XML, FF, or CSV files. Since SP2, we also support the new ACE provider for Office 2007 connectivity to support sourcing data from MS Office data sources. It’s true that out-of-the-box connectivity options are pretty limited, and Microsoft has connectors to Oracle, DB2, and SAP. However, our extensible platform enables our partners to fill in the gaps for other connectivity options. In fact, SSIS can connect to more than 20 popular data sources including Teradata, Sybase, Informix, and other legacy systems. For a full list, please take a look at the table in our connectivity wiki : http://ssis.wik.is/Data_Sources
Many of these features are not well known, and I really like to engage with our customers to better communicate what SSIS is capable of.
Would you be interested in more information which might perhaps change some of your findings around SSIS? I can also connect you with Unisys to talk about their studies and findings around performance. Would you then be open to update your blog factoring in all these new data points?
Please let me know,
Deniz Erkan
Program Manager – SQL Server Integration Services
Posted by: Deniz Erkan | March 22, 2007 3:29 PM
Hi Dan,
I read about your articles on other posts also like Informatica development etc.
My concern right now is that I use SSIS as my ETL tool and wanted to know how can you really extract data from PeopleSoft and SAP etc. Expandable databases.
I cannot find anything in SSIS so far which can do taht.Also you didnot mention this as an improvement suggestion for MSFT.
Thanks
Surendra
Posted by: Surendra | April 30, 2007 9:32 AM
Hi, Dan.
I'm kind of late to this party, but I thought I'd share some of my experiences with SSIS.
When I first looked at it, I said, "Wow, this looks powerful but complicated." A year and a half later I still feel pretty much the same way.
We have a mixed environment with SQL Server and Oracle databases. The bulk of our need for SSIS is for loading files into either SQL Server or Oracle.
My experience is that it is fine for loading into SQL Server, but radically slower than the Oracle tools for loading into Oracle.
I've also found it overly complex for our most typical scenario: load a file, run a stored procedure.
I find the GUI frustrating and limiting, but I get the sense that it was designed for non-programmers to allow them to achieve programmatic work without actually programming.
It is possible to use the SSIS API in .Net code and by-pass the GUI. I have mixed feelings on that since it then limits maintenance to a .Net programmer. That leaves me with a weird, contradictory position on using SSIS.
It is obvious that Microsoft has put lots of time and thought into this tool, and I think that it can be useful, but I don't think it is quite there yet.
I tried using SSIS for a relatively simple automation process a month or so ago. There were some unusual wrinkles involving FTPing files around and loading them into a clustered database where I cannot run SSIS, then executing a stored procedure. That meant I had to run SSIS on a different SQL Server instance.
Long story short I spent two weeks trying to resolve things like file names that change daily and executing a stored procedure on a machine OTHER than the one running SSIS. At the end I couldn't quite get it to work. I dropped back to a .Net program called by the System Scheduler and had the whole thing working in 2 hours...
So far everyone I have talked too who advocated SSIS has only looked at it from the 100,000 ft level. At that level you just see the powerful potential. The problem comes (for me) when you get down to ground level and try to work with messy details...
Does anyone have any suggestions for good books or other materials that could help me bridge the gap?
Thanks,
jeffa
Posted by: Jeff Ammons | May 11, 2007 10:58 AM
To all,
Thank-you so much for your comments. I've had several discussions with Deniz, and a few other users here.
Let me clarify:
1. It is possible to connect to just about anything using SSIS (but not with the out-of the box solution), you need to seek a 3rd party like ETI custom connectors, which are high-speed and extremely flexible. They can also reach to Teradata, AS/400, and other alternative legacy systems (on a native access basis).
2. Regarding performance, I spoke to Deniz stating the following: I was trying to replicate the "average" Microsoft customer, they will not have the high-powered Unisys machine that he mentions. He conceeded that this is true. In order to compete in the market space that Microsoft likes to "live in" and sell to, they MUST improve performance on standard power boxes (dual core, dual quad core), etc..
3. Regarding the details below 10,000 foot level on SSIS: I work with SSIS hands on, and yes at first it can be complicated to learn and navigate through (such as when do you use a regular connector versus a bulk-load connector), and how do you manage changes to the structure, and so on. But once you figure out how to get around the technical environment it can become very easy to create new components quickly.
On the other hand, there's something to be said for coding and not coding. Hand-coding of highly specialized routines *MAY* be ok, but remember, that whenever we hand code we're back to increasing maintenance costs, and difficulty in maintaining standards. Less than 1% of the data integration routines should be hand-coded in reality.
I grew up an assembly level coder on a Z80/8080 CPM chip, and have coded in Assembly, Pascal, C/C++, Perl, PHP, and Java. I've written BIOS, compilers, linkers, memory management technology, and interrupt device drivers.
I can tell you that without a doubt it is easier and faster for me to use a GUI development environment to build data integration routines. It's also cheaper to maintain and stick to standards.
That said, I don't know of any good books that will take you deeper, however I am about to announce a class on SSIS building/maintaining, contact me for more details.
Cheers,
Dan Linstedt
Posted by: Dan Linstedt | May 15, 2007 7:03 AM
Hello Dan,
I thought it is worth mentioning we just released a control flow task called "Database Partitions for SSIS". It is elegant solution to the problem you mentioned in your article about "1. Easier partitioning, while partitioning is there, it can be difficult to setup and maintain."
Regards
Ivan
Posted by: Ivan Peev | May 20, 2007 1:53 PM