Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

An interesting thing happened on the way to the data bank.... Well, I just couldn't help myself. This is a discussion on performance, notions of derivations, product improvements and directions in SQLServer2005, and some annotations on SSIS and where it has gone, where it needs to go, and so forth. Please remember, this is for 35 Million row tables on a 1 CPU-2GB RAM-1 Disk I/O Channel laptop, so it's the bare-minimum. I'd hope to see exponential (or at least linear) performance gains by going to a larger machine (but this doesn't happen either).

I've recently been working with Daniels College of Business, Denver University on a State Tax Data Initiative, laying out and integrating information by year, by county, and so on. We have a very small amount of information, some 35 million assessed properties over 8 years of effort. The point of which is to say that I'm working with this information in SQLServer2005, and SSIS data integration components.

When running this information on my Single CPU AMD Athlon, 64 bit processor it runs fairly slowly. When I run a query on all 35 million rows (table scan based), no aggregation, it runs for about 10 to 20 minutes, depending on how many rows it returns (between 100,000 rows and 535,000 rows). 2 GB RAM. But when I run the same query on a brand new laptop with 2GB RAM, and Intel Centrino Duo (dual-core) it runs between 1 minute and 3 minutes to return data.

A huge difference in performance or I/O to disk. SQLServer2005 with Table Scans and no joins works extremely well, and I'm very happy with that aspect of the performance on a laptop. However, entering queries which require joins (after adding the appropriate indexes) slow the query down exponentially, due to the significant increase in DISK I/O. Especially when I set it up to insert into a secondary target table.

Now here's an interesting thought / difference I noticed. When I run the query as a view through "import / export" data wizard, it runs very quickly until it gets to the final commit. The final commit takes forever, especially if the target table has foreign keys, primary keys, and is set to auto-shrink, auto-statistics update, and has traditional logging and so on. Now, if I turn off auto-shrink, auto-stats, and set db_options for select into/bulk-copy, and trunc. log on chkpt, then drop the foreign keys leaving only the default constraints and a sequence identifier, the import/export wizard screams. I can move 35 Million rows from one table to the next (on the new laptop) inside of 5 minutes. Not bad for a single disk, single table copy.

However, I noticed something else: if I use a view (join across not more than two tables), and then load data using SSIS, all of the sudden performance jumps. Trying to execute a join/view inside of Import/Export wizard gives me a different result (slower).

Anyhow, SSIS is proving to be extremely powerful when you have access databases to load, OLE/DB sources, ODBC sources, or are all SQLServer based. I'm impressed with the technology that Microsoft has put together. SSIS is NOT the answer (in my opinion) for non-microsoft shops, as it has a hard time connecting to non-microsoft operating systems, and non-microsoft systems (like SAP, Oracle Financials, PeopleSoft, etc..)

The transformations within SSIS are powerful, and include fuzzy logic grouping, fuzzy logic matching, and neural net learning capacities (very helpful indeed). Finally, A.I. has made it to the transformation main-stream. Yes, I'm aware that certain RDBMS engines have had this functionality for a couple years now, buried in the SQL layers, but to see it inside an ETL type engine is great, particularly now a part of drag-and-drop GUI.

Where does SSIS and SQLServer need to go?
These are my thoughts, I would like to see the following focus areas improved upon in future products, also - in order to compete, it may be wise for Microsoft to implement some of these ideas going forward:

SSIS:
1. Expandable Data Flow Views (so that columns can be seen in a standard view format, and so that I don't have to double click to see the column views exploded).
2. Additional connectors / adapters for other sources of data, maybe even unstructured data, word docs, and the like.
3. Separated notion of "connection" from "table definition", today I have to import the table definition every time I setup a new data flow, upon selecting an existing connector to use.
4. Conditional aggregation functionality (conditional aggregate functions for each column)
5. performance numbers (rows per second throughput)
6. Row width information
7. CPU/DISK and I/O performance information (easily accessible from SSIS)
8. Better "stop job" capabilities, I recently noticed SSIS outputting messages (warnings about low virtual memory). I had to watch for these, then physically kill the job, change the virtual mem, reboot, then restart the job. This is fine, except I'd rather have a job setting that says: STOP ANY JOB when virtual or physical RAM is running low.

SQLServer2005:
1. Easier partitioning, while partitioning is there, it can be difficult to setup and maintain.
2. Improved Query Optimizer, the query optimizer begins to execute some things in parallel (whcih is nice), but it has a hard time optimizing more than 2 or 3 table joins even with the proper indexes in place - particularly when the data set is 35 million rows or more in each table that is being joined. Keep in mind, we have 1 disk, 1 CPU, and 1 I/O channel - so this is a huge limitation, but none-the-less, the query optimizer should be able to make more efficient use of block reads and data organization.
3. Better data modeling capabilities. I find the "data model diagramming" engine to be severly lagging behind the times. Let's get some serious data modeling capabilities built in to SQLServer, now that Microsoft owns Visio, I'm surprised they haven't built the Visio data modeling engine into SQLServer (it's a little better in some areas, but worse in others). We need physical and logical data modeling capabilities, metadata definition capabilities, and so on.

What-ever you do, don't cluster the SQLServers together (unless you're willing to pay the maintenance and upgrade prices). Maintenance of clustered SQLServers is tough at best, but you can see some significant performance improvements. Me? I prefer the big-iron boxes to clustering. In other words: take a mainframe box (p-Series/z-Series, etc..) run a logical partition / virtual partition with Windows OS, and SQLServer2005 - this should scream performance....

Do you have Suggestions for SQLServer? Are you running SQLServer2005? Questions? Comments?

Thanks,
Dan Linstedt
Want to recruit students with an MS-BI? Talk to us: http://www.COBICC.com


Posted March 21, 2007 5:14 AM
Permalink | 6 Comments |

6 Comments

Dan,

ETI is a strategic partner of Microsoft for SQL Server 2005, our value proposition is enabling SSIS to extend beyond the MSFT based environments. ETI provides pre-built high performace connectors for shuttling data in and out of SSIS to Teradata, Oracle and DB2 achieving 8-20 times the performance of the typical OLeDB and ODBC providers. We also generate Built-to-Order Connectivty for SSIS to legacy systems on the mainframe, Unix or other proprietary systems that are specific to a customers needs.

Laurie Crossett

Hi Dan,

I’m a program manager in SQL Server Integration Services team, and just read your article

You have some really good feedback at the end, and I wanted to say that we are really working on fixing many of those issues.

Two things I wanted to mention:

First one is around performance and adjusting the engine parameters to better take advantage of the computing environment. SSIS has knobs around threads and buffer/row sizes, which enables the engine to specifically fit into the environment better. i.e. if the CPU(s) can deal with multiple threads, you can increase the # threads per engine…or if the bottleneck is the memory, you can right-size the buffer, so that SSIS engine adjusts the # of rows being sourced to avoid spooling, which costs a lot of I/O. For instance, a study from Unisys actually proved that SSIS can source ~0.5TB/hour on a single box with a super high-end (ES7000) Unisys machine.

Second one is around connectivity. We do support un/semi-structured data sources out of the box, such as XML, FF, or CSV files. Since SP2, we also support the new ACE provider for Office 2007 connectivity to support sourcing data from MS Office data sources. It’s true that out-of-the-box connectivity options are pretty limited, and Microsoft has connectors to Oracle, DB2, and SAP. However, our extensible platform enables our partners to fill in the gaps for other connectivity options. In fact, SSIS can connect to more than 20 popular data sources including Teradata, Sybase, Informix, and other legacy systems. For a full list, please take a look at the table in our connectivity wiki : http://ssis.wik.is/Data_Sources

Many of these features are not well known, and I really like to engage with our customers to better communicate what SSIS is capable of.

Would you be interested in more information which might perhaps change some of your findings around SSIS? I can also connect you with Unisys to talk about their studies and findings around performance. Would you then be open to update your blog factoring in all these new data points?

Please let me know,

Deniz Erkan
Program Manager – SQL Server Integration Services

Hi Dan,
I read about your articles on other posts also like Informatica development etc.
My concern right now is that I use SSIS as my ETL tool and wanted to know how can you really extract data from PeopleSoft and SAP etc. Expandable databases.
I cannot find anything in SSIS so far which can do taht.Also you didnot mention this as an improvement suggestion for MSFT.
Thanks
Surendra

Hi, Dan.

I'm kind of late to this party, but I thought I'd share some of my experiences with SSIS.

When I first looked at it, I said, "Wow, this looks powerful but complicated." A year and a half later I still feel pretty much the same way.

We have a mixed environment with SQL Server and Oracle databases. The bulk of our need for SSIS is for loading files into either SQL Server or Oracle.

My experience is that it is fine for loading into SQL Server, but radically slower than the Oracle tools for loading into Oracle.

I've also found it overly complex for our most typical scenario: load a file, run a stored procedure.

I find the GUI frustrating and limiting, but I get the sense that it was designed for non-programmers to allow them to achieve programmatic work without actually programming.

It is possible to use the SSIS API in .Net code and by-pass the GUI. I have mixed feelings on that since it then limits maintenance to a .Net programmer. That leaves me with a weird, contradictory position on using SSIS.

It is obvious that Microsoft has put lots of time and thought into this tool, and I think that it can be useful, but I don't think it is quite there yet.

I tried using SSIS for a relatively simple automation process a month or so ago. There were some unusual wrinkles involving FTPing files around and loading them into a clustered database where I cannot run SSIS, then executing a stored procedure. That meant I had to run SSIS on a different SQL Server instance.

Long story short I spent two weeks trying to resolve things like file names that change daily and executing a stored procedure on a machine OTHER than the one running SSIS. At the end I couldn't quite get it to work. I dropped back to a .Net program called by the System Scheduler and had the whole thing working in 2 hours...

So far everyone I have talked too who advocated SSIS has only looked at it from the 100,000 ft level. At that level you just see the powerful potential. The problem comes (for me) when you get down to ground level and try to work with messy details...

Does anyone have any suggestions for good books or other materials that could help me bridge the gap?

Thanks,
jeffa

To all,

Thank-you so much for your comments. I've had several discussions with Deniz, and a few other users here.

Let me clarify:
1. It is possible to connect to just about anything using SSIS (but not with the out-of the box solution), you need to seek a 3rd party like ETI custom connectors, which are high-speed and extremely flexible. They can also reach to Teradata, AS/400, and other alternative legacy systems (on a native access basis).
2. Regarding performance, I spoke to Deniz stating the following: I was trying to replicate the "average" Microsoft customer, they will not have the high-powered Unisys machine that he mentions. He conceeded that this is true. In order to compete in the market space that Microsoft likes to "live in" and sell to, they MUST improve performance on standard power boxes (dual core, dual quad core), etc..
3. Regarding the details below 10,000 foot level on SSIS: I work with SSIS hands on, and yes at first it can be complicated to learn and navigate through (such as when do you use a regular connector versus a bulk-load connector), and how do you manage changes to the structure, and so on. But once you figure out how to get around the technical environment it can become very easy to create new components quickly.

On the other hand, there's something to be said for coding and not coding. Hand-coding of highly specialized routines *MAY* be ok, but remember, that whenever we hand code we're back to increasing maintenance costs, and difficulty in maintaining standards. Less than 1% of the data integration routines should be hand-coded in reality.

I grew up an assembly level coder on a Z80/8080 CPM chip, and have coded in Assembly, Pascal, C/C++, Perl, PHP, and Java. I've written BIOS, compilers, linkers, memory management technology, and interrupt device drivers.

I can tell you that without a doubt it is easier and faster for me to use a GUI development environment to build data integration routines. It's also cheaper to maintain and stick to standards.

That said, I don't know of any good books that will take you deeper, however I am about to announce a class on SSIS building/maintaining, contact me for more details.

Cheers,
Dan Linstedt

Hello Dan,

I thought it is worth mentioning we just released a control flow task called "Database Partitions for SSIS". It is elegant solution to the problem you mentioned in your article about "1. Easier partitioning, while partitioning is there, it can be difficult to setup and maintain."

Regards
Ivan

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›