Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I recently had the chance to work with the Beta program of SQLServer2005, and Integration services. Let me tell you, it was amazing. There are several highlights in the technology that I'd like to share. This entry will go through these items. There are only a couple of things that I'd like to see improved. My basic thoughts are: this will give customers who are tight on budget, or not moving huge volumes per load-cycle, an opportunity to really begin building out a true data warehousing/BI solution.

SQLServer2005:

* Database/Table partitioning. At long last, SQLServer has implemented range-partitioning. This will be a HUGE boost to performance on larger data sets, as long as the partitions are architected properly across the existing disk sets. If you're not familiar with partitioning, now's your chance to read up on it - do all the reading you can before you upgrade. Partitioning is not something you simply want to "turn-on". However, SQLQueries that use clustered indexes can finally avoid table scanning when partitions are invoked, why? next feature... Other types of partitioning (like hash, and sub-partitions) are said to be in the works for a future release.

* Query Parallelism. Query parallelism has been implemented to take advantage of the database and table partitioning. The queries can and will be broken into their respective components to take advantage of the range partitions. Queries also take advantage of new performance of the SQLServer2005.

* Performance. SQLServer2005 has extended its performance by optimizing to 8k block sizes, windows operating system hooks, and shares RAM caches with I/O buffers. They've streamlined the I/O from SQLServer2005 to the disk, and as I understand it - bypassed the page-file to bring data into RAM (although page-file is still used for locking and swapping). The performance of SQLServer2005 is much higher than 2000, and can process millions of rows per minute, even without clustering.

* Clustering. There are new mechanisms for clustering SQLServer2005, and as I understand it, it can take shape of multiple clusters on a network of nodes (SMP under an MPP look and feel). This brings new meaning to low-cost high power computing. This notion is still challenging to setup, and make happen with SQLServer2005, but will get easier as Microsoft continues to release their product.

Integration Services:
* Fast fast fast, is all I can say. I saw a windows platform processing 2 Million rows in 2 minutes from a flat file, into SQLServer2005 through Integration services, and through a data mining transformation that provided name and address cleansing (and learned) - It was based on a neural network that you could tune. This was done on a single CPU laptop running all the software. It's been tuned to run with Flat Files, and SQLServer connections in parallel. Finally, direct connections gain parallel utilization and catch up with other DBMS vendors.

* Tons of transformations. Comes out of the box - lots of transformations, as I mentioned just a minute ago, one of them is a neural network. It also houses the parallel processing properties for splitting data (in parallel) down multiple transformation paths. They also have the notion of an error control and error target, which you can control the rules for.

* Workflow style transformation - all transformations are workflow style, meaning they are "runnable designs". You can see the row count increase as the data passes through the transformations along the way.

Improvements I'd like to see:
* Metadata. I had heard that they don't have a real "metadata repository", they apparently are working something out with Meta Integration (you'll hear about them shortly). They need to get into the OPEN metadata game.

* Import/Export of transformation logic via XML. There apparently is no import/export of the transformation designs, therefore there is no current capability for sharing the transformation logic across servers.

* Reusable transformation "workflows" - they do not have the concept of reusability down yet. I've heard they are working on this.

* Biggest one: Connectivity options. Today, if you want Oracle Data, DB2 UDB, DB2 AS/400, Teradata, Sybase, MySQL, or Informix you have three options: use BRIDGE software provided by the other vendor (which is slow), use ODBC and OLE*DB drivers from Microsoft, some are fast, some are slow, or write your own for native connectivity using their SDK.

Bottom line: if you're a Microsoft only shop, SQLServer2005 and Integration services is a HUGE boost to productivity over DTS and SQLServer2000 - well worth the upgrade cost. If you're a small shop that deals with mostly flat-file transfers from other databases, it will help as well. If you’re into volume - and native connectivity, I suggest looking elsewhere, and letting Microsoft grow into this space.

More to come, did I miss something? Post a reply!

Cheers,
Dan Linstedt


Posted April 26, 2006 5:45 AM
Permalink | 1 Comment |

1 Comment

Glad you appreciate the product Dan - we're certainly excited to have Integration Services out there, shaking up the market.

You're comments are, as always, insightful.

We do have a rather good metadata application - interestingly we made it a shared source download so that users can modify it if they like to - or need to. See http://www.microsoft.com/downloads/details.aspx?FamilyID=11daa4d1-196d-4f2a-b18f-891579c364f4&DisplayLang=en

it would be great to see what you think of this approach.

There is also a whitepaper here: http://www.microsoft.com/downloads/details.aspx?FamilyID=182bd330-0189-450c-a2fe-df5c132d9da9&DisplayLang=en

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›