Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

If you're like the rest of the world these days you've got an ever growing data set, and at the same time an ever shrinking processing window. This is not something you want to treat lightly. In most cases, you are also experiencing severe performance problems and don't know how to deal with it, or haven't been able to solve these issues. Well, there are ways and means in which performance can be improved - I've been teaching, and consulting on performance of VLDW integration systems for 10+ years, there are techniques and manners in which your performance can improve. The catch? You have to be willing to swallow the blue pill (from the movie: The Matrix). Let's just see how far done the rabbit hole goes...

So you're caught in performance issues, be it ETL, ELT, EII, EAI, or worse: RDBMS system. You have big data to move or maybe lots of data in very short time windows. Maybe you have huge data to bounce it against (look up, match, consolidate, quality cleanse, etc). Maybe you're pulling together 200M rows for one feed to a dimension. You're source is 20 to 40 tables, and you've got a window to worry about. How do you handle this situation and others like it? What do you look for?

There are many different facets to examine in performance and tuning of your systems and your architecture. The major facets that I examine are:
* Hardware: RAM, CPU, Network, Disk Speed and number, I/O Throughput
* Software Concepts: Co-location, over the network, block sizing, parallelism, partitioning abilities, caching, data sharing, piping, and data layout.
* Software: ETL, EAI, ELT, EII, and the big one: RDBMS.

Of course, we also consider where the data is coming from: ERP, CRM, BPM, Financials, and other applications that lay up-stream, and whether or not the systems are delivering data in real-time, batch, or both.

If there's one thing I rely on in many of these situations it's the numbers. The throughput numbers usually tell the story as to where the problems are originating, and then through orders of scale - what we can do about it. The numbers tell me how much, how fast, and give me the ability to re-factor the architecture where the pain points exist.

In plain English please...
What this means is: I look for the following (for example): Number of instances of an RDBMS on a single machine, balanced against the number of CPU's, amount of RAM, speed of Disk, and number of disk controllers. "If you ain't got that "I/O" you aint got a thing...." (modified from Duke Ellington, sorry Duke). Now the problem is: most people see I/O and immediately think "disk" - well that's only one area. I/O stands for Input / Output, no-where does it have the word "disk". I/O represents the balanced speed of all devices and systems that we are trying to measure and improve. When we treat I/O as a constant, the problems become clear very fast.

I go through more than 100+ points in an architecture to pinpoint the top 25 that are causes to the performance problems in a system (too numerous to mention here), but I'll give you a taste of what I usually look at in the RDBMS side of the house. By the way, the customers who go through my performance and tuning assessment have seen on average, anywhere between 400% and 4000% performance improvements - taking run times from 48 hours down to 8 hours, from 8 hours to 2 hours, from 6 hours to 45 minutes, from 56 hours to just under 12 hours and so on. But it means the customers were willing to take my suggestions to heart and implement them.

In the RDBMS world here are some things I measure by:
1. Number of instances on a single box
2. Amount of parallelism setup for the engine
3. Number of Partitions setup for the largest of data sets
4. Data Layout of different tables across disk
5. Number of Disk Controllers, type of controllers, and database layout across the I/O channels
6. Amount of RAM cache available.
7. Percentage of Index kept in RAM from call to call.
8. Number of read-ahead buffers.
9. Size of blocks within the database
10. Cache hit to cache miss ratios
11. Positioning of the Log and Temp areas.
12. Amount of swap space on the machine where the RDBMS resides
13. Amount of RAM the RDBMS is limited to
14. Number of CPU's the RDBMS is limited to.

And so on... If you have an interest in a particular area, please post a question or comment, and I'll try to blog on that going forward. In the mean-time, please be aware that we offer these assessments with fantastic results.

Thank-you,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted November 28, 2006 6:23 AM
Permalink | 1 Comment |

1 Comment

Hi Dan,
Yours was a interesting reading.I am an ETL Developer myself, and I found your blog very informative and fascinating, I would like to know more about designing data ware house in ETL Architect perspective

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›