Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Design Category

Here's another "what I've learned." When we have a deadline or a business-critical issue, we can tend to underemphasize the business requirements, due diligence, experience and sometimes some spending that is going to be necessary.

One area, just to pick something out, is Dun&Bradstreet numbers. DNB is practically ubiquitous in my clientele yet it’s always harder to match than is initially believed. Another truth about DNB is that not every company has a DNB number. Plan for it!

I have a collection of data warehousing “famous last words.” Here’s one: “We’ll just link our customers to DNB tomorrow and it’ll be ready for you on Thursday.”


Posted September 11, 2007 7:22 AM
Permalink | No Comments |

Many systems have struggled to add maximum value to their organizations as an indirect result of lacking classical OSS architectural standards, such as:
• Reasonably-scaled and separate test, development, and QA platforms
• Change management
• Business Continuance and disaster recovery planning.
• Robust backup management
• Physical isolation or thoughtful co-location based on shared characteristics
• User care measures such as support, training and built-in descriptive information
• Proactive performance planning
• Business roles and responsibilities
• Ongoing business tuning of program direction

They therefore may not seem stable enough for business user dependence and continued development.

However, there are several aspects of technology that enable the progression of these standards:

• The continued reduction in hardware costs, especially in disk technology, make it economically possible to have separate development, QA, and recovery systems.
• Improvement in database technology so that we now have the ability to insert massive amounts of data while the system is being accessed, enabling occasional (near) real-time data feeds
• Sophisticated ETL tools and “hub and spoke” systems and standards facilitate the establishment and operation of systems that move data from source systems to targets, and vice versa.

The main inhibitor, however, will probably continue to be the perception of such systems as strictly technical in nature, when in reality the fit into the business situation is just as important as the technical aspects.


Posted July 27, 2007 8:37 AM
Permalink | No Comments |

One of the most difficult things to do in data warehousing is to engage a new source system. Learning about the fields the system has to offer the data warehouse, when they are populated, how “clean” the fields are and when you can get after them with your extract job can be daunting. Then, after going through the process of attaching the extract jobs, scheduling and beginning the cycles, you would want to be set for a while.

Not so fast. Usually 1 day to 2 weeks after putting a data warehouse – any iteration – into production (or prototype), users who previously communicated requirements in abstract terms are now seeing the results and requiring changes. New fields and new transformations are not unheard of at this point.

Although data warehousing is very dynamic, it is possible for a practitioner to think beyond initial, spoken requirements and “prime the pump” by bringing additional fields into the ETL process. This concept, known as “triage” works very well if you have a staging area where initial loading from source is “dropped” prior to the majority of the transformations.

With triage and a staging area, the staging area can contain many more fields than are moved forward to the actual data warehouse. Then, if a new field is needed in the warehouse, there is no effect on the source extracts (and no accompanying disruption of source operation and negotiation with the source system team).

But wait, you say. "What about the historical data that usually accompanies such new data sourcing?"

The concept of the persistent staging area is to keep all data, both from a “triaged” (see yesterday’s tip) and a historical perspective in the staging area. That way, when requirements change post-production (again, see yesterday’s tip), you not only have the ETL “primed”, you also have the historical data primed and ready to be moved forward to the warehouse – in the persistent staging area.

Persistent staging areas almost always require a separate DBMS instance from the data warehouse DBMS due to the volume that will accumulate in them.

Since historical data is also kept in the warehouse, the distinctness for the persistent staging area lies in its capturing of triaged data, ready for historical loading of required data post-implementation. It will be bigger than the warehouse itself.

Although I usually do not use this technique in my data warehouses, if there was a high likelihood that requirements would be very dynamic after production and disk cost were not an issue, it would be very applicable.


Posted July 4, 2007 10:49 AM
Permalink | No Comments |

Here is some code I’ve been dragging around in my toolbox for many years. Some data warehouses populated their date dimension through only 2007 or 2008 because that seemed to be a long time from when the data warehouse project began. Well, now it’s time to add on a few more years of data to the date dimension. However, with the auto-population capabilities with current releases of database management systems for date dimensions, manually populating the date dimension has become a lost art. So, whether it’s useful or a stroll into the past, this code populates Year, Quarter, Month, Week, Day and Date dimensions.

The table names are as follows:

Year – Year_D
Year_Key - SK
The_Year - Int

Quarter – Quarter_D
Quarter_Key - SK
The_Quarter - Int

Month – Month_D
Month_Key - SK
Quarter_Key - Int [Ref: Quarter_D.Quarter_key]
The_Month - Int

Week – Week_D
Week_Key - SK
The_Week - Int

Day – Day_D
Day_Key - SK
The_Day - Int

Date – Date_D
Date_Key - SK
Year_Key - Int [Ref: Year_D.Year_Key]
Month_Key - Int [Ref: Month_D.Month_Key]
Week_Key - Int [Ref: Week_D.Week_Key]
Day_Key - Int [Ref: Day_D.Day_Key]
The_Date - Date

Code:

create procedure Load_Date_Dimension_Sp @year_start int, @year_end int
as
begin
declare @yr int
set @yr = year_start
while (@yr <= year_end)
begin
insert into year_d (the_year) values (@yr)
set @yr = @yr + 1
end


declare @qt int
set @qt = 1
while (@qt <= 4)
begin
insert into quarter_d (the_quarter) values (@qt)
set @qt = @qt + 1
end


declare @mo int
set @mo = 1
while (@mo <= 12)
begin
insert into month_d (the_month) values (@mo)
set @mo = @mo + 1
end

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=1)
where the_month in (1,2,3)

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=2)
where the_month in (4,5,6)

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=3)
where the_month in (7,8,9)

update month_d
set quarter_key = (select quarter_key from quarter_d
where the_quarter=4)
where the_month in (10,11,12)


declare @wk int
set @wk = 1
while (@wk <= 52)
begin
insert into week_d (the_week) values (@wk)
set @wk = @wk + 1
end


declare @dt1 int
set @dt1 = 1
while (@dt1 <= 31)
begin
insert into day_d (the_day) values (@dt1)
set @dt1 = @dt1 + 1
end


declare @dt datetime
declare @dt_end = ‘12/31/’ + @year_end
declare @day_k int
declare @week_k int
declare @month_k int
declare @year_k int


set @dt = '01/01/’ + @year_start
while (@dt <= @dt_end)
begin
select @year_k=year_key from year_d where the_year = year(@dt)
select @month_k=month_key from month_d where the_month = month(@dt)
select @day_k=day_key from day_d where the_day = day(@dt)
select @week_k=week_key from week_d where the_week = datepart(wk,@dt)
insert into date_d (the_date, day_key, week_key, month_key, year_key) values (@dt, @day_k, @week_k, @month_k, @year_k)
set @dt = @dt + 1
end

end

The Procedure can be executed as:

Exec Load_Date_Dimension_Sp 1985, 2030

…which will populate the date dimension for the years 1985 through 2030.

All the syntaxes are based on SQL Server TSQL.

Technorati tags: Data warehouse, Dimensional Model, Date Dimension


Posted June 15, 2007 8:51 AM
Permalink | No Comments |

I’ve been putting some thought into data warehouse design issues in my recent three (and one more to go) DM Review column, but today I was thinking about what inhibits good data modeling. In other words, what are the practices leading to poor data warehouse data modeling?

The main ones really all have to do with the same thing – not doing a logical model before the physical model. The major reasons for this are:


· Thinking physical too early - The hardest part of modeling is getting the grain of each entity and the relationships right. It is difficult to tell the grain and “real” relationships (versus those that are being added for performance) when physical constructs are thrown in too early.

· Thinking usage over relationships too early - Sure, there are those target queries that you will test the model with. But if you model to them, you can easily make the model very difficult to accomplish future requirements with – those sitting below the peak of the proverbial iceberg.

· Thinking denormalization before the normalization is done - Anticipating downstream query performance shortcomings and considering them too early into the modeling process is a reason models fail. By the way, I'm using normalization here generically to mean normalizing or dimensional modeling - whichever you are doing.

As a rule, you should try to stay close to the logical model during implementation. Physical modeling should be iterative and based on performance testing. It’s not something you necessarily do with your modeling tool. If you “over-spec” the machine a little bit, and arm your toolbelt with ability to index, cluster, partition (which does not stray from the logical model) and build cubes, summaries and push complexities in a mart or at least outside the core model, what you gain in model understandability and manageability usually outweigh the added complexities of numerous rule exceptions, like denormalizations, in the model.

Technorati tags: Data Modeling, Data Warehouse, Normalization


Posted April 13, 2007 2:41 PM
Permalink | No Comments |
PREV 1 2

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›