Blog: William McKnightSeptember 11, 2007Famous Last WordsHere's another "what I've learned." When we have a deadline or a business-critical issue, we can tend to underemphasize the business requirements, due diligence, experience and sometimes some spending that is going to be necessary. One area, just to pick something out, is Dun&Bradstreet numbers. DNB is practically ubiquitous in my clientele yet it’s always harder to match than is initially believed. Another truth about DNB is that not every company has a DNB number. Plan for it! I have a collection of data warehousing “famous last words.” Here’s one: “We’ll just link our customers to DNB tomorrow and it’ll be ready for you on Thursday.” Continue reading "Famous Last Words" » July 27, 2007Information Architectural StandardsMany systems have struggled to add maximum value to their organizations as an indirect result of lacking classical OSS architectural standards, such as: They therefore may not seem stable enough for business user dependence and continued development. However, there are several aspects of technology that enable the progression of these standards: Continue reading "Information Architectural Standards" » July 4, 2007Triaging Source Data and a Persistent Staging AreaOne of the most difficult things to do in data warehousing is to engage a new source system. Learning about the fields the system has to offer the data warehouse, when they are populated, how “clean” the fields are and when you can get after them with your extract job can be daunting. Then, after going through the process of attaching the extract jobs, scheduling and beginning the cycles, you would want to be set for a while. Not so fast. Usually 1 day to 2 weeks after putting a data warehouse – any iteration – into production (or prototype), users who previously communicated requirements in abstract terms are now seeing the results and requiring changes. New fields and new transformations are not unheard of at this point. Although data warehousing is very dynamic, it is possible for a practitioner to think beyond initial, spoken requirements and “prime the pump” by bringing additional fields into the ETL process. This concept, known as “triage” works very well if you have a staging area where initial loading from source is “dropped” prior to the majority of the transformations. With triage and a staging area, the staging area can contain many more fields than are moved forward to the actual data warehouse. Then, if a new field is needed in the warehouse, there is no effect on the source extracts (and no accompanying disruption of source operation and negotiation with the source system team). But wait, you say. "What about the historical data that usually accompanies such new data sourcing?" Continue reading "Triaging Source Data and a Persistent Staging Area" » June 15, 2007Code to populate a date dimensionHere is some code I’ve been dragging around in my toolbox for many years. Some data warehouses populated their date dimension through only 2007 or 2008 because that seemed to be a long time from when the data warehouse project began. Well, now it’s time to add on a few more years of data to the date dimension. However, with the auto-population capabilities with current releases of database management systems for date dimensions, manually populating the date dimension has become a lost art. So, whether it’s useful or a stroll into the past, this code populates Year, Quarter, Month, Week, Day and Date dimensions. The table names are as follows: Year – Year_D Quarter – Quarter_D Month – Month_D Week – Week_D Day – Day_D Date – Date_D Code: Continue reading "Code to populate a date dimension" » April 13, 2007Logical Modeling and Physical ModelingI’ve been putting some thought into data warehouse design issues in my recent three (and one more to go) DM Review column, but today I was thinking about what inhibits good data modeling. In other words, what are the practices leading to poor data warehouse data modeling? The main ones really all have to do with the same thing – not doing a logical model before the physical model. The major reasons for this are: Continue reading "Logical Modeling and Physical Modeling" » April 9, 2006Her Cheatin' Heart Made A Null Out Of MeYou can learn a lot about a data warehouse program by analyzing how it uses nulls. As most of you know, null means "unknown" or "irrelevant". The nullability, or ability to take on the null value, is a dimension of every column in the data warehouse - or any database for that matter. Most columns should not be nullable, but a few should be. Nulls do not equal zero or spaces. They actually have an entire different meaning than either of those. There may be actual values like "not supplied" or "invalid value" that should be used in place of null (or zeroes or spaces). These descriptive terms are actually more explanatory about the field than null. So nulls get overused sometimes. But mostly nulls are underused, taking a backseat to zeros and spaces. Nulls don't equal other nulls. And the manner in which nulls participate in aggregate functions like SUM, AVG and COUNT is very logical, but can be tricky. You also can't join on nulls. These basic facts discourage many from using nulls in data warehouses at all. But a little investment in knowledge of nulls can go a long way and afford your program the power that nulls bring. Thinking about nulls makes me think about... country music of course. Here are some of my favorite songs when looking at the effective use of nulls in data warehousing: Stand by Your Null December 23, 2005Archiving strategies and data life cycle concerns just became very importantE-week reports that the European Union has passed a contentious data-retention directive that requires all telephone and Internet traffic to be logged and stored for between six months and two years in order to help combat organized crime and terrorism. Data to be retained include both incoming and outgoing phone numbers, how long calls last, and the location of calls, for both successful calls and those that get dropped. Also covered are IP addresses for SMS and Internet activity, as well as login and logoff times.
November 15, 2005Focus: Disaster RecoveryI remember the day, not many years ago, when data warehouse programs would annually undergo a disaster recovery plan test. This usually involved a weekend and usually involved some important revelations in terms of readiness! I've noticed in the past year that those tests are seldom done anymore. Not being a server expert, I assumed that there were some important new built-in capabilities to servers or more failover environments in place that removed the need for the exercise. So, I began to make it a point to find out more and, while surely the servers have improved in this area and there are more failover arrangements in place, the primary reason data warehouse disaster recovery plans aren't done as much seems to be that the exercise is out-prioritized. This is not a great reason. Of course, averting a disaster is always a tough justification in light of the many clearly progressive things we like to do for our data warehouse environments. But consider these simple things that form the basis of a sound disaster recovery plan, as given to me by a systems expert... Continue reading "Focus: Disaster Recovery" » November 13, 2005Source System impact on DW/BI data modelToday, I'm thinking about building the DW/BI data model and the various approaches builders will take on the relationship to the source system(s). This is one of about 35 different aspects of a DW/BI methodology that are often effected in a shop in de-facto fashion, and often suboptimally. With this entry, I'd like to point out the importance, not only of this decision, but of making all these decisions in a heads-up fashion. Theoretically, the source system(s) should have little impact on the DW/BI data model. However, let’s take a look at real world data warehouse model building strategies. Data warehouse data models are built with one of 3 strategies in mind... Continue reading "Source System impact on DW/BI data model" » |