Blog: Dan E. Linstedt« July 2007 | Main | September 2007 » August 31, 2007Necessary Shifts in the IndustryFor too long the industry has preached: load quality data into your warehouse, cleanse the data, manipulate it, and then load it to the warehouse. The mantra has been "never release bad data to the end users." There are hundreds of articles written about this, and probably quite a few groups carrying this mantra forward. But I have to ask, where did this mantra come from? How on earth did it get "written in stone?" What really, is the true value here? When we implement this kind of paradigm do we RISK getting the warehouse "wrong?" Are we integrating away problems which are causing the business to hemorrhage money? I would argue that this paradigm is partially wrong, that a data warehouse should be a statement of integrated fact, NOT a "single version of the truth." Why? Because there are diamonds in the rough my friend, what I mean is: there is gold in exposing to the business the GAPS between what the source systems say the data IS, and what the Business BELIEVES (in how it's operating and collecting information). These gaps lead to money loss, sometimes significant money loss and inefficient business processes. Unless we (I.T.) can expose the gaps between "what-is" and "what-they-truly-think" we cannot begin to help them. We must learn to expose the good, bad, and ugly information in the system, bounce this against the business processes and see where the business is broken. This is one of the true jobs of a good data warehouse. Armed with this information, businesses can begin to investigate why it's broken, or if it's even that important to fix. They can begin taking ownership and accountability for their own "bad-data" and "bad systems processes" that collect that information. In one case, upon exposing "bad-data", the business was able to find and fix a mis-billing calculation that had been hidden in a financial billing report (operational report) for over 15 years. The business always "rounded away" the error because they couldn't find it. Now, if we (as a data warehousing / BI team) had "fixed" or "cleansed" the data going in to the warehouse we 1) never would have passed the financial audit when the warehouse was blamed for being wrong, and 2) never would have been able to expose the error to the business in the first place. Which means the business would have demanded that the same "rounding error" be coded into the data warehousing report.
This is just flat wrong. Now, am I saying there isn't value in cleansing, and quality checking the data? No, quite the contrary - I'm saying that in integrating the data going in to the EDW, it should be RAW grain, no changes to the data set, so that the errors and patterns of errors can be seen. I'm suggesting that post-warehouse rather than pre-warehouse is a better place for doing quality and cleansing, in other words: put a filtered lense on the warehouse data on the _way_ to the data marts, not on the way in to the EDW. Then construct something called an ERROR MART where "bad data" (those without keys, those that break "today’s" business rules, etc...) can be funnelled. This means a fundamental shift in the way we look at data warehousing as a practice. I've been building EDW's this way for 12+ years, with tremendous success.. many (if not all) of these EDW's are still around today, growing, and passing compliance audits. What I'm saying is the paradigm needs to change, it just so happens that DW2.0 offers a rare opportunity to execute the change going forward, without a major impact to what's already in place. I'm also suggesting that the new paradigm be built going forward whether or not you're engaged in a DW2.0 build-out. Remove the complexities of executing business logic "up-stream" (on the way IN to the warehouse), move these business computations and complex calculations "down-stream" to on the way OUT of the data warehouse, on the way to true data marts. This is the proper way to build and scale an enterprise data warehousing vision. Now, those of you who have built a "staging area with history at a raw level grain" have the basics of this component, in fact it is an EDW whether you like it or not, even though it goes from "staging" into "federated data marts." However, if you find yourself changing, re-rolling, or re-stating the historical data every time the business has a fundamental change, then you have compromised the compliance and audit ability, and haven't executed the paradigm shift I'm discussing here. If you are re-stating, re-rolling, or changing historical data to represent business changes, then you may be experiencing the pain of growth, high costs and huge impacts to business change, and basically a "melt-down" of the federated star schema approach (as an EDW only, not as a data mart delivery solution). If changing a conformed dimension to meet business needs has a HUGE impact list and a high cost, then you have issues around business logic "up-stream" that can be resolved by changing the paradigm. BUT you need a good data model architecture in order to implement the paradigm I'm discussing above, you can use a star-schema data model (IF there are hundreds of small dimensions, and IF the fact tables are lowest level of grain, and IF the data is NOT altered on the way in, and IF the dimensions are not conformed) - however, there is a data modeling architecture (recently endorsed by Bill Inmon) that I've spent 10 years designing (based on mathematical principles), and the last couple years releasing. The modeling architecture is called "Data Vault Data Modeling Architecture", Bill has stated that this is the "optimal choice for modeling the DW2.0". The architecture in and of itself helps to enforce the paradigm shift I'm discussing, but the approach surrounding the implementation of the architecture is where the magic lies. The Data Vault architecture is freely available, just like 3rd normal form, and star-schema. It is quite simplistic and returns to business modeling for it's foundational build-out, specifically business keys. But I digress. My points are: a) the business needs to be shown "what-is" versus "what-they-think-is", a gap analysis if you will b) the data in the EDW needs to be accountable, auditable, lowest level of grain and never "re-stated" never "re-rolled" or altered, c) I.T. and Business are running a huge risk when "cleansing and quality" are executed on the way IN to the EDW, instead of on the way OUT. To that end, data quality processing is full of interpretation. Even if a "quality engine" is used, it can change/assimilate/aggregate data in ways that are incorrect. If there's no base-data or raw data to check it against, we can lose a compliance audit (and often do). When I.T. implements a process "on-the-way-in" to the EDW that changes data, then they are subject to the business stating "what you / your algorithms produced is wrong, now change it." Then, business & I.T. go to war, because the business can't agree on how to define it, and the requirements continually change. We (I.T.) are left trying to fit a round peg in a square hole, just to make it fit - or meet the needs of the many. When a compliant and scalable EDW architecture is built (regardless of the data model), and star schemas for delivery are NOT federated together for the enterprise, then single stars can do what they do REALLY WELL: provide subject oriented answers to specific business user groups rapidly. This means we can source the EDW many times for different stars, execute the business rules to put the data into specific context, and load individual star schemas for different uses. One more note about data quality / information quality: If it is done, it should be done AT THE SOURCE SYSTEM LEVEL, where the system of record exists, this way, the original and "cleansed" data can be audited at the source, and fed into the raw warehouse at the source. These "cleansed" values should be governed by a governance process and a data management specialist. On the flip side, quality can be executed when loading data from the EDW to the star-schema, and what I recommend there is: I.T. build "error marts" for broken data, and move the data that doesn't meet "todays version of the truth" into the error marts (note: my definition of mart, is any data that is customer or user facing - I am not implying any particular modeling methods to the word MART). Hope this helps, August 14, 2007Welcome to Thought ExperimentsEvery once in a great while I have a small idea bubbling around in my head. These ideas are things that seem like interesting points of view and make themsleves known either in the twighlight hours, or just before I'm about to fall asleep. Anyhow they are interesting ideas, and I am choosing to share them as they occur. They could be anything from observations to idiosyncratic comments, some may be useful while others will (hopefully) be forgotten. I ask you now, that if you like an idea that you spur the conversation by offering a comment or two, who knows? Maybe, just maybe it will be good.... Let's engage the warp drive past the horizon. Appliances for Business Intelligence and Data WarehousingThere's a movement afoot in the appliance world. Appliances are growing up. They are getting faster, smaller, cheaper, and yes: more specialized. I had it in my mind that the appliance market would combine on a single platform, and provide common plug & play hardware interfaces, well that just doesn't seem to be the case (maybe in the future, but then again maybe not). In this entry we will explore the different classes of appliances that are available, and what they do. We will also take a look at where they may go within the next 12 to 18 months. There are classes of appliances for DW and BI these days, and there's still some debate about what "appliances" really are. But all of that aside, if we take a general definition of the "Appliance" for BI / EDW space - then it might look something like this: A basic plug and play black box with a programmable interface, and an embedded data management engine (might be a database, might be an indexing engine, or something else). In any case, it manages "data in, and data out" at high rates of speed, mostly through network traffic, and a listener of some sort. Most of these appliance boxes come as autonomous network aware components with self-contained hardware. The "magic" is usually buried somewhere in the firmware algorithms, the high speed data stores, the caching mechanisms, and the internal data placement. Some would go so far as to say: if you could predict the context of a query, and match it with the context of the data before executing the query, you'd have the world’s fastest data retrieval and data placement engine. The problem is: the context changes when the query / question changes. The other problem is: data by itself is not deterministic of context; therefore logical context groupings do not make sense in the storage patterns. This is the age-old problem of Random Access File Systems on Physical Disk (Hard drives). Some of this can be solved with RAM disk, USB Flash Drives, and so on - but I ramble. So let's see what we can find about appliances. I used a web search with the following terms to dig up information: Term: "BI Appliance" Term: "DW Appliance" And let's not forget: "Data Appliance" Ok, so what does all this mean? This brings me to the following conclusion: A data appliance simply manages the data access and retrieval, regardless of the type of data, the source or target of the data and the functionality. Within "Data Appliances" their might be specific needs (like backup / restore), high speed text / unstructured access, or database "file" access. The question here is: what then, differentiates a "Data Appliance" from something like an EMC, Hitatchi, or Fujitsu Smart SAN/NASD array? At first glance, not much. However, when we pull back the covers we begin to see some levels of specialized functions: like managing database files, or managing unstructured text documents, or specifically backup/restore as if it were a hot-swap drive within an existing RAID array. This is a hard market to compete in, producing differentiators will be critical to the success of specific vendors like Dataupia. They are already making a splash in the BI world, but they'll have to go a few steps further (which they are already investigating according to their partner’s page). The BI appliance can mean all kinds of things, but one thing (in my mind) it certainly DOESN'T mean is: "I don't need a data warehouse anymore." That couldn't be further from the truth. There are some vendors out there touting Business Intelligence without a Data Warehouse. That may be, and yes, you can get Operational Business Intelligence (a phrase coined by Claudia Imhoff and Colin White) without a Data Warehouse, but Claudia and I agree: to get the analytics, historical trends and patterns, and provide true data mining capabilities, a data warehouse must be part of the picture. So, where does that leave the BI APPLIANCE? The BI Appliance can become the Data Mart Appliance, or act as an Operational BI Appliance, possibly incorporating the physical ODS (operational data store) within it's hardware. The BI Appliance can speed up queries, understand data access patterns and focus on sharing data across web services (transactional data that is). But this is an Operational BI Appliance. What are you talking about? So what, why should I care? How do they lower the cost? Ok, so you get the picture - there's a whole lot more to this than I've discussed, but it is clear that this really is an optimal functionality. Who wants or needs to maintain "separate hardware, separate databases, separate functionality" anymore? Why not have it bundled and working together, pre-packaged and already performance tuned? What does the future look like? For now, there will be an appliance for reporting, one for databases, one for OLAP cubes, one for Web Services, one for ETL, one for Quality/Cleansing, hopefully one for Metadata and so on. The race goes on, and we will continue to see different vendors enter and leave this space. I also think this space is ripe for acquisitions and consolidations. Cheers, August 3, 2007Dynamic Nature of SystemsI've been out for a while; getting new companies started can be hard to do... But I think we're getting there. Anyhow, as everyone knows - I've got an interest in the real nature of things, how they work, how they change and adapt themselves, and how automated computing facilities can bring these systems closer and closer to the business users. Ultimately we'd like the business users to manage, and suggest change to the back-ends through uncomplicated interfaces. We have a lot of terms popping up on the internet (even just over the past two years) which I'd like to explore, and then we'll dive in to where I think the industry should be going. In particular, one term is driving me to sleepless nights: "Dynamic" It seems to start and end with the term Dynamic. Throw the phrase on to the front of existing terms (like decision support, BI, DW, web services, etc...) and we have a whole new ball game. Just run a search in an engine like http://www.SurfWax.com - which uses a neural network to really find what you might be looking for... and see the hundreds of results that now pop up as a result. I've been following some really interesting writing in ACM Communications journal, and IEEE on visualization, business tactics, and in their latest print volume, Dynamic Decision Support (caught my eye). I'll try to summarize for you what they were getting at: As you might imagine, there would be security around this entire process to keep them from blowing the larger business flows out of the water. However, they all live in this "world" and play with different levels of "process flow metadata", to help streamline and optimize the business itself. In other words, in their eyes, DSS is no longer just a reporting interface, nor is it actually developing dashboards, but an interactive process enriched with metadata, process analysis, and business workflows that manage dependencies, definitions, ontologies, and so on. The end result is getting the data out to a platform like "excel" or your favorite reporting tool - in an automated fashion. Of course they also state that this direction is several years off, there's a lot of engineering to be done to ensure the quality, consistency, and usability of such interfaces. There is also a level of business education that must take place, along side of fundamental shifts in the way we do business, but another point they make is: that business must stay nimble, and in order to compete - businesses (even large ones) must be flexible and employees who do their daily jobs are the ones armed with the knowledge on how best to change the business for the future. What does this have to do with DYNAMIC? But what does it _really_ mean? I think this is a paradigm shift, and there are different levels of getting to a Dynamic Data Warehouse, or a Dynamic Operational System, or even a Dynamic Enterprise Architecture. I think the levels are already built by standard ontological definitions of terms. For instance: Dynamic Data Sets, Dynamic Information (a bit higher level), Dynamic Hardware, and so on. There is a difference in my mind between "active" and "dynamic." Again, I would take Dynamic Data Warehouse to encompass or include all the following components: These are the general components included at an architectural level in a non-dynamic or traditional Data Warehouse. They should (in my opinion) be defined as part of a Dynamic Data Warehouse. The same can be said for the term: "Real-Time" which I'd prefer to call "Right-Time", and the term "Active". How do we define these components? Dynamic Data Warehouse - if we accept the fact that this term is a "complex system" which includes architecture, design, data, execution, software, and hardware, then we should also expect to see each of these components within the system to be "dynamic" and autonomous in their nature to change with the surrounding environment. A "Dynamic Data Warehouse" is _not_ (in my opinion) simply a Hardware Platform with a Database that accepts right-time data that might have a DSS thrown in. There's a difference between providing Dynamic Access to right-time data, and Dynamic Data Warehousing in my opinion. In my definition, the structure (metadata) must adapt, the business rules must be flexible and truly at the fingertips of the business users, those same rules must be part of the governance that helps the automated changes take place in the back-end, and so on. But if we look at a very simple example, let's see what a "Dynamic Data Model" might be: Suppose I have an SOA in my organization, and at it's heart is a common data model, surrounded by a layer of common data services. These services "listen" to the outside world for transactions. When the transaction is ready, some of these services are smart enough to check the metadata (the version of the sending service, and the definition of the transaction itself). In doing so, the service discovers a new "XML" element; a new piece of information has been added. Now, in order to be a dynamic service, it must check the common model it has - and decide "what" to do with the new data, where does it put it? How does it adapt the model? Can it adapt the model? What’s the impact of the change if it were to make one? Does the service simply "stop" in mid-stream and fire an alert? Well, the easiest answer is it takes the XML as a blob, and sticks it into an "XML aware" database object, theoretically it would send a workflow notification to the business users that this new element arrived. It would then be up to the business users to "adapt" or change their processes, and queries to make use of the element. This is a very crude interpretation of a Dynamic Services component. What's missing is the automated adaptation of the element through MODEL DISCOVERY, META DATA MINING, and ARCHITECTURAL MINING with scoring, and "approximation" of context for where the new element fits, and where it impacts the process flows downstream. Not to mention the automated CHANGE to the process flows (when appropriate). This begins to touch some of the components in a Dynamic System - but doesn't get to a Dynamic Warehouse. These are daunting tasks - and yes, we need to take baby steps along the way (and if someone can make a giant leap, all the better - if the market is ready). I would urge you to think of "Dynamic" in a new light, let's be more specific about WHAT it is we are creating that is dynamic rather than just throwing the label on an entire system. I'm curious to hear your thoughts, as always - I like to take this as far beyond the horizon as possible. This is just one way to look at "Dynamic Systems" and "Dynamic Data Warehousing", if you have other ideas, please let me know. Have a wonderful day, |