Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

August 2007 Archives

For too long the industry has preached: load quality data into your warehouse, cleanse the data, manipulate it, and then load it to the warehouse. The mantra has been "never release bad data to the end users." There are hundreds of articles written about this, and probably quite a few groups carrying this mantra forward. But I have to ask, where did this mantra come from? How on earth did it get "written in stone?" What really, is the true value here? When we implement this kind of paradigm do we RISK getting the warehouse "wrong?" Are we integrating away problems which are causing the business to hemorrhage money?

I would argue that this paradigm is partially wrong, that a data warehouse should be a statement of integrated fact, NOT a "single version of the truth." Why? Because there are diamonds in the rough my friend, what I mean is: there is gold in exposing to the business the GAPS between what the source systems say the data IS, and what the Business BELIEVES (in how it's operating and collecting information).

These gaps lead to money loss, sometimes significant money loss and inefficient business processes. Unless we (I.T.) can expose the gaps between "what-is" and "what-they-truly-think" we cannot begin to help them. We must learn to expose the good, bad, and ugly information in the system, bounce this against the business processes and see where the business is broken. This is one of the true jobs of a good data warehouse. Armed with this information, businesses can begin to investigate why it's broken, or if it's even that important to fix. They can begin taking ownership and accountability for their own "bad-data" and "bad systems processes" that collect that information.

In one case, upon exposing "bad-data", the business was able to find and fix a mis-billing calculation that had been hidden in a financial billing report (operational report) for over 15 years. The business always "rounded away" the error because they couldn't find it. Now, if we (as a data warehousing / BI team) had "fixed" or "cleansed" the data going in to the warehouse we 1) never would have passed the financial audit when the warehouse was blamed for being wrong, and 2) never would have been able to expose the error to the business in the first place. Which means the business would have demanded that the same "rounding error" be coded into the data warehousing report.

BusnProcessingShift.jpg

This is just flat wrong. Now, am I saying there isn't value in cleansing, and quality checking the data? No, quite the contrary - I'm saying that in integrating the data going in to the EDW, it should be RAW grain, no changes to the data set, so that the errors and patterns of errors can be seen. I'm suggesting that post-warehouse rather than pre-warehouse is a better place for doing quality and cleansing, in other words: put a filtered lense on the warehouse data on the _way_ to the data marts, not on the way in to the EDW. Then construct something called an ERROR MART where "bad data" (those without keys, those that break "today’s" business rules, etc...) can be funnelled.

This means a fundamental shift in the way we look at data warehousing as a practice. I've been building EDW's this way for 12+ years, with tremendous success.. many (if not all) of these EDW's are still around today, growing, and passing compliance audits. What I'm saying is the paradigm needs to change, it just so happens that DW2.0 offers a rare opportunity to execute the change going forward, without a major impact to what's already in place. I'm also suggesting that the new paradigm be built going forward whether or not you're engaged in a DW2.0 build-out.

Remove the complexities of executing business logic "up-stream" (on the way IN to the warehouse), move these business computations and complex calculations "down-stream" to on the way OUT of the data warehouse, on the way to true data marts. This is the proper way to build and scale an enterprise data warehousing vision. Now, those of you who have built a "staging area with history at a raw level grain" have the basics of this component, in fact it is an EDW whether you like it or not, even though it goes from "staging" into "federated data marts." However, if you find yourself changing, re-rolling, or re-stating the historical data every time the business has a fundamental change, then you have compromised the compliance and audit ability, and haven't executed the paradigm shift I'm discussing here.

If you are re-stating, re-rolling, or changing historical data to represent business changes, then you may be experiencing the pain of growth, high costs and huge impacts to business change, and basically a "melt-down" of the federated star schema approach (as an EDW only, not as a data mart delivery solution). If changing a conformed dimension to meet business needs has a HUGE impact list and a high cost, then you have issues around business logic "up-stream" that can be resolved by changing the paradigm.

BUT you need a good data model architecture in order to implement the paradigm I'm discussing above, you can use a star-schema data model (IF there are hundreds of small dimensions, and IF the fact tables are lowest level of grain, and IF the data is NOT altered on the way in, and IF the dimensions are not conformed) - however, there is a data modeling architecture (recently endorsed by Bill Inmon) that I've spent 10 years designing (based on mathematical principles), and the last couple years releasing. The modeling architecture is called "Data Vault Data Modeling Architecture", Bill has stated that this is the "optimal choice for modeling the DW2.0".

The architecture in and of itself helps to enforce the paradigm shift I'm discussing, but the approach surrounding the implementation of the architecture is where the magic lies. The Data Vault architecture is freely available, just like 3rd normal form, and star-schema. It is quite simplistic and returns to business modeling for it's foundational build-out, specifically business keys. But I digress.

My points are: a) the business needs to be shown "what-is" versus "what-they-think-is", a gap analysis if you will b) the data in the EDW needs to be accountable, auditable, lowest level of grain and never "re-stated" never "re-rolled" or altered, c) I.T. and Business are running a huge risk when "cleansing and quality" are executed on the way IN to the EDW, instead of on the way OUT.

To that end, data quality processing is full of interpretation. Even if a "quality engine" is used, it can change/assimilate/aggregate data in ways that are incorrect. If there's no base-data or raw data to check it against, we can lose a compliance audit (and often do). When I.T. implements a process "on-the-way-in" to the EDW that changes data, then they are subject to the business stating "what you / your algorithms produced is wrong, now change it." Then, business & I.T. go to war, because the business can't agree on how to define it, and the requirements continually change. We (I.T.) are left trying to fit a round peg in a square hole, just to make it fit - or meet the needs of the many.

When a compliant and scalable EDW architecture is built (regardless of the data model), and star schemas for delivery are NOT federated together for the enterprise, then single stars can do what they do REALLY WELL: provide subject oriented answers to specific business user groups rapidly. This means we can source the EDW many times for different stars, execute the business rules to put the data into specific context, and load individual star schemas for different uses.

One more note about data quality / information quality: If it is done, it should be done AT THE SOURCE SYSTEM LEVEL, where the system of record exists, this way, the original and "cleansed" data can be audited at the source, and fed into the raw warehouse at the source. These "cleansed" values should be governed by a governance process and a data management specialist. On the flip side, quality can be executed when loading data from the EDW to the star-schema, and what I recommend there is: I.T. build "error marts" for broken data, and move the data that doesn't meet "todays version of the truth" into the error marts (note: my definition of mart, is any data that is customer or user facing - I am not implying any particular modeling methods to the word MART).

Hope this helps,
Dan Linstedt
PS: Love to hear your thoughts, comments and experiences.


Posted August 31, 2007 6:40 AM
Permalink | No Comments |

Every once in a great while I have a small idea bubbling around in my head. These ideas are things that seem like interesting points of view and make themsleves known either in the twighlight hours, or just before I'm about to fall asleep. Anyhow they are interesting ideas, and I am choosing to share them as they occur. They could be anything from observations to idiosyncratic comments, some may be useful while others will (hopefully) be forgotten. I ask you now, that if you like an idea that you spur the conversation by offering a comment or two, who knows? Maybe, just maybe it will be good....

Let's engage the warp drive past the horizon.


Posted August 14, 2007 10:26 PM
Permalink | No Comments |

There's a movement afoot in the appliance world. Appliances are growing up. They are getting faster, smaller, cheaper, and yes: more specialized. I had it in my mind that the appliance market would combine on a single platform, and provide common plug & play hardware interfaces, well that just doesn't seem to be the case (maybe in the future, but then again maybe not). In this entry we will explore the different classes of appliances that are available, and what they do. We will also take a look at where they may go within the next 12 to 18 months.

There are classes of appliances for DW and BI these days, and there's still some debate about what "appliances" really are. But all of that aside, if we take a general definition of the "Appliance" for BI / EDW space - then it might look something like this:

A basic plug and play black box with a programmable interface, and an embedded data management engine (might be a database, might be an indexing engine, or something else). In any case, it manages "data in, and data out" at high rates of speed, mostly through network traffic, and a listener of some sort. Most of these appliance boxes come as autonomous network aware components with self-contained hardware. The "magic" is usually buried somewhere in the firmware algorithms, the high speed data stores, the caching mechanisms, and the internal data placement.

Some would go so far as to say: if you could predict the context of a query, and match it with the context of the data before executing the query, you'd have the world’s fastest data retrieval and data placement engine. The problem is: the context changes when the query / question changes. The other problem is: data by itself is not deterministic of context; therefore logical context groupings do not make sense in the storage patterns.

This is the age-old problem of Random Access File Systems on Physical Disk (Hard drives). Some of this can be solved with RAM disk, USB Flash Drives, and so on - but I ramble.

So let's see what we can find about appliances. I used a web search with the following terms to dig up information:

Term: "BI Appliance"
IBM MidMarket Appliance entry
Celequest Operational BI Appliance
Business Objects and their Open Appliance Initiative
HP Launches BI Appliance
Business Objects forms alliance with Netezza
Ingress and JasperSoft prepare BI Appliance

Term: "DW Appliance"
Sun & Greenplum DW Appliance
DATAllegro Appliance
Active-Base discusses Hybrid DW Appliance

And let's not forget: "Data Appliance"
Dataupia Product Overview
ArcGIS Data Appliancehttp://www.esri.com/software/arcgis/arcgisonline/about/data-appliance.html
Data Mining Appliancehttp://news.thomasnet.com/fullstory/482588/2585
Data Backup and Restore Appliance from Dataedge

Ok, so what does all this mean?
Well, I like to think that it means several things:
1) By reading through these articles, news releases, and other information - it is easy to tell that the water is muddy. (The definition of Appliance can mean most anything) I even found a post from Oracle - 1999 about a "Data Appliance" that Oracle proclaimed to produce.
2) there seems to be a distinction being made between "Data appliance", "BI appliance" and "Data Warehouse/Warehousing Appliance".

This brings me to the following conclusion: A data appliance simply manages the data access and retrieval, regardless of the type of data, the source or target of the data and the functionality. Within "Data Appliances" their might be specific needs (like backup / restore), high speed text / unstructured access, or database "file" access. The question here is: what then, differentiates a "Data Appliance" from something like an EMC, Hitatchi, or Fujitsu Smart SAN/NASD array? At first glance, not much. However, when we pull back the covers we begin to see some levels of specialized functions: like managing database files, or managing unstructured text documents, or specifically backup/restore as if it were a hot-swap drive within an existing RAID array.

This is a hard market to compete in, producing differentiators will be critical to the success of specific vendors like Dataupia. They are already making a splash in the BI world, but they'll have to go a few steps further (which they are already investigating according to their partner’s page).

The BI appliance can mean all kinds of things, but one thing (in my mind) it certainly DOESN'T mean is: "I don't need a data warehouse anymore." That couldn't be further from the truth. There are some vendors out there touting Business Intelligence without a Data Warehouse. That may be, and yes, you can get Operational Business Intelligence (a phrase coined by Claudia Imhoff and Colin White) without a Data Warehouse, but Claudia and I agree: to get the analytics, historical trends and patterns, and provide true data mining capabilities, a data warehouse must be part of the picture.

So, where does that leave the BI APPLIANCE?
Out in the rain... (sorry, just kidding). Really, they are extremely useful, and very valuable - they can replace "data mart" solutions (I define a data mart as any architected/governed/managed data delivery solution where data is rolled up, cleansed, aggregated, and altered from its original state). Remember, I define a Data Warehouse as RAW data... for another time.

The BI Appliance can become the Data Mart Appliance, or act as an Operational BI Appliance, possibly incorporating the physical ODS (operational data store) within it's hardware. The BI Appliance can speed up queries, understand data access patterns and focus on sharing data across web services (transactional data that is). But this is an Operational BI Appliance.

What are you talking about?
The term "BI" (business intelligence) really refers to people, architecture, data, databases, hardware, ETL, ELT, data delivery, cleansing, quality, metadata, management, reports, and so on. Business Intelligence includes all components and PEOPLE who make the decisions - it isn't simply the "REPORTING" engine as we've all been lead to believe. Step back for a minute, Business Intelligence has been around since the inception of competitive business... Ok - enough soap box, you get the picture, BI includes data warehousing.

So what, why should I care?
Probably because the appliances are coming to an I.T. department near you -- and for $9.95 *actual price may vary, today only, you can get a set of GINSU knives too! Just kidding. In reality, appliances have a VERY attractive proposition: They LOWER cost of ownership (TCO).

How do they lower the cost?
1. The knowledge needed to manage, tweak, maintain a database engine can (in certain circumstances) be eliminated. For instance, when a Netezza Appliance is purchased (it is a really good replacement for star schemas that are currently housed in traditional DBMS systems), the System Admin doesn't need to know anything about indexing, block sizes, performance and tuning and so on. In fact, less DBA resources are needed to manage it. But does it work with Operational Data, or Normalized Data Warehouse architectures? NO. This is where it falls down. The cost drops.
2. Performance increases through engineering levels buried in the hardware, and firmware, and partnerships.
3. Self-enclosed, self-healing devices, just hot-swap the disk if there's a failure. I'm still waiting for hot-swappable RAM, hmmmm... RAID 5 for RAM? Interesting.
4. Network pluggable, self configurable (for the most part)
5. Single management console, no matter how many "devices/racks/stacks" are put into the network.
6. Embedded functionality. Dataupia and Celequest offer functionality embedded within their appliances. Things like ETL, ELT, OLAP cube support, and so on - some of these are done at the firmware level, others are done in the partners' software, optimized for high speed data access.

Ok, so you get the picture - there's a whole lot more to this than I've discussed, but it is clear that this really is an optimal functionality. Who wants or needs to maintain "separate hardware, separate databases, separate functionality" anymore? Why not have it bundled and working together, pre-packaged and already performance tuned?

What does the future look like?
I think we'll see a continued pendulum swing toward specialized appliances, lower cost for each appliance, highly focused on solving tasks, before we see someone produce a "hardware grid appliance" which will attempt to standardize the management of all the appliances plugged in (this may take a couple years to get to).

For now, there will be an appliance for reporting, one for databases, one for OLAP cubes, one for Web Services, one for ETL, one for Quality/Cleansing, hopefully one for Metadata and so on.

The race goes on, and we will continue to see different vendors enter and leave this space. I also think this space is ripe for acquisitions and consolidations.

Cheers,
Dan L
Get your masters degree in BI from Denver University
See our board members at: http://www.COBICC.org


Posted August 14, 2007 8:26 AM
Permalink | 2 Comments |

I've been out for a while; getting new companies started can be hard to do... But I think we're getting there. Anyhow, as everyone knows - I've got an interest in the real nature of things, how they work, how they change and adapt themselves, and how automated computing facilities can bring these systems closer and closer to the business users. Ultimately we'd like the business users to manage, and suggest change to the back-ends through uncomplicated interfaces. We have a lot of terms popping up on the internet (even just over the past two years) which I'd like to explore, and then we'll dive in to where I think the industry should be going. In particular, one term is driving me to sleepless nights: "Dynamic"

It seems to start and end with the term Dynamic. Throw the phrase on to the front of existing terms (like decision support, BI, DW, web services, etc...) and we have a whole new ball game. Just run a search in an engine like http://www.SurfWax.com - which uses a neural network to really find what you might be looking for... and see the hundreds of results that now pop up as a result.

I've been following some really interesting writing in ACM Communications journal, and IEEE on visualization, business tactics, and in their latest print volume, Dynamic Decision Support (caught my eye).

I'll try to summarize for you what they were getting at:
Dynamic Decision support (according to them) includes the notions of tying business process workflow to a visual design system on the front end, which allows the users full drag-and-drop access to their data, and to their metadata, and to the lineage - all connected within ontologies and sub-ontologies defined for each "workflow component." The business users would then design different workflows, and therefore re-organize the data sets underneath to produce their own reports, and their own "processes." In another article in this magazine, they continue on to discuss how the changes to these processes can be measured and managed in terms of impacts.

As you might imagine, there would be security around this entire process to keep them from blowing the larger business flows out of the water. However, they all live in this "world" and play with different levels of "process flow metadata", to help streamline and optimize the business itself. In other words, in their eyes, DSS is no longer just a reporting interface, nor is it actually developing dashboards, but an interactive process enriched with metadata, process analysis, and business workflows that manage dependencies, definitions, ontologies, and so on. The end result is getting the data out to a platform like "excel" or your favorite reporting tool - in an automated fashion.

Of course they also state that this direction is several years off, there's a lot of engineering to be done to ensure the quality, consistency, and usability of such interfaces. There is also a level of business education that must take place, along side of fundamental shifts in the way we do business, but another point they make is: that business must stay nimble, and in order to compete - businesses (even large ones) must be flexible and employees who do their daily jobs are the ones armed with the knowledge on how best to change the business for the future.

What does this have to do with DYNAMIC?
Everything. Too often this term is abused on the web - it's like the new buzz word used in front of all else. For instance: I went to lunch and ordered a dynamic pizza, or a dynamic menu that changes based on my dress style, or dynamic presentation, or dynamic speaker, or dynamic blog entry... You get the point, it's been tossed in front of Dynamic Data Warehouse, Dynamic Decision Support, Dynamic Reporting, Dynamic ScoreCards, Dynamic ETL, Dynamic Data Integration, Dynamic Web Services, Dynamic SOA.

But what does it _really_ mean?
In my humble opinion, the word "dynamic" should not be used lightly, yes there are standard definitions dating back to the Greeks, but I am of the opinion that when we use the word Dynamic in front of today's technologies we really are saying that structure (parts of metadata) change with the data and the processes. Well, at least that what I am suggesting. I contend that the nature of "dynamic systems" would expand to include "adaptive systems", to me: a dynamic system is one that enhances the user experience, and adapts to the user experience in a holistic fashion.

I think this is a paradigm shift, and there are different levels of getting to a Dynamic Data Warehouse, or a Dynamic Operational System, or even a Dynamic Enterprise Architecture. I think the levels are already built by standard ontological definitions of terms.

For instance: Dynamic Data Sets, Dynamic Information (a bit higher level), Dynamic Hardware, and so on. There is a difference in my mind between "active" and "dynamic." Again, I would take Dynamic Data Warehouse to encompass or include all the following components:
* Dynamic Data Sets
* Dynamic Hardware
* Dynamic Metadata
* Dynamic Structure
* Dynamic Technical Metadata
* Dynamic Business Metadata
* Dynamic Process Flows
* Dynamic Architecture
* Dynamic Reporting
* Dynamic Data Mining

These are the general components included at an architectural level in a non-dynamic or traditional Data Warehouse. They should (in my opinion) be defined as part of a Dynamic Data Warehouse. The same can be said for the term: "Real-Time" which I'd prefer to call "Right-Time", and the term "Active".

How do we define these components?
By defining each of the terms that are laid out with further definitional content I think we can walk further down the path. Dynamic to me, means capable of adapting to change in an automated fashion. Some of these terms are self-explanatory, but here are a few that I think have been missed by the industry:

Dynamic Data Warehouse - if we accept the fact that this term is a "complex system" which includes architecture, design, data, execution, software, and hardware, then we should also expect to see each of these components within the system to be "dynamic" and autonomous in their nature to change with the surrounding environment. A "Dynamic Data Warehouse" is _not_ (in my opinion) simply a Hardware Platform with a Database that accepts right-time data that might have a DSS thrown in.

There's a difference between providing Dynamic Access to right-time data, and Dynamic Data Warehousing in my opinion.

In my definition, the structure (metadata) must adapt, the business rules must be flexible and truly at the fingertips of the business users, those same rules must be part of the governance that helps the automated changes take place in the back-end, and so on. But if we look at a very simple example, let's see what a "Dynamic Data Model" might be:

Suppose I have an SOA in my organization, and at it's heart is a common data model, surrounded by a layer of common data services. These services "listen" to the outside world for transactions. When the transaction is ready, some of these services are smart enough to check the metadata (the version of the sending service, and the definition of the transaction itself). In doing so, the service discovers a new "XML" element; a new piece of information has been added. Now, in order to be a dynamic service, it must check the common model it has - and decide "what" to do with the new data, where does it put it? How does it adapt the model? Can it adapt the model? What’s the impact of the change if it were to make one? Does the service simply "stop" in mid-stream and fire an alert?

Well, the easiest answer is it takes the XML as a blob, and sticks it into an "XML aware" database object, theoretically it would send a workflow notification to the business users that this new element arrived. It would then be up to the business users to "adapt" or change their processes, and queries to make use of the element.

This is a very crude interpretation of a Dynamic Services component. What's missing is the automated adaptation of the element through MODEL DISCOVERY, META DATA MINING, and ARCHITECTURAL MINING with scoring, and "approximation" of context for where the new element fits, and where it impacts the process flows downstream. Not to mention the automated CHANGE to the process flows (when appropriate).

This begins to touch some of the components in a Dynamic System - but doesn't get to a Dynamic Warehouse. These are daunting tasks - and yes, we need to take baby steps along the way (and if someone can make a giant leap, all the better - if the market is ready). I would urge you to think of "Dynamic" in a new light, let's be more specific about WHAT it is we are creating that is dynamic rather than just throwing the label on an entire system.

I'm curious to hear your thoughts, as always - I like to take this as far beyond the horizon as possible. This is just one way to look at "Dynamic Systems" and "Dynamic Data Warehousing", if you have other ideas, please let me know.

Have a wonderful day,
Daniel Linstedt
http://www.COBICC.org (Colorado Business Intelligence Community Connection)


Posted August 3, 2007 6:30 AM
Permalink | No Comments |