Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

October 2006 Archives

Have you ever thought about Master Data as a Service? Well, some companies are thinking this way. If this happens, a major paradigm shift will occur. This entry looks at MDaaS - and it's possibilities for changing the way we do business entirely. Who knows, maybe EII vendors could play in this space very very well. After all, they are the ones with Niche technology that really fits this space to begin with.

I'll blog on Master Data, the hype - the shenanigans, and the fears in my next entry. For now, realize that master data is important to the enterprise for many reasons.

Master Data means a number of things to a number of people, I'm no exception - master data to me are literally the keys to standardized kingdom. The cycle repeats itself in everything we build, first there's a new idea, then everyone implements their interpretation of this new idea as a gain, a benefit, the fact that they are different seemingly gives them "an-edge." Then some of these edges fail, best practices and lessons learned emerge, and then all the smart industry implementers begin to follow what really works - common ground, standardization, convergence of thoughts - then the real players emerge.

This is what is happening with Master Data Management solutions. However I think there are a couple of companies who are thought leaders in this space who are making a difference today. One (of course) is my company, Myers-Holum, Inc. Another is IBM Professional Services, another is Cap-Gemini, Intelligent Solutions, Object River, Kalido, and of course my respective peers (here on B-Eye Network) like David Loshin who write about MDM implementations.

But something caught my eye the other day, Cap-Gemini was saying that as a best practice, they take their customers' master data and house it off-site, so that the customer is not impacted by the machines, hardware, extra support for master data. They enable the master-data set with web-services for their customer, and they surround it with Enterprise Metadata (or my term: Master Metadata).

When I first saw this, I thought: no, not possible that a company would release their intellectual capital (which master data really is like golden keys to a kingdom when implemented properly), and allow it to be stored off-site. Then I started thinking about differentiation and then about standardization.

I realized very quickly the same thing applies to master data that applies to SaaS - standardization of particular parts, geographical locations, customers, and so on - as long as the data can be "secured", treated with integrity, delivered on time, standardized and made available - why not put it out as a service? Data Warehouses as a Service never really took off, and I'm not sure it ever will (maybe one day), but MD as a service, that's different - why? It's operational data when we look at it, we deal with transactional based information, now information - small numbers of rows going through a web-service request.

What a gold mine! Now imagine you get common data from Dunn & Bradstreet, you clean it up, and you standardize it over a web-service request, then you get common local census data (like the post-office does), and address data, and you intermix these as master data sets, then release them as MDaaS, you've got an interesting solution for the industry.

Suppose you load company profiles, SIC codes, and other public information - what happens? You can serve many different customers at the same time with the same data (master data that is standardized). A "virtually compressed" image of the data, because you don't have to store different copies for each implementation that is built. Voilla - keeping costs down for the customers of the service, the master data is updated, and pushed when changed to the customers who have signed an SLA with you.

I think Cap-Gemini takes this one-step further, by offering MDaaS for ALL the data sets the customer has, in agreement to keep certain company information confidential. Of course if Cap-Gemini or any other MDaaS system is compromised there will be a lot of stirr in the ethics community, and compliance will become an issue. Cap-Gemini must abide by in-sourcing, and different country rules, particularly with a global enterprise.

I think transactional Master Data as a Service is one wave of the future that I would ride. It's potentially a huge wave if it can be implemented properly, and security concerns can be addressed with encryption, compression, and proper data access. After all, the true nature of SOA is services, regardless of whether or not they are in-house or out-of-house, the true nature of Master Data is consolidation and standardization, regardless of company utilizing that information.

If you have any thoughts on why this would work, or wouldn't work - or what you think it would take to make it successful, I'd love to hear from you.

Cheers for now,
Dan L
CTO, Myers-Holum, Inc (http://www.MyersHolum.com)


Posted October 28, 2006 6:13 AM
Permalink | 3 Comments |

Everyone's blogged on this and quite a few more have offered their two cents on the topic. After reading about all of this, I figured maybe my voice could add a half-cent value to the noise out there. This is my opinion of what this acquisition means, and what Oracle really needs to do to solve their ailing sales issues in the EDW / ADW space. I recently wrote an article on Bill's newsletter about Oracle and Clustering that explains this core issue in a round-about manner. This entry is more direct.

Many companies are finding out that volume is a huge deal for them, as is real-time loads into an already overloaded and overwhelmed RDBMS engine. The problem is that companies are left with several key decisions to make:

1. Should we invest in a hardware solution? Is a single big box better than many small boxes? we want to consolidate... lower cost, etc... Big boxes are expensive.
2. We are on RDBMS X today, should we consider Oracle, DB2 UDB (DPF), Netezza, DatAllegro, Sand, Calpont, Teradata, SQLServer2005? What world-class appliance or RDBMS should we move to and why? Volume is pushing our current system to the brink of self-destruction...
3. Our ETL just doesn't seem to cut it anymore, running transformations in-stream with these volumes has overwhelmed the CPU and RAM resources. Or the other one: running ETL with all these "file staging" intermediate steps has overrun our CPU, RAM, and DISK resources. It simply takes too long. BUT we're caught in a catch-22, when we put the data into the RDBMS and try ELT, our system runs even slower... HELP! WHAT DO WE DO?

Then there's the everlasting questions:
a. We have EAI - our business wants SOA and web-services - should we just run all this stuff over EAI or a message queing system? OUCH - like it or not, you still suffer the volume problem.

It's all a shell-game, pushing the volume off legacy or mainframe, or collecting the volume off the web or gaining the volume from customer growth are all problems that lead to the same conclusion: Upgrade and change the architecture. But which part? What changes? And what does Oracle & Sunopsis have to do with this?

Ok - so that's what I've been hearing in the market space (amongst other things). Oracle has been under-fire from Teradata, Netezza, IBM, DatAllegro, Sand, and now Microsoft with SQLServer2005. Oracle in the context of OLTP (transaction processing) is an awesome database, and clustering for OLTP works wonders, especially in large clustered systems - but it drives up cost of support.

Oracle has had problem after problem with their Oracle Warehouse Builder (so many have discussed this, I don't feel I should give it time) - ranging from Metadata management, to user interfaces, to lack of complex transformation ability. But the real-killer to using Oracle as an EDW / ADW has been Volume issues. Processing Volume in the context of history (already with volume in the database - I'm not talking about loading an empty DB here), coupled with complex Transformation requirements has basically put Oracle out on a limb.

In my opinion, the problem stems from their clustering, rather than an MPP solution. Now, it's been said that they never had a "world-class" data integration / data migration ETL / ELT (ETLT) engine - so what did they do? They bought an up-and-comer who had a strong relationship with Teradata.

Sunopsis metadata leaves a lot to be desired, but being a young engine in the market place, they effectively leveraged their ability to generate highly specific Teradata SQL code to run transformations in the database (as ELT) to gain market share and visibility; particularly among the volume crowd in big systems. If the same customer were to put that level of volume on Oracle, then try to run ELT through Sunopsis on it, it would be Oracle's engine that would choke.

Now keep this in mind: Oracle EDW on a SINGLE BIG IRON BOX (without clustering) hums along just fine at a multi-terabyte level, and probably would meet this need just fine. Also, big iron keeps getting bigger and faster (Single SMP, or Mainframe with LPAR) we're talking 64 to 128 CPU's with 64 GB to 248 GB of RAM...

I'm not talking about 500GB of data here, I'm talking about 50 TB or more - of historical, and incoming information at 2TB a day - something that ELT is necessary for due to volume constraints.

Bottom line: I'm not going to speculate as to why Oracle thought buying Sunopsis would save it's EDW market, or gain market share for it... But I will say: Sunopsis is a life-preserver the size of an individual, when the whole ocean liner appears to be sinking.

Oracle needs to wake up! They MUST REWRITE THEIR DATA WAREHOUSE CORE ENGINE to be an MPP enabled solution, drop this cluster stuff (except for OLTP) - they need two separate core engines, one for data warehousing and one for transaction processing. Again, I'm talking enterprise class engines where this makes a difference - then and only then will the engine begin to support the volumes needed (at the performance needed) to make the "T" in ELT work properly, at that point they need to add a host of OLAP functions to operate within SQL (some they already have), and everything needs to run in parallel ALL the time on the enterprise platform.

When this has happened, they could be a formidable force to be reckoned with - and they just might be able to regain a foothold by producing a single suite of products (like IBM). They need to add metadata foundations to the Sunopsis product, add more MPP like functionality and parallel processing ability to the core-engine, and so on.

But I ramble. I realize this is fairly one-sided, but after years of Very Large Data Warehousing experience at executive level projects, I feel it had to be said.

I'd love to hear any counter-view points, or other thoughts you might have, If you work for Oracle - and are willing to share, let me know...

Thanks,
Dan L
CTO, Myers-Holum, http://www.MyersHolum.com


Posted October 19, 2006 6:05 AM
Permalink | 2 Comments |

This is a long overdue entry from way back when. I've recently received some good questions on my original entry for this discussion on a data modeling technique that sits squarely between Inmon and Kimball, or should I say, squarely between 3NF and star schema. This technique is called the Data Vault Data Modeling architecture. In this entry we'll dive a little deeper and answer some of the questions posted on the last entry.

You can find my first entry (and the questions) here. You can find my web site on the Data Vault - here.

Are there healthcare and/or insurance companies that have successfully implemented an enterprise or patient care data vault? (Question from a reader).

Yes. Blue-Cross Blue-Shield (Wellpoint in Saint Louis, MO) is just such a company, and they have (last I'd heard) 18TB or more in their Data Vault on an IBM DB2 UDB platform, using Informatica as an ETL tool, and IBM Websphere for EAI purposes.

What are some of the Best Practices around a hybrid hub-and-spoke approach (i.e. data vault with supporting data marts)?

Please keep in mind that standard enterprise data integration / data warehousing best practices always apply, no matter which data modeling technique you're choosing.

Some of the best practices include the following:
* Use views for virtual data marts (to produce dimensions and facts virtually) until performance becomes an issue. When performance becomes an issue, then "instantiate" denormalized tables underneath to help with speed of queries.
* Produce an ERROR MART for poor quality data, or data that breaks business rules, move your "bad" data to the error mart, and not into the accountable marts. This helps increase business accountability, increase visibility of bad-data (and ultimately the business' desire to fix it).
* If you're going to use Data Quality engines to "clan and alter the data", then ensure you have audit trails somewhere, either from the tool itself, or better yet - kept inside the Data Vault by using Link tables with Satellites to answer these questions. This will help with the auditability and traceability (compliance) of the data sets within the Data Vault structures.
* Remember the Data Vault is a true data warehouse, that supports both active and batch feeds at the same time, and that it is very good for data mining (inside the database), and is not so good (due to high numbers of joins) at data delivery. It is not for the average user to "sit down and query." the average user should go through managed views, or a data mart should be built for those types of questions. Power Users and trained business users looking for specific anomalies can use the Data Vault in all it's glory to achieve some extremely high-end results.
* Denormalize using Materialized views, or Point-in-Time or Bridge tables which are defined on my web-site, but not in my articles (to any great degree).
* Don't break the architecture - one of the strengths of the Data Vault is it's flexibility and scalability over time. It's ability to continue to meet the business needs without huge impacts when business changes need to be made to the model. As with any modeling technique - if the formalized rules are broken, then the model deviates into "never-never land" and begins to take on characteristics of a ball of band-aids, which eventually is left out in the rain and melts away. It eventually stops meeting business needs, and increases impacts to the point where business is constrained and "can't afford" the impact that their change requires.

Is there a matrix for when do you which data warehouse modeling technique? I know there is a lot around Inmon vs Kimball but nothing which includes data vault. What business questions need to be answered to determine the technique?

Business / Technical questions that will lead you to the Data Vault:
* Is your current EDW so large that volume and performance are binding up the SQL queries?
* Is your EDW currently batch-driven, and star-schema based, such that going to a real-time enterprise is proving to be a real hassle?
* Is your EDW matured, and business is saying "no - we can't afford the time/cost to make the change to our star-schema based EDW"?
* Does the data model of your EDW look like a patch-work quilt? In other words, does it follow all the rules of good clean data modeling, or have "exceptions" to the rules been made to accommodate business changes in the past, and now - making changes results in a huge list of impacts / time / or cost?
* Is the volume of data in your EDW or the number of tables currently made the EDW model inflexible to change?
* Does your model contain Junk, and Helper tables? why? What is the helper table "helping?" Is it helping the architecture, or is it helping the data set? If it's truly Junk - why is it in the EDW model to begin with? According to good data architecture practices, nothing should be in the data models without a defined purpose.
* Is your business struggling with compliance, auditability, and accountability within the EDW?
* Does your business have a need to recovery from a disaster quickly? If so, how many times has your EDW data model changed over the past year - which leads to: can the data from a backup truly be restored to a parent or child table?
* Does your data model suffer from Dimensionitis? Too many dimensions created for too many purposes, leading back to stove-piped solutions, rather than something that resembles and EDW / ADW.
* Does your data model contain cascading change impacts when parent tables need to be inserted at the top of the change? (This is a result of modified 3NF style data warehousing modeling where the surrogate and the time-based key are part of the Primary Key)
* Does your ADW/EDW model support splits of Type of data and Rate Of Change?
* Does your Dimension have Dementia or a split personality? In other words, how many of the attributes within the dimension REALLY represent or describe the business key at the top of the table? How many attributes are tacked on to the dimension just because they were "joined" somewhere upstream, and they look like they belong there?
* Has the level of volume reached a point where nested-sub queries begin to be problematic across your Star Schema, especially when you're examining a wide slice of customers with all their history?

There are quite a few more, this is just a start. If you have some of these issues, and wish to share anonymously, that would be wonderful. I'm looking for alternative and opposing views as well.

Thank-you kindly,
Dan Linstedt
CTO, http://www.MyersHolum.com

Feel free to ask more questions, I'll try to post the answers.


Posted October 18, 2006 6:36 AM
Permalink | 1 Comment |

I've just blogged on ETL and ELT - the basics of some of the differences. While it starts with volumes of data within the system, it doesn't end there. While the power of SQL in the DBMS hardware can really show it's muscle, it's the metadata at the end of the day that makes all the difference in the maintenance world down the road.

Knowing SQL is critical, understanding each RDBMS specific set of Functionality is also critical, but more than that - building the solution within a tool that can capture the metadata is vital to the long-term success of handling huge volumes.

Why is metadata so important? Because coding in SQL for ELT can quickly get out of hand, especially on large scale projects where hundreds of jobs are necessary to perform the proper transformation. Also, to help in speeding time to delivery, as well as taking advantage of best practices and lessons learned in the industry. If we aren't allowed to leverage our knowledge in such a manner that is fitting to the metadata, then why use a tool at all?

ETLT (as I've described in articles 3 years ago) is a requirement - the ability to do both ELT and ETL depending on the needs of the design. However, designing the transformation or data flow in a single stream, then pushing that into SQL within the database, or even within the RDBMS engine, as SSIS (from Microsoft, SQLServer2005 does) is a necessary task for future use. See my articles on Teradata Magazine: A Symphony of Two, Additional thoughts #1, Additional Thoughts Part 2,
Tomorrows Real-Time ETL

In order for ELT to work, the power and functionality must be there within the RDBMS. As hardware becomes stronger, and is bundled with RDBMS software to become an "appliance", it becomes critical that these vendors begin to cross-implement functionality. We (as an enterprise) will have multiple levels of appliances from different vendors across the playing field in order to support different requirements. The "T" or transformations that are available should follow some standard like ANSI-SQL in their functions.

Today, each vendor offers individual functions specific to their own database environments which makes it nearly impossible to construct ELT in all environments from a single metadata source. The vendor that begins to implement cross-functionality will have a jump on the other vendors, because their "appliance" will fit in with the rest of the enterprise and can be better leveraged by existing Data Flow Diagramming tools (ETL/ELT) that exist.

Remember, metadata is king - proper use of metadata (IN ANY TOOL) allows me to build systems 3 times faster than hand-coding.

Lastly, don't be fooled by RDBMS vendors, some would have you believe they are powerful enough to handle Transformation (ELT) within the database when joining 100M rows to 100M rows to 100M rows, unless the vendor can prove it, AND they can do it while returning a number of queries, and running a batch load sequence - then they are not worth their weight. Today, all these things are happening within the RDBMS while transformation is taking place. The larger the data set, the more active the systems.

Thoughts?
Dan L
CTO, Myers-Holum, Inc


Posted October 17, 2006 4:03 AM
Permalink | 2 Comments |