Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

September 2008 Archives

VLDB (very large databases) and VLDW (very large data warehousing) are two different terms in the industry that evoke a lot of stir. The terms have been changed, altered, re-defined, and brought back to the table many times by many people. Their are many problems associated with implementing "big systems" and not very many solutions (although vendors are trying). There are some major business questions around the data sets and the application of such large data sets.

In this entry I will explore the business questions, and the technical challenges faced by big systems. I will attempt to hold my opinion, and see what the responses are - what do you think are issues faced by your business?

First, as always, let's level-set the terms by defining what we mean by "big systems".

VLDB - A large database, with large amounts of information being loaded by a trickle feed, and large amounts of information being queried 24x7x365 (always up). This creates a mixed workload environment. An example system might be a telephone switch data capturing system hit by Quality control and financial analysts looking to see where they are loosing and making money NOW (all current information). Typically sized in the ranges from 50TB to 150TB of operational type data.

VLDW - A large database, inclusive of history (making it a data warehouse) at a granular level. Typically loaded anywhere between 3 minute intervals and 24 hour intervals, with queries against large amounts of history, mixed in with queries that are "wide" but not "deep" - mixed workload, 24x7x365, detailed data set, raw data set. An example might be all the history of the telephone switching systems mentioned above, so the analysts can determine over time which switches/hosting facilities have the most problems, and which bandwidth is frequently overloaded, and what the patterns of overload actually are. Typically sized in the ranges of 150TB to well over 800TB of historical information (that is ACCESSED).

I'm not discussing systems where "I have 800TB, but it's all on storage, and we load weekly..." - no, that's not what I'm talking about.

The business questions that are under controversy include: (remember, I'm going to hold my opinionated answers until later)
1) Do we really NEED all this data? What does it buy the business? What can be learned from this?

2) What could possibly be hidden in 800TB that the business users access?

3) What tactical questions are answered by having raw data (transactions) loaded to the VLDW?

4) Why can't the operational system (VLDB) serve as the system of record?

5) What does the VLDW have that the VLDB doesn't? Why do I need to justify the existence of both?

6) How do I mitigate risk of failure of either system?

7) Do I need replication technology instead of "backup" technology for fail-over and recoverability?

8) Is there a SINGLE RDBMS engine that will answer these questions AND scale beyond?

9) Do I need to scale beyond 300/400/800TB? What will that buy me?

And the technical questions:
1) How do I manage backups and restores of this much information?
2) is Data Modeling really necessary?
3) Why can't I cluster my machines together, how come I need MPP or Big-Iron SMP to make this work?
4) How do I get the DBMS to handle mixed-workload queries?
5) Why does the system "go-down" when I fire up massive loads WHILE querying?
6) Why do vendors continue to push TPC-H performance when that isn't my "real-world"?
7) What's the difference in systems at 300TB and systems at 800TB?
8) What changes to my architecture/network/OS do I need to make to accomodate this scale?
9) Why can't the users get along with "LESS DATA?" Do they really use all of this?

Love to hear your thoughts,
Dan L
DanL@DanLinstedt.com


Posted September 25, 2008 12:15 PM
Permalink | 1 Comment |

In part 7 of this series I mentioned that I would share how to number deliverables of the project to assist in monitoring progress, and managing metrics (KPA's and KPI's of the project). In this entry I provide some very simplistic starting blocks on how this is done within a project methodology, and hopefully - there's light at the end of the tunnel where we can begin to see the impact on the risk, estimations of hours, cost measurements/forecasts, and actuals for delivery. This entry is all about Project Management and deliverables - how to tie them together.

If you are SEI/CMMI certified, or are an auditor, I would love to have your feedback and comments regarding these subjects. If you are PMP or Six Sigma or TQM familiar, it would be good to hear from you as well - please, tear down these ideas if they do not fit with your experience. I can only learn from you if you respond. :)

In my past 15+ years (now going on 20 years) of IT experience I've spent maybe 6 to 8 years managing projects (technical project management). One of the hats I wore was not only business analyst, but also full project manager, and team enabler. Under this guise of lean-initiatives and cycle time reduction, I learned a few things that seemed to make sense at the time.

I got tired of estimates that didn't match the project plan, couldn't be scoped properly, or attached to actuals and deliverables. I got tired of risk running rampant and killing projects before they started. I got tired of always being asked: "how complete are you and your team on task X?" the real question I got was: "How close to done are you with the requirement... you know, the requirement that discusses ZZZZZ...."

I needed a way to track all of this, and furthermore to be able to press a button and run some analytical reports / graphs (every day) on the project as we moved along. So, taking from technical writing requirements, and from SEI/CMMI and from the legal profession (which I only know they number every paragraph).... I started numbering everything I could find.

For instance, I went through the business requirements, numbered all TITLES and SUB-TITLES, and paragraphs.
1.0 Requirements Overview
1.1 Requirement 1
1.1.1 Response time for req. 1
1.1.1.1 The expected response time........ (paragraph)
1.1.2 Types of queries...

You get the idea, next I numbered the technical requirements to mesh with the business requirements. I aligned the requirements to match up in a matrix of "this is what they want, and this is how we propose to build it." This was appropriately called "IT alignment" (at least it was in the '80s... Then, I took the technical requirements and began numbering EVERY line-item in the project plan. By the way, this became a GREAT way to spot requirements (stated) that were missed in the project plan... interesting loop-hole catch.

I then thought to myself: Self.... (just kidding)
I took the Project plan, and assigned roles rather than people (as responsible parties). To which I developed a roles & responsibilities document, and then numbered that too (independently of the requirements). RR1.0, RR1.1, RR2.0, etc.... I took the R&R numbering system and attached them back to the technical requirements, then assigned resources to the roles and responsibilities, and to each of the resources - I assigned resource loading. This ended up becomming the work-breakdown structure (Project + Tech requirements + RR)

Next, I created an organizational breakdown structure (org chart), and developed an escalation path for each role, numbering each element in the org-chart as I went, assigning RR1.1 to a specific org unit. Now we knew where the risk would be handled, or escalated as things got hot (if they got hot). Next on the list was a process breakdown structure (AS-IS process flows). We needed to know how the data currently moved from one business unit to another, from one system to another. We developed process flows at 30,000 feet and above - then numbered all of them with the appropriate business requirements number (which tied the artifacts to specific components of the project plan).

Then, we immediately began designing new (to-be) process flows, which re-defined some of the interfaces, and how the data would flow to the warehouse, out to the marts, and back to the reports in the business users' hands. We then numbered each of these to-be process flows with the "original process flow numbers" tying them together. As we built the "to-be flows" and completed the process re-design, we could attach these components to mile-stones reached within the project and produce deliverables consisting of data and process to the business users.

Finally, we went through each major section of the technical requirements and assigned risk analysis templates by applying expected skill sets (balanced against the R&R, and the org chart, and availablility) - we developed a low,medium, and high risk score. We then set a threshold for warning (approaching high risk) where we would begin escalation procedures up the Org Breakdown structure.

Needless to say, there were many other deliverables (all docs were versioned in keeping with CMM), all processes were measured, quantified, and then optimized, and the Data warehouse (now some 15 years later) is still running strong.

Ok - you think this is a lot of work? We did this with a team of 3 people, + 1 person from the PMO (proj. Mgmt Office), + 1 DBA part time, + 1 senior/expert data modeler/data architect part time. And we accomplished delivery of the full production warehouse inside 6 months for 3 source systems (Planning, Manufacturing and Finance). The EDW consisted of 60+ tables, source systems around 300+ tables with manufacturing bill of materials.

It can be done, with the right people, the right training and the right expertise - and the benefits can be enormous. It doesn't take a huge bankroll to institute this type of "project governance" or maturity model for EDW projects, just dedication and consistency.

Hope this helps,
Dan Linstedt
danL@DanLinstedt.com


Posted September 24, 2008 8:51 PM
Permalink | 3 Comments |

I've blogged about this topic for many years now, my first mention of it was in my www.TDAN.com articles regarding the Data Vault Modeling architecture. However, that said, I've been blogging on everything from autonomic data models, to dynamic data warehousing, but in my research, I've come to realize I've left out some very critical components. I've lately been experimenting with building a self-adapting structured data warehouse. There are many moving pieces and not all the experiments are finished, so I cannot write (yet) about any of the findings. But here, I'll expose some more of the under-belly as it were that is necessary to make DDW a reality (in my labs anyhow)....

I've tried and tried to find a new name for this thing, but alas, it just seems to elude me. Dynamic Data Warehousing seems to have a nice ring, and is quite the nice fit. The term however evokes all kinds of different meanings to different companies and different people. So much so, that I've had open discussions with IBM in the past about their use of the term! Oh-well, water under the bridge.

But that brings me to my next point. There are missing components to my definition of DDW, I didn't get it all, and I'm sure that this is just another step in the definition (that the definition will not be completed for another year or two). If I look back at what's going on I see the following:

Convergence of:
* Operational Processing and Data Warehousing.
* Master Data and Metadata to use the Master Data Properly
* Tactical decisions backed by strategic result sets
* Business, Technical, Architectural, and Process Metadata
* Real-Time and Batch processing
* Standard reporting technologies and "Live animated scenarios" with walk-throughs and 3D imagry
* Human-machine interfaces
* MPP RDBMS systems and Column Based Database solutions

Why then shouldn't we see convergence of "data models" and "business processes"?
or "Data Models" and "Systems Architecture"?

The point is: WE ARE. (or at least I am). Not only is this happening in my labs, but It's being requested of me when I visit client sites. The customers want "1 solution", or better yet, they want a solution that "appears to learn" based on the demands put upon the system.

Why do I say "appears to learn?"
because Machine learning and appearances of machines translating context are two totally different things. I cannot and will not claim to have made a machine to think. However, I can and have made a machine's enterprise data warehouse responsive to external stimulous - at least when it comes to the data model, loading routines, and queries. Please do NOT mistake this as anything more than AI applied in a new manner - mining metadata (structure and queries and load-code and web-services) rather than just mining data sets themselves. (more on that later, much later --- I still have a LOT of research to do).

Ok - so what's missing from the Dynamic Data Warehouse definition?
* Use of metadata: business, technical, and process during the model learning/adaptation phase
* Use of an ontology (part of business and technical metadata as described above)
* Use of a training model, all good neural nets need to be trained over time, and then corrected.
* Use of the queries to examine and compare HOW the data sets are being used and accessed against the current data model
* Use of a minimal load-code parser, again to assist in training the neural net to recognize the correct structure.

Anyhow you get the point. Dynamic Data Warehousing is about a back office system, that responds to changes in the structured data world - as the queries change then the indexes change. As the incomming data set changes, the model needs to change. Some queries (if consistent enough) can actually express new relationships that need to be built.

This is an adaptable system, this is a dynamic system, this will eventually become a true Dynamic Data Warehouse.

Thoughts?
Dan Linstedt
DanL@DanLinstedt.com


Posted September 21, 2008 9:52 PM
Permalink | 3 Comments |

Column based databases/appliances are making headway in the VLDB/VLDW world. There is no doubt that there are benefits to this approach, but there are also drawbacks. In this entry I explore some of the articles, links, facts and figures - as related to my personal experience. Then I compare what different authors are saying against Row-Based MPP technologies to see what the differences and similarities are. This by no means is a complete research paper, but just a peek into what the future may hold for RDBMS vendors and the new Column based data stores. Of course, Solid state disk, and RAM/Flash based data sets will change things again shortly. I'll also touch on the impacts to Data Modeling and what it may mean going forward.

Let's first set the table by defining what the terms mean:

1) for VLDB/VLDW I'm referring specifically to a 300TB and above system.
2) I'm also referring to LIVE data sets, where it isn't JUST 300TB sitting in a storage disk somewhere, but there's a significant amount of information being loaded AND queried at the same time, utilization is somewhere around 100TB "used/accessed/referenced/loaded" per week.
3) I'm also referring to a MIXED WORKLOAD system, meaning real-time transactions are streaming in, batch loads are occurring, and both tactical and strategic queries are taking place at the same time.

By MPP: I mean Massively Parallel Processing capabilities, like DPF from DB2 UDB (IBM - running shared-nothing architecture), and Teradata with independent nodes to scale out, I'm also referring to theses traditional database systems as "row-based" database engines.

For Column Based "appliances" I am referring to Sybase IQ, Vertica, Dataupia, and others which provide column based data storage. NOTE: Netezza is NOT a column based store, rather it is a flat-wide appliance with hardware that figures out exactly what data set you need before hitting disk to retrieve it.

Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).

Though this argument sounds reasonable, it is completely incorrect. It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.

http://www.databasecolumn.com/2008/07/debunking-a-myth-columnstores.html

If you're interested in furthering your knowledge on indexing versus column compression, the article: http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf is a very good source for examining the mathematics behind the tuple sets and joins.

Most of the articles I've located discuss indexing, and differences between indexing and column based tuple access. Unfortunately they don't tend to address the loading speeds and performance of getting the data "IN" to the database in the first place.

Column based data stores bring benefits to the table:
* Rapid Query, less overhead (according to the math that I've read through)
* No need for PHYSICAL data modeling (as long as you don't need/want GOVERNANCE or MANAGEMENT in your data store).
* No "seemingly physical" limit to the number of columns PER TABLE.
* Automatic data compression/removal of duplicates on insert
* IF the grid / cloud computing works properly, then they should be able to scale out
* They appear to achieve anywhere from a 3:1 to a 7:1 compression ratio on the data slammed in to the box.
* Raw data can be loaded quickly (in native format) without "stopping to normalize, or assign sequence number surrogate keys"

Now let's take a look at some of the issues that they bring to the table (simple issues)
* Most column based databases have yet to solve massive load performance issues
* Most column based databases have to "STOP" the data stream to compress it, and assign it to the right column post-loading.
* In order to achieve high speed trickle feed (8,000 transactions per second or better) they need to have a significant RAM cache somewhere on one of the nodes to load the data.
* Splitting the data over multiple gridded nodes might take more work than originally thought
* Load balancing with spreading the data set across multiple gridded nodes might be an issue.
* Today, most column based data stores work extremely well on big iron SMP boxes, but struggle to take full advantage of Grid technologies and shared-nothing architectures.
* To handle "BATCH LOADS" Most column based data stores use a "staging area" internally to load the batch data, then split it across and push it in to the column database (this may NOT be such a bad thing... we do this in MPP environments too!)
* Column based databases have "come and gone", the only one that has stuck around over the years has been Sybase IQ, and finally for the first time in many years we are beginning to see announcements from the company that they are putting money back into R&D for this product.

Let's take a look at the physical nature of MPP:
Pros:
* Provides mechanisms for governance and management through physical data modeling
* Provides high-speed batch loads, and high-speed trickle feeds (real-time transactions)
* Provides balanced queries, and can easily handle mixed workload components (loading while querying, and both tactical and strategic queries at the same time).
* Has grown up, is based on mature proven technology.
* Scales out very easily, allows MASSIVE sets of data (because it's not locked in to a single SMP environment).

Cons:
* Usually requires good physical data modeling (normalization) in order to load-balance the data sets across the nodes.
* Usually requires a staging area inside the MPP platform before re-distributing the data ** caveat: some MPP platforms have architected their bulk-loaders to overcome this problem.
* Usually requires JOIN INDEXES or some materialized table to assist with the Tuple Joins
* Usually requires column based compression to be turned on by the operator to achieve benefits.
* Requires enough nodes to "split the workload evenly"
* Requires all nodes to be running at the same speed in order to achieve maximum performance gains.

So these are just a FEW of the points made both for and against column based databases when comparing them to MPP designs. They both work well for their own purpose. Customers of mine continue to look for a "single solution to do it all" however today, it just doesn't seem possible. This is why (I think) that we continue to hear vendors like IBM and Teradata advertise: "we partner with...." fill in the blank of your favorite column based database...

However, watch the vendors closely - this market space is heating up, and over the next year I expect new technologies to be released from all vendors that will converge some functionality and blur the lines between RDBMS MPP, and Column based on a grid.

Thoughts? What do you see in the market?
Dan Linstedt
DanL@DanLinstedt.com


Posted September 21, 2008 2:56 PM
Permalink | 1 Comment |