Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

July 2008 Archives

When I teach, I frequently discuss temporal based data sets - after all, that's a big piece of what data warehousing and BI is about - Data Over Time. But when examining the database engines ability to "retrieve" specific data sets as a snapshot in time, it seems there is a problem. There appears to be no "consistent" manner in which to retrieve these layers for use by the business. We are left to create physical dimensions and physical fact tables - aggregate our data up to higher levels (to shrink the amount of data) so that joins can execute cleanly and efficiently across information. So why then, after all these years haven't vendors properly implemented the ANSI-SQL-92 standard of "PERIOD"?

Database Vendors, are you listening? There is a serious revenue gain to be had by implementing these feature sets...

First, take a look at the standard defined for ANSI-SQL-92 called "PERIOD" You can find some of the preliminary work here:

http://www.cs.arizona.edu/~rts/sql3.html

And a full research paper here:
http://www-db.deis.unibo.it/~fgrandi/papers/NGITS95.pdf

By the way, speaking of temporally based Data Models, the Data Vault model is based on usable concepts (which most EDW systems are, by using start/end, or begin/end, or effective dates). But that's beside the point.

Now according to the first reference I provided, Oracle 9i and 10g have this capability, as does IBM and a few other vendors. Mr. Snodgrass references slides for Oracle that discuss Oracle's implementation of the temporal components called "Flashback" which once-again (sadly) is for TRANSACTIONAL SYSTEMS ONLY!

Why am I so upset? Well, if database vendors were truly catering to the enterprise data warehouse, they would allow the database architect / data modeler to pre-determine a single field to use as the "PERIOD" field, and end-dates would no longer be needed. They would then implement these components in such a way as to not require "updates" or "deletes" of the information in order to make it accessible to a time-variant query.

"Flashback" from Oracle is extremely powerful; I'm interested in the way the feature set is implemented. IBM's Data Propagator Log entries also appear to be extremely powerful. The problem is again, they are transactional mechanisms that only trigger based on update and delete or DROP TABLE.

These vendors are completely missing the boat in Data Warehousing if they can't bring to the table an ENGINE OPTIMIZED/engine defined temporal notion for enterprise data warehousing models to use. After all these years, one would think that "temporality of data warehouses" would have been noticed by the engineering staff of database vendors, and that they would have sought out optimizations in the core engine to react to table definitions that are defined by time; and that they would have built a query responsive engine to returning snapshots of data for a specific point in time.

The problem is: we (the implementation specialists and data architects) have had to result to "work-arounds" for all these years. Work-arounds are: put your begin/end dates in your table, when a new image arrives, insert it - then update the old one (end-date it), followed by a query that executes a BETWEEN to get what should be an easy AS OF command.

It's clear that Oracle understands what needs to be done, with this presentation: http://www.oracle.com/technology/deploy/availability/pdf/40109_Bednar_ppt.pdf But it's not clear that they know they should apply this technology to warehousing.

So to summarize: in my view, RDBMS engines (SQLServer, Teradata, IBM DB2 UDB EEE, Oracle, Sybase ASE, and MySQL) are NOT temporally aware when it comes to data warehousing. The following features should have been implemented in 2004, I hope we can find these features in 2009... Furthermore, these features should be DEFAULT BEHAVIOUR when operating as a data warehouse.

* AS OF queries with built-in date-time stamping based on insert date/time of the data set

* Automatic column compare - at the optimizer levels, option switch for table definitions that allow some tables to "be run through a delta before inserts occur", followed by option switches on each column that allow "delta on/off" for specific columns.

When a delta is spotted, the insert takes place automatically. We should no longer be FORCED to execute these comparisons outside the RDBMS engines.

The point here is THE OPTIMIZER EXECUTES THIS AT ENGINE LEVEL

* SQL Delta commands (to be executed at the SQL level) - in other words a SELECT to "show me all rows between X and Y AS OF Z (max date). OR show me the FIRST and LAST row between X and Y as of Z, or show me the [FIRST or LAST] row as of Z.

* In keeping with the ANSI-SQL Standard, these rows AS OF should be able to be joined together by the same primary key, producing a "geological layer of data" AS OF a specific point in time.

* When a DELETE is issued, the option of "DELTA COMPARE" across remaining time windows should be available to the delete command... so that the engine automatically removes duplicated data (if there).

* When an UPDATE is issued, the query should be given an option: WITH HARD UPDATE, where the default is a "soft-update". Hard updates execute against the exact row at that point in time. Soft-updates, actually issue an INSERT at the core-level, producing a new delta for that point in time.

These queries, and this insert/update/delete behavior should be built-in, automatic execution. The designers and implementers should NOT have to think about this. By the way, COLUMN BASED TECHNOLOGY APPLIANCES are in a PERFECT position to execute on this vision TODAY. Big RDBMS engines are too, but they don't seem to be nimble enough to get it done quickly (in the next 3 months!)

Keep in mind: that the ENGINE CORES should be optimized to make use (high performance, parallelism, partitioning) of the TEMPORAL based logic. I was hoping (against all odds) that the RDBMS vendors would have seen this years ago, but it just didn't happen (sorry folks).

There are TONS of good articles on search engines: "temporal SQL", or "temporal database" will pull many of the articles around the mathematics of temporal data. I still wonder why we are left to use a 1992 standard "BETWEEN date_field_1 and date_field_2", and why we are left to compare our own row-sets (outside the core engine), and why we are left to JOIN all of our temporally defined data ourselves (again without core optimizations) ourselves...

It's a sad story to me, but the first "engine" to get here will break some serious performance barriers facing both ETL / ELT loading cycles, and SQL queries for warehousing.

Cheers,
Dan Linstedt
DanL@RapidACE.com


Posted July 18, 2008 5:22 AM
Permalink | 1 Comment |

Any time we get back to secrets, we seem to fall right back to the category of standards, standardization, measurement and enablement. The old saying is: "if you can't measure it, you can't monitor it, and if you can't monitor it - you don't know when it's broke, or you can't optimize it/fix it." Something like this anyhow.

The common feedback from the general project implementation community is usually: "Why do I need to standardize? Why should I document? Won't it take more time to follow standards than to build rapidly?"

Well, yes and no. If you don't standardize (or your team doesn't standardize), then your project usually cannot be repeated successfully. If the team doesn't standardize then looking back at "what you did right/wrong" is good and can be done, but doesn't provide any sort of "metrics enablement or measurement" abilities against what was done, versus what was estimated, versus what "should" have been done.

Furthermore, documenting the process usually doesn't occur - and when it does happen it's retro-fitted to the existing project just released to production. This also can cause a herculean effort to "reverse engineer" and understand what was built just to build up the documentation.

One more side-effect to these efforts (JAD/RAD typically) include a hit on: flexibility, scalability, and maintainability. In other words, without standards - the project better be a "one-off" never to be repeated in the near future. Reusability is extremely tough in an environment where standards have been tossed into the wind. IT ends up (usually) loosing their agility.

Ok - enough of this, this is all project based stuff. We learned all this in PMP/PMI, Six Sigma, TQM, and so on... what about the templates, how are they useful, can a project be successful using them, how can a project proceed without the "standards" being seen as a hindrance?

Well, there's always a slight hindrance for issuing and following defined procedures. There's always a hindrance to defining standard processes and procedures that are acceptable to the team and the organization. You just can't get away from that. So in this entry we will explore enabling tools and libraries of templates that will help you on your way.

ITIL: on the web at: http://www.itlibrary.org/
Has a plethora of templates, best practices, and standards for projects (including EDW projects). You need to order the books for these.

http://www.isaca.org/
Also has a large array of standards, templates, implementation paradigms and guidance based on SEI/CMMI Level 5.

Or of course you could seek out the Data Vault methodology and approach which has distilled down the templates specific to enterprise data warehousing, enterprise data integration. These templates have also been optimized for quick and easy to use build-outs of your projects. The Data Vault approach (when followed appropriately) helps you instantiate your goals to follow lean-initiatives, business process management, and cycle time reduction.

A few of the different templates that you should have in your project folder include the following:
* Statement of Work
* Service Level Agreement (templated, so you can fit the topic in appropriately)
* Roles and Responsibilities sheet (numbered in accordance with the project plan)
* Organizational Breakdown sheet (numbered in accordance with the project plan)
* Data Breakdown Structure (numbered in accordance with the project plan)
* Project plan (numbered - you guessed it - technically - 1.1.1, 1.2.1, etc...)
* Process Breakdown Structure (numbered in accordance with the project plan)
* Risk Estimation, Mitigation, and responsibilities sheets
* AS-IS and TO-BE data flow documents, and process flow documents
* AS-IS and TO-BE system architecture documents
* Project release plan
* Bug tracking/Enhancement tracking plan

and so on. There are a number of other documents required to make a project successful including the Statement of Work, possibly a letter of intent, Goals and Objectives, Phased approach definitions, Definition of "Success" criteria for development - test and production releases. Estimated person-hours, level of experience on the team (according to the Roles and Responsibilities), and training plan.

A good set of templates, coupled with a solid project approach can be utilized on any project from 800 person-hours to 50,000 person-hours. It can be used repeatably, it can be measured as to it's effectiveness, and when a specific "template" is left out, the RISK of removing that process from the project plan can be accurately assessed.

To be successful in one's endeavors one of the final ingredients is: the ability and desire to teach the client to fish, rather than implement what you have and walk out with your own methodology.... But then again, no one does that to you do they? :)

On the next secrets, we'll get in to what one of these numbering systems looks like and why it helps solve the pain in business today. We'll also address some of the issues plaguing IT, and keeping them from being "agile" in the business environment.

As always, comments are welcome.

Hope this helps,
Dan Linstedt
DanL@rapidACE.com


Posted July 16, 2008 3:15 PM
Permalink | 1 Comment |