When I teach, I frequently discuss temporal based data sets - after all, that's a big piece of what data warehousing and BI is about - Data Over Time. But when examining the database engines ability to "retrieve" specific data sets as a snapshot in time, it seems there is a problem. There appears to be no "consistent" manner in which to retrieve these layers for use by the business. We are left to create physical dimensions and physical fact tables - aggregate our data up to higher levels (to shrink the amount of data) so that joins can execute cleanly and efficiently across information. So why then, after all these years haven't vendors properly implemented the ANSI-SQL-92 standard of "PERIOD"?
Database Vendors, are you listening? There is a serious revenue gain to be had by implementing these feature sets...
First, take a look at the standard defined for ANSI-SQL-92 called "PERIOD" You can find some of the preliminary work here:
And a full research paper here:
By the way, speaking of temporally based Data Models, the Data Vault model is based on usable concepts (which most EDW systems are, by using start/end, or begin/end, or effective dates). But that's beside the point.
Now according to the first reference I provided, Oracle 9i and 10g have this capability, as does IBM and a few other vendors. Mr. Snodgrass references slides for Oracle that discuss Oracle's implementation of the temporal components called "Flashback" which once-again (sadly) is for TRANSACTIONAL SYSTEMS ONLY!
Why am I so upset? Well, if database vendors were truly catering to the enterprise data warehouse, they would allow the database architect / data modeler to pre-determine a single field to use as the "PERIOD" field, and end-dates would no longer be needed. They would then implement these components in such a way as to not require "updates" or "deletes" of the information in order to make it accessible to a time-variant query.
"Flashback" from Oracle is extremely powerful; I'm interested in the way the feature set is implemented. IBM's Data Propagator Log entries also appear to be extremely powerful. The problem is again, they are transactional mechanisms that only trigger based on update and delete or DROP TABLE.
These vendors are completely missing the boat in Data Warehousing if they can't bring to the table an ENGINE OPTIMIZED/engine defined temporal notion for enterprise data warehousing models to use. After all these years, one would think that "temporality of data warehouses" would have been noticed by the engineering staff of database vendors, and that they would have sought out optimizations in the core engine to react to table definitions that are defined by time; and that they would have built a query responsive engine to returning snapshots of data for a specific point in time.
The problem is: we (the implementation specialists and data architects) have had to result to "work-arounds" for all these years. Work-arounds are: put your begin/end dates in your table, when a new image arrives, insert it - then update the old one (end-date it), followed by a query that executes a BETWEEN to get what should be an easy AS OF command.
It's clear that Oracle understands what needs to be done, with this presentation: http://www.oracle.com/technology/deploy/availability/pdf/40109_Bednar_ppt.pdf But it's not clear that they know they should apply this technology to warehousing.
So to summarize: in my view, RDBMS engines (SQLServer, Teradata, IBM DB2 UDB EEE, Oracle, Sybase ASE, and MySQL) are NOT temporally aware when it comes to data warehousing. The following features should have been implemented in 2004, I hope we can find these features in 2009... Furthermore, these features should be DEFAULT BEHAVIOUR when operating as a data warehouse.
* AS OF queries with built-in date-time stamping based on insert date/time of the data set
* Automatic column compare - at the optimizer levels, option switch for table definitions that allow some tables to "be run through a delta before inserts occur", followed by option switches on each column that allow "delta on/off" for specific columns.
When a delta is spotted, the insert takes place automatically. We should no longer be FORCED to execute these comparisons outside the RDBMS engines.
The point here is THE OPTIMIZER EXECUTES THIS AT ENGINE LEVEL
* SQL Delta commands (to be executed at the SQL level) - in other words a SELECT to "show me all rows between X and Y AS OF Z (max date). OR show me the FIRST and LAST row between X and Y as of Z, or show me the [FIRST or LAST] row as of Z.
* In keeping with the ANSI-SQL Standard, these rows AS OF should be able to be joined together by the same primary key, producing a "geological layer of data" AS OF a specific point in time.
* When a DELETE is issued, the option of "DELTA COMPARE" across remaining time windows should be available to the delete command... so that the engine automatically removes duplicated data (if there).
* When an UPDATE is issued, the query should be given an option: WITH HARD UPDATE, where the default is a "soft-update". Hard updates execute against the exact row at that point in time. Soft-updates, actually issue an INSERT at the core-level, producing a new delta for that point in time.
These queries, and this insert/update/delete behavior should be built-in, automatic execution. The designers and implementers should NOT have to think about this. By the way, COLUMN BASED TECHNOLOGY APPLIANCES are in a PERFECT position to execute on this vision TODAY. Big RDBMS engines are too, but they don't seem to be nimble enough to get it done quickly (in the next 3 months!)
Keep in mind: that the ENGINE CORES should be optimized to make use (high performance, parallelism, partitioning) of the TEMPORAL based logic. I was hoping (against all odds) that the RDBMS vendors would have seen this years ago, but it just didn't happen (sorry folks).
There are TONS of good articles on search engines: "temporal SQL", or "temporal database" will pull many of the articles around the mathematics of temporal data. I still wonder why we are left to use a 1992 standard "BETWEEN date_field_1 and date_field_2", and why we are left to compare our own row-sets (outside the core engine), and why we are left to JOIN all of our temporally defined data ourselves (again without core optimizations) ourselves...
It's a sad story to me, but the first "engine" to get here will break some serious performance barriers facing both ETL / ELT loading cycles, and SQL queries for warehousing.
Posted July 18, 2008 5:22 AM
Permalink | 1 Comment |