Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

This entry is a candid look (opinionated mind you) at what I see as the future of transformations themselves.  We will cross several subjects in this entry, as it is meant to be a look at where transformations currently happen, where they need to happen, and what's actually happening in the market place.

ETL or Extract, Transform and Load has been around a long long time.  ELT (or sometimes referred to as in-database, or push-down) is new to the ETL vendor world, but a very old concept.  On the other hand, RDBMS vendors have heard the cry and have responded by continually adding new features and functionality to in-database transformation logic.

Now, enter real-time.  Ok, EAI (enterprise application integration) and message queuing - both have been around a long time too, they are also growing and changing.  Then along came BPM (business process management) which changed or morphed into BPEL (business process execution language) and BPW (business process workflows).  All of which to engage real-time flows and manage transactions at the user level.  Oh yea, I almost forgot: the middle tier technology known as EII (enterprise information integration) which never really caught on, but is valuable (none-the-less) when embedded in other technologies like web-services and SOA.

Down to brass tacks...

When we look at what's around the corner we have to ask ourselves the following questions:

1.     What does compliance and auditability mean to our transformation efforts?

2.     What really and truly is so difficult about transforming the data?

3.     What do some of these complex business rules look like in transformation logic?

4.     WHY do we fundamentally rely on machines and programmatic (static rules) to alter data sets?  In other words, why do we "write" rules into SQL or transformation logic to make data "usable" by the business?

5.     Just what is considered "usable data" anyway?

Ok, enough of the esoterical stuff - I just thought we needed to ask these questions, of course - if you have concrete answers, I'd love to hear them in your replies to this blog entry.  Now, on to more serious stuff...  where is transformation going to happen?  Especially given ever-growing data sets, and ever-decreasing latency of arrival...

I would argue that ETL is still partially viable, however their comes a time when transformation in-stream simply falls down, no longer feasible to execute.  ESPECIALLY when loading data from the source systems IN to the EDW.  However, the exception to this rule is when the application is encoded directly on top of the business process rules application - or the business workflow management system.  THEN, as the data is entered and submitted to the application, the data is "edited" or transformed before placing it on the transaction stream.

Likewise this might occur over web-services and streaming services for data sets.

Now this raises the question again: WHAT exactly is auditable data?  WHEN is it compliant or auditable? even for the operational systems?  Is it when the user enters the data on the screen?  is it when it's first captured by the transaction system?

Ok - back to brass tacks.

In order to handle volumes of data in the EDW (flowing in and out), and decrease loading cycle times, it is absolutely imperative that the business rules or transformation logic be moved downstream of the EDW.  That it *NOT* be placed upstream between the source system and the staging area or EDW (as generally architected).  This causes significant re-engineering costs to be incurred, and creates an ETL bottleneck with larger data sets.

Some of this bottleneck is solved through larger hardware or 64 bit systems.  HOWEVER that's not enough anymore.

So what are you saying?

By moving the transformations downstream of the EDW, (between the EDW and the data marts) we now have created an architectural OPTION.  We can now CHOOSE to use ETL or ELT and leverage the RDBMS for transformation.  Especially if both the EDW and the data marts reside on the same database instance.  This allows us to apply the technology in the right place at the right time.  Furthermore it makes the data in the EDW more "compliant and auditable" because it is not subject to change before loading.  (see http://www.DataVaultInstitute.com for more information).

Alright - the future stuff... so what do we need from ETL "vendors" in the future?

* ETL vendors must support both, ETL and ELT (in-database)

* Fully configurable temporary tables, block style processing, in-database control - all from an ETL metadata and visual GUI perspective

* FULL 100% push-down must be supported, and if "EL" needs to be added to the chain, so be it - the ETL tool will automatically set that up, and do it's best to provide 100% push-down where necessary.

* For advanced developers, the ability to control "HOW" the push down will be executed, will full over-rides and step by step debugging IN THE DATABASE.

* Many more, which I don't have time to post now...  these are the major ones.

What does this mean to the Database Vendors?

* Ever increasing support of "faster API calls"

* More parallel API calls

* dedicated "step-by-step" debugging interfaces

* a whole lot more in-core coded transformations and complex SQL statements

* MORE BATCH oriented SQL statements, where a "batch processing size" can be set, then the statements will manage themselves

* MORE interconnection (high speed) with remote database instances.

* MORE metadata

* inclusive of versioning of every single piece of executing code

* Versioning of the TABLE structures and INDEXES

* on-the-fly indexing

* Parallel index builds DURING high speed load or batch operations

* NO MORE "TABLE COPY SWITCHING" for high-volume and high-availability.

 

Please add some of your own thoughts to this party, I'd like to hear what you think.

As always,

Dan Linstedt
DanL@RapidACE.com  - check out a 3D Data Model Visualizer Demo!


Posted February 5, 2009 7:22 AM
Permalink | 1 Comment |

1 Comment

I think automation of design will become more important. The upstream ETL becomes much simpler and can be made faster by automating the way the ETL tool maps and builds the ETL jobs.

As far as downstream goes where the transformation occurs the ETL vendors could make life easier with better management of rules and the manually maintained tables to support those rules. You don't want to have to pay for a full blown business rules engine or MDM solution but more in the way of a light version of each that adds some control and flexibility to complex transformations for ETL projects.

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›