This entry is a candid look (opinionated mind you) at what I see as the future of transformations themselves. We will cross several subjects in this entry, as it is meant to be a look at where transformations currently happen, where they need to happen, and what's actually happening in the market place.
ETL or Extract, Transform and Load has been around a long long time. ELT (or sometimes referred to as in-database, or push-down) is new to the ETL vendor world, but a very old concept. On the other hand, RDBMS vendors have heard the cry and have responded by continually adding new features and functionality to in-database transformation logic.
Now, enter real-time. Ok, EAI (enterprise application integration) and message queuing - both have been around a long time too, they are also growing and changing. Then along came BPM (business process management) which changed or morphed into BPEL (business process execution language) and BPW (business process workflows). All of which to engage real-time flows and manage transactions at the user level. Oh yea, I almost forgot: the middle tier technology known as EII (enterprise information integration) which never really caught on, but is valuable (none-the-less) when embedded in other technologies like web-services and SOA.
Down to brass tacks...
When we look at what's around the corner we have to ask ourselves the following questions:
1. What does compliance and auditability mean to our transformation efforts?
2. What really and truly is so difficult about transforming the data?
3. What do some of these complex business rules look like in transformation logic?
4. WHY do we fundamentally rely on machines and programmatic (static rules) to alter data sets? In other words, why do we "write" rules into SQL or transformation logic to make data "usable" by the business?
5. Just what is considered "usable data" anyway?
Ok, enough of the esoterical stuff - I just thought we needed to ask these questions, of course - if you have concrete answers, I'd love to hear them in your replies to this blog entry. Now, on to more serious stuff... where is transformation going to happen? Especially given ever-growing data sets, and ever-decreasing latency of arrival...
I would argue that ETL is still partially viable, however their comes a time when transformation in-stream simply falls down, no longer feasible to execute. ESPECIALLY when loading data from the source systems IN to the EDW. However, the exception to this rule is when the application is encoded directly on top of the business process rules application - or the business workflow management system. THEN, as the data is entered and submitted to the application, the data is "edited" or transformed before placing it on the transaction stream.
Likewise this might occur over web-services and streaming services for data sets.
Now this raises the question again: WHAT exactly is auditable data? WHEN is it compliant or auditable? even for the operational systems? Is it when the user enters the data on the screen? is it when it's first captured by the transaction system?
Ok - back to brass tacks.
In order to handle volumes of data in the EDW (flowing in and out), and decrease loading cycle times, it is absolutely imperative that the business rules or transformation logic be moved downstream of the EDW. That it *NOT* be placed upstream between the source system and the staging area or EDW (as generally architected). This causes significant re-engineering costs to be incurred, and creates an ETL bottleneck with larger data sets.
Some of this bottleneck is solved through larger hardware or 64 bit systems. HOWEVER that's not enough anymore.
So what are you saying?
By moving the transformations downstream of the EDW, (between the EDW and the data marts) we now have created an architectural OPTION. We can now CHOOSE to use ETL or ELT and leverage the RDBMS for transformation. Especially if both the EDW and the data marts reside on the same database instance. This allows us to apply the technology in the right place at the right time. Furthermore it makes the data in the EDW more "compliant and auditable" because it is not subject to change before loading. (see http://www.DataVaultInstitute.com for more information).
Alright - the future stuff... so what do we need from ETL "vendors" in the future?
* ETL vendors must support both, ETL and ELT (in-database)
* Fully configurable temporary tables, block style processing, in-database control - all from an ETL metadata and visual GUI perspective
* FULL 100% push-down must be supported, and if "EL" needs to be added to the chain, so be it - the ETL tool will automatically set that up, and do it's best to provide 100% push-down where necessary.
* For advanced developers, the ability to control "HOW" the push down will be executed, will full over-rides and step by step debugging IN THE DATABASE.
* Many more, which I don't have time to post now... these are the major ones.
What does this mean to the Database Vendors?
* Ever increasing support of "faster API calls"
* More parallel API calls
* dedicated "step-by-step" debugging interfaces
* a whole lot more in-core coded transformations and complex SQL statements
* MORE BATCH oriented SQL statements, where a "batch processing size" can be set, then the statements will manage themselves
* MORE interconnection (high speed) with remote database instances.
* MORE metadata
* inclusive of versioning of every single piece of executing code
* Versioning of the TABLE structures and INDEXES
* on-the-fly indexing
* Parallel index builds DURING high speed load or batch operations
* NO MORE "TABLE COPY SWITCHING" for high-volume and high-availability.
Please add some of your own thoughts to this party, I'd like to hear what you think.
DanL@RapidACE.com - check out a 3D Data Model Visualizer Demo!
Posted February 5, 2009 7:22 AM
Permalink | 1 Comment |