Sometimes I wonder just a bit - why we have so many different mechanisms to solve the same problem. Oh yes, but what exactly is the problem to begin with?
It all boils down to this: moving data from point A to point B.
Yes, it really is that simple!
The options (byproducts of having the data available)? Integrating, changing, recording, merging, matching, and cleansing are all by-products. We have it in-stream, in-transit, in-route - now we need to do something with it.
Is ETL dead? Yes, I believe so - in it's current form it won't last much longer as a paradigm. It needs to morph if it will survive, change into something more "becoming" of the integration age we are currently feeling (which by the way, the wave or movement started over 4 years ago).
ETL = extract transform and load.
ELT = extract, load to the RDBMS then transform
EAI = Enterprise application integration
EII = enterprise Information Integration
EIEIO = Old MacDonald had a farm... (sorry, did it again).
As I mention in the data modeling blog, the paradigm is shifting, the need to move data from "all sources" into an integrated business model that houses both current and historical views of consistent data is being sharply focused by business acumen.
So what? That means EL, doesn't it? Yep - from a data movement perspective, extract and load - basically detecting existence of new/changed data, and doing a delta comparison is all that's needed. Icing on the cake is having a visual no-code, drag and drop development paradigm that handles and manages metadata along the way.
Let's talk about the Transformation section for a minute, the Big "T". It's a bottleneck, in fact - it's THE bottleneck in most VLDW / VLDB and very large data integration systems. Over the years it has been more efficient to transform the data in-stream, because the RDBMS engines lacked the scalability, and sometimes the functionality to handle all the complex transformations that are necessary.
Today, all that has changed. RDBMS engines now contain highly complex optimizers, incredible business transformations on the SQL level including (but not limited to) object level transformation, in-database data mining engines, in-database data quality/cleansing/profiling plug-ins, statistical algorithms, hierarchical and drill-down functionality, and on and on...
Along with the paradigm shift for bringing all the data to a single statement-of-fact (across the enterprise), the nature of convergence, and consolidation are now saying: it's more efficient to perform any type of "transformation" within the bounds of the RDBMS engine itself. After all, they have begun growing up and offering Multi-Terabyte solutions, some hundred+ terabyte solutions have been around for a long long time.
If we shift the "T" bottleneck from the ETL into ELT, we have a very strong case for scalability - the resulting engine leveraging the best-of-breed, latest RDBMS capabilities, and taking advantage of every ounce of scalability and parallelism (and load-balancing) that the RDBMS can muster up.
So, ETL is "dead". There, I said it. ETL vendors MUST re-tool towards EL with a focus on "T", or better yet - why not make the transition EASY, make the tool "ETL-T". Give the designers the option to convert to "EL-T" where it makes sense. After all, we have sunk-costs and development time into intensive ETL routines, let them stand for a while and earn back their keep.
Now what? What about ELT, EAI, and EII?
Ok, EAI is a 10+ year old paradigm that focuses on integrating applications. There are vendors like Tibco that "run wall-street". EAI is going strong as long as there are applications to integrate, but does EAI overlap into the world of EL? Let's first define EAI: Every time a change happens in an application "plugged in" to the EAI tool, it pushes the change to the message bus, and looks for business rules and other "listeners" that need to be notified of the change.
These business rules can consist of manual intervention, is this really necessary? or is EAI just another "band-aid" for overcoming source system capture problems and integration problems that exist in (for instance) mainframe interchange protocols? I would argue that EAI is more than that, because it focuses on the business processes of the data - goes above and beyond simple integration and begins to look at HOW the data becomes information, and where/when/why it should be utilized.
I will say this though: EAI as a paradigm is also dead. What? How can I possibly say this? This is blasphemous. In my opinion, EAI is a technology who's time has come - who wants to "push" all this traffic onto centralized busses, especially if there's no-one listening? What I mean is, simply pushing data out onto a bus or into a queuing system just because we have a change in the application doesn't mean it's vibrant and desired data that needs to be absorbed down-stream. Off-topic: how many times during the day do you hear noises that you "tune-out"? What if that noise were never made in the first place?
Besides that, EAI focuses only on ONE aspect of the business: the APPLICATION making the data change. There are many more places that data changes within a corporate environment, and some of them are not application based (take unstructured data for instance)...
What if I don't have applications to integrate? What if my picture is bigger than that? Say, web-services or SOA? Ok, EAI vendors must also adapt to meet the needs of SOA - so they have a paradigm shift to undertake if they wish to survive. Even though the paradigm has outlived it's usefulness, as long as there are new applications to "install locally" within a company, there will exist a need for EAI.
There is a shift afoot: "Applications are being out-sourced" says CIO magazine, DMReview, and couple other sources. Software and app providers are taking up the SOA provisions, and EAI (like it's predecessor EDI (electronic data interchange) will be lost in the fray).
However, if the EAI vendors take heed, and re-tool they have TREMENDOUS business value proposition already built into their "routing and business flow management" side of the house, so why lose all that investment? They could (if they wanted) take the SOA management of components and integration by storm, there is one such vendor I'm thinking of right now who could do this in a flash...
That leaves EII: EII picks up where ETL and EAI left off, it's a Pull on demand solution, which I evaluated 3 years ago (privately). I saw the EII paradigm as a niche player, and still do - it has a VERY limited life-span unless it too encompasses some additional technology and re-tool. EII is wonderful to get data at its' source, on-demand. It could very well fill some of the needs of an SOA if desired (and in some cases does very nicely). EII too, has some nifty capabilities to handle business metadata in it's form of meta-modeling. I've never seen such gracefulness in dealing with multiple modeling formats. Once I got over the horrendous learning curve (of one particular tool), it began to make sense.
Some of EII's problems are: it can't handle massive volumes of data, it performs transformations on the fly (row/by/row column by column). The tools that implement EII usually rely on a middle tier meta-model, some of them have the capability of defining business models and allow business users to actually CHANGE the model without IT intervention - nifty trick. However, the transformation is again, squarely in the way of scalability of these engines, as is Write-back capability.
Ouch, write-back? If I setup an EII query to source from a web-service, a stock ticker, and a data warehouse, how on earth can write-back be enforced? let alone a two phase commit? The rules to determine write back must be extremely complex, and again - volume and complexity are directly juxtaposed to each other, and inversely proportional to performance over a given constant of time.
So we're back to square one (whatever that was). I'm a bit dis-illusioned by the vendor hype and what is truly delivered, and I haven't even begun to discuss the XML, Web-services and their transformation ability. I will say this though: There is a LOT of good technology buried in each of these solutions which has been developed, now if each of these vendors could focus on solving some of these problems:
1. Move the transformation logic into the RDBMS engines (they can do it, they really can!) Maybe add some "RDBMS tuning wizards" to the tool set, metrics gathering and collection would be nice...
2. Move the EL logic into loaders and connectors with high-throughput and CLEAR network channels (to move big-data, fast, and in parallel). Maybe even offering COMPRESSED network traffic as a free add-on? Adding CDC on the SOURCE as a free add-on?
3. Leverage RDBMS BEST features and plug-ins, like built-in data mining, built-in-data quality.
4. Focus their tool set on ease of use (from a business user perspective, developing business process flows).
5. Focus their tool set on "staging data" in an RDBMS of choice, so that write-back becomes a reality (with the caveat, that not all sources can be written back to, only our "copy" or snapshot of a stock ticker feed can actually be changed).
6. Focus their tool set on metadata, BUSINESS METADATA, and how it works with the business process flows, the management, maintenance and reporting of that metadata.
7. Focus their tool set on managing, setting up, and maintaining web-services and the security around them.
I think there would be a winning paradigm, maybe it would be called:
E-L-A-I-I-T-I-BMM (sounds like a foreign language).
Extract
Load
Application
Information
Transformation (in-RDBMS)
Integrate
Business Metadata Manager
(now breathe deeply)
You never know, it might just show up on EggHead's shelves! (just kidding, horrible acronym, but it describes the concepts). Of course if Disk Vendors have their way, they will take over the EL portion soon, and pair it with the compliance packages.
These are just ramblings on the state of these technology areas. I beg your pardon, this is not meant to be an attack, just a very opinionated blog as to how I see this industry evolving. Invest in BMM today! Comments?