I've been asked about the pros and cons of ETL push-pull, I thought I'd generalize the issue a little more into the pros and cons of Push Pull technology in general. I'm including EII, and EAI in this posting. It's not that push or pull is necessarily bad by itself, its' more about using the right notion for the right data access at the right time.
Push-Pull, a direction I find myself pulled in, many different times during the day. (Seriously folks... :)
Ok - down to brass tacks, the nature of PUSH technology is basically the realm of EAI and Message Queuing. In this realm we deal with the publish/subscribe model, or maintaining a broadcast message to anyone listening.
Really "easy" technology until you get to the engineering underneath. The real work is deciding WHICH transactions are important, and WHICH are not. Then there's the decision on how often, how fast, and how to write the drivers to "plug" in to each of the applications, or legacy apps that service transactions to begin with. Ok - enough of the engineering talk, let's get back to the business aspects.
Push technology is GREAT when wanting to distribute transactions as-they-happen. Stock tickers, and other types of financial institution transactions are very important when it comes to push technology. How about disasters and notification? Again, important.
What about the different components?
For EAI: Push technology is it's life-blood, this is what it's built on, making the applications "talk" when the transactions are available.
For ETL/ELT: Not so important, even in an "Active Data Warehouse" it's not so important - ok, the PUSH of the transaction is important, but the ETL component? Gets in the way of getting the data in the right time to the warehouse for analysis.
Now wait just a minute - Aren't ETL/ELT engines getting stronger and faster? Yes - they are. But they still aren't "architected" for real-time dynamic data integration. The worlds BEST ETL/ELT engine will focus on transforming as many transactions as possible (in batch) in the shortest amount of time, that's their strength - and they should STICK to it (Stick to your ticket Harry, very important that you STICK to your ticket... - Harry Potter) We could learn a few things from this line; no really!
ETL/ELT is GREAT at PULL technology - go get the data on a scheduled timing interval, not just the data - but ALL the data, en masse. Bring me everything that meets criteria X, across ALL disparate systems, then integrate it all en masse (batch style) - and do it as fast as possible so that I can replicate the system with new information, and transformed information.
Ok - well, ETL/ELT engines will HAVE to process near real time in the near future in order to survive, while batch will not go away any time soon, the windows are shrinking, and the data sets are growing, and the timeliness of critical data is becoming more important. ETL/ELT are GREAT at static rules, parallelism, partitioning, and performance - they require huge amounts of processing power to get the job done right (with very large data sets). This is the nature of PULL. I guess one could speculate that PULL technologies require a place to "land" the data once it's been transformed.
Not something that PUSH technology needs, nor wants. PUSH technology wants to ACT on the transaction as it stands, once it reaches it's destination. This is a primary difference between PUSH and PULL.
Now let's not get confused! There's such a thing as IMMEDIATE PULL, or PULL ON DEMAND, this is new - it's called EII (as a paradigm).
EII in this nature offers many different things and is a _complimentary_ technology to EAI and ETL/ELT. Pull on demand isn't (usually) interested in massive history sets, nor is it interested in "doing" something with the transaction, such as applying it to another system based on business process workflow (although this could change in the near future). It is more interested in managing the metadata layers in between the business and data set, it is more interested in immediate access, immediate integration of CURRENT state than it is in history.
Now hold on! Don't get me wrong - EII can be used to access warehouses just the same as it can be used to access current OLTP/ODS, Staging areas, and Stock Tickers. It's the FOCUS of what EII does that makes PULL ON DEMAND different than PULL on batch schedule. The focus is much different. That same focus makes it a complimentary technology to the EAI and ETL/ELT world.
Using the right tool for the right job makes all the difference. EII also can transform/conform, and write-back. Something that EAI does (write-back), but ETL frequently is not "architected" for. Mostly because the "work" that ETL does must be checked before it is re-integrated with the source systems.
Now take Active or Right-Time Data Warehousing, there's a combination of technologies being utilized to get the data into the warehouse at the right-time, and there's a combination (including data mining, and scoring analysis) to re-deliver the transactions back to the source systems at the right time. Of course this is neither push nor pull, but rather "closed loop processing." Ok - it uses push to get the transaction to the warehouse, and push to get it back from the warehouse to the OLTP system.
So at the bottom of this blog entry, we are still left with the question, what are the pros and cons of push and pull? Let's see if we can sum it up (forgive me, I may forget a few):
1. Instant transaction communication
2. Feedback on the transaction after the business processes are invoked.
3. Transaction by Transaction / Guaranteed delivery mechanisms
4. Mass Distribution, or publish subscribe to those that want it.
5. Visual Business Rule Processing Engines (are usually in place).
6. TACTICAL in nature (for solving business problems)
7. New sources can come on-line and push out new transactions (integrating with ease into existing layers).
1. Independent transactions - meaning can't rely on "history", can't rely on "trends", and canâ€™t rely on an understanding.
2. Difficult to establish context
3. Can't transform "massive sets of data" at once - technology just isn't fast enough yet - this may change with Nanotech and DNA computing.
4. Once a transaction is sent - it's gone. No "recorded history", although some EAI engines actually have mitigated this point over the years.
5. Sometimes tends to be a highly code-driven environment under the covers.
6. The number of crisscrossing attachments to transactions means it's harder to "unhook" legacy systems that are providing the information...
1. Massive sets of transactions in parallel/partitioned can be handled in ever smaller execution windows.
2. Increase in processing power means increase in data set that can be dealt with.
3. We can get what we want when we want it via scheduling.
4. Predictive support, predictive failures, predictive model - leading to standardization, and automation.
5. STRATEGIC IN NATURE.
1. Requires a Landing Area for the transformed data sets.
2. Requires massive sets of processing power (for large data)
3. Batch Windows are continually shrinking while data sets are ever growing.
4. No "NOW" data available, in other words, little to no visibility into the transactions occuring RIGHT NOW.
5. Once a source, always a source (static SOURCING, static TARGETING)
PULL ON DEMAND Pros:
1. Focus on the metadata integration layer
2. Focus on the business rules of integration
3. Utilized by services to conform NOW transactions, WHEN requested (as opposed to WHEN they happen)
4. Provides access to previously inaccessible systems (like word docs, emails, power points, and so on).
5. Dynamic and Distributed query sets mean the queries and their plans can change in accordance with the data set changes (straight PULL is STATIC QUERY BASED - unless the RDBMS engine tunes the query under the covers).
6. Dynamic Sourcing, Dynamic Targeting - if one source isn't available, the metadata layer and engine can determine the "next source in line" and fire the query just the same.
7. TACTICAL IN NATURE!!
PULL ON DEMAND Cons:
1. Requires STRICT adherence and agreement by the enterprise to metadata management, and development.
2. Requires (or forces the hand of) data quality initiatives ON THE SOURCE SYSTEMS.
3. Increases management costs, and required processing power. BUT DECREASES Long-Term costs of implementation of "Services", be-it B2B, B2C and so on.
4. Requires sources be defined and setup ahead of time (before accessing), but PULL strategic has the same requirement.
Ok, none of these are Complete lists by any stretch of the imagination (*some might say I have none :) But hopefully they give a peek into what might be some of the top differentiators across these technologies.
Thoughts? Comments? Have some pros/cons you'd like to add? Please, feel free.
Posted October 20, 2005 2:42 PM
Permalink | 1 Comment |