Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Well well, lookie here - Old MacDonald had a farm, E-I-E-I-O. (sorry, on a bit of a funny kick today). What do all these things have in common? More over what problem are they trying to solve? Are some of these technology stacks "sun-setting"?

In this blog we explore some of these garbled acronyms, and no - I won't repeat the farm joke... We'll also take a hard look at some of the existing business issues that are forcing changes in the way we (IT) work. If nothing else, a bit of light reading - you might get a laugh or two out of this... :)

Sometimes I wonder just a bit - why we have so many different mechanisms to solve the same problem. Oh yes, but what exactly is the problem to begin with?

It all boils down to this: moving data from point A to point B.
Yes, it really is that simple!

The options (byproducts of having the data available)? Integrating, changing, recording, merging, matching, and cleansing are all by-products. We have it in-stream, in-transit, in-route - now we need to do something with it.

Is ETL dead? Yes, I believe so - in it's current form it won't last much longer as a paradigm. It needs to morph if it will survive, change into something more "becoming" of the integration age we are currently feeling (which by the way, the wave or movement started over 4 years ago).

ETL = extract transform and load.
ELT = extract, load to the RDBMS then transform
EAI = Enterprise application integration
EII = enterprise Information Integration
EIEIO = Old MacDonald had a farm... (sorry, did it again).

As I mention in the data modeling blog, the paradigm is shifting, the need to move data from "all sources" into an integrated business model that houses both current and historical views of consistent data is being sharply focused by business acumen.

So what? That means EL, doesn't it? Yep - from a data movement perspective, extract and load - basically detecting existence of new/changed data, and doing a delta comparison is all that's needed. Icing on the cake is having a visual no-code, drag and drop development paradigm that handles and manages metadata along the way.

Let's talk about the Transformation section for a minute, the Big "T". It's a bottleneck, in fact - it's THE bottleneck in most VLDW / VLDB and very large data integration systems. Over the years it has been more efficient to transform the data in-stream, because the RDBMS engines lacked the scalability, and sometimes the functionality to handle all the complex transformations that are necessary.

Today, all that has changed. RDBMS engines now contain highly complex optimizers, incredible business transformations on the SQL level including (but not limited to) object level transformation, in-database data mining engines, in-database data quality/cleansing/profiling plug-ins, statistical algorithms, hierarchical and drill-down functionality, and on and on...

Along with the paradigm shift for bringing all the data to a single statement-of-fact (across the enterprise), the nature of convergence, and consolidation are now saying: it's more efficient to perform any type of "transformation" within the bounds of the RDBMS engine itself. After all, they have begun growing up and offering Multi-Terabyte solutions, some hundred+ terabyte solutions have been around for a long long time.

If we shift the "T" bottleneck from the ETL into ELT, we have a very strong case for scalability - the resulting engine leveraging the best-of-breed, latest RDBMS capabilities, and taking advantage of every ounce of scalability and parallelism (and load-balancing) that the RDBMS can muster up.

So, ETL is "dead". There, I said it. ETL vendors MUST re-tool towards EL with a focus on "T", or better yet - why not make the transition EASY, make the tool "ETL-T". Give the designers the option to convert to "EL-T" where it makes sense. After all, we have sunk-costs and development time into intensive ETL routines, let them stand for a while and earn back their keep.

Now what? What about ELT, EAI, and EII?

Ok, EAI is a 10+ year old paradigm that focuses on integrating applications. There are vendors like Tibco that "run wall-street". EAI is going strong as long as there are applications to integrate, but does EAI overlap into the world of EL? Let's first define EAI: Every time a change happens in an application "plugged in" to the EAI tool, it pushes the change to the message bus, and looks for business rules and other "listeners" that need to be notified of the change.

These business rules can consist of manual intervention, is this really necessary? or is EAI just another "band-aid" for overcoming source system capture problems and integration problems that exist in (for instance) mainframe interchange protocols? I would argue that EAI is more than that, because it focuses on the business processes of the data - goes above and beyond simple integration and begins to look at HOW the data becomes information, and where/when/why it should be utilized.

I will say this though: EAI as a paradigm is also dead. What? How can I possibly say this? This is blasphemous. In my opinion, EAI is a technology who's time has come - who wants to "push" all this traffic onto centralized busses, especially if there's no-one listening? What I mean is, simply pushing data out onto a bus or into a queuing system just because we have a change in the application doesn't mean it's vibrant and desired data that needs to be absorbed down-stream. Off-topic: how many times during the day do you hear noises that you "tune-out"? What if that noise were never made in the first place?

Besides that, EAI focuses only on ONE aspect of the business: the APPLICATION making the data change. There are many more places that data changes within a corporate environment, and some of them are not application based (take unstructured data for instance)...

What if I don't have applications to integrate? What if my picture is bigger than that? Say, web-services or SOA? Ok, EAI vendors must also adapt to meet the needs of SOA - so they have a paradigm shift to undertake if they wish to survive. Even though the paradigm has outlived it's usefulness, as long as there are new applications to "install locally" within a company, there will exist a need for EAI.

There is a shift afoot: "Applications are being out-sourced" says CIO magazine, DMReview, and couple other sources. Software and app providers are taking up the SOA provisions, and EAI (like it's predecessor EDI (electronic data interchange) will be lost in the fray).

However, if the EAI vendors take heed, and re-tool they have TREMENDOUS business value proposition already built into their "routing and business flow management" side of the house, so why lose all that investment? They could (if they wanted) take the SOA management of components and integration by storm, there is one such vendor I'm thinking of right now who could do this in a flash...

That leaves EII: EII picks up where ETL and EAI left off, it's a Pull on demand solution, which I evaluated 3 years ago (privately). I saw the EII paradigm as a niche player, and still do - it has a VERY limited life-span unless it too encompasses some additional technology and re-tool. EII is wonderful to get data at its' source, on-demand. It could very well fill some of the needs of an SOA if desired (and in some cases does very nicely). EII too, has some nifty capabilities to handle business metadata in it's form of meta-modeling. I've never seen such gracefulness in dealing with multiple modeling formats. Once I got over the horrendous learning curve (of one particular tool), it began to make sense.

Some of EII's problems are: it can't handle massive volumes of data, it performs transformations on the fly (row/by/row column by column). The tools that implement EII usually rely on a middle tier meta-model, some of them have the capability of defining business models and allow business users to actually CHANGE the model without IT intervention - nifty trick. However, the transformation is again, squarely in the way of scalability of these engines, as is Write-back capability.

Ouch, write-back? If I setup an EII query to source from a web-service, a stock ticker, and a data warehouse, how on earth can write-back be enforced? let alone a two phase commit? The rules to determine write back must be extremely complex, and again - volume and complexity are directly juxtaposed to each other, and inversely proportional to performance over a given constant of time.

So we're back to square one (whatever that was). I'm a bit dis-illusioned by the vendor hype and what is truly delivered, and I haven't even begun to discuss the XML, Web-services and their transformation ability. I will say this though: There is a LOT of good technology buried in each of these solutions which has been developed, now if each of these vendors could focus on solving some of these problems:

1. Move the transformation logic into the RDBMS engines (they can do it, they really can!) Maybe add some "RDBMS tuning wizards" to the tool set, metrics gathering and collection would be nice...
2. Move the EL logic into loaders and connectors with high-throughput and CLEAR network channels (to move big-data, fast, and in parallel). Maybe even offering COMPRESSED network traffic as a free add-on? Adding CDC on the SOURCE as a free add-on?
3. Leverage RDBMS BEST features and plug-ins, like built-in data mining, built-in-data quality.
4. Focus their tool set on ease of use (from a business user perspective, developing business process flows).
5. Focus their tool set on "staging data" in an RDBMS of choice, so that write-back becomes a reality (with the caveat, that not all sources can be written back to, only our "copy" or snapshot of a stock ticker feed can actually be changed).
6. Focus their tool set on metadata, BUSINESS METADATA, and how it works with the business process flows, the management, maintenance and reporting of that metadata.
7. Focus their tool set on managing, setting up, and maintaining web-services and the security around them.

I think there would be a winning paradigm, maybe it would be called:
E-L-A-I-I-T-I-BMM (sounds like a foreign language).
Extract
Load
Application
Information
Transformation (in-RDBMS)
Integrate
Business Metadata Manager
(now breathe deeply)

You never know, it might just show up on EggHead's shelves! (just kidding, horrible acronym, but it describes the concepts). Of course if Disk Vendors have their way, they will take over the EL portion soon, and pair it with the compliance packages.

These are just ramblings on the state of these technology areas. I beg your pardon, this is not meant to be an attack, just a very opinionated blog as to how I see this industry evolving. Invest in BMM today! Comments?


Posted April 25, 2005 4:51 PM
Permalink | 11 Comments |

11 Comments

Have you looked at Sunopsis ELT solution? Seems to cover many of the points you raise here!
Chris

Chris,

Yes - I am currently looking into Sunopsis as we speak.

Thanks,
Dan L

LoL!

BTW you forgot ESB ;-)

You touched on something that I believe deserves more attention. ETL was an early paradigm for getting data from source to platform and came about because the databases and infrastructure didn't have the performance capabilities they do now. With the increase in power of databases and particularly systems like Netezza's NPS appliance a significant portion of the "T"ransformation work can and should be moved into the database. However, let's not let the pendulum swing too far. The ETLT model that you suggest takes advantage of doing record oriented transformation work outside the database and set oriented work inside the database. Using Ab Initio (selected because among other things it has the best parallel processing model) one can implemention the ETLT model very effectively.

Herb, excellent observations - I agree it needs more attention, and I hope to blog more on these things. I try to stay away from naming names, that would cause more trouble than it's worth. Hey, maybe I'll start a "vendor only" section of blogging... Hmmm.

I think your view of EII is limited and framed by the tools available today. EII is the new data management. Feel free to buy a copy of my book Enterprise Information Integration: A Pragmatic Approach and perhaps then you will understand the full depth and breadth of what EII is and why it is such a powerful paradigm for data management.

JP,

Thank-you for your comments, I will read through your book to see what your vision of EII is. Maybe I should have been more specific and stated at the top of the blog that the EII references are to today's tools, and do not reflect the EII vision. I will blog more on this topic going forward.

I think it's important to shed light on the differences and similarities of encompassing technology, and architectural paradigms.

Hello,
Is www.anycont.com an online ETL ?

I, too, look forward to future blogs of yours on EII. I think some of your observations, while perhaps intended to be provocative, might give some the wrong impression.

We seem to be in a very negative phase of technology these days. I think a good slant for a future blog would be to investigate where EII can save customers time and money - there are several.

Also, as another commenter pointed out, EII is really in its first generation. I think you're going to see some amazing innovations even in the next 12 months. The point is that EII can perform useful functions today, and it will of course, get better. Not too different from the origins in EAI and ETL when you think about it.

Hi,

i was wroking in Sunopsis with Teradata the project Called BASEL II. Its good tool for ELT with teradata, but not much developer compatable.

The the concept behind Transformation running under the Database (Teradata) means its grat one.

Thanks,
Rajendran
rajendran77@hotmail.com

I found this article about data-integration
It also explains very clearly the differences between ETL, EAI and EII

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›