Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

September 2006 Archives

I've been working recently with the new release of a popular ETL tool that now handles ELT. It is an interesting approach but one that bears discussion. The systems we are working with are very large in terms of data sets. I'm not talking about large historical data (although that's one issue) I'm talking about large daily data feeds. This customer has Terabytes of information to deal with on a daily basis. So if you're like me, and you've used ETL for years, and want to know more about ELT, read on. This is a technical entry in my blog. I'll get back to MDM very shortly.

ETL has it's place in the data migration and integration world. ETL = extract, transform (in stream), then load to the target database. However, ETL is proving to have issues with super-huge volume sets (like 800 million rows in a single table) once the data has reached the target database, and still needs transformation - what do you do? Simple math says that if you can run at a maximum of 50,000 or 100,000 rows per second it will still take: 4.4 hours and 2.2 hours to process (respectively), and this is just ONE of the 50 or so tables to be processed.

When we talk about processing volumes like this, time is critical. It's not to say that ETL can't handle the load. On a large enough machines, with enough processing power - it certainly can be done. But extracting all that data (that you just loaded) then transforming it, then re-inserting it back into the database can be a bit unnerving.

Enter ELT - the ability to perform extremely strong, parallel SQL and transformation levels within the RDBMS. There are only certain RDBMS engines, and certain configurations (mostly in the MPP space or extremely large SMP space) that can handle this kind of load in single SQL statements. In an ETL format we try to push the data through an optimized parallel set of processes, and break the data stream up such that much of the decode, aggregation, filtering and expressions are held in RAM. In ELT the goal is different, simplify the SQL statements as much as possible, formulate intermediate tables along the way (could be many of them), use the power of the join operations within the database to resolve all the calculations.

If we are to use tools to gain functionality of ELT, our goal is very simple, extremely small processes - but lots of them. Each step along the way takes care of one small set of the transformation logic. Multiple passing of the data is the secret to making ELT a huge success.

This denotes an architectural change in the way we code our "transformation" logic. ETL focused on complex mapping designs split into parallel processes, ELT focuses on small incremental steps - very simple with extremely fast results. Trying to do to much in ELT will minimize performance, and will move too much data over the MPP nodes within the environment. Complexity is the enemy of a single step with the ELT environment.

For instance, in ETL it may take 5 processes to completely cleanse and transform customer data. in ELT that same logic may take 32 processes (steps), but each one operates at a completely different performance level. In ETL those 5 steps may operate at 30,000 rows per second, in ELT, they operate at 100,000 rows per second or better depending on the hardware and the dispersion of the data sets across the MPP environment.

The point is: if you're beginning to use ELT instead of ETL, you have to think differently - there are different standards, a different architecture, and a different approach to take.

If you've been in the ETL and ELT worlds, I'd love to hear your comments about what you've had to change to make it work.

Thanks,
Dan Linstedt
CTO, Myers-Holum, Inc


Posted September 29, 2006 5:24 AM
Permalink | 4 Comments |

If you haven't heard of them already, you should! These guys are hot in the market space. They are open source ETL, ELT, BI, Dashboards, and Analytics. Two awesome companies that should be checked out. One is Pentaho, and the other is Bread-Board BI. For low-cost solutions, these guys are hot. There are major institutions out there currently examining the use of tools like these. for prime-time.

Apparently I'm not the only one to notice. Pentaho grabs $8M in funding according to Computer Business Review Online. Shawn Rogers had a post back in July referencing these two vendors.

The link to the open source (sourceForge project for Pentaho is):
http://sourceforge.net/projects/pentaho
http://www.pentaho.org

"A complete business intelligence platform that includes reporting, analysis (OLAP), dashboards, data mining and data integration (ETL). Use it as a full suite or as individual components that are accessible via web services. Ranked #1 in open source BI. "

The link to Bread-Board BI is:
http://www.breadboardBI.com
Bread Board BI has an On-Line demo that hits real-data in a real database....
http://24.5.207.85:7070/pentaho/Navigate?&solution=breadboard

"In a highly competitive global marketplace, knowing your customers, supply chain, finances, and workforce are critical to success. In the recent past, the complex systems required to consolidate and display this knowledge were extremely expensive and time-consuming to build. As a result, many systems never moved past the brainstorming stage, others were killed in development due to cost overruns.

Packaged Content + Open Source = The Breadboard Difference
Seasoned experts from Breadboard BI or one of our global partners match your needs to one or more Breadboard BI Solution Areas. Your solution is then iteratively customized using open source Solution Slices™. The result is a highly-tailored, scalable, enterprise solution built rapidly and inexpensively."


Posted September 13, 2006 8:07 AM
Permalink | 5 Comments |

I've recently searched the web for references to my nanohousing articles, and I found two interesting references that I'd like to point out. I've taken a hiatus from blogging on nanohousing to learn more about nanotechnology, I hope you won't be disappointed with the efforts. Anyhow this is a short entry; I thought you might enjoy these two items.

The first is a successful story that uses one of my articles here on B-Eye Network on Nanohousing, check out this slide show about a robot that was built to "LEARN" systems

eTrium Corporation in the Czech Republic has created a robot based on nanotechnology, and has quoted the nanohousing paper.

The second is a reference to ComputerWire in which an author from their team has plagiarized my work - word for word, and claimed it as their own. For the past several years, ComputerWire has been selling my article for $45 (of which I see NO retribution), they have promised to issue a press release retracting the fact that it was that author's original work, but I've never seen it.

What's even more interesting is they've back-dated the material to 2003, when I wrote it in 2005!!

What's funny is you can get the article for free, HERE on B-Eye-Network. It was originally published in Bill Inmons newsletter many years ago.

I think it's interesting that they still insist on selling my original work without paying me for it.

Cheers,
Dan Linstedt
CTO, Myers-Holum, Inc


Posted September 8, 2006 6:28 AM
Permalink | No Comments |

Appliance is an over-used term. Face it, from a customer perspective, there is NO clear definition of what an appliance should be or should contain. To that end, I'll take an opinionated stand on this because it's gotten to be a thorn in my side again. Platforms are what we have today; appliances are something we need to grow into. The EDW/ADW market needs to learn lessons from "accepted appliances" in the market place, things like fire-walls, routers, hubs, etc...

With that, I will say in addition to a previous entry I blogged (an attempt at defining the EDW/ADW appliance), I will augment those definitions with these additional comments:

A TRUE Appliance (for EDW/ADW purposes):
* has not yet been developed.
* is NOT just an RDBMS slapped on top or coupled with hardware, this is just another RDBMS engine on top of hardware.
* is NOT just a simple black box with a SQL API.

To me, these are platform combinations - why? because I can put a database engine on my laptop, but does that make my laptop an appliance? No. Likewise I can put an RDBMS engine on a server, specific hardware, but does that make my server/hardware an appliance? NO.

So what is a TRUE appliance?
A TRUE APPLIANCE:
* has a Web-based Thin client GUI administration front-end (full BI capabilities)
* has a full API for reporting, logging, admin, reprogramming, configuration.
* has software embedded at the hardware / firmware levels.
* can do transformation, data mining, loading, and reporting.
* can notify end-users, administrators of security breaches, and "bad or suspect incoming data."
* has web-enabled firmware updates.
* is truly: plug and play.
* is NOT part of a cluster, it IS part of a grid.
* is self-contained.
* has 9 - 9's up-time, and never quits.
* has (near) linear scalability.

Let's give it some background:
If you think about a router with a fire-wall built in, it has all these services. Granted the BI is simplistic, but if an EDW or ADW is to call itself an appliance, these are the criteria by which I would judge it by. Think about it, data mining and security breaches is BI at it's best, and firewalls HAVE to stop unfriendly intrusions.

Why then is it SO HARD for the "data warehousing appliance market" to get a grip on this?
Because the data is 10x to 100x the volume (or more), it's quite simply more complex a task.

ADW/EDW is more complex, deals with more data (than a fire-wall), deals with different kinds of data - or does it? If we back away or maybe get close enough to see the grass blades, the data and the API's might have to be really simplistic to make an appliance work. Maybe there are classes of "actions" like usage, and configuration - underneath usage is: SQL, underneath SQL is ETL / ELT, Data Mining, UDF, Data Cleansing, Data Profiling, and REPORTING - all functions that should be driven through SQL ON THE APPLIANCE!! Underneath configuration are intrusion detection (data mining), real-time analysis, alerting, and so on - everything the appliance has to "report".

When we think about the FireWall, and security we think about complexity in a different light. The data mining in the firewall must include intrusion detection, virus protection, anti-spam protection, anti-Trojan horse protection, but really - much of this is data driven in configuration. The Firewall is an extremely advanced ANALYTIC APPLIANCE which we take for granted, it is an OPERATIONAL SYSTEM (even though it logs a historical trail of the items it catches), it still rolls them off. It provides alerts, reporting, fraud detection..

The ADW/EDW market should be listening and take a queue - perhaps even partner or acquire firewall device companies, what can happen is the ADW/EDW company can adapt the firewall technology to become a true "point-collection-ADW-device" by adding storage. It would keep a rolling history of data as transactional analysis - and roll the data into a centralized store. The centralized ADW/EDW would then be connected "live” rather than through software, through hardware/software combination - specialized that is to do specific tasks.

Do you have thoughts on how you define APPLIANCE for ADW/EDW? Please share

Thanks,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted September 7, 2006 8:53 AM
Permalink | No Comments |