Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

About this blog >

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can’t wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

Recently in Compliance and Integration Category

In this entry we explore the nature and notion of compliance - specifically Sarbanes-Oxley and what it means to your EDW. I've been working with compliant based systems for years. Over the years I've learned about data as an asset, that is: data in the EDW affecting the financial bottom line. I've learned about audits and auditability (been through a few of them myself). In this "series" I will first explore sarbanes oxley, then follow with CoBIT, ITIL, SEI/CMMI Level 5, and a few other things. Please let me know what you think of this entry/series and if you'd like to see more.

Let's start with a few definitions:

Sarbanes-Oxley_Act
1. Public Company Accounting Oversight Board (PCAOB)
2. Auditor Independence
3. Corporate Responsibility
4. Enhanced Financial Disclosures
5. Analyst Conflicts of Interest
6. Commission Resources and Authority
7. Studies and Reports
8. Corporate and Criminal Fraud Accountability
9. White Collar Crime Penalty Enhancement
10. Corporate Tax Returns
11. Corporate Fraud Accountability

How does this tie to my EDW/BI initiative?
Very interesting question, to which I have an opinion. My opinion is as follows: I believe that data is an asset within our organizations. Now before you run off to tell the world that "only good data is an asset", let me back up. Good, Bad, and Indifferent - data is an asset - regardless of how it's perceived. Data that is captured, or created on the fly is an asset. It doesn't matter if it's good or bad data. Besides, who determines which label to place on the data?

With data as an asset, it affects the bottom line financials. Financial decisions are made based on data every day, sometimes every second. In some cases (like NASA), data affects peoples lives. Clearly, data is worth something on the financial books.

Ok - so how do you value it?
That's a discussion for another day.

Now that data is seen as an asset to the corporation, and that it's considered tied to financials, it should be available for audits, and compliance. The compliance must come from the people themselves within the organization; however the data can shed light on the firm's compliance or non-compliance abilities. In other words, the data can tell the auditors: "what the company knows, and how they are reacting to the situation." The data can also help determine the "net-worth" of the organization.

Sarbanes-Oxley_Act (a little further down says)
Auditing Standard No. 5
* Assess both the design and operating effectiveness of selected internal controls related to significant accounts and relevant assertions, in the context of material misstatement risks;
* Understand the flow of transactions, including IT aspects, sufficient enough to identify points at which a misstatement could arise;
* Evaluate company-level (entity-level) controls, which correspond to the components of the COSO framework;
* Perform a fraud risk assessment;
* Evaluate controls designed to prevent or detect fraud, including management override of controls;
* Evaluate controls over the period-end financial reporting process;
* Scale the assessment based on the size and complexity of the company;
* Rely on management's work based on factors such as competency, objectivity, and risk;
* Conclude on the adequacy of internal control over financial reporting.

Ok, I can see how source systems are affected, but how does this tie to my EDW?
The EDW must house "A SINGLE VERSION OF THE FACTS for a specific point in time." (see Data Vault Modeling and Methodology e-learning on http://inmoninstitute.com) The Data must tell a story of what the company DID and how they REACTED to a specific situation that occurred within the organization. The data in the EDW must create an AUDIT TRAIL of decision making along the way. The EDW is crucial to uncovering the facts about what people knew when. It MUST become a system of record "capture mechanism" in order to meet compliance initiatives.

Wait a minute, that's a big leap - I don't follow...
You're not alone. Many people around the world are now discovering that the only way to uncover corruption, fraud, or pure misjudgment is to look at the good, the bad, and the ugly data in the EDW - and how it changed (or didn't) over time. The EDW tells the story of the companies' evolution, ranging from new source data, to changing of the business rules. Ok, back to the point:

How can you "assess the effectiveness of audit controls" without looking into the EDW for a data trail of how the company is operating? Especially if you are warehousing the financial systems...

How can you "Understand the flow of transactions" without tracking how the flow's business rules changed the transactions along the way? An EDW should capture the history of the raw transactions BEFORE and AFTER the changes in order to meet compliance.

SOX 404 compliance costs represent a tax on inefficiency, encouraging companies to centralize and automate their financial reporting systems. This is apparent in the comparative costs of companies with decentralized operations and systems, versus those with centralized, more efficient systems. For example, the 2007 FEI survey indicated average compliance costs for decentralized companies were $1.9 million, while centralized company costs were $1.3 million.[28] Costs of evaluating manual control procedures are dramatically reduced through automation.
http://en.wikipedia.org/wiki/Sarbanes-Oxley_Act

Regarding costs, the EDW is meant to be a centralized repository of information. The Sarbanes-Oxley auditor should be asking to view the financial reports from three directions - using triangulation to spot discrepancies.

Auditor to the firm:
Direction 1: Show me today's financial reports from today's data... (firms response: ok, either from the EDW or from the operational systems) - usually this will come from an "OPERATIONAL DATA WAREHOUSE" or a system using operational BI.

Direction 2: Show me yesterdays' financial reports - reproduce them for me using yesterdays' routines, and yesterday's data.... don't just grab your "old hard-copy"... (firms response: ok - from the EDW, and the backed-up routines, and yesterday's data mart).

Direction 3: I see errors, discrepancies between the two reports... Now, show me the RAW detail data that went in to yesterdays report, and the RAW detail data that went in to today's report. (Firms response with a "version of the truth" warehouse is: UH-OH, we're in trouble.... Firms response with a Data Vault is: No problem)

Data is an asset, data affects the financial bottom lines. RAW data needs to be tracked in the EDW in order to be compliant with Sarbanes-Oxley. Auditors will ask to see this information, and the EDW better have it.

*** Compliance initiatives are difficult (if not impossible) to meet without a historical tracking of RAW data sets, integrated, and stored in the EDW ***

Changing the data on the way IN to your EDW can cause a compliance audit failure in the future, especially if the source system is retired, is destroyed, or is unable to "restore" the system of record that created the data in question. The EDW is the ONLY place in the future to house this information.

I will be continuing my series on auditability, compliance here - but you can also find out more by registering, and watching new on-line courses on http://inmoninstitute.com - I will have some courses available by February 15th, 2009 about auditability and compliance and the EDW.

I will continue my series as well, in discussing governance controls, and accountability as we move forward.

I'd like to hear about your thoughts/experiences. Please reply with comments below.

Thank-you,
Dan Linstedt
CIO, Genesee Academy, LLC
DanL@GeneseeAcademy.com


Posted January 20, 2009 4:45 AM
Permalink | No Comments |

Well, it's happened again. IT is trying desperately to eliminate the value of the EDW from the business (at least this is what I see). Business is responding by demanding the creation of Master Data systems. There seems to be an age-old argument in the market space about the use of, definition of, and condition of: Business Keys. IT appears to be telling people to use surrogate keys and to ignore the business keys entirely. In this entry we will explore this single notion, and see what some folks have to say about it (me included!) Mind you, this is a bit of a rant; they seem to know how to "get my goat" as they say...

I start off by stating, Codd & Date designed normalized forms to have business meaning. They insisted that Business Keys be utilized in order to "make sense of" and "tag" the data sets appropriately so that relationships can be understood and maintained.

A simple link to a temporal database book houses a brief entry on the Information Principle: See it here.

I'd like to say a word or two about business keys (which by the way, you'll be able to find additional information on my videos on YouTube: http://www.youtube.com/dlinstedt/

What are business keys?
Business keys ARE the master information that unlocks context for business users. Business keys are (often) intelligent keys that have MEANING to the business. Business Keys are often alphanumeric, parts of which may be generated sequences, other parts have meaning based on position. In any event, business keys are USED by the business to locate, identify, and track information through the business life-cycle. Without them, business may not be able to "use" or apply the information properly.

What business keys are NOT:
Business keys are NOT surrogates, NOT sequences, NOT ordered numeric elements assigned based on technical insertion rates. Surrogate keys should NEVER be shown to the business, ever.... They should be used within a system (internally only) to identify rows to the machine, and provide optimal join paths, but they should NEVER appear on reports, screens, or anywhere that the business can see them.

Is there an argument around business keys versus surrogate keys?
You bet! Check out these comments:
http://www.mindfuldata.com/Modeling/modeling-pdf/DAMA%202008%20Speaker%20Notes.pdf
http://stackoverflow.com/questions/63090/surrogate-vs-naturalbusiness-keys


"Dimensions should always use a surrogate key that is generated within the warehouse. I went to a presentation a couple of years ago by Ralph Kimball (a data warehouse author), and he discussed the importance of removing the warehouse's dependency on business keys. The idea is a good one, because business keys change regularly and this will result in a long-term problem for the warehouse. However, when we discussed Slowly Changing Dimensions (especially ones that kept history), he said that we should use the business key to link them together. This went against what he had just said, so I decided that we needed to find another solution." http://expertanswercenter.techtarget.com/eac/blog/0,295203,sid63_tax298150,00.html

http://www.mindfuldata.com/Modeling/modeling-pdf/DAMA%202008%20Speaker%20Notes.pdf

http://www.infoadvisors.com/Home/tabid/36/EntryID/191/Default.aspx
http://www.cerebiz.com/blog/index.php/2007/08/06/use-of-surrogate-keys-in-data-warehousing/

WHY are these people demanding that there is no value to business keys?
Because it's a very tough problem for business to overcome. Yet the business today is ASKING, Begging, pleading for answers from Master Data Sets. I maintain that you cannot build a master data system without looking at and using business keys as a central HUB of information.

Why not surrogates?
If I ask you to look up surrogate key 5, do you understand what this is? where it came from? what it is bound to? Does it give you _any_ context at all as to which system generated the number? Do you even know where to begin to find this key?

Surrogate numbers are generated today in EACH source system. In the Data Warehousing world we are responsible for integrating MULTIPLE systems at once into a single place. If we rely solely on these "surrogate keys" and completely ignore business keys as has been suggested by the links above, our EDW would never mesh or align for the business. Furthermore trying to build a master data system would be impossible. Some of these individuals I listed even went so far as to say: "ignore the business keys in your dimension entirely, because it is unruly (null) most of the time".

I say rubbish. If your business is not properly synchronizing, populating, or utilizing business keys then they are hemorrhaging money along their business process. Business keys are vital to the traceability of information ACROSS lines of business and ACROSS systems.

Take a look at what I say about business keys:
http://www.danlinstedt.com/AboutDV.php
http://www.tdan.com/view-articles/5285
http://www.b-eye-network.com/blogs/linstedt/archives/2005/09/between_inmon_a.php

Bottom line, Business keys are imperative that they span the systems. If the business keys are changing, or are re-used, the business is LOSING MONEY. I will take that to the board of directors level every single time, and every time - I can find busted and broken business problems and lack of visibility ACROSS the organization in accordance with their lack of regard for business keys.

The ONLY thing one has to do is look at the businesses that want master data systems - how are you (IT) going to integrate the data sets by surrogate if the surrogates generated by source system ARE THE SAME across multiple sources? WHICH surrogate are you going to show to the business as the "MASTER KEY" for which pieces of information? It's a near impossible problem to solve, the business units will fight over the definition, and it will come down to politics as to who is right/wrong, when the business REALLY should be deciding how to fix the source of the problem: lack of a single business key.

Auto manufacturers figured it out long ago, they use VIN (vehicle identification numbers) to uniquely identify: make, model, manufacturer, date of manufacturer, size of engine, and so on. Unless you are doing something illegal, the VIN does not change, nor does it go away. What would happen to the world of car's if the VIN disappeared?

We have the SAME question in the world of counterfeit drugs... Unfortunately E-Pedigree as a country wide solution has been lobbied down, and pushed back. Each bottle was to be labeled and identified as a unique bottle using a very specific bar code. It would have allowed the entire industry to sort out the MOST of the counterfeit drug problem, and save people’s lives.

You can sit there and tell me that "Business Keys don't matter" but at the end of the day, I will say: you are losing money, and quite possibly people are dying without them.

Cheers,
Dan Linstedt
Check out WHY business keys are important, learn about the Data Vault Model.


Posted November 2, 2008 10:07 PM
Permalink | 3 Comments |

In part 7 of this series I mentioned that I would share how to number deliverables of the project to assist in monitoring progress, and managing metrics (KPA's and KPI's of the project). In this entry I provide some very simplistic starting blocks on how this is done within a project methodology, and hopefully - there's light at the end of the tunnel where we can begin to see the impact on the risk, estimations of hours, cost measurements/forecasts, and actuals for delivery. This entry is all about Project Management and deliverables - how to tie them together.

If you are SEI/CMMI certified, or are an auditor, I would love to have your feedback and comments regarding these subjects. If you are PMP or Six Sigma or TQM familiar, it would be good to hear from you as well - please, tear down these ideas if they do not fit with your experience. I can only learn from you if you respond. :)

In my past 15+ years (now going on 20 years) of IT experience I've spent maybe 6 to 8 years managing projects (technical project management). One of the hats I wore was not only business analyst, but also full project manager, and team enabler. Under this guise of lean-initiatives and cycle time reduction, I learned a few things that seemed to make sense at the time.

I got tired of estimates that didn't match the project plan, couldn't be scoped properly, or attached to actuals and deliverables. I got tired of risk running rampant and killing projects before they started. I got tired of always being asked: "how complete are you and your team on task X?" the real question I got was: "How close to done are you with the requirement... you know, the requirement that discusses ZZZZZ...."

I needed a way to track all of this, and furthermore to be able to press a button and run some analytical reports / graphs (every day) on the project as we moved along. So, taking from technical writing requirements, and from SEI/CMMI and from the legal profession (which I only know they number every paragraph).... I started numbering everything I could find.

For instance, I went through the business requirements, numbered all TITLES and SUB-TITLES, and paragraphs.
1.0 Requirements Overview
1.1 Requirement 1
1.1.1 Response time for req. 1
1.1.1.1 The expected response time........ (paragraph)
1.1.2 Types of queries...

You get the idea, next I numbered the technical requirements to mesh with the business requirements. I aligned the requirements to match up in a matrix of "this is what they want, and this is how we propose to build it." This was appropriately called "IT alignment" (at least it was in the '80s... Then, I took the technical requirements and began numbering EVERY line-item in the project plan. By the way, this became a GREAT way to spot requirements (stated) that were missed in the project plan... interesting loop-hole catch.

I then thought to myself: Self.... (just kidding)
I took the Project plan, and assigned roles rather than people (as responsible parties). To which I developed a roles & responsibilities document, and then numbered that too (independently of the requirements). RR1.0, RR1.1, RR2.0, etc.... I took the R&R numbering system and attached them back to the technical requirements, then assigned resources to the roles and responsibilities, and to each of the resources - I assigned resource loading. This ended up becomming the work-breakdown structure (Project + Tech requirements + RR)

Next, I created an organizational breakdown structure (org chart), and developed an escalation path for each role, numbering each element in the org-chart as I went, assigning RR1.1 to a specific org unit. Now we knew where the risk would be handled, or escalated as things got hot (if they got hot). Next on the list was a process breakdown structure (AS-IS process flows). We needed to know how the data currently moved from one business unit to another, from one system to another. We developed process flows at 30,000 feet and above - then numbered all of them with the appropriate business requirements number (which tied the artifacts to specific components of the project plan).

Then, we immediately began designing new (to-be) process flows, which re-defined some of the interfaces, and how the data would flow to the warehouse, out to the marts, and back to the reports in the business users' hands. We then numbered each of these to-be process flows with the "original process flow numbers" tying them together. As we built the "to-be flows" and completed the process re-design, we could attach these components to mile-stones reached within the project and produce deliverables consisting of data and process to the business users.

Finally, we went through each major section of the technical requirements and assigned risk analysis templates by applying expected skill sets (balanced against the R&R, and the org chart, and availablility) - we developed a low,medium, and high risk score. We then set a threshold for warning (approaching high risk) where we would begin escalation procedures up the Org Breakdown structure.

Needless to say, there were many other deliverables (all docs were versioned in keeping with CMM), all processes were measured, quantified, and then optimized, and the Data warehouse (now some 15 years later) is still running strong.

Ok - you think this is a lot of work? We did this with a team of 3 people, + 1 person from the PMO (proj. Mgmt Office), + 1 DBA part time, + 1 senior/expert data modeler/data architect part time. And we accomplished delivery of the full production warehouse inside 6 months for 3 source systems (Planning, Manufacturing and Finance). The EDW consisted of 60+ tables, source systems around 300+ tables with manufacturing bill of materials.

It can be done, with the right people, the right training and the right expertise - and the benefits can be enormous. It doesn't take a huge bankroll to institute this type of "project governance" or maturity model for EDW projects, just dedication and consistency.

Hope this helps,
Dan Linstedt
danL@DanLinstedt.com


Posted September 24, 2008 8:51 PM
Permalink | 2 Comments |

Any time we get back to secrets, we seem to fall right back to the category of standards, standardization, measurement and enablement. The old saying is: "if you can't measure it, you can't monitor it, and if you can't monitor it - you don't know when it's broke, or you can't optimize it/fix it." Something like this anyhow.

The common feedback from the general project implementation community is usually: "Why do I need to standardize? Why should I document? Won't it take more time to follow standards than to build rapidly?"

Well, yes and no. If you don't standardize (or your team doesn't standardize), then your project usually cannot be repeated successfully. If the team doesn't standardize then looking back at "what you did right/wrong" is good and can be done, but doesn't provide any sort of "metrics enablement or measurement" abilities against what was done, versus what was estimated, versus what "should" have been done.

Furthermore, documenting the process usually doesn't occur - and when it does happen it's retro-fitted to the existing project just released to production. This also can cause a herculean effort to "reverse engineer" and understand what was built just to build up the documentation.

One more side-effect to these efforts (JAD/RAD typically) include a hit on: flexibility, scalability, and maintainability. In other words, without standards - the project better be a "one-off" never to be repeated in the near future. Reusability is extremely tough in an environment where standards have been tossed into the wind. IT ends up (usually) loosing their agility.

Ok - enough of this, this is all project based stuff. We learned all this in PMP/PMI, Six Sigma, TQM, and so on... what about the templates, how are they useful, can a project be successful using them, how can a project proceed without the "standards" being seen as a hindrance?

Well, there's always a slight hindrance for issuing and following defined procedures. There's always a hindrance to defining standard processes and procedures that are acceptable to the team and the organization. You just can't get away from that. So in this entry we will explore enabling tools and libraries of templates that will help you on your way.

ITIL: on the web at: http://www.itlibrary.org/
Has a plethora of templates, best practices, and standards for projects (including EDW projects). You need to order the books for these.

http://www.isaca.org/
Also has a large array of standards, templates, implementation paradigms and guidance based on SEI/CMMI Level 5.

Or of course you could seek out the Data Vault methodology and approach which has distilled down the templates specific to enterprise data warehousing, enterprise data integration. These templates have also been optimized for quick and easy to use build-outs of your projects. The Data Vault approach (when followed appropriately) helps you instantiate your goals to follow lean-initiatives, business process management, and cycle time reduction.

A few of the different templates that you should have in your project folder include the following:
* Statement of Work
* Service Level Agreement (templated, so you can fit the topic in appropriately)
* Roles and Responsibilities sheet (numbered in accordance with the project plan)
* Organizational Breakdown sheet (numbered in accordance with the project plan)
* Data Breakdown Structure (numbered in accordance with the project plan)
* Project plan (numbered - you guessed it - technically - 1.1.1, 1.2.1, etc...)
* Process Breakdown Structure (numbered in accordance with the project plan)
* Risk Estimation, Mitigation, and responsibilities sheets
* AS-IS and TO-BE data flow documents, and process flow documents
* AS-IS and TO-BE system architecture documents
* Project release plan
* Bug tracking/Enhancement tracking plan

and so on. There are a number of other documents required to make a project successful including the Statement of Work, possibly a letter of intent, Goals and Objectives, Phased approach definitions, Definition of "Success" criteria for development - test and production releases. Estimated person-hours, level of experience on the team (according to the Roles and Responsibilities), and training plan.

A good set of templates, coupled with a solid project approach can be utilized on any project from 800 person-hours to 50,000 person-hours. It can be used repeatably, it can be measured as to it's effectiveness, and when a specific "template" is left out, the RISK of removing that process from the project plan can be accurately assessed.

To be successful in one's endeavors one of the final ingredients is: the ability and desire to teach the client to fish, rather than implement what you have and walk out with your own methodology.... But then again, no one does that to you do they? :)

On the next secrets, we'll get in to what one of these numbering systems looks like and why it helps solve the pain in business today. We'll also address some of the issues plaguing IT, and keeping them from being "agile" in the business environment.

As always, comments are welcome.

Hope this helps,
Dan Linstedt
DanL@rapidACE.com


Posted July 16, 2008 3:15 PM
Permalink | 1 Comment |

ETL (extract transform & load), and ELT (extract load and transform) have been getting interesting rubs lately from different people in the market space. The issue is that most of what I read / see doesn't include dealing with super-large volumes of data. For example: every DFD (data flow diagram) I build is architected to deal with a minimum of 80Million rows per source, and is considered to be a small load. The medium sized loads deal with 150 Million rows per source, and the large loads deal with 300 million and up (to 1 to 2 billion rows per load). In this entry we explore the nature of both ETL and ELT as they relate to this size of data set. We'll cover problems, issues, and architectural changes that need to happen.

We'll also attempt to address the changing nature of ETL, ELT and what really needs to be achieved in the market space (something called ETLT where CHOICE is dependent on the task and size/performance of available hardware resources). With that, let's dive in.

These pros and cons are limited to items which are not necessarily available in both paradigms at the same time. Things like Security, Version Control, Code Optimization, and dependency analysis are available in both paradigms, and thus will not be listed.

For further reading, see my previous entry: http://www.b-eye-network.com/blogs/linstedt/archives/2006/12/etlt_or_elt_eit.php

ETL Pros:
* Can handle transformation midstream, outside the database walls
* Is tied to the scale and size of the mid-stream transformation hardware it sits on
* Can be spread across a computing grid without the high cost of "DBMS grid technology"
* Is usually database agnostic (as far as the transformations are concerned)
* Usually can manage metadata (Technical / Process Metadata)
* Generally is GUI driven
* Can interface with EAI, and EII along with other component layers like web-services and reporting tools
* Most of the time doesn't require a heavy knowledge of SQL (however, in VLDW situations, it is highly preferable to have deep SQL knowledge).

ETL Cons:
* Is limited by the platform it sits on
* Is limited by the source and target throughput speeds (including network and dbms systems)
* Must bring the data off the RDBMS engine into the transformation engine in order to tie it into the transform stream.
* Can lead inexperienced developers to setup highly complex mappings (which drive performance down, down, down)
* Cannot take advantage of hardware or DBMS scalability within the database engine platform
* Can require a proprietary "transformation" engine to be run in a middle tier
* Usually houses it's metadata in a non-standard architecture (another database schema/model as opposed to directly in the RDBMS header or system tables).
* Has startup and shutdown time

ELT Pros:
* Can utilize the partitioning and performance strategies set forth by the RDBMS and table structures
* Can utilize RDBMS specific feature sets
* Keeps up to date with the RDBMS releases
* Can process all data within the RDBMS caches and buffers
* Can process near-real time data with ease
* Is not dependent on network speeds, nor client connectivity (cache sizing)

ELT Cons:
* Requires 98% SQL Select statements (raw code)
* If it accesses stored procs / functions / triggers, then it tends to be in danger of not executing in parallel at all.
* Can lead to highly complex, difficult to maintain SQL statements.
* Leads to scripts to manage "strings of serially executing SQL statements"
* Makes use of many "temporary tables" in the database, requiring a lot more storage and a lot more horsepower within the DB platform.
* Can "lock up" the RDBMS engine from doing any alternative work
* Usually does not have a lot of accompanying metadata (like process metadata)

ETL's performance is reliant on many different moving parts, while ELT's performance is reliant on just the RDBMS and it's ability to scale. The larger issue is as follows: usually when I tune ETL platforms, there is a lot of performance we can get from the system before any hardware needs to be upgraded to handle additional load. When I tune ELT, usually we hit the need (very quickly) for increasing the RAM caches, and making use of bigger hardware.

In other words, when I tune an 80 million row load map on ETL, we can optimize the ETL engine, network, disk speeds, and database buffers to get anywhere from 80,000 to 120,000 rows per second throughput really quickly.

When I tune the same functionality of ELT (SQL Scripts), we get much higher performance - but require twice the amount of RAM, or CPU, or DISK or all three in order to make it work. If the hardware is scaled properly, ELT can run in the ranges of 120,000 to 250,000 rows per second. The statement? Running batch through ELT in a database can require significant amounts of hardware to make it go, most clients have under-powered database platforms, this can cost a LOT of money to scale up, where-as throwing in another Windows / Linux box for ETL engines might be half or less the comparable cost.

Where's the market today?
Today, most engines are getting smart and offering ETLT (designer’s choice). This is the best of both worlds, as it incorporates the GUI building, metadata management, integration features - as well as an ability to tune performance where necessary. In other words, loads that are acceptable (via user requirements) can take place in the ETL engine, where loads that require an extra performance boost, or particular tuning can take advantage of the ELT components.

What's the recommendation for when to use what?
We suggest that ELT only be used for a) small near-real time transactions or b) very large batches needing CDC in the target - but for this, you MUST be able to scale the hardware of your database engine.

We suggest that ETL be used for everything else, from process flow, to metadata management, to GUI development, re-usability and so on. Apply where appropriate. I've always found that by achieving a balance across both architectures, I can achieve an optimal performance setup - of course I end up having to tune the platforms as well.

What about JAVA ETL / ELT engines can they work for us?
Most open-source engines are written in Java, and yes - as of Java 1.5.x they are fast, efficient, and can be run to solve problems. Are they capable of handling the scale of throughput that we suggest at the top? Not sure, I've never run that kind of load through any of these systems (Pentaho, Talend, etc..) If you've got experience, we welcome the information.

These Java engines are actually quite advanced, and work very well.

Is there anything new on the horizon?
Yes, there is always something new.. But in this case, the nature of Transformation and the fact that it seems to "need" to be done, will stick around for a long long time. For as long as the computer has been around, people have needed it to "manage and transform" data. The new components (as discussed previously) should show up in the form of hardware cards, plug & play pieces that sit right inside the RDBMS platform, and offer specialized transformation - possibly even "plugging" directly into the RDBMS platform as an extension to the database system.

These cards will house their own CPU's, their own memory, their own high-speed busses - and will make use of connecting directly to disk as well as the database engines.

It will be something to watch, the expected performance gain of this kind of solution should be anywhere from 10x to 200x faster than both ETL and ELT platforms that exist today.

Cheers,
Dan Linstedt
Do you want a Masters of Science in Business Intelligence?
Get one from Denver University: http://www.COBICC.org


Posted July 19, 2007 7:17 AM
Permalink | No Comments |

EII has been getting a lot of buzz lately, especially with the purchase of Meta Matrix by Red Hat. I want to turn your attention (instead) to where EII needs to go as an industry. These are my opinions, and I welcome you’re constructive comments. EII (enterprise Information Integration) is a pull technology - grabbing data on-demand when needed from all kinds of sources, and building a single integrated view of the current world of "transactional data." So what's left?

In the future as we progress towards heterogeneous appliances, we will need EII more and more, especially with it's persistence of data in a virtual world. But what we are missing today are a few feeds on metadata (both business and technical), infrastructure and management of multiple web-services domains (both inside and outside the company walls), and the ability to track changes to data models - be it web service structure changes or physical data model changes in source systems.

EII will become more and more important as a back-office integration system and "glue" providing the framework needed to run the back-office more efficiently. I would expect that the EII tool of the future will pick up and integrate the appliances, along with managing the network of appliances in the plug and play scope. The more we can virtualize the information on a transactional level (and integrate it on the fly) the better we can manage all the back-office systems.

Furthermore, I expect the GUI of the EII tools to be focused more on the front-end users, bringing the integration management out of the back-office and more into the business user world. I believe that by focusing the EII GUI on plug & play nature it will provide additional power to business rule engines, workflow engines, processing engines, metrics engines, and of course metadata engines.

The EII GUI will reach the front office, and be simplified (as it should be), the the advanced interface will still be available for the IT staff, however business users should be able to switch context within their portals and not know or care that they are using EII for data exploration. Plugging EII directly into source data systems and pumping the data into MS-OLAP cubes (MDB), or Excel will push utilization forward.

Metadata collection systems are being built and focused on, particularly over the past year by all kinds of vendors including Meta Integration, ASG Systems, CA and so on. However, the interfaces used to collect and manage (not to mention link together) the metadata leaves a bit to be desired. EII is a perfect fit for integrating all kinds of metadata in a visual format, and providing a repeatable metadata integration and management front-end. By leveraging EII's ability to connect to all kinds of sources, and by visualizing the metadata stores we can easily combine the metadata into a common data model and write the metadata back.

Not only should EII be providing visualization of Metadata, but it should also plug in to the Reporting Tools out there, and provide the metadata feed on the fly with all the security and accessibility that the reporting tools offer. Management of the metadata MUST be created into a GUI somewhere, and it should be leveraged with EII's ability to not only "allow alteration" but provide write-back of the metadata to a common repository.

Summary:
EII of the future will have a much more robust GUI. There will be two different components to the GUI - a business user interface, and a technical interface. The EII GUI will become a BI tool in it's own right, and should plug and play with business rules engines, business processing workflow engines, web-services, metadata engines, and front-end applications like Excel and Microsoft OLAP cubes. The write-back capabilities should be leveraged to manage change, and assist with producing a common data model for the data to reside in (which will eventually be an exploration warehouse of sorts).

Do you have any thoughts?

Thanks,
Dan Linstedt
Check out a Masters of Science in Business Intelligence at: http://www.COBICC.org


Posted May 10, 2007 5:55 AM
Permalink | 2 Comments |

My last post discussed the notion of unstructured data being as much as 80% of the data that we in IT will / should begin to deal with. One of the readers requested that I expand on what I'm including in Unstructured Data. This entry discusses the types of structured/unstructured and semi-structured data as I see it. As usual, this pertains to business knowledge, and is a huge part of DW2.0. As it turns out, it also is (or will become) a huge part of changing IT from a cost center into a profit center; why? Because if we can integrate unstructured information, and glean the knowledge from it (determine contextual linkages), we can better understand where our business holes are.

There are three terms being bantered about in our DW2.0 world: Structured, Semi-Structured, and Unstructured data. Let's take a look at defining these terms and what they mean to us going forward.

Structured Data
Data that is sitting in a data store, defined by a catalogue (table definitions), something accessible via SQL, or data models, or Cobol Copybooks, or Object definitions. Data in rows and columns. Furthermore, this data has a characteristic of being contextualized by the heading (field name), and possibly defined in relation to other "fields". This data is also capable of being processed in a simple manner, summed, and aggregated and so on. What this data is NOT: is images, blobs, binary fields, free form documents, and so on.

Semi-Structured Data
Semi-Structured data seems to be that which houses structure with free-form elements, things like e-mails for instance, which have structure and context to specific elements in the header, but are free-form text documents in the body. Semi-structured comes in many forms, but it depends on what you are looking at as to whether or not the data is semi-structured or unstructured. For instance, semi-structured data for a fire-wall might be TCP/IP packets, where they care about the contents of the individual packet, along with a string of packets from the same IP to establish a pattern, and so on.

Unstructured Data
Unstructured data typically is all that which is not semi-structured or structured. For instance, images, this blog entry, content of web documents, standard documents, movies, audio, and so on.

What's the big difference? Why the hoopla? I thought Word Docs were structured!
Well, it all depends on your perspective. If the application is MS-Word, then in fact, the document itself is structured, however the CONTENT is not. Just like a web-page, the tags are structured, as are CSS elements, and XML, and HTML, and so on, the CONTENT is not.

Free form text (content) is NOT structured, until you are looking at a document which has sentences, and punctuation. Then, from a grammatical standpoint it is structured at a lower level of grain. But do you care? This is the big question. Just like we care about the grain of structured data, we should care about the grain of unstructured data.

We need to separate the terms: in an unstructured or semi-structured world, we need to make the choice: do we care about the "encapsulating structure" or do we care about the content or both? This is where the knowledge is, buried in the content, and doing something meaningful with the content.

Why?
Because unstructured and semi-structured and structured data are "one-and-the-same" when we talk about the encapsulating structure. All word docs for instance have markers, metadata, and processing instructions for Word to follow (layout, borders, size, color, font, etc..) All emails have standardized "structure", all images have specific processing instructions for standardized rendering engines, all audio (the same), all blobs, etc...

But when we talk about CONTENT, the playing field changes. Not all content is "the same", in other words, when you process a series of images, detecting when one is a face, one is a human, one is a tree, one is an ocean, etc... Determining WHAT the image is where the knowledge lies, and how it relates to other data based on WHAT it is - that's where the unstructured data processing lives.

Content derivation, assimilation, and integration is part of the story, once the content can be parsed, then hopefully basic outliers of context (important points) can be derived. In other words, like a search engine looking for key terms, but take it further than that: key terms that make sense or have relevance, ok: one more step further: not only have relevance but actually tie together what's duplicate, what's not, and learn from "elimination" of search results that the context is not relevant for those particular search terms...

This is just one example. Anyhow, all of this relates to DW2.0 and the stack within. Unstructured and semi-structured and structured data are NOT the same within a contextual sense, but are the same from within a structural encapsulation sense. In DW2.0, we must integrate the contextual information (meaning mine, and link together) in order to increase our awareness of what's going on in both the external and internal worlds of the corporation.

In order to make money, increase profits in IT, and actually provide more business value back to the business we MUST as IT professionals, undertake automation, and data mining of unstructured information, along with contextual integration as a step forward or we will lose sight of valuable information (particularly competitive).

As always, in the next blog I'll talk a little more about approaching IT automation, and how to integrate unstructured information into your enterprise from a DW2.0 perspective.

Please don't hesitate to comment, or ask questions.

Thank-you,
Dan Linstedt
Get your Masters of Science in Business Intelligence at: http://www.COBICC.com


Posted March 21, 2007 4:38 AM
Permalink | No Comments |

I had some good questions come in recently, thank-you. In this entry I'll share my experiences with ETL and ELT with regards to metadata; I'll also try to elaborate on when it is right to use which type of technology. This also goes back to my original articles in Teradata Magazine on ETLT.

When to use which technology.
As it turns out (and this is a good thing), not all systems have the super-huge volumes that I've been discussing. Bill recently documented a specific case where too much granularity is a bad thing - which I agree with. There's a time and place for asking the question: what's the value of this data set? Is the value increased or decreased by going to finer grain of data? But that's for another blog.

Bottom line: ETL still has it's uses, and will continue to be useful moving forward. Here are some typical cases and reasons why you might still want to use ETL:
* When volumes are medium to low sized, in my estimation, a single feed (single file) to be processed, that is under 180M rows can easily be processed by ETL when architected properly.
* When the time-line for loading the data is not super critical, in other words even if I have 500M rows, but I have 2 or 3 days to load, because it's a monthly feed, then maybe this is Ok too.
* When the source systems provide data slower than 60,000 rows per second - or the networks are restricting data flow. In these cases, you really have no choice. You now have the time (due to slow sources) to do transformation in-stream, so why not take advantage of this?
* When the target system won't perform at more than 50,000 or 60,000 rows per second, even if it's been tuned. (see my last bullet)
* When code and reference tables are needed to populate default values or provide simple data replacement through key matching.
* When small micro-bursts of data are arriving in real-time, and we need to perform transformation in-line between systems, or between queue's.

In other circumstances we will use the ETL engine to provide EL in parallel and partitioned jobs, providing maximum throughput with minimum development time.

So what does ETL provide in the metadata space:
Generally ETL's metadata repositories have been around for a little bit longer than the ELT tool vendors' repositories, and those ETL tool sets now dabbling in ELT are just growing into this area.

If we look at ELT from a scripting point of view, and not using a tool to execute, we lose all kinds of metadata from process flow, to business logic, to full traceability. Particularly this is seen when scripting Teradata BTEQ, Oracle SQL Loader and Stored Procedures, SQLServer command line, and DB2 UDB Functions, and so on... Any time we return to CODE base for ELT, we're back to deciphering scripts to figure out metadata, and much of it is hidden. Unfortunately (today) most of the ELT that I run into is still hand-coded, although this is set to change in the near future.

This is where people stop me and say: well, what about Sunopsis? They were 100% ELT... True - but their metadata repository left a LOT to be desired, they were young in the field, and hadn't quite grown up yet.

Ok, so what about Informatica, Ascential, Ab-Initio, Business Objects Data Mover, and so on...
They grew up in the ETL space and have strong metadata components, however, they are just now completing their transition to ELT - or in some cases, they've completed it, but it's not 100% compliant with the all the database features available that I can see.

So then this argument kicks in: Well, ok - but what about when I override the SQL in these tools? That get's me 100% compliance with ELT and database feature sets... Yes, but then some tools simply don't have the SQL de-composition routines built into their engine yet. They can't pick these apart for the metadata lineage and achieve end-to-end lineage.

For those tools that offer end-to-end lineage with SQL, the next thing that I find is:
Some database engines do not record the number of "rows" which were operated on within these SQL statements that executed within the database. They record them, but the tool sets don't pick them up. Other database engines record this information (for every SQL statement) in metadata tables in the RDBMS - as it should be!

Finally, when it comes down to it - ELT requires much simpler processes, resulting in many more steps, resulting in a lot more "temporary" tables during execution cycles, metadata is bound to be lost through this process (today) - this is also due to change in the future, and get better as the tools find the gaps and fix them.

Now what are some of the pros and cons of ELT?
Pros:
* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.

Cons:
* ELT relies on proper database tuning, proper data model architecture, normalized data model architecture
* ELT relies on MPP hardware
* ELT can easily eat 100% of the hardware resources available for complex and huge operations
* ELT can't balance the workload
* ELT can't reach out to alternate systems (all data must exist in the RDBMS BEFORE ELT operations take place)
* ELT easily doubles, triples, and quadruples disk storage requirements (more processes, each simpler, each requiring intermediate temporary tables).
* ELT (sometimes) is not 100% metadata lineage traceable.
* ELT can take longer to design and implement, more steps, less complicated per step, but usually results in more custom SQL code (sometimes this is where metadata is lost).

Pros and Cons of ETL
Pros:
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware

Cons:
* ETL requires separate and equally powerful hardware in order to scale.
* ETL can "bounce" data to and from the target database, requires separate caching mechanisms which sometimes don't scale to the magnitude of the data set - this can result in scalability and performance issues.
* ETL cannot perform as fast as ELT without twice the size of hardware (usually for RAM and CPU resources).

So, you see - it all depends on how you look at the picture and what you want. Ultimately the nirvana is that no metadata will be lost, regardless of database specific SQL, or ETL "functions / transformations" utilized. The other nirvana is that you will ultimately design an "ETL" stream, and when it no longer performs, flip a switch on a job that says: turn this whole thing into ELT for me without me re-designing; that is the day when ELT will have grown up on the job, and be our best friend.

Of course, it all depends on where you can afford to put your system load, and where you can spend your money to achieve scalability....

Hope this helps,
Dan Linstedt
CTO, http://www.MyersHolum.com


Posted December 15, 2006 10:53 AM
Permalink | 2 Comments |

Ok, I've said it before in previous entries, I've discussed ELT and ETL and loss of metadata here on the blog before. I've worked in both situations, I've worked in VLDW for 8 years, I've worked in ADW (Active or real-time data warehousing) for the past 3 to 4 years, I've been involved with non-data warehousing data integration projects using ETL and massive volumes. Now I'll say it again - there's a difference, a time a process, or set of processes, and singular points of architecture where everything converges. If you've seen me present VLDW, you've seen my Pyramid diagram that shows the impact of volume and latency on the number of ways to "execute". This entry redresses ELT or a term I wrote about in Teradata Magazine over 5 years ago called ETLT.

Warning: read on if you like a good "one-sided" rant... I lay it on the table here, after all my years of experience, I just need to share. As always, I'm open to other sides of the argument - and invite anyone with alternative views to comment on this entry. If you agree, I'd love to hear from you as well...

I've had the pleasure today of sitting through a two hour architecture meeting. The client we are working for at the moment has to move (consistently) 1 Billion rows during a single load in a single process, the maximum growth rate for this single target is potentially 1 Billion rows a month. Of course, this is the maximum rates. 80% of our data sits on average around 100 Million a day; we have maybe 10 to 50 source tables to work with at these volumes.

What would you do in this situation? We are working with a particular ETL tool which has ventured into ELT land recently. We are also working with the worlds leading VLDW database engine (ranked according to Gartner Magic Quadrant). By the way, if you think this is large - you should try dealing with Telco Switch Data - same problem, same size of data, different data, and different requirements - same issues.

Our requirements are to take "micro-batches", and real-time data feeds in the future, along side the standard volume batch updates. There's no question that the architecture must be accommodating to this volume, that the data model must be accommodating to the overall architecture and data arrival timings.

What am I saying when I state: the larger volume, and/or the lower the latency, the less architecture choices you have to solve the problem?

Well, here's an example: if I ask you to load 500 rows of data and tell you, you have 1 hour to get it done. You can probably tell me, 1000 different ways (architectures / designs) that will work - from the data model to the coding solution, to the architecture of the data flow (ETL or ELT - it won't matter). And I'll say great: make it happen. Now, if a year goes by, all's well, and I tell you, wait: you now have 5 minutes to load 500 rows - you might say... Hmmmm our current process runs in 8 minutes - we need to re-architect. Or maybe you'd say: we need a larger box / faster box to get it done. I might say: ok - make it happen.

Now if I tell you instead: you now have to load 100 million rows, oh - and you still only have 1 hour to get it done. You might say.... "You’re crazy" or "we have to re-architect", there might only be 5 ways instead of 1000 ways to get it done. Now if I change the scope and say (6 months later): 100 M rows must be loaded in 15 minutes... You might tell me, we have to change the data model, change the base-architecture, change the loading paradigm, and change the hardware and the tool sets we are using to get it done.

This is at the crux of what I say when I talk about the fact that "real-time" is really "right-time", and that the cost of VLDW (to implement) increases exponentially when your time factor is reduced below a 8 to 10 minute refresh rate, and your data set stays consistently large. But back to the point of this blog - how do you solve this problem?

Well, everything, and I mean everything starts and ends with architecture - first and foremost data model / data architecture, second in line is hardware _architecture_ i.e.: MPP, SMP, Clusters, Grid, etc... Third, is data processing architecture - everything converges on a single point: performance, performance, performance. All roads lead to a single architectural solution: ELT, with MPP, with a SINGLE SPECIFIC NORMALIZED DATA MODEL - no R.I.!! Not at the Billion+ row mark... no way.

Surprised? Maybe, maybe not.

My point: ETL is changing folks, to ETLT (ETL & ELT) mixed workload. Why? Because there's a COST involved in going 100% ELT, there's a loss of metadata as transformation pieces are broken up into smaller execution steps, and run as ELT (disappears into RDBMS complex SQL logic). The kicker? There's still a place for ETL, to take advantage of development and maintenance speed. There's still a place for where ETL is good enough, where Metadata in-stream is good enough, and massive volumes simply don't exist.

Now, let's look at this from a STANDARDS side of the house:
In order to maintain ETL and ELT separately, you quickly realize that this breeds dual-standards, and a divergence of execution states. Pretty soon it becomes clear that maintaining BOTH (regardless of volume) simply costs too much. But wait a minute; the opposite is the COST of executing one or the other methodology. The cost of maintaining ETL at this level is: * Performance (high cost) - the cost of maintaining ELT at this level is usually: * DISK SPACE, and HARDWARE to run the MPP, loss of Metadata - in some cases.

So you see, achieving a balance across the standards (due to cost) is vital in this time of "changing lines" and shifting sands. In the future ELT will be better supported by the remainng vendors, and it will become easier to "flip a switch" and let the traditional ETL tool make the decisions as to what to execute.

Back to standards: what do we really want with an architecture?
* Repeatability
* Consistency
* Reliability (up-time)
* Ease of maintenance
* Ease of build-out/development
* Performance (optimized - according to SEI CMMI level 5)

Ok - these are conflicting goals today. ELT provides most of these at a cost, ETL on the other hand also provides most of these (again at a cost), but they both lead to different architectural implementations. So why can't they co-exist? Because ETL runs on data that's cached within the ETL server - breaking coherency with "real-time" feeds that exist within a true active data warehouse (the cache is extremely difficult - if not impossible) to synchronize to the real-time changes occurring within the RDBMS. That's right - the closer to right-time the warehouse gets, the more it becomes operational in nature, the more operational, the more it requires transaction consistency.

ELT on the other hand requires tons of staged/cached data steps along the way within the RDBMS - particularly to execute against this level of volume.

Both can be created to be repeatable, and restartable, and consistent, and optimized. However what we learn (quickly) by mixing the two is that ETL has a different set of technical issues - mixing a cache with a right-time data warehouse is just one of the problems we see. The other is restartability begins to erode at a certain level. Jobs that were once restartable (in one form or the other) are now reliant on major checkpoints to re-establish caches, or to reset "staging tables." It's a bad mix...

So what should I do?
You need to weigh the costs and benefits of each approach against the volumes and latency time frames that your business can / will sustain within the next year - then decidedly document and issue decrees about how the architecture will work going forward. At some point volume and low latency push us all over the edge (today) to an ELT solution. Maybe in the future an "appliance" of sorts will address these issues and it will all be hidden in the background.

What are you're thoughts / Comments?

Thanks,
Dan Linstedt


Posted December 6, 2006 3:04 PM
Permalink | 2 Comments |

I just blogged on the need for allowing dirty data to flow through to an auditable reporting area. There are a lot of questions about WHERE the dirty data should reside, and where the dirty data should be cleaned up. In this blog entry we'll dive in to take a look at that in a short consolidated view. If you are fighting with compliance and auditability at a systems or data level, hopefully this will be helpful.

In Nirvana land, there would never be dirty data. Our source systems would be perfect, and would capture all clean data. And if the data wasn't clean it would go to great lengths to get it that way before releasing it to the rest of the corporation. Of course in Nirvana land we would have a SINGLE STACK (one operational system, one data store that had both - current / transactional and historical / warehouse like) information system.

In Nirvana land the single system would integrate with web-services external to our company. In Nirvana land, this "single system" would really be the be-all-end-all, dare I say it: Appliance / Platform / pre-configured with all that you need, everything from DSS to source system OLTP and edit checks to information quality, and reporting and dashboards. In Nirvana land, it would all be... well... nirvana.

But alas, this is not the case. In the real world, for years - the data warehousing industry has taught or preached that data quality should be between the source system and whatever the next "step" of integration is: which might be an ODS, an EDW, an ADW, a Kimball Bus Staging area, and from a strategic point of view - this may work nicely. As long as whatever tools are being used to change and alter the data produce the correct audit trail.

However from a tactical or operational point of view, or from a point of view where we are trying our best to "catch" broken source systems, it doesn't work so well. Cleansing the data or using data quality to produce a "single version of the truth" can have a negative impact on our ability to be accountable (as business users) and actually get the source systems fixed. That said: where oh-where do we put the IQ systems?

Between Sources and ODS?
Between Sources and Data Warehouses?
Between ODS and Data Warehouses?
Between Data Warehouses and Data Marts?

And where oh-where do we put the "bad" records? Do we kick them back to the source system? Do we report on them separately for end-users to clean up? Do we leave them in a daily-snapshot set of error tables? Do we let them through our ODS, and Data warehouse environments to end up in separate data mart areas?

Remember that TRUTH is subjective, and no one can tell "who's truth it is - or when". And also, that auditors have a very different definition of what truth is, they try to take an objective view of truth - one that is traceable and auditable.

Putting DQ or IQ between sources and the ODS allows Master Data to be built within the ODS - the caveats are that the bad data, pre-altered data trails, and any audit trails produced by the integration or "data alteration" stream MUST be warehoused somewhere over-time. Simply "improving the quality of the data" without showing the errors to the business causes all kinds of heartburn downstream. Kicking bad records back to the source system is fine (for a while), until more code must be hand-written on the source system to take it as a new feed, this drives maintenance costs up through the roof. Having quality / cleansed data in the ODS allows great interaction from a strategic standpoint, and possibly allows transactional feeds back to the source systems. Ultimately it greatly increases the complexity of the solution that is handling "BAD DATA" up-stream.

Putting DQ or IQ between the Sources and the Warehouse can (and often does) cause auditability problems. Not many of these ETL or ELT tool sets have the capacity to create data "as of before it was changed/aggregated/filtered", that must be architected and increases the difficulty of the load routines - not to mention slows down the overall load cycles. Furthermore, Bad data must be captured in "error tables" which are usually in the staging area, sometimes these are only daily refresh - no analytics can be run on these data sets, they disappear to fast. As we move towards real-time or near-real-time processing, there is NO TIME to process, integrate, aggregate the data (either between sources and ODS or between Sources and your DW). Real-time dictates that when the data is available, good bad or indifferently, that it be loaded to your warehouse AS IS - no delay's, otherwise the data decays and becomes useless, or the "attempt to detect fraud" is cleansed away, cleaned up and never detected.

Putting DQ or IQ between the Warehouse and the Data Marts is probably (in my opinion) the best way to handle this. The Data Warehouse then becomes a statement of integrated facts (integrated by business key), and the marts then become specific versions of the truth from which management can begin to dissect where they are losing money.

This all applies when loading "today’s" data, and in fact, cleansing with audit trails before reaching either the ODS or the DW applies when loading "today’s" data as well. What doesn't fly is the data model, once the data model has been constructed for "cleansed" data, it very rarely allows broken data (or error data) in, in other words, loading historical data wreaks havoc on the data model, or dirty historical data is cleansed - and through that process the "broken business process / broken business rules" are cleansed and washed away as well.

Now, is there value to cleaning up the source systems? You bet, this is the ultimate goal. Nirvana. Is there value to having cleansed data in your ODS? Absolutely! You get the idea, there's definite value in having cleansed and quality checked data, but there's also value in running analytics on the dirty data, when did it break the business rule (back in history), why? What was the pattern? Then the real work begins: what caused the pattern? Is the pattern still happening today? Is the business losing money or opportunities with customers because the source system is still broken? And so on.

So next time you think about where to PUT your DQ/IQ solution, consider what the impact is of NOT looking at the dirty data, ask the question: instead of kicking the data back to the source system, is there somewhere we can warehouse this information to examine it's patterns? I think you'll be surprised at all the business accountability you'll uncover.

Love to hear your thoughts on this.

Dan Linstedt
CTO
Myers-Holum, Inc


Posted August 29, 2006 6:37 AM
Permalink | 2 Comments |
Search this blog
Categories ›
Archives ›
Recent Entries ›