Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently my discussions in the field have centered on Information Quality (or the lack thereof) and the EII tool set as well as the Active Data Warehouse (right-time data warehouse). We will explore this exceedingly dry (hopefully interesting) aspect in this blog entry, particularly in relation to Compliance and Integration - but I felt that it fits under SOA as well - so here goes.

Information Quality (according to Larry English) includes business processes, data, reporting, and people involved in interpretation of the information. But Information Quality both helps and hurts compliance efforts, particularly when the corporation is audited.

One of the over-simplified definitions of SOX is (at least at the data level): Can your system show the "before it was changed, after it was changed, and when this change occurred" audit trail. Without being able to answer these questions, any "software product" that claims it is SOX compliant is flat out wrong.

What does this have to do with EII?
EII can be quality driven too - but in doing so, it can and will, break compliance - IF it is told to transform data in the middle without producing an audit trail of what it did, what it used, and when - this is where the Write-Back capability of EII comes in handy - we almost need a EII-SOX warehouse to record the information flowing THROUGH EII tool sets in order to meet compliance and auditability.

I've written an EAB (executive action brief) on this site that talks about "making your data integration processes compliant" - click on B-Eye-Network, go to the HOME page, and look for "education" link in the lower left corner.

Now, if EII is _not_ transforming data, then the data set it pulls from should be "sox compliant" - it shifts the owness back on to the source systems to maintain audit trails of any information it changes. For source systems, this is a no-brainer, they are capture systems and are supposed to be "systems of record" for the business, which means the business is already supposed to "trust" these systems - even though the data quality may or may not be there.

Time out - this doesn't make a lot of sense, where's the quality in all of this?
Ok - compliance is one thing, but the reason I talk about it is this: Information Quality tools CHANGE DATA under the covers, therefore in order meet compliance initiatives and be auditable, we must surround these tools with a before and after process. At the DATA level - this means that if we introduce "quality processes" in-stream with EII, we could be in serious trouble with Compliance - again, unless we record the effects (before/after and when).

Quality Tools are nothing more than transformation engines (ok - they do a LOT more than that), but when it comes to bare-bones they are CHANGING DATA sets. Therefore: everything that applies to ETL/EAI and data mining (in accordance with compliance) also applies to EII, and the processes that load active warehouses.

Wait a minute! Active Warehouses have a refresh cycle that's too fast to put a quality trigger in play, right?
Right and wrong - remember active warehouses are "right-time" warehouses, it's all about latency. However, there are active warehouses that cannot use quality initiatives in-stream, because the data decays too fast.

Now what we will say is this: even ADW's still have "strategic" initiatives to them, which means that only the tactical sides for-go the quality settings (until the strategic based quality engine cleanses the historical data, and sometimes that historical data is returned to the source during transactional/tactical processing).

Remember this: Information Quality is SUBJECTIVE, it is one version/one flavor of the truth - truth is subjective, and will change depending on the eye-of-the beholder (the end user). Therefore, quality engines MUST be held accountable and auditable by surrounding them with processes that capture before-after-when (BAW).

Can EII use IQ tools or Data mining processes in-stream?
Sure, and they probably should - especially when sourcing external or freely available data. I'm just saying that EII will have to take the extra hit, and write-back the BAW somewhere to be compliant. The challenge here is: when to initiate a quality process, and keep it so that it doesn't impact the query timing too significantly. Now if EII is pulling from the strategic side of the warehouse - wonderful, it should be pulling quality data (already altered/cleansed/patched).

Can ADW use IQ tools or data mining processes in stream?
Yes and no - depending on the latency requirements this may vary. Most ADW's at 5 minutes or less latency, don't run IQ processes in-stream, companies at this level use IQ tools plugged directly in to their source/capture systems, which raises other questions, like how do I find my broken business processes? But that's for another day, another entry.

Quality should come "after" the load of the raw data to the data warehouse, or "after" the load of the raw data into the EII engine, it should be secondary, and applied only if there is an audit trail mechanism in place to trace back to the original data.

Thoughts and comments are welcome; I'll blog more on the subject if there's an interest.

Cheers,
Dan L


Posted October 26, 2005 6:33 AM
Permalink | 2 Comments |

2 Comments

I think I'd have to beg to differ with you on the claim that EII will have to perform a write-back to satisfy an audit.

Consider this. EII, even if it transforming data instream, is effectively stateless. The only state of the EII operation is the metadata that describes the data source, the transformation rules, the exact request being issued, when that request was issued, by whom, etc. So, given a compliant set of source systems that can meet the "able to reproduce a replica of what the state of data was at a point in time" requirement, the EII tool should be able to reuse its own metadata to produce the exact same output that it produced originally. In which case, you have all of the data (from the source system) and process specification (from the metadata) to show the original results, explain how they were achieved, and explain why the might be different now.

Of course it isn't trivial to maintain metadata versioning like that. You've got to have a perfect record of the system configuration, either in source control or metadata backups. But I should think that you'd be able to fulfill compliance needs in an EII environment without forcing the system to actually log the data anywhere.

Hi Paul thanks for your comment, I appreciate the feedback - and I understand why you'd think this way.

In my opinion, no system is "safe" from audits, the truth - even in operational systems - can be questioned in courts, which raises the fundamental question: is there ANY data which is stored anywhere, that can be called "fact" without a doubt?

But beyond that, the issue I have with EII not recording what it saw before-after-when, is in the case where the OLTP data changes in between queries - therefore nullifying any chance of EII actually producing the same result twice.

In all reality though, if we make the assumption: 80% stays the same, 20% changes in OLTP systems - then I think we're safe with what you've stated, and I would agree - and as you've said: a set of sourcing systems that can produce the data as it was at that point in time, then write-back is not necessary within EII. In this case, it might just be an integrated warehouse that undertakes this task. When we get into unstructured data, we need to be careful about how and what is sourced - unless it too is included within the warehouse in one fashion or another.

Thanks Paul, great comment! :)
Dan L

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›