Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

The market has been asking for EDW's to deal with more and more real-time based data.  IT on the other hand has become "slower and less agile" as their current system of federated data marts gets larger and larger.  In this entry we will deal with some of the issues, some of the questions, and of course offer an opinion into the insight of dealing with true real-time data sets arriving at the doorstep of the EDW.

First, what exactly is real-time data?

Well, when we put it that way, there is no such thing.  It's always near-real-time to be honest.  If it were true real-time, we'd have the data the instant it is created.

So what is near-real-time data as opposed to batch data?

In my book, near-real-time data or streaming, or semi-streaming data is anything that arrives at intervals less than 5 seconds.   Even at a constant every 5 seconds, I would have to say mini-batch might take place.  Quite honestly, there needs to be a continual flow of transactions and in some of the systems I've dealt with, the transaction rates are usually millisecond based, and sub-millisecond based.  In any event, near-real-time means (to me) that data arrival is too fast to do anything with (like apply business rules and cleansing) and that if you did apply these things, you end up with a backed up pipe on inflow.

Now, every thing else becomes mini-batch, burst-rate-batch, or large batch based systems.

How does this affect me?

Well, it affects your traditional batch loads really.  Particularly if you "put your business rules upstream" of the data warehouse.  By having business rules (quality, cleansing, data alteration, etc...) upstream of the warehouse you are immediately introducing a processing bottleneck and "disabling" your system from the ability to have near-real-time feeds (as I've defined them).

Example: if the EDW receives 10,000 transactions a second, and it takes your "batch load process" 1 minute to load 80,000 rows - you have a bottleneck.  You simply cannot run near-real-time without one of the following: new hardware, faster hardware, RE-ENGINEERING of the batch process, RE-ARCHITECTURE of the EDW and so on...  If the reason you can only run 80,000 rows per minute is because of the business rules upstream, and you through new hardware at it - you have just instituted a short term stop-gap measure.  The risk that you will hit this bottleneck again in the very near future is very very high.  Eventually you cannot afford to throw money at it anymore, and you are stuck re-engineering, or re-architecting to solve the problem.

The business sees re-engineering and re-architecture as a form of weak or incompetent IT (sometimes), other times they call this an "EDW failure" where the costs to re-engineer grow too high.  So they shut us down, and hire a new team and start over.

Well, there's a simple solution to this problem: MOVE YOUR BUSINESS RULES DOWNSTREAM OF YOUR DATA WAREHOUSE!!  Move them to the OUTPUT side of the EDW, between the EDW and the data marts.  Then, allow RAW data to arrive and land in the EDW, that is to say: allow the "good, the bad, and the ugly" data in once it is ready.

This is of critical mass, and is truly the only way to handle near-real-time data at sub-second arrival latencies.  It is the only way to scale the EDW appropriately.  Furthermore, it is the best method (I believe) for answering auditability and compliance issues in real-time.  I've implemented systems like this over 10 years ago, and they are still standing, and growing today without the need for re-architecture/re-design!!  The business is HAPPY with IT team, and the EDW is a "success" according to the business.

What interests you?  Do you have "real-time" needs?  How have you been able to successfully meet them?

Thanks,

Dan Linstedt

DanL@GeneseeAcademy.com


Posted May 9, 2009 5:25 AM
Permalink | No Comments |

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›