Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

October 2009 Archives

Ok folks, I'm back.  I've spent the last six months reading, writing (and arithmetic - that's a joke... ha-ha...)  seriously, implementing solutions, visiting customer sites, and seeing what's going on in the world; and to that end, the DW/BI world is changing.  You've seen me write about this before, but now it's on-top of your head.

Like it or not...  Your Data Warehouse IS a System Of Record, and if it isn't in 2009, it WILL BE before the end of 2010.  So what does that mean?

You've heard about it for the past 2 years (at least), it's called Operational BI, and my good friend Claudia Imhoff, and my other friend Colin White have published many many articles and blog entries about these topics.  But have you given any consideration to exactly what that means to the Data Warehouse?

Come on people, get with the times!

Ok, down to brass tacks.  To have "operational BI" you must have Operational Data, well, wait a minute - operational data means system of record doesn't it? 

Ding Ding Ding, you win the prize... come on down!

Wow, didn't know there was a prize... (there's not)...  Anyhow, Operational Data is required - and if you're talking operational BI, then your talking about business intelligence, and if you mention business intelligence, then you MUST consider data warehousing.... 

So what does that mean?

It means you've GOT NO CHOICE but to consider an Operational EDW under the covers.  That's right folks, I've said it for years, now I'm not going to hide this fact from the light anymore.  YOUR DATA WAREHOUSE WILL FROM NOW ON... HOUSE OPERATIONAL DATA RIGHT BESIDE YOUR HISTORICAL DATA, this means that the operational data will be captured from applications sitting directly ON-TOP of your Data Warehouse.

Think I'm foolin do ya?

Think again...  These are the systems I've been building, or helping folks build for the past 3 years - you don't think that I've been sitting around resting on my laurels eh?  Far from it.  I've been traveling the world, talking to folks about accountability, credibility, RAW OPERATIONAL DATA sitting right inside the data warehouse.  This means that the combined result produces Operational BI.  it also means that the resulting Data Warehouse or (ODW for short) IS a system-of-record.

Don't like it?  Tough.  Take it to the bank!

Customers like JP-Morgan-Chase, Lockheed Martin, Raytheon, World Bank, ABN-AMRO, RBS (royal bank of Scotland), Sisters of Mercy Healthcare, Blue-Cross Blue Shield, Department of Defense, Food and Drug Administration, Federal Aviation Administration, City of Charloette (North-Carolina), SNS Bank Netherlands, and more have all used these principles to create complete ODW's in house today.

What do they need?

Accountable, Auditable, Flexible and scalable data stores that act as an EDW while at the same time absorbing real-time transactional (operational) data directly in to the EDW.  Thus creating a RAW ODW.  It's here today folks, fed by web-services (SOA), batch, and ESB's - managed and edited by applications that control metadata, carrying changes and audit trails along with cube-write-backs and data edits, that's what ODW and Operational BI are all about.  It's making the systems more robust, more scalable, more complete.

At the risk of upsetting some folks: And ODW is the underpinning of successful OBI.

The ODW contains and inherits all the responsibilities of the data warehouse.  In addition, because it houses operational data it also inherits all the responsibilities of an operational system!!!  Yep, you guessed it: 9x9 up-time, reliability, fault-tolerance, fail-over, garaunteed access, high speed transaction handling, on-top of: strategic data handling, historical record keeping, integration, etc...  And don't forget: system-of-record.

But why is it a system-of-record?  I though other operational systems handled this?

Each operational system handles their own system of record, but with an ODW - think about it...  you're saying: My data warehouse/BI solution is an operational system in addition to being a data warehouse.  So, you're saying: this is the only place where a complete, single picture of the integrated operational and historical data lives.  Which in essence means: you have built what needs to be a system-of-record.

What's involved in creating a system-of-record for the ODW?

Good question, and I wish I had time to answer it....  Ok - here's my opinion: you need to worry about all the operational "rules and regulations and governance", you also need to worry about storing RAW (uncleansed, unchanged, unaltered) data sets in your data warehouse.  That means, your "business rules" to mainpulate data can no longer be executed up-stream of your data warehouse.  You must move your business rules downstream of the ODW, otherwise you're changing operational data to make it fit, and blowing up any chance of an auditable system-of-record.  ESPECIALLY when 3 months go by and the business rules change!!

Are you saying I need "RAW" data in my ODW to make this work?

That's exactly what I'm saying.  I'm not saying get rid of the business rules, I'm not saying put "bad" data in front of business users.... No, I am saying "move the processing downstream" to POST-ODW (between ODW and Data Mart delivery).  It will make your life MUCH easier.

This is absured, I've never heard of anything like this before.  Why should I care?

You don't have to care if you don't want, but the world is a changing place.  If you want an auditable and compliant ODW, or you want to add operational data to your data warehouse you will have to follow these rules, dare I say it: you might find yourself out of a job if you don't!

All the companies I mentioned above, and more, every day are learning this the hard way (through failed audits, through corporate restructuring, through spread-marts where cost has gotten out of hand)...  They have decided to build ODW's to solve these problems.

It just so happens, they are using the Data Vault Modeling to make this happen, you don't need a Data Vault model to get here, you must however follow the principles outlined in this article - but the Data Vault is an enabler that will help you get here faster.  You can read more about the Data Vault Model at: http://www.DataVaultInstitute.com

or contact me directly: DanL@DataVaultInstitute.com

Cheers,

Dan Linstedt


Posted October 28, 2009 12:46 AM
Permalink | No Comments |

There are many trends afoot in the industry, but I think there are two or three that are here to stay, and quite frankly, they will make a huge impact on the way BI/EDW is done. Some of the technologies that are ground-shaking and game-changing include Solid State Disk, Database appliances on the Cloud, and data services on the cloud.  I've been watching the Database market now for several years, and here are a few of my predictions for 2010.

Here's the skinny of it:

1) Solid State Disk.  It's not your grand-fathers "RAM on a USB stick" plugged in to the computer, no - it's high-speed, ultra parallel, screaming bus performance RAM.  As one analyst recently put it: "CPU's have gained 5 million times faster performance over the last 30 years, while disk drives have improved only about five times, he says. SSD's offer 150x improvement in random I/O's per second over spinning disk."  Information Week, Oct 26, 2009, Douglas Henschen.

So what does this really mean in business terms?

In business terms this brings the ability to finally execute high speed parallel operations for massive BI queries, and massive bulk-loading systems.  But wait, there's more (yes, it gets better)...  This also means that we can finally take advantage of extreme flexibility in logical modeling layers, and have to "worry less" about physical modeling components... Which of course leads us right to the Data Vault Modeling methodology, or 6th normal form (can become a reality if you like). 

In business terms, if we have extreme flexibility in our architecture, we can finally enable IT to become nimble and fast again.  It means we can reduce delivery times, and execution cycle times.  It also means we can increase up-time, and oh-wait...  HOT DATA?  Did all of you forget about Temperature driven data?  Shame on you!  Hot data will reside in the SSD devices, allowing hyper-cubes to take place, AND allowing DYNAMIC changes to both the data models, the queries, and the loading systems to occur.

Why is that important to me?

Well, I'll tell you...  In my opinion, it's important to you because you should care about the architecture you are using, you should care about the accountability and auditability of the data set, you should (and usually do) care about the performance of the system you're building your solution on; and by the way: your going to have MORE and MORE data to deal with, like it or not: the EDW/BI solutions of today are accountable (they have become a system of record in their own right).

Yea ok, so what...  Solid state disk, faster machines, phooey...

You can say what you like, but IBM Smart Analytic System, Oracles' Exadata, Teradata Extreme Performance Appliance 4555, are just a few of these notions.  What would happen (do you think) if ParAccel, or Vertica, or Sybase IQ jumped on this ship?  Especially into Amazon EC2 cloud...  Hmmmm Now it get's interesting!

Which brings me to my next point:

2) Database technology is changing for 2010.  RDBMS making use of SSD will change the game, it will be an uneven playing field, but then making these devices available on the EC2 cloud (wether it's amazon or google providing the services) - based on HADOOP (mostly) - it won't matter.  The RDBMS technology will get faster, and now - in-company network bandwidth will be OUTSHINED by cloud privisioned bandwidth.

What?  What does that mean outshined by cloud bandwidth?  I don't get it...

You will have to get it if you want to lower cost, and succeed in business going forward.  What I mean is: Data Warehousing and BI solutions will be offered as an outsourced service.  That's right folks, for the first time in DW/BI history, (with just these two advancements coupled together), Data Warehousing and BI as a service will become a reality.  Mark my words!   Ok, don't take my word for it... 

All you need to do is call up a company called: Quantivo they are processing/ingesting 7 billion customer records, and allowing 2 to 3 second query response times against this sizable data.  What do they sell?  packaged data warehousing/BI analytics.  In other words, they do the loading of the raw data, they manage their internal data model, they perform the cross-mapping, the indexing, and they build the "data mart" availability of the data set.

So what's the catch, there's always a catch, and I want to know what it is!

Of course there's a catch... today there's an ethical, security, and privacy dilemma.  How do you "outsource" data to a company providing a service that is put into a cloud owned by a 3rd party company (say Amazon or Google), and keep it all safe and secure?  Well, I would argue there are several layers to making "DW/BI as a service" available.

Layer A: is what Quantivo Does: doesn't keep (or receive) any specific or private information about individual customers that may risk identity theft if the security wall is broken.  They provide a VPN service, dedicated encrypted channels for the customer to get at their own data sets.  They also slice the data sets off - for each customer, they setup different grids in the cloud - ensuring all data stays separate.

Layer B: Hardware vendor/RDBMS vendor offers use of their software/hardware within the cloud as a direct contract or managed service for upgrades, the customer contracts them to execute the cloud installation, and then the customer manages, owns the data, the data modeling, the queries, and all other components within the cloud over VPN.

I think beyond this, there may be some additional options for EDW/BI as a service however they have to be scrutinized carefully.

Ok, ok - so what's the real business benefit here?

Still don't see it eh?  Well, here are just a few of the business benefits to be received by executing plans like this:

  • Lower cost of ownership
  • Faster turn-around time (like in Quantivo's case - they are specialized in customer analytics and data mining)
  • More parallelism with "sloppier design" - wait, that's a risk!  Yes, it is...  but sloppy designs will get a significant performance boost from SSD, and from cloud technology
  • More agile, more nimble (Dynamic data marts) - here today, and possibly trashed tomorrow, only to build a new one!
  • Cube Write-Back, yes - because of the nature of cubes, LARGE CUBES on SSD in a CLOUD now can offer FAST write-back. Hey Microsoft, you listening?
  • More power in the hands of the business user - direct access to create on-demand data mart solutions IN-RAM, that's right... SSD on a cloud over VPN, lets users PLAY with their data directly (if the outsourced service has implemented properly).  finally, play with LARGE to HUMUNGOUS sets of data on essentially a RAM drive!  Pay as you go, pay for only what you need.

I don't mean to dig on Microsoft, I think just the opposite - they have cube write back technology today, and their 2008 SQLServer R2 release is INCREDIBLE...  it is screaming fast, and super scalable (even on standard 32 bit windows servers, if you know how to tune it).  I just think it's an easy jump for them to implement something on SSD and in the cloud, and voila - what an INCREDIBLE solution.  It's so easy, I don't know why their gold partners haven't done it yet...  Anyone want to invest in me?  I'll build you a company to do it.

Alright, what are the risks...

The risks include: not managing the services effectively, not setting the quotas and policies for each user appropriately (example: a low rung business user fires off a query that expands the cloud to 12 to 20 machines for 1 hour, but can't authorize the payment for the service usage).  Ok - Governance, Governance, and more governance....  Oh, did I mention: GOVERNANCE over the solution?  Absolutely required.

Another risk is: letting yourself believe that this is the answer to "all the poor quality data modeling" or "sloppy ETL" or "poorly behaved BI queries" because now the performance is incredible out of the gate.  Wrong, wrong wrong.  Even though this opportunity affords HUGE gains in performance, it is no excuse for lack of GOVERNANCE and DATA MANAGEMENT best practices.  Your data models can & should be improved, your sloppy ETL can and should be tuned, and your poorly behaved BI queries must be optimized.  Then instead of seeing only a 10x gain, you will be able to see a 150x gain.

You know what they say: No PAIN, no GAIN... In this case, the pain is paying for the right architecture to be put in place, the right training for performance and tuning, and the right data modeling to be utilized under the covers.

You can find out more about Data Vault (550+ users, 2000 unique visitors a month, 150k hits a month) at http://www.DataVaultInstitute.com

You can find out more about ETL performance and Tuning at http://www.TrainOvation.com, and http://www.GeneseeAcademy.com

OR: Contact me directly: danL@DataVaultInstitute.com  (even if you're not using Data Vault)

Cheers for now,

Dan Linstedt

 


Posted October 28, 2009 12:03 AM
Permalink | No Comments |