Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently in Business Intelligence Category

Let's face it, cloud computing, grid computing, ubiquitous computing platforms are here to stay.  More and more data mind you will make it's way on to these platforms, and enterprises will continue to find themselves in a world of hurt if they suffer from security breaches.  If we think today's hackers are bad, just wait...  they're after the motherload: all customer data, massive identity theft, etc...  I'm not usually one for doom and gloom, after all - we have good resources, excellent security and VPN and firewalls right?   In this entry we'll explore the notion of what it *might* take to protect your data in a cloud/distributed or hosted environment.  It's a thought provoking future experiment - maybe it would take a black swan?

Imagine your data on a cloud environment, or a hosted BI solutions vendor, or any other number of hosted environments. 

* How do you protect your data from getting stolen?

* How do you trace it if it is stolen?

* How do you track down the invader/hackers?

* How do you /can you stall the hackers long enough to take evasive action?

First off: we all know that NO security solution will ever be 100% fail-safe, it just simply will not exist.  When someone creates a more secure solution, someone else comes up with a way to break it, that's just the way it goes.  BUT, there is such a thing as "thinking ahead", thinking outside the box, thinking about what you CAN do to prevent and deter succumbing to these types of problems - which may cost you your business, may cost you money (in law-suits), may tangle you up in legalities with governments, and on and on.

There's no question as to the issues that can arise if you don't prove that you've "done everything in your power" to protect the consumers/customers.  So what can you do?

Cloud Computing and Hosted Environments (that is if they host your Data) present unique challenges, really unique challenges.  Everything from "shared data on shared machines" to shared data on dedicated machines, to VPN with non-shared, non-public machines, etc...

Well, let's blow away the fog in the cloud for a minute and take a look at a simple case:

Suppose you outsource your BI and your data for that BI to an analytics as a service firm.... So far so good.  The BEST possible protection you can have (and they should tell you this up front) is to NOT release any personally identifyable data to the hosted service or cloud, only release aggregate data - rolled up trends, trends of trends, and so on...  Then, add to that all the standard and well known security that you can buy, and it's fairly decent.

Now let's say for whatever reason you've outsourced the CLOUD environment, and you've uploaded sensitive data...  What can you do?  what kinds of questions should you ask?  What should the vendors be willing to help with?

First, there's not much you can do - other than ask the vendors for new features... which leads me to answer the question: what can / should the vendors do in this new computing arena?

Cloud is interesting: it follows on the notion that if you need more computing power, that it becomes available on-demand.  Ok, cool.  What if the extra computing power needed was for encryption/decryption of the data?  In other words, I believe that ALL outsourced data should be stored in encrypted format on disk - well, that's a good start, especially if each DISK array carries it's own encryption/decryption hardware to perform the task on the fly.  This prevents someone from "hot-swapping" a RAID 5 disk with a copy of sensitive data and taking it home to crack it.   Well - not really, they can still hot-swap it and take it home, but cracking it (with the right algorithm) can be another story.

The next step: is encryption/decryption at the database level.  I'm assuming that since you're running BI / Analytics in a cloud, that you'll also be needing a Database engine right?  Ok, so the Database engine should encrypt/decrypt the data as it works with it.  The only place the sensitive data should "be visible" is in RAM of the database engine or the machine on which it is currently existing.

Am I saying to encrypt ALL data?  No - that would be ludicrous, and would cost way too much money...  Only encrypt the data that your organization deems sensitive or private in nature.  However, the Database engine should make it EASY with little to no performance hit (due to cloud resource availability).

Next, comes the part out of the box...  What if: Data needed a DONGLE to be utilized properly?  What if the data was sent to your machine, exported to excel, but to SEE the data had to be decrypted on your desktop with a public or private key?  Now, this is interesting...  A database manufacturer working with a cloud based hosted service, to produce DONGLE's instead of SEAT licenses, or instead of selling "software clients" they sold "decryption" dongles for USB...  Hmmm - interesting.  The Dongles then would talk to the cloud, report who, when, and what was viewed - you don't have a choice (sorry, big brother has been watching for a VERY long time).

Let's do one better... what if, just what if, the data itself could be SIGNED - just like a certificate is signed, and this "watermark" went whereever the data went, it couldn't be removed, and it couldn't be seen - but it would be used as part of the key to decrypt and make sense of it.  Now, if the data itself was stolen (even in encrypted format) it would be traceable.  No - it wouldn't call home, as there's no "application" for that, but if it shows up somewhere, there would be forensic evidence (like a finger print) on the data that would point to it's origin....  Now that's some cool science fiction stuff (I wish it were so)....

Anyhow, back to reality... The Dongles provide really good protection mechanisms today, and in fact can also be embedded with a finger-print reader as part of the authentication mechanisms.  This technology exists, and could be put to good use.  \

In some cases your data is worth more than your gold or money in the bank, because it represents tomorrows profitability.  Don't you have the right to ask vendors to help you protect it?  Of course, they have the right to ask you to pay for this service....

Just a thought.  If you have some other cool thoughts, reply in the comments to this blog - I'd love to hear them.

Thanks,
Dan Linstedt
DanL@DanLinstedt.com
http://www.DanLinstedt.com


Posted March 30, 2010 3:05 PM
Permalink | No Comments |

Sorry folks, this one is a shameless plug (but I am trying to separate concepts and ideas).

I've decided to give my personal web-site a new face-lift. and in doing so, have begun posting all-things Data Vault modeling and Methodology related there.  So if you're interested in reading new posts about the Data Vault, please see: http://www.DanLinstedt.com

I will reserve this blog for posting about Cloud Computing, MPP architectures in general, scalability, performance & tuning, vendors, database engines, data warehousing architecture, compliance, etc.. that I see making a difference in the BI / EDW world.

If you have an idea, or want to see me post about a specific subject, please reply with your comments here.

Thank-you kindly,
Daniel Linstedt


Posted March 25, 2010 4:51 AM
Permalink | No Comments |

Welcome to the next installment of Data Vault Modeling and Methodology.  In this entry I will attempt to address the comment I received on the last entry surrounding Data Vault and Master Data.  I will continue posting as much information as I can to help spread the knowledge for those of you still questioning and considering the Data Vault.  I will also try to share more success stories as we go, as much of my industry knowledge has been accrued in the field - actually building systems that have turned in to successes over the years.

Ok, let's discuss the health-care provider space, conceptually managed data and master data sets, and a few other things along the way.

I have a great deal of experience in building Data Vaults to assist in managing health-care solutions.  I helped build a solution at Blue-Cross Blue Shield (Wellpoint St. Louis), another Data Vault was built and used for a part of the Center for Medicare and Medicaid facilities in Washington DC.  Another Data Vault is currently being built for the US Government Congressionally mandated health-care electronic records systems for helping track US Service personell, and there are quite a few more in this space that I cannot mention or discuss.

Anyhow, what's this got to do with Data Vault Modeling and Building Data Warehouses for chaotic systems, or immature organizations?

Well - let's see if we can cover this for you.  First, realize that we are discussing the Data Vault Data Modeling constructs (Hub & Spoke) here, we are not addressing the methodology components - that can come later if you like (although having said that, I will introduce parts of the project that help with parallel yet independent team efforts, that meet or link together at the end.

Ok, so how does Data Vault Modeling truly work?

It starts with the Business Key, or should I say the multiple business keys.  The business keys are the true identifiers of the information that lives and breathes at the finger tips of our applications.  These keys are what the business users apply to locate records, and uniquely identify records across multiple systems.  There are plenty of keys to go around, and source systems often disagree as to what the keys mean, how they are entered, how they are used, and even what they represent.  You can have keys that look the same that represent two different individuals, or you can have two of the same key that SHOULD represent the same individual, but the details (for whatever reason) are different in each operational system, or you can have two of the same keys representing duplicate records across multiple systems (best case scenario).

These business keys are the HUBS or HUB entities.  If you wanted, the different project teams can build their own Data Vault models constructed from business keys in their own systems.  Once the data is loaded to a historical store (or multiple stores), you can then build Links across the business keys to represent "same-as" keys: ie: keys that look the same, that the business user defines to be the same, but the data disagrees with itself.

Remember, links are transitionary, they represent the state of the current business relationships today.  They change over time, links come and go - they are the fluid dynamic portion of the Data Vault - making "changes to structure" a real possibility... but I digress.

To get the different teams building in parallel (their own Data Vaults) is the first step.  Once they have build Hubs, Links, and their own Satellites - and they are loading and storing historical data, then a "master data" team can begin to attack the cross-links from a corporate standpoint.  This must be done in much the same manner as building a corporate ontology - different definitions for different parts of the organization, even for different levels within the organization.   The Master Data Team can build the cross-links to provide the "corporate view" to the corporate customers, with the appropriate corporate definitions.

Think back to a scale-free architecture, it's often built like a B+ or binary tree, where nodes are inside of nodes, and other nodes are stacked on top of nodes, etc...  So... we have Data Vault Warehouse A, and Data Vault Warehouse B - now, we need Corporate Data Vault Warehouse C to span the two....  Links are the secret, followed by Satellites on the Links.  There may (as a result of a spread-sheet or two used at the corporate levels) even be some newly added Hub keys.  Again, business keys used at the corporate level that are not used at any other level of the organization.

Finally, at long last - a good use for Ontologies marrying to Enterprise Data Warehouses.  By the way, this is also the manner in which you develop a Master Data Set.  Don't forget that MDM means Master Data Management - and MDM includes people, process, and technology.  The Data Vault only provides the means to easily construct Master Data - it is NOT an MDM solution, strictly an MD "Master Data" solution.

Governance doesn't have to be separate, doesn't have to come before or after the Data Vaults are built - and again, disparate EDW Data Vaults can be built by parallel teaming efforts.  However - that said, once you embark on building Master Data sets, you *MUST* have governance in place to define the Ontology, the access paths, the corporate view (corporate Links & Hubs & Satellites) that you want in the Master Data components.

In essence you are using the Data Vault componentry (from the data modeling side) to bridge the lower level Data Vaults - to feed back to operational systems (that's where Master Data begins to hit ROI if done properly), and to provide feeds to corporate data marts.

In fact, we are using this very same technique in a certain organization to protect specific data (keep some data in the classified world) while other data lives in the non-classified or commercial world.  Scale free architecture works in many ways, and the LINK table (aside from adding joins), is the sole reason that makes this possible, and is what makes the Data Vault Model fluid.

It's also what helps IT to be more agile/more responsive to business needs going forward.  The Link table houses the dynamic ability to adapt quickly, change on the fly.

I'm not sure if I mentioned it, but ING Real Estate is using Excel Spreadsheets through Microsoft Sharepoint to trigger Link changes and structural changes to the Data Vault on the fly.  Thus when spreadsheets change and the relationships change, the Link tables change - leaving the existing history in tact, and creating new joins/new links for future historical collection.  But this is yet another example of dynamic alteration of structure (on the fly) that is helping companies overcome many obstacles.

But I ramble, there's another company: Tyson Foods who has a very small Data Vault absorbing XML, and XSD information from 50 or so external feeds?  Most of which change on a bi-weekly basis.  They had one team who built this as a pilot project using the Data Vault, and are now adapting easily and quickly to any of the external feed changes coming their way.  In fact, they were able to apply the "master data/governance" concepts at the Data level, and "clean up" the XML quality of the feeds they were re-distributing to back to their suppliers.

So let me bring it home:  Is Governance and clean-up required up-front to build a Data Vault?

No, not now, not ever.   Is it a good thing? Well - maybe, but by loading the data you do have into disparate Data Vaults, you can quickly and easily discover just where the business rules are broken, and where the applications don't synchronize when they are supposed to.  Can the Data Vault Model help you in building your MDM?  Yes, but's it's only a step in the Master Data side of the house.  You are still responsible for the "Data Management" part of MDM, the people, process, and technology - including the Governance...  all part of Project Management at a corporate level.

This brings the second segment to a close.  Love to have your feedback, what else about the Data Vault are you interested in?  Again, these are meant to be high level - to explain the concepts.  Let me know if I'm meeting your needs.  Feel free to contact me directly.

Thank-you kindly,

Dan L,  DanL@DanLinstedt.com


Posted March 15, 2010 6:39 PM
Permalink | No Comments |

Most of you by now have heard the words: "Data Vault".  When you run it through your favorite search engine you get all kinds of different hits/definitions.  No surprise.  So what is it that I'm referring to when I discuss "Data Vault" with BI and EDW audiences?

This entry will try to answer such basic questions, just to provide a foundation of knowledge with which to build your fact finding on.

Data Vault: Definitions vary - from security devices, to appliances that scramble your data, to other services that offer to "lock it up" for you...  That's NOT what I'm discussing.

I define the Data Vault as follows:  Two basic components:

COMPONENT 1: The Data Vault Model

The Modeling component is really (quite simply) a mostly normalized hub and spoke data model design with table structures that allow flexibility, scalability, and auditability at it's core.

COMPONENT 2: The Data Vault Methodology

I've written a lot less about this piece.  BUT: This piece is basically a project management component (project plan) + implementation standards + templates + data flow diagrams + statement of work objects + roles & responsibilities + dependencies + risk analysis + mitigation strategies + level of effort guestimations + predictive type / expected outcomes + project binder, etc....

What's so special about that?

Well - what's special about the methodology is that it combines the best practices of six sigma, TQM, SEI/CMMI Level 5 (people and process automation/optimization), and PMP best practices (project reviews, etc..).  Is it overkill?  for some projects, yes, for others - no.  It depends on how mature the culture of your organization is, and how far along the maturity path IT is - whether or not they are bound or decreed to create then optimize the creation of enterprise data warehouses.

Ok - the project sounds a lot like "too huge to handle"  - old, cumbersome, too big, too massive an infrastructure.  etc.. etc.. etc...  Yea, I've heard it all before, and quite frankly I'm sick of it.

I built a project this way in the 1990's for Lockheed Martin Astronautics called the Manufacturing Information Delivery System (MIDS / MIDW) for short which last I heard is still standing, still providing value, still growing today.  I was an employee for them under their EIS (enterprise Information Systems) company.  My funding came from project levels, specifically through contracts.  I couldn't get time from a fellow IT worker without giving them my project Charge Number  (Yes, CHARGEBACKS).  So every minute we burned was monitored and optimized.  We built this enterprise data warehouse in 6 months total with a core team of 3 people (me, a DBA, and a SME).  We had a part time data architect/data modeler helping us out.  We wrote all our code in COBOL, SQL, and PERL scripts.  Our DEC/ALPHA mainframe was one of our web-servers, so we wrote scripts that generated HTML every 5 minutes to let our users know when our reports were ready.

Ok - technology has come a long long way since then, but the point is: we used this methodology successfully with limited time, and limited resources.  We combined both waterfall and spiral project methodologies to produce a repeatable project for enterprise data warehouse build-out.  At the end of the project we were able to scale out our teams from our lessons learned, optimize our IT processes, and produce more successes in an agile time frame.  We had a 2 page business requirements document - that once the business user filled in and handed back to us, to the time we delivered a new star schema was approximately 45 minutes to 1 hour.  ** as long as the data was already in the data warehouse, and we didn't have to source a new system**

This is efficiency.  We had a backlog of work from around the company because we had quick turn-around.  Is this Agile?  Don't know - all I know is it was fast and Business Users Loved it.

Anyhow, off track - so let's get back.

The methodology is what drove the team to success - allowed us to learn from our mistakes, correct and optimize our IT business processes, manage risk, and apply the appropriate mitigation strategies.  We actually got to a point where we began turning a profit for our initial stakeholders (they were re-selling our efforts to other business units, bringing in multiple funding projects across the companies because of our turn around time).  The first project integrated 4 major systems: Finance, HR, Planning, and Manufacturing.  The second project integrated Re-work, Contracts, and a few others like launch-pad parts.

Anyhow, at the heart of the methodology was and is a good (I like to think it's great) data architecture.  The Data Vault Modeling components.

This is just the introduction, there is more to come - I really am counting on your feedback to drive the next set of blog entries, so please comment on what you'd like to hear about, what you have heard (good/bad/indifferent) about the Data Vault Model and / or methodology.  Or contact me directly with your questions - as always I'll try to answer them.

Thanks,

Dan Linstedt

DanL@DanLinstedt.com


Posted March 14, 2010 8:52 PM
Permalink | 3 Comments |

There are many things that come to mind when reading the title of this entry, it's a HUGE space with even larger prospects - from the app servers to the databases, from the tips of BI reporting all the way to ethics, security and privacy laws....  And then there's the dreaded: "What if the company supporting my current cloud apps & data fails?"

Hmm, in this entry we will explore the tip of the iceberg as it were, and explore some of the notions to consider when looking at Business Intelligence and Data Warehousing in the cloud...  Why? Because the CLOUD is big, and getting bigger - and because it IS a central and important part of technology evolution in 2010.

Who can define "CLOUD" computing?  Not me, that's for sure.  It has the same problem as every other industry changing paradigm shift has had over the past 5 years.  Multiple definitions, multiple meanings, multiple provisions - all correct for different reasons.

Well, for lack of a better definition, let's focus on the one provided by techtarget.com:

DEFINITION - Cloud computing is a general term for anything that involves delivering hosted services over the Internet. These services are broadly divided into three categories: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). The name cloud computing was inspired by the cloud symbol that's often used to represent the Internet in flow charts and diagrams.

http://searchcloudcomputing.techtarget.com/sDefinition/0,,sid201_gci1287881,00.html

let's add that to specifically focus on 1) Data Warehouses and Data Marts in the cloud, and 2) Business Intelligence "reporting" and analytics in the cloud. 

Is it a good thing?  Yes, I think so.  I believe it can help reduce complexity, it can help consolidate disparate data sets, it can increase efficiency and offer "pay by use" computing power as data sets grow. 

But where might it fall short?  Well, firms that have poor data architecture (or poorly optimized data architectures) coupled with ever growing data sets will continue to see costs rise - quite quickly, and quite possibly beyond what they ever dreamed possible.  In other words, on the outset, the costs will be reduced - as migration into the cloud occurs.  Then, over a period of 6 to 9 to 12 months (as the data sets grow), costs for computing power will grow - exponentially or worse.  Why?  Because getting at "bad data architecture" and physically performing joins across engines without the ability to scale MPP at the core, will continue to run CPU intensive single streamed data access.  So, costs will continue to grow - in order for the "cloud to meet end-user BI performance expectations, computing power will have to be added to run additional user logins with additional single-streaming power."

Regardless of what Business Users like to believe (or IT for that matter) Performance is a MAJOR cost driver in cloud computing.  Now, to mitigate the high cost, or potential for exponential cost increase - the ONLY thing that might help is: a strong, yet parallel data architecture for back-office enterprise data warehouses.  It MUST be parallel in design and by nature, as the cloud is inherantly built around MPP scalability (on-demand MPP that is).  Which means: a GOOD or GREAT data model will be at the core of the cost contingency - which in turn means, call in a consultant who understands Terabytes and Petabytes and MPP systems BEFORE transitioning.  Optimize your BI data models, and accessibility before transitioning.

Ok, next point: what else can go wrong?  Well, let's put it this way: Clouds and cloud computing services are just that: Services.  And being services to the enterprise, they are equipped with API's (application programmer interfaces), also well known as web-services these days.  What this means is: potential for hacks/security breaches.  But that's not all!  Now, the computing resources that house YOUR corporate data are OUTSIDE your Firewall.  You as a company no longer know where (physically) your data lives in the cloud, and as cloud services expand (on-demand) to handle the need, then the cloud service provider opens more machines with visibilty to your data sets, your queries, and your on-line access.

So, security is a HUGE deal.  Here's an interesting thought (quite possibly a real engineering challenge): Construct a cloud on a virtual private network, with a maximum number of dedicated private machines, with encrypted data on the back-end.  Wow, that's a mouth-full.  I'm not even sure if this is possible (although with enough money, anything is possible).  So what does that mean?  It might mean buying servers, hardware, and fundamentals of internalizing cloud technology inside the corporate walls....  But where do you find IT resources with this knowledge?....  Another tough question.  IF you make this kind of decision, then outsourcing clearly is not an option, unless the outsourced talent is in house.  But wait: yet another security breach waiting to happen.

If you have sensitive protected data, then whomever you hire to work with it, better be trained and monitored closely.  Let me guess, haven't you heard about Cisco and the "back-door" they put into their routers/switches/networking devices?  I'm not saying all companies or all people are like this, but I've been in enough situations to understand that this is a security risk, and it is ripe for picking in a Cloud environment.

So there's one more to think about: Privacy and Ethics laws.  More data means more management costs, more data also means higher risk of "losing data to the outside world" with the missing data going un-noticed or un-traced.  If it's executed in a cloud, how do you know that the "machine/hardware" you are using doesn't have a virus on it, or some sophisticated monitoring program that when it's "turned on" added to the compute cluster in real-time, that it doesn't find and share sensitive data running through it's hardware?

Again, traceability is in question.  Let me say this: Cloud Computing IS (I believe) the future, we MUST find a way to work with it, and leverage it in the proper manner.  I am just saying that we must tread carefully into these waters, and like every other project - we must put a few things forward, like security tests, breach procedures, and yes - off-site backups of clouds in case the cloud vendor goes belly up....  What would you do if that happened and you were running your business on a cloud outside your organization?

Just a few things to think about.  Please reply - love to hear what you are considering in your cloud implementation.

Cheers,

Dan Linstedt


Posted March 3, 2010 5:55 PM
Permalink | No Comments |

Search this blog
Categories ›
Archives ›
Recent Entries ›