<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>Blog: Dan E. Linstedt</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/" />
<modified>2009-01-20T12:27:47Z</modified>
<tagline>Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI.  Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models.  You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T.  I can’t wait to hear from you in the comments of my blog entries.  Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com</tagline>
<id>tag:www.b-eye-network.com,2009:/blogs/linstedt/9</id>
<generator url="http://www.movabletype.org/" version="3.33">Movable Type</generator>
<copyright>Copyright (c) 2009, Dan Linstedt</copyright>
<entry>
<title>Intro to Compliance - Sarbanes Oxley and your EDW</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2009/01/post.php" />
<modified>2009-01-20T12:27:47Z</modified>
<issued>2009-01-20T11:45:30Z</issued>
<id>tag:www.b-eye-network.com,2009:/blogs/linstedt/9.2864</id>
<created>2009-01-20T11:45:30Z</created>
<summary type="text/plain">In this entry we explore the nature and notion of compliance - specifically Sarbanes-Oxley and what it means to your EDW. I&apos;ve been working with compliant based systems for years. Over the years I&apos;ve learned about data as an asset,...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>In this entry we explore the nature and notion of compliance - specifically Sarbanes-Oxley and what it means to your EDW. I've been working with compliant based systems for years.  Over the years I've learned about data as an asset, that is: data in the EDW affecting the financial bottom line.  I've learned about audits and auditability (been through a few of them myself). In this "series" I will first explore sarbanes oxley, then follow with CoBIT, ITIL, SEI/CMMI Level 5, and a few other things.   Please let me know what you think of this entry/series and if you'd like to see more.</p>]]>
<![CDATA[<p>Let's start with a few definitions:</p>

<p><a href="http://en.wikipedia.org/wiki/Sarbanes-Oxley_Act">Sarbanes-Oxley_Act</a><br />
1. Public Company Accounting Oversight Board (PCAOB) <br />
2. Auditor Independence<br />
3. Corporate Responsibility<br />
4. Enhanced Financial Disclosures <br />
5. Analyst Conflicts of Interest<br />
6. Commission Resources and Authority<br />
7. Studies and Reports<br />
8. Corporate and Criminal Fraud Accountability<br />
9. White Collar Crime Penalty Enhancement<br />
10. Corporate Tax Returns <br />
11. Corporate Fraud Accountability</p>

<p>How does this tie to my EDW/BI initiative?<br />
Very interesting question, to which I have an opinion.  My opinion is as follows: I believe that data is an asset within our organizations.  Now before you run off to tell the world that "only good data is an asset", let me back up.  Good, Bad, and Indifferent - data is an asset - regardless of how it's perceived.  Data that is captured, or created on the fly is an asset.  It doesn't matter if it's good or bad data.  Besides, who determines which label to place on the data?</p>

<p>With data as an asset, it affects the bottom line financials.  Financial decisions are made based on data every day, sometimes every second.  In some cases (like NASA), data affects peoples lives.  Clearly, data is worth something on the financial books.</p>

<p>Ok - so how do you value it?<br />
That's a discussion for another day.</p>

<p>Now that data is seen as an asset to the corporation, and that it's considered tied to financials, it should be available for audits, and compliance.  The compliance must come from the people themselves within the organization; however the data can shed light on the firm's compliance or non-compliance abilities.  In other words, the data can tell the auditors: "what the company knows, and how they are reacting to the situation."  The data can also help determine the "net-worth" of the organization.</p>

<p><a href="http://en.wikipedia.org/wiki/Sarbanes-Oxley_Act">Sarbanes-Oxley_Act (a little further down says)</a><br />
Auditing Standard No. 5<br />
* Assess both the design and operating effectiveness of selected internal controls related to significant accounts and relevant assertions, in the context of material misstatement risks; <br />
* <strong>Understand </strong>the <strong>flow of transactions</strong>, including IT aspects, sufficient enough to identify points at which a misstatement could arise; <br />
* Evaluate company-level (entity-level) controls, which correspond to the components of the COSO framework; <br />
* Perform a fraud risk assessment; <br />
* Evaluate controls designed to prevent or detect fraud, including management override of controls; <br />
* Evaluate controls over the period-end financial reporting process; <br />
* Scale the assessment based on the size and complexity of the company; <br />
* Rely on management's work based on factors such as competency, objectivity, and risk; <br />
* Conclude on the adequacy of internal control over financial reporting. </p>

<p>Ok, I can see how source systems are affected, but how does this tie to my EDW?<br />
The EDW must house "A SINGLE VERSION OF THE FACTS for a specific point in time."  (see Data Vault Modeling and Methodology e-learning on <a href="http://inmoninstitute.com">http://inmoninstitute.com</a>)  The Data must tell a story of what the company DID and how they REACTED to a specific situation that occurred within the organization.  The data in the EDW must create an AUDIT TRAIL of decision making along the way.  The EDW is crucial to uncovering the facts about what people knew when.  It MUST become a system of record "capture mechanism" in order to meet compliance initiatives.</p>

<p>Wait a minute, that's a big leap - I don't follow...<br />
You're not alone.  Many people around the world are now discovering that the only way to uncover corruption, fraud, or pure misjudgment is to look at the good, the bad, and the ugly data in the EDW - and how it changed (or didn't) over time.  The EDW tells the story of the companies' evolution, ranging from new source data, to changing of the business rules.  Ok, back to the point:</p>

<p>How can you "assess the effectiveness of audit controls" without looking into the EDW for a data trail of how the company is operating?  Especially if you are warehousing the financial systems...</p>

<p>How can you "Understand the flow of transactions" without tracking how the flow's business rules changed the transactions along the way?  An EDW should capture the history of the raw transactions BEFORE and AFTER the changes in order to meet compliance.</p>

<blockquote>SOX 404 compliance costs represent a tax on inefficiency, encouraging companies to centralize and automate their financial reporting systems. This is apparent in the comparative costs of companies with decentralized operations and systems, versus those with centralized, more efficient systems. For example, the 2007 FEI survey indicated average compliance costs for decentralized companies were $1.9 million, while centralized company costs were $1.3 million.[28] Costs of evaluating manual control procedures are dramatically reduced through automation.</blockquote>  <a href="http://en.wikipedia.org/wiki/Sarbanes-Oxley_Act">http://en.wikipedia.org/wiki/Sarbanes-Oxley_Act</a>

<p>Regarding costs, the EDW is meant to be a centralized repository of information.   The Sarbanes-Oxley auditor should be asking to view the financial reports from three directions - using triangulation to spot discrepancies. </p>

<p>Auditor to the firm:<br />
Direction 1: Show me today's financial reports from today's data...   (firms response: ok, either from the EDW or from the operational systems) - usually this will come from an "OPERATIONAL DATA WAREHOUSE" or a system using operational BI.</p>

<p>Direction 2: Show me yesterdays' financial reports - reproduce them for me using yesterdays' routines, and yesterday's data.... don't just grab your "old hard-copy"...  (firms response: ok - from the EDW, and the backed-up routines, and yesterday's data mart).</p>

<p>Direction 3: I see errors, discrepancies between the two reports... Now, show me the RAW detail data that went in to yesterdays report, and the RAW detail data that went in to today's report.  (Firms response with a "version of the truth" warehouse is: UH-OH, we're in trouble....  Firms response with a Data Vault is: No problem)</p>

<p>Data is an asset, data affects the financial bottom lines.  RAW data needs to be tracked in the EDW in order to be compliant with Sarbanes-Oxley.  Auditors will ask to see this information, and the EDW better have it.</p>

<p>*** Compliance initiatives are difficult (if not impossible) to meet without a historical tracking of RAW data sets, integrated, and stored in the EDW ***</p>

<p>Changing the data on the way IN to your EDW can cause a compliance audit failure in the future, especially if the source system is retired, is destroyed, or is unable to "restore" the system of record that created the data in question.  The EDW is the ONLY place in the future to house this information.</p>

<p>I will be continuing my series on auditability, compliance here - but you can also find out more by registering, and watching new on-line courses on http://inmoninstitute.com - I will have some courses available by February 15th, 2009 about auditability and compliance and the EDW.</p>

<p>I will continue my series as well, in discussing governance controls, and accountability as we move forward.</p>

<p>I'd like to hear about your thoughts/experiences.  Please reply with comments below.</p>

<p>Thank-you,<br />
Dan Linstedt<br />
CIO, Genesee Academy, LLC<br />
DanL@GeneseeAcademy.com</p>]]>
</content>
</entry>
<entry>
<title>Fun Programs to keep your computer running smooth</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/12/fun_programs_to.php" />
<modified>2008-12-18T22:54:03Z</modified>
<issued>2008-12-18T22:44:17Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2794</id>
<created>2008-12-18T22:44:17Z</created>
<summary type="text/plain">Fun excerpt in favorite tools to manage my PC</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>This has nothing to do "really" with BI, but on the other hand, we nearly all use Windows at some point in our career, and over the years I have latched on to some vital programs that I really like to use to keep my machine running efficiently.  They are mostly fairly cheap, and work really really well.</p>]]>
<![CDATA[<p>Disk Defragmentation:<br />
  VOPT @ http://www.vopt.com  - best program I've had in years, runs in assembly, and is fast, and will even defrag your windows pagefile</p>

<p>Registry fixers:<br />
CCleaner - open source, http://www.ccleaner.com - really good at cleaning up components, missing files, extensions, etc...<br />
WinASO - http://www.winASO.com - really really good at fixing problems in the registry, will even compact the registry for you.</p>

<p>Manage your DRIVERS!<br />
Driver Detective:  http://www.drivershq.com/ the BEST driver software for your windows boxes that I've seen in many years, fixed problems on my old XP box too!</p>

<p>My personal favorites for Anti-virus and software based fire-walls:<br />
kaspersky - http://www.kaspersky.com  - doesn't invade your machine the way some other virus checkers do.<br />
ZoneAlarm - http://www.zonealarm.com - really good, but requires quite a bit of training before it can be super effective.</p>

<p>So there you have it, some simple tools, really great pieces of software to own and not too expensive.</p>

<p>Enjoy!<br />
Dan L</p>]]>
</content>
</entry>
<entry>
<title>The need for: IT Agility &amp; Data Warehousing/BI</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/12/the_need_for_it.php" />
<modified>2008-12-18T10:46:58Z</modified>
<issued>2008-12-18T10:01:34Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2789</id>
<created>2008-12-18T10:01:34Z</created>
<summary type="text/plain">Choosing the right data model for your EDW can make all the difference in the world, especially when it comes to IT Agility.  This entry explores IT Agility (or the lack there-of) and attempts to bring some form of relief by offering a possible solution.</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>In this day and age everyone is cutting costs, every customer and corporate client is looking for ways and means to become lean and efficient.  I've heard a lot about disparate "enterprise data junkyards" recently, especially when it comes to stove-piped solutions involving star schemas as an EDW causing problems with IT agility.  As a result of problems with IT agility in the area of EDW/BI processing, business users continue to build spread-marts (access databases, along side of complicated excel spreadsheets).  In this entry we will explore this phenomenon, and discuss what executives and business users can do about this growing issue.</p>]]>
<![CDATA[<p>It's no secret that the entire world is suffering from a financial "crises".  Every company is struggling to make profits, and keep their work-force.  But out of "crises" I believe comes opportunity, opportunity to break the molds, bust up the "old way" of doing business, break down the barriers to entry, and the "not invented here" syndrome.  Companies and business users must seek new ways of doing business in order to compete, and to stay in business.</p>

<blockquote>The pressure for companies to become more agile means enterprise IT has to become more agile too. Companies must quickly redirect IT resources and efforts to compete effectively in an increasingly competitive global marketplace.</blockquote> <a href="http://esj.com/Enterprise/article.aspx?EditorialsID=2135">http://esj.com/Enterprise/article.aspx?EditorialsID=2135</a>

<p>This is especially true around current generation 1 EDW / BI projects.  The problems are currently evident everywhere we look.  Generation 1 EDW's have been built around the notion of stove-piped answer sets (a set of answers for sales, another for HR, another for finance, and yet another for...) you name it, IT has built it somewhere for a _specific_ business unit.  The end result is that IT calls this an EDW (loosely affiliated star schemas).  Business users continue to request changes, and continue to receive ever-increasing costs, and ever-increasing time to implement from IT.</p>

<p>Business sees this as IT non-agility.  At some point the business begins to tell IT: you cost too much for a new subject oriented star, you take to long to integrate my changes, and they (business) run off to create their own "marts"/EDW like structures in Microsoft Access Databases, and Excel Spreadsheets.  These are what we call "spread-marts".  Eventually the corporation bears the brunt of this directly.  Business users eventually "toss" the spread-mart over the wall to IT to handle, and "integrate" with the existing data mart solutions.</p>

<p>* Costs of maintenance steadily rise out of control (for IT to keep up and maintain all the different components)<br />
* Backward compatibility, and integration of new spread-marts requires re-engineering of existing load cycles into a number of different star schemas.<br />
* Business ends up with disparate answer sets<br />
* Staging areas turn into pseudo-warehouses because IT must put history in to the staging areas to satisfy compliance initiatives.</p>

<p>The largest problems that face business are:<br />
1) The cost of "Re-engineering" existing conformed dimensions rises out of control as "more and more conformity" is stuffed in to ever increasing complex load routines<br />
2) The cost of "maintaining" multiple systems for different star-schemas rises out of control, and the time to implement "re-usable components" (conformed dimensions/federated marts) becomes unbearable.</p>

<p>All of this occurs because the WRONG DATA MODEL has been chosen for implementation purposes within an EDW vision.  IT is then seen as "slow to respond", or "costing too much" to implement a solution" both of which lead business down the path of creating their own copies of data sets for BI analytics purposes.  This is a serious lack of IT agility.</p>

<blockquote>Looking at how long it takes to make required changes or enhancement from start to finish—even when some of the time lapse is outside the direct control of enterprise IT—gives the best picture of enterprise IT agility. 
[...]
Cost : Obviously, time should be tracked as part of a measure of enterprise IT agility, but why track cost?

<p>The answer is simple. Committing extra resources or dollars to reduce elapsed time isn’t a good solution to the agility challenge. Paying a premium to reduce elapsed time might be practical under certain circumstances, but spending extra money to “buy” agility on a regular basis may not be a good investment.</blockquote> <a href="http://esj.com/Enterprise/article.aspx?EditorialsID=2135">http://esj.com/Enterprise/article.aspx?EditorialsID=2135</a></p>

<p>Well, take heart.  There is a solution out there.  Please note: I'm not here to tell you Star Schemas are bad, quite the contrary.  Star Schemas are awesome tools for OLAP and drill down, and discovery analysis.  Star schemas should be used as Data Mart architectures, and should NOT be used for enterprise data warehousing architectures.  </p>

<p>The data model chosen to act as the EDW is at the heart of the success or failure in IT agility within a BI project.</p>

<p>MODELS AFFECT IT AGILITY - CHOOSING THE RIGHT MODEL/ARCHITECTURE FOR THE RIGHT JOB IS CRITICAL.</p>

<blockquote>Based on the process maps, we used data modeling to define the logical data model for the system database. Once we knew the new business processes and the data model, we were able to create a prototype of the user interface and the technical architecture for the system. </blockquote> <a href="http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=112307">http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=112307</a>

<p>Do you (business user) find yourself asking IT "please deliver this change to the 'EDW'" (fully expecting a 90 day time box deliverable for $125k) only to find out that IT comes back and says: well, the EDW/BI system requires retrofitting/re-engineering, so it will take 250 days, and cost $500k (because we must integrate this data into existing conformed dimensions) - causing the re-engineering.  At this point do you say:</p>

<p>* Never mind, it costs too much...  Hey, why don't you just "copy" the star schema models, change them, and build me my own...</p>

<p>* Never mind, it takes too long...  Hey, I'm going to build this in Microsoft Access, and get my own data feeds to make this work.</p>

<p>If you are experiencing this, it's because the wrong architecture has been chosen for the EDW (not the data marts, but the enterprise data warehouse) and it has reached it's limits for agility.</p>

<p>When we build systems, what we want is a sure fire way to deliver new data marts in about a 45 minute turn-around time (from the time the 2 page requirements documents hit my desk to the time the user has a sample row set to play with in a data mart) - providing of course that we already have the data in the EDW.  This is IT Agility and responsiveness.  The business user no longer has any reason to "roll their own".</p>

<p>So what's the secret?  How can we return to a good solid system?  How can IT get this accomplished especially given the pain of their existing "disparate and federated EDW" architecture?<br />
Part of the secret is in fact the Data Vault Model and the CMM compatible Data Vault Methodology.  Companies all over the world are actually seeing huge agility gains by implementing these components.  The Data Vault model is freely available (just like 3rd normal form and Star Schema) - you can read about it on www.TDAN.com.  We are doing work with intelligence agencies, government agencies, and commercial industries (very large companies) that are proving this today.</p>

<blockquote>An agile IT organization can lower its operating costs, can improve overall customer service, and can find new revenue opportunities. On the other hand, things such a disconnect between IT partners and the business, poor project management, and a large investment in legacy systems can deter IT from becoming agile. </blockquote><a href="http://enterpriseleadership.org/content.php?cid=1838">http://enterpriseleadership.org/content.php?cid=1838</a>

<p>This goes back to a differentiation between the definition of an Enterprise Data Warehouse and Data Marts (or data release areas).  </p>

<p>"You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998.</p>

<p>You can also read my other entry on agility here: <a href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/08/it_agility_and_1.php">www.b-eye-network.com/blogs/linstedt/archives/2008/08/it_agility_and_1.php</a></p>

<p>Conclusion:<br />
The minnows being data marts cobbled together in an attempt to solve agility problems.  Today these systems are breaking down, and business users are losing (or have lost) faith in IT's ability to respond in a timely cost effective manner.  IT needs to get back on track, the NEED to be able to create new solutions, change with the business, get costs under control in the EDW.  They MUST be flexible, scalable, and AUDITABLE all at the same time.  They need to choose the right architecture for the job.  </p>

<p>The Data Vault modeling techniques were created in 1990, and released to the public in 2001.  They are currently in use at the Belastingdienst (netherlands tax authority), Central Bureau for Statistics in the Netherlands, SNS Bank, Diamler-Chrylser, <a href="http://www.hypotheker.nl/default.htm">Hypotheker </a>(netherlands) , Oil & Gas companies in Canada, Banks around the globe, Food & Drug Administration, Federal Aviation Administration, and a number of other large institutions around the world.  The benefits are clear, the time is now.  Find out how to regain your agility and make business users happy again.</p>

<p>As always, I would love to hear from you.</p>

<p>Dan Linstedt<br />
danL@DanLinstedt.com<br />
</p>]]>
</content>
</entry>
<entry>
<title>Dimensionitis - Federated Stars as an EDW</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/11/dimensionitis_f.php" />
<modified>2008-11-18T11:13:23Z</modified>
<issued>2008-11-18T10:54:38Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2683</id>
<created>2008-11-18T10:54:38Z</created>
<summary type="text/plain">If you have a large number of star schemas, or a large federated star schema as an enterprise data warehouse, then you might or might not have this issue. This is one of the issues affecting business today. In this...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>If you have a large number of star schemas, or a large federated star schema <em>as an enterprise data warehouse</em>, then you might or might not have this issue.  This is one of the issues affecting business today.  In this entry we will explore the issue called Dimensionitis from a business perspective, in other words: how much does it cost to maintain, what happens when... and so on.</p>]]>
<![CDATA[<p>Have you been confronted with "Silos" of information?  Does your IT team have a "logical box" drawn around a set of split up star schemas, and is it labeled EDW?</p>

<p>Does your "EDW" system look like this: <a href="http://intelligententerprise.com/channels/business_intelligence/showArticle.jhtml?articleID=206902663">http://intelligententerprise.com/channels/business_intelligence/showArticle.jhtml?articleID=206902663</a></p>

<p>or this:<br />
<a href="http://www.dwmantra.com/dwconcepts.html">http://www.dwmantra.com/dwconcepts.html</a></p>

<p>If you have one of these systems, then let me ask you this as a business user:<br />
a) Does it continually cost more money to build new stars? (add on to the logical EDW)<br />
b) Do you have "copies" of stars for different business units that produce different answers?<br />
c) Does your EDW contain silos of information that business is demanding be reconciled, and consolidated because of management costs?</p>

<p>OR: do you find yourself saying to IT: "Just create a copy of the existing dimension, modify the data fields so they contain just what I need...  Why try to conform it?  It costs too much, or it will take too long.  And by the way, if you (IT) can't do this, then I'll build it myself in Microsoft Access or Excel spreadsheets..."</p>

<p>If this is the case, then you may have Dimensionitis.  Dimensionitis is the desire to extend your "EDW" but because of cost or time being prohibitive, you suggest IT simply "copy" the dimension to create a new one.</p>

<p>This needs to be fixed at a business level.</p>

<p>Don't get me wrong, please... I'm not saying that dimensions and star schemas are bad - I believe they are the best mechanism for presenting data to the business users for OLAP, drill down, and so on.  What I am saying is that Star Schema modeling IS NOT SUITED to be an enterprise data warehouse.  The data modeling architecture was never built nor intended to be an EDW.  The original specifications did not have "type 1, type 2, or type 3, nor did it define a conformed dimension" - they only had a single star (no history) and were designed to be a subject oriented answer set.</p>

<p>If your lines of business create new dimensions because the COST of re-engineering or the TIME it takes IT to "conform the new data to an existing dimension" takes too long, then you've got a case of Dimensionitis running around.  This also is a loss of governance and control over the data in the EDW.</p>

<p>Next time, we'll discuss the impact of IT agility on your EDW projects.</p>

<p>Thoughts? Comments? Questions?<br />
Thanks,<br />
Dan Linstedt<br />
http://www.GeneseeAcademy.com<br />
</p>]]>
</content>
</entry>
<entry>
<title>People believe Surrogate keys are better than Business Keys</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/11/people_believe.php" />
<modified>2008-11-03T05:35:40Z</modified>
<issued>2008-11-03T05:07:05Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2650</id>
<created>2008-11-03T05:07:05Z</created>
<summary type="text/plain">Well, it&apos;s happened again. IT is trying desperately to eliminate the value of the EDW from the business (at least this is what I see). Business is responding by demanding the creation of Master Data systems. There seems to be...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>Well, it's happened again.  IT is trying desperately to eliminate the value of the EDW from the business (at least this is what I see).  Business is responding by demanding the creation of Master Data systems.  There seems to be an age-old argument in the market space about the use of, definition of, and condition of: Business Keys.  IT appears to be telling people to use surrogate keys and to ignore the business keys entirely.  In this entry we will explore this single notion, and see what some folks have to say about it (me included!)  Mind you, this is a bit of a rant; they seem to know how to "get my goat" as they say...</p>]]>
<![CDATA[<p>I start off by stating, Codd & Date designed normalized forms to have business meaning.  They insisted that Business Keys be utilized in order to "make sense of" and "tag" the data sets appropriately so that relationships can be understood and maintained. </p>

<p>A simple link to a temporal database book houses a brief entry on the Information Principle:  <a href="http://books.google.com/books?id=grTubz0fjSEC&pg=PA35&lpg=PA35&dq=%22codd's+information+principle%22&source=web&ots=x75hjF6jLG&sig=1u36g5k5B6H1XfDqNpPF7wc6sWc&hl=en&sa=X&oi=book_result&resnum=5&ct=result">See it here.</a></p>

<p>I'd like to say a word or two about business keys (which by the way, you'll be able to find additional information on my videos on YouTube: http://www.youtube.com/dlinstedt/</p>

<p>What are business keys?<br />
Business keys ARE the master information that unlocks context for business users.  Business keys are (often) intelligent keys that have MEANING to the business.  Business Keys are often alphanumeric, parts of which may be generated sequences, other parts have meaning based on position.  In any event, business keys are USED by the business to locate, identify, and track information through the business life-cycle.  Without them, business may not be able to "use" or apply the information properly.</p>

<p>What business keys are NOT:<br />
Business keys are NOT surrogates, NOT sequences, NOT ordered numeric elements assigned based on technical insertion rates.  Surrogate keys should NEVER be shown to the business, ever....  They should be used within a system (internally only) to identify rows to the machine, and provide optimal join paths, but they should NEVER appear on reports, screens, or anywhere that the business can see them.</p>

<p>Is there an argument around business keys versus surrogate keys?<br />
You bet!   Check out these comments:<br />
<a href="http://www.mindfuldata.com/Modeling/modeling-pdf/DAMA%202008%20Speaker%20Notes.pdf">http://www.mindfuldata.com/Modeling/modeling-pdf/DAMA%202008%20Speaker%20Notes.pdf</a><br />
<a href="http://stackoverflow.com/questions/63090/surrogate-vs-naturalbusiness-keys">http://stackoverflow.com/questions/63090/surrogate-vs-naturalbusiness-keys</a></p>

<p><br />
"Dimensions should always use a surrogate key that is generated within the warehouse. I went to a presentation a couple of years ago by Ralph Kimball (a data warehouse author), and he discussed the importance of removing the warehouse's dependency on business keys. The idea is a good one, because business keys change regularly and this will result in a long-term problem for the warehouse. However, when we discussed Slowly Changing Dimensions (especially ones that kept history), he said that we should use the business key to link them together. This went against what he had just said, so I decided that we needed to find another solution."   <a href="http://expertanswercenter.techtarget.com/eac/blog/0,295203,sid63_tax298150,00.html">http://expertanswercenter.techtarget.com/eac/blog/0,295203,sid63_tax298150,00.html</a></p>

<p><a href="http://www.mindfuldata.com/Modeling/modeling-pdf/DAMA%202008%20Speaker%20Notes.pdf">http://www.mindfuldata.com/Modeling/modeling-pdf/DAMA%202008%20Speaker%20Notes.pdf</a></p>

<p><a href="http://www.infoadvisors.com/Home/tabid/36/EntryID/191/Default.aspx">http://www.infoadvisors.com/Home/tabid/36/EntryID/191/Default.aspx</a><br />
<a href="http://www.cerebiz.com/blog/index.php/2007/08/06/use-of-surrogate-keys-in-data-warehousing/">http://www.cerebiz.com/blog/index.php/2007/08/06/use-of-surrogate-keys-in-data-warehousing/</a></p>

<p>WHY are these people demanding that there is no value to business keys? <br />
Because it's a very tough problem for business to overcome.  Yet the business today is ASKING, Begging, pleading for answers from Master Data Sets.  I maintain that you cannot build a master data system without looking at and using business keys as a central HUB of information.</p>

<p>Why not surrogates?<br />
If I ask you to look up surrogate key 5, do you understand what this is?  where it came from? what it is bound to?  Does it give you _any_ context at all as to which system generated the number?  Do you even know where to begin to find this key?</p>

<p>Surrogate numbers are generated today in EACH source system.  In the Data Warehousing world we are responsible for integrating MULTIPLE systems at once into a single place.  If we rely solely on these "surrogate keys" and completely ignore business keys as has been suggested by the links above, our EDW would never mesh or align for the business.  Furthermore trying to build a master data system would be impossible.  Some of these individuals I listed even went so far as to say: "ignore the business keys in your dimension entirely, because it is unruly (null) most of the time".</p>

<p>I say rubbish.  If your business is not properly synchronizing, populating, or utilizing business keys then they are hemorrhaging money along their business process.  Business keys are vital to the traceability of information ACROSS lines of business and ACROSS systems.</p>

<p>Take a look at what I say about business keys:<br />
<a href="http://www.danlinstedt.com/AboutDV.php">http://www.danlinstedt.com/AboutDV.php</a><br />
<a href="http://www.tdan.com/view-articles/5285">http://www.tdan.com/view-articles/5285</a><br />
<a href="http://www.b-eye-network.com/blogs/linstedt/archives/2005/09/between_inmon_a.php">http://www.b-eye-network.com/blogs/linstedt/archives/2005/09/between_inmon_a.php</a></p>

<p>Bottom line, Business keys are imperative that they span the systems.  If the business keys are changing, or are re-used, the business is LOSING MONEY.  I will take that to the board of directors level every single time, and every time - I can find busted and broken business problems and lack of visibility ACROSS the organization in accordance with their lack of regard for business keys.</p>

<p>The ONLY thing one has to do is look at the businesses that want master data systems - how are you (IT) going to integrate the data sets by surrogate if the surrogates generated by source system ARE THE SAME across multiple sources?  WHICH surrogate are you going to show to the business as the "MASTER KEY" for which pieces of information?  It's a near impossible problem to solve, the business units will fight over the definition, and it will come down to politics as to who is right/wrong, when the business REALLY should be deciding how to fix the source of the problem: lack of a single business key.</p>

<p>Auto manufacturers figured it out long ago, they use VIN (vehicle identification numbers) to uniquely identify: make, model, manufacturer, date of manufacturer, size of engine, and so on.  Unless you are doing something illegal, the VIN does not change, nor does it go away.  What would happen to the world of car's if the VIN disappeared?</p>

<p>We have the SAME question in the world of counterfeit drugs...  Unfortunately E-Pedigree as a country wide solution has been lobbied down, and pushed back.  Each bottle was to be labeled and identified as a unique bottle using a very specific bar code.  It would have allowed the entire industry to sort out the MOST of the counterfeit drug problem, and save people’s lives.</p>

<p>You can sit there and tell me that "Business Keys don't matter" but at the end of the day, I will say: you are losing money, and quite possibly people are dying without them.</p>

<p>Cheers,<br />
Dan Linstedt<br />
Check out WHY business keys are important, learn about the Data Vault Model.<br />
</p>]]>
</content>
</entry>
<entry>
<title>Many vendors Claim dynamic data warehousing...</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/10/many_vendors_cl.php" />
<modified>2008-10-30T21:05:30Z</modified>
<issued>2008-10-30T06:31:08Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2632</id>
<created>2008-10-30T06:31:08Z</created>
<summary type="text/plain">But very few (if any) actually execute on the vision that I am laying out here. This is a very short entry, but basically re-iterates some of the points of Dynamic Data Warehousing that I believe to be necessary before...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>BI Vendors</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>But very few (if any) actually execute on the vision that I am laying out here.  This is a very short entry, but basically re-iterates some of the points of Dynamic Data Warehousing that I believe to be necessary before it (software/appliance/database) can be labled as being something like this.</p>]]>
<![CDATA[<p>In my definition of dynamic data warehousing the software around the database is an artificially intelligent engine.  The database contains metadata about the structure, about the usage of the structure, and versions of all this metadata (producing a structural and usage life-cycle). </p>

<p>In other words, the AI engine is fed or kick started with an ONTOLOGY.  The ontology of terms defines the basic data model that is executed underneath.  The Ontology is driven by business terms, business definitions, functions, and descriptions (in accordance with OWL ontology).   Secondarily, the AI engine is fed many different data points including usage of the ontology:</p>

<p>* SQL Queries<br />
* Loading Code<br />
* Scripting Code<br />
* Application Code<br />
* Web Service Code</p>

<p>And all of the table references/usages/join criteria components within the code.</p>

<p>Dynamic Data Warehousing RESPONDS to changes BY ITSELF.  It responds to USAGE controls (ad-hoc queries, repeated queries, and so on).  It responds to LOADING controls (changes to structures, appearance of new attributes/fields, changes to loading code, volume and width).  It responds to length of processes (metrics driven), and responds to USER DRIVEN ONTOLOGY CHANGES (based on business requirements).</p>

<p>At the end of the day, the AI engine grades changes, and figures out by itself, how to a) TUNE the structure b) ADAPT or CHANGE the structure, including indexing, c) Add new elements to the structure, d) retire old elements from the structure, d) OPTIMIZE the modeling paradigm for today's business execution cycles, e) manage and propogate structural changes TO the loading code, TO the queries, and TO the ontologies.</p>

<p>Vendors may claim that they have "Dynamic Data Warehouses" all they want, but until they have automatic detection of structural changes, and automatic propogation of those changes - and these automated systems are associated to/with business ontologies, I will not agree that they infact have a "Dynamic Data Warehouse".</p>

<p>This is just my opinion.  I believe that in order to become more fluid, and more appropriate to the business, and closer to business change, these are the kinds of systems that will evolve in the next 5 to 7 years.  </p>

<p>Cheers for now,<br />
Dan Linstedt<br />
Feel free to contact me directly: DanL@GeneseeAcademy.com, we teach custom Informatica courses, DW2.0 and Unstructured Data Courses, Zachman Framework courses, and Data Vault Data Modeling courses.</p>]]>
</content>
</entry>
<entry>
<title>VLDB/VLDW Expected Issues</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/09/vldbvldw_expect.php" />
<modified>2008-09-25T19:34:25Z</modified>
<issued>2008-09-25T19:15:52Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2467</id>
<created>2008-09-25T19:15:52Z</created>
<summary type="text/plain">A short Question list of VLDB/VLDW to introduce the ideas of complexity and scale, but not to offer any opinion on how to handle these situations.</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>VLDB (very large databases) and VLDW (very large data warehousing) are two different terms in the industry that evoke a lot of stir.  The terms have been changed, altered, re-defined, and brought back to the table many times by many people.  Their are many problems associated with implementing "big systems" and not very many solutions (although vendors are trying).  There are some major business questions around the data sets and the application of such large data sets.</p>

<p>In this entry I will explore the business questions, and the technical challenges faced by big systems.  I will attempt to hold my opinion, and see what the responses are - what do you think are issues faced by your business?</p>]]>
<![CDATA[<p>First, as always, let's level-set the terms by defining what we mean by "big systems".</p>

<p>VLDB - A large database, with large amounts of information being loaded by a trickle feed, and large amounts of information being queried 24x7x365 (always up).  This creates a mixed workload environment.  An example system might be a telephone switch data capturing system hit by Quality control and financial analysts looking to see where they are loosing and making money NOW (all current information).  Typically sized in the ranges from 50TB to 150TB of operational type data.</p>

<p>VLDW - A large database, inclusive of history (making it a data warehouse) at a granular level.  Typically loaded anywhere between 3 minute intervals and 24 hour intervals, with queries against large amounts of history, mixed in with queries that are "wide" but not "deep" - mixed workload, 24x7x365, detailed data set, raw data set.  An example might be all the history of the telephone switching systems mentioned above, so the analysts can determine over time which switches/hosting facilities have the most problems, and which bandwidth is frequently overloaded, and what the patterns of overload actually are.  Typically sized in the ranges of 150TB to well over 800TB of historical information (that is ACCESSED).</p>

<p>I'm not discussing systems where "I have 800TB, but it's all on storage, and we load weekly..."  - no, that's not what I'm talking about.</p>

<p>The business questions that are under controversy include:  (remember, I'm going to hold my opinionated answers until later)<br />
1) Do we really NEED all this data?  What does it buy the business? What can be learned from this?</p>

<p>2) What could possibly be hidden in 800TB that the business users access?</p>

<p>3) What tactical questions are answered by having raw data (transactions) loaded to the VLDW?</p>

<p>4) Why can't the operational system (VLDB) serve as the system of record?</p>

<p>5) What does the VLDW have that the VLDB doesn't?  Why do I need to justify the existence of both?</p>

<p>6) How do I mitigate risk of failure of either system?</p>

<p>7) Do I need replication technology instead of "backup" technology for fail-over and recoverability?</p>

<p>8) Is there a SINGLE RDBMS engine that will answer these questions AND scale beyond?</p>

<p>9) Do I need to scale beyond 300/400/800TB?  What will that buy me?</p>

<p>And the technical questions:<br />
1) How do I manage backups and restores of this much information?<br />
2) is Data Modeling really necessary?<br />
3) Why can't I cluster my machines together, how come I need MPP or Big-Iron SMP to make this work?<br />
4) How do I get the DBMS to handle mixed-workload queries?<br />
5) Why does the system "go-down" when I fire up massive loads WHILE querying?<br />
6) Why do vendors continue to push TPC-H performance when that isn't my "real-world"?<br />
7) What's the difference in systems at 300TB and systems at 800TB?<br />
8) What changes to my architecture/network/OS do I need to make to accomodate this scale?<br />
9) Why can't the users get along with "LESS DATA?"  Do they really use all of this?</p>

<p>Love to hear your thoughts,<br />
Dan L<br />
DanL@DanLinstedt.com</p>]]>
</content>
</entry>
<entry>
<title>Part 9: Secrets of the Masters - Tracking &amp; Governance</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/09/part_9_secrets.php" />
<modified>2008-09-25T04:16:51Z</modified>
<issued>2008-09-25T03:51:38Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2464</id>
<created>2008-09-25T03:51:38Z</created>
<summary type="text/plain">In part 7 of this series I mentioned that I would share how to number deliverables of the project to assist in monitoring progress, and managing metrics (KPA&apos;s and KPI&apos;s of the project). In this entry I provide some very...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>In part 7 of this series I mentioned that I would share how to number deliverables of the project to assist in monitoring progress, and managing metrics (KPA's and KPI's of the project).  In this entry I provide some very simplistic starting blocks on how this is done within a project methodology, and hopefully - there's light at the end of the tunnel where we can begin to see the impact on the risk, estimations of hours, cost measurements/forecasts, and actuals for delivery.  This entry is all about Project Management and deliverables - how to tie them together.</p>]]>
<![CDATA[<p>If you are SEI/CMMI certified, or are an auditor, I would love to have your feedback and comments regarding these subjects.  If you are PMP or Six Sigma or TQM familiar, it would be good to hear from you as well - please, tear down these ideas if they do not fit with your experience.  I can only learn from you if you respond. :)</p>

<p>In my past 15+ years (now going on 20 years) of IT experience I've spent maybe 6 to 8 years managing projects (technical project management).  One of the hats I wore was not only business analyst, but also full project manager, and team enabler.  Under this guise of lean-initiatives and cycle time reduction, I learned a few things that seemed to make sense at the time.</p>

<p>I got tired of estimates that didn't match the project plan, couldn't be scoped properly, or attached to actuals and deliverables.  I got tired of risk running rampant and killing projects before they started.  I got tired of always being asked: "how complete are you and your team on task X?" the real question I got was: "How close to done are you with the requirement...  you know, the requirement that discusses ZZZZZ...."</p>

<p>I needed a way to track all of this, and furthermore to be able to press a button and run some analytical reports / graphs (every day) on the project as we moved along.  So, taking from technical writing requirements, and from SEI/CMMI and from the legal profession (which I only know they number every paragraph)....  I started numbering everything I could find.</p>

<p>For instance, I went through the business requirements, numbered all TITLES and SUB-TITLES, and paragraphs.<br />
1.0 Requirements Overview<br />
1.1 Requirement 1<br />
1.1.1 Response time for req. 1<br />
1.1.1.1  The expected response time........ (paragraph)<br />
1.1.2 Types of queries...</p>

<p>You get the idea, next I numbered the technical requirements to mesh with the business requirements.  I aligned the requirements to match up in a matrix of "this is what they want, and this is how we propose to build it."  This was appropriately called "IT alignment" (at least it was in the '80s...  Then, I took the technical requirements and began numbering EVERY line-item in the project plan.  By the way, this became a GREAT way to spot requirements (stated) that were missed in the project plan...  interesting loop-hole catch.</p>

<p>I then thought to myself: Self.... (just kidding)<br />
I took the Project plan, and assigned roles rather than people (as responsible parties). To which I developed a roles & responsibilities document, and then numbered that too (independently of the requirements).  RR1.0, RR1.1, RR2.0, etc....  I took the R&R numbering system and attached them back to the technical requirements, then assigned resources to the roles and responsibilities, and to each of the resources - I assigned resource loading.  This ended up becomming the work-breakdown structure (Project + Tech requirements + RR)</p>

<p>Next, I created an organizational breakdown structure (org chart), and developed an escalation path for each role, numbering each element in the org-chart as I went, assigning RR1.1 to a specific org unit.  Now we knew where the risk would be handled, or escalated as things got hot (if they got hot).  Next on the list was a process breakdown structure (AS-IS process flows).  We needed to know how the data currently moved from one business unit to another, from one system to another.  We developed process flows at 30,000 feet and above - then numbered all of them with the appropriate business requirements number (which tied the artifacts to specific components of the project plan).</p>

<p>Then, we immediately began designing new (to-be) process flows, which re-defined some of the interfaces, and how the data would flow to the warehouse, out to the marts, and back to the reports in the business users' hands.  We then numbered each of these to-be process flows with the "original process flow numbers" tying them together.  As we built the "to-be flows" and completed the process re-design, we could attach these components to mile-stones reached within the project and produce deliverables consisting of data and process to the business users.</p>

<p>Finally, we went through each major section of the technical requirements and assigned risk analysis templates by applying expected skill sets (balanced against the R&R, and the org chart, and availablility) - we developed a low,medium, and high risk score.  We then set a threshold for warning (approaching high risk) where we would begin escalation procedures up the Org Breakdown structure.</p>

<p>Needless to say, there were many other deliverables (all docs were versioned in keeping with CMM), all processes were measured, quantified, and then optimized, and the Data warehouse (now some 15 years later) is still running strong.</p>

<p>Ok - you think this is a lot of work?  We did this with a team of 3 people, + 1 person from the PMO (proj. Mgmt Office), + 1 DBA part time, + 1 senior/expert data modeler/data architect part time.  And we accomplished delivery of the full production warehouse inside 6 months for 3 source systems (Planning, Manufacturing and Finance).  The EDW consisted of 60+ tables, source systems around 300+ tables with manufacturing bill of materials.</p>

<p>It can be done, with the right people, the right training and the right expertise - and the benefits can be enormous.  It doesn't take a huge bankroll to institute this type of "project governance" or maturity model for EDW projects, just dedication and consistency.</p>

<p>Hope this helps,<br />
Dan Linstedt<br />
danL@DanLinstedt.com</p>]]>
</content>
</entry>
<entry>
<title>Dynamic Data Warehousing Stepping Forward</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/09/dynamic_data_wa_2.php" />
<modified>2008-09-22T05:09:48Z</modified>
<issued>2008-09-22T04:52:48Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2449</id>
<created>2008-09-22T04:52:48Z</created>
<summary type="text/plain">I&apos;ve blogged about this topic for many years now, my first mention of it was in my www.TDAN.com articles regarding the Data Vault Modeling architecture. However, that said, I&apos;ve been blogging on everything from autonomic data models, to dynamic data...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Dynamic Data Warehousing</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>I've blogged about this topic for many years now, my first mention of it was in my www.TDAN.com articles regarding the Data Vault Modeling architecture.  However, that said, I've been blogging on everything from autonomic data models, to dynamic data warehousing, but in my research, I've come to realize I've left out some very critical components.  I've lately been experimenting with building a self-adapting structured data warehouse.  There are many moving pieces and not all the experiments are finished, so I cannot write (yet) about any of the findings.  But here, I'll expose some more of the under-belly as it were that is necessary to make DDW a reality (in my labs anyhow)....</p>]]>
<![CDATA[<p>I've tried and tried to find a new name for this thing, but alas, it just seems to elude me.  Dynamic Data Warehousing seems to have a nice ring, and is quite the nice fit.  The term however evokes all kinds of different meanings to different companies and different people.  So much so, that I've had open discussions with IBM in the past about their use of the term!  Oh-well, water under the bridge.</p>

<p>But that brings me to my next point.  There are missing components to my definition of DDW, I didn't get it all, and I'm sure that this is just another step in the definition (that the definition will not be completed for another year or two).  If I look back at what's going on I see the following:</p>

<p>Convergence of:<br />
* Operational Processing and Data Warehousing.<br />
* Master Data and Metadata to use the Master Data Properly<br />
* Tactical decisions backed by strategic result sets<br />
* Business, Technical, Architectural, and Process Metadata<br />
* Real-Time and Batch processing<br />
* Standard reporting technologies and "Live animated scenarios" with walk-throughs and 3D imagry<br />
* Human-machine interfaces<br />
* MPP RDBMS systems and Column Based Database solutions</p>

<p>Why then shouldn't we see convergence of "data models" and "business processes"?<br />
or "Data Models" and "Systems Architecture"?</p>

<p>The point is: WE ARE.  (or at least I am).  Not only is this happening in my labs, but It's being requested of me when I visit client sites.  The customers want "1 solution", or better yet, they want a solution that "appears to learn" based on the demands put upon the system.</p>

<p>Why do I say "appears to learn?"<br />
because Machine learning and appearances of machines translating context are two totally different things.  I cannot and will not claim to have made a machine to think.  However, I can and have made a machine's enterprise data warehouse responsive to external stimulous - at least when it comes to the data model, loading routines, and queries.  Please do NOT mistake this as anything more than AI applied in a new manner - mining metadata (structure and queries and load-code and web-services) rather than just mining data sets themselves.  (more on that later, much later --- I still have a LOT of research to do).</p>

<p>Ok - so what's missing from the Dynamic Data Warehouse definition?<br />
* Use of metadata: business, technical, and process during the model learning/adaptation phase<br />
* Use of an ontology (part of business and technical metadata as described above)<br />
* Use of a training model, all good neural nets need to be trained over time, and then corrected.<br />
* Use of the queries to examine and compare HOW the data sets are being used and accessed against the current data model<br />
* Use of a minimal load-code parser, again to assist in training the neural net to recognize the correct structure.</p>

<p>Anyhow you get the point.  Dynamic Data Warehousing is about a back office system, that responds to changes in the structured data world - as the queries change then the indexes change.  As the incomming data set changes, the model needs to change.  Some queries (if consistent enough) can actually express new relationships that need to be built.</p>

<p>This is an adaptable system, this is a dynamic system, this will eventually become a true Dynamic Data Warehouse.</p>

<p>Thoughts?<br />
Dan Linstedt<br />
DanL@DanLinstedt.com</p>]]>
</content>
</entry>
<entry>
<title>VLDB: Column Based versus Row Based</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/09/vldb_column_bas.php" />
<modified>2008-09-21T22:38:35Z</modified>
<issued>2008-09-21T21:56:49Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2448</id>
<created>2008-09-21T21:56:49Z</created>
<summary type="text/plain">Column based databases/appliances are making headway in the VLDB/VLDW world. There is no doubt that there are benefits to this approach, but there are also drawbacks. In this entry I explore some of the articles, links, facts and figures -...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>Column based databases/appliances are making headway in the VLDB/VLDW world.  There is no doubt that there are benefits to this approach, but there are also drawbacks.  In this entry I explore some of the articles, links, facts and figures - as related to my personal experience.  Then I compare what different authors are saying against Row-Based MPP technologies to see what the differences and similarities are.  This by no means is a complete research paper, but just a peek into what the future may hold for RDBMS vendors and the new Column based data stores.  Of course, Solid state disk, and RAM/Flash based data sets will change things again shortly.  I'll also touch on the impacts to Data Modeling and what it may mean going forward.</p>]]>
<![CDATA[<p>Let's first set the table by defining what the terms mean:</p>

<p>1) for VLDB/VLDW I'm referring specifically to a 300TB and above system.<br />
2) I'm also referring to LIVE data sets, where it isn't JUST 300TB sitting in a storage disk somewhere, but there's a significant amount of information being loaded AND queried at the same time, utilization is somewhere around 100TB "used/accessed/referenced/loaded" per week.<br />
3) I'm also referring to a MIXED WORKLOAD system, meaning real-time transactions are streaming in, batch loads are occurring, and both tactical and strategic queries are taking place at the same time.</p>

<p>By MPP: I mean Massively Parallel Processing capabilities, like DPF from DB2 UDB (IBM - running shared-nothing architecture), and Teradata with independent nodes to scale out, I'm also referring to theses traditional database systems as "row-based" database engines.</p>

<p>For Column Based "appliances" I am referring to Sybase IQ, Vertica, Dataupia, and others which provide column based data storage.  NOTE: Netezza is NOT a column based store, rather it is a flat-wide appliance with hardware that figures out exactly what data set you need before hitting disk to retrieve it.</p>

<blockquote>Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).

<p>Though this argument sounds reasonable, it is completely incorrect.  It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.</blockquote>  http://www.databasecolumn.com/2008/07/debunking-a-myth-columnstores.html</p>

<p>If you're interested in furthering your knowledge on indexing versus column compression, the article: http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf  is a very good source for examining the mathematics behind the tuple sets and joins.</p>

<p>Most of the articles I've located discuss indexing, and differences between indexing and column based tuple access.  Unfortunately they don't tend to address the loading speeds and performance of getting the data "IN" to the database in the first place.</p>

<p>Column based data stores bring benefits to the table:<br />
* Rapid Query, less overhead (according to the math that I've read through)<br />
* No need for PHYSICAL data modeling (as long as you don't need/want GOVERNANCE or MANAGEMENT in your data store).<br />
* No "seemingly physical" limit to the number of columns PER TABLE.<br />
* Automatic data compression/removal of duplicates on insert<br />
* IF the grid / cloud computing works properly, then they should be able to scale out<br />
* They appear to achieve anywhere from a 3:1 to a 7:1 compression ratio on the data slammed in to the box.<br />
* Raw data can be loaded quickly (in native format) without "stopping to normalize, or assign sequence number surrogate keys"</p>

<p>Now let's take a look at some of the issues that they bring to the table (simple issues)<br />
* Most column based databases have yet to solve massive load performance issues<br />
* Most column based databases have to "STOP" the data stream to compress it, and assign it to the right column post-loading.<br />
* In order to achieve high speed trickle feed (8,000 transactions per second or better) they need to have a significant RAM cache somewhere on one of the nodes to load the data.<br />
* Splitting the data over multiple gridded nodes might take more work than originally thought<br />
* Load balancing with spreading the data set across multiple gridded nodes might be an issue.<br />
* Today, most column based data stores work extremely well on big iron SMP boxes, but struggle to take full advantage of Grid technologies and shared-nothing architectures.<br />
* To handle "BATCH LOADS" Most column based data stores use a "staging area" internally to load the batch data, then split it across and push it in to the column database (this may NOT be such a bad thing... we do this in MPP environments too!)<br />
* Column based databases have "come and gone", the only one that has stuck around over the years has been Sybase IQ, and finally for the first time in many years we are beginning to see announcements from the company that they are putting money back into R&D for this product.</p>

<p>Let's take a look at the physical nature of MPP:<br />
Pros:<br />
* Provides mechanisms for governance and management through physical data modeling<br />
* Provides high-speed batch loads, and high-speed trickle feeds (real-time transactions)<br />
* Provides balanced queries, and can easily handle mixed workload components (loading while querying, and both tactical and strategic queries at the same time).<br />
* Has grown up, is based on mature proven technology.<br />
* Scales out very easily, allows MASSIVE sets of data (because it's not locked in to a single SMP environment).</p>

<p>Cons:<br />
* Usually requires good physical data modeling (normalization) in order to load-balance the data sets across the nodes.<br />
* Usually requires a staging area inside the MPP platform before re-distributing the data ** caveat: some MPP platforms have architected their bulk-loaders to overcome this problem.<br />
* Usually requires JOIN INDEXES or some materialized table to assist with the Tuple Joins<br />
* Usually requires column based compression to be turned on by the operator to achieve benefits.<br />
* Requires enough nodes to "split the workload evenly"<br />
* Requires all nodes to be running at the same speed in order to achieve maximum performance gains.</p>

<p>So these are just a FEW of the points made both for and against column based databases when comparing them to MPP designs.  They both work well for their own purpose.  Customers of mine continue to look for a "single solution to do it all" however today, it just doesn't seem possible.  This is why (I think) that we continue to hear vendors like IBM and Teradata advertise: "we partner with...." fill in the blank of your favorite column based database...</p>

<p>However, watch the vendors closely - this market space is heating up, and over the next year I expect new technologies to be released from all vendors that will converge some functionality and blur the lines between RDBMS MPP, and Column based on a grid.</p>

<p>Thoughts?  What do you see in the market?<br />
Dan Linstedt<br />
DanL@DanLinstedt.com<br />
</p>]]>
</content>
</entry>
<entry>
<title>Self-adapting Data Models</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/08/selfadapting_da.php" />
<modified>2008-08-27T13:14:47Z</modified>
<issued>2008-08-27T12:54:29Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2381</id>
<created>2008-08-27T12:54:29Z</created>
<summary type="text/plain">Automorphic data models, self-healing, self-optimizing enterprise data warehouses.  Learning trends, and adapting/tuning the model as we ask questions and feed it data sets from live streams.  The ability to spot trends and associations we haven&apos;t considered before.</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Thought Experiments</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>In my last entry in this category, I described automorphic data models and how the Data Vault modeling components is one of the architectures/data models that will support dynamic adaptation of structure.  In this entry I will discuss a little bit about the research I'm currently involved in, and how I am working towards a prototype of making this technology work.  </p>

<p>If you're not interested in the Data Vault model, or you don't care about "Dynamic Data Warehousing" then this entry is not for you.</p>]]>
<![CDATA[<p>The Data Vault model has reached the height of flexibility by applying the Link tables.  It is an architecture that is linear scalable and is based on the same mathematics that MPP is based on.  Single Link tables represent associations, concepts linking two or more KEY ideas together at a point within the model.  They also represent the GRAIN of those concepts.</p>

<p>Because the link tables are always a Many To Many, they are extracted away from the traditional relationship (1 to many, 1 to 1, and many to 1).  The Links become flexible, and in fact, dynamic.  By adding strength and confidence ratings to the link tables we can begin to gauge the STRENGTH of the relationship over time.</p>

<p>Dynamic mutability of data models is coming.  In fact, I'd say it's already here.  I'm working in my labs to make it happen, and believe me it's exciting.  (only a geek would understand that one...)  The ability to:</p>

<p>* Alter the model based on incoming where clauses in queries (we can LEARN from what people are ASKING of the data sets and how they are joining items together)<br />
* Alter the model based on incoming transactions in real-time (by examining the METADATA) and relative associativity / proximity to other data elements within the transaction.<br />
* Alter the model based on patterns DISCOVERED within the data set itself.  Patterns of data which were yet previously "un-connected" or not associated.</p>

<p>The dynamic adaptability of the Data Vault modeling concepts show up as a result of these discovery processes.  I'm NOT saying that we can make machines "think" but I AM suggesting that we can "teach" the machines HOW the information is interconnected through auto-discovery processes over time.  This mutability of the structure (without losing history) begins to create a "long term memory store" of notions and concepts that we've applied to the data over time.  </p>

<p>Through recording a history of our ACTIONS (what data we load, and how we query) we can GUIDE the neural network into better decision making and management over the structures underneath.  This includes the optimization of the model, to discovery of new relationships that we may not have considered in the past.</p>

<p>The mining tool is:<br />
* Mining the data set AND<br />
* Mining the ARCHITECTURE<br />
* Mining the queries AND<br />
* Mining the incoming transactions</p>

<p>to make this happen.  We've known for a very long time that Mining the data can reap benefits, but what we are starting to realize NOW is that mining these other components really drive home new benefits we've not considered before.  In the Data Vault Book (the new business supermodel) I show a diagram of convergence (which has been bought off on by Bill Inmon).  Convergence of systems is happening, Dynamic Data Warehousing is happening.</p>

<p>These neural networks work together to achieve a goal: creating and destroying link tables over time (dynamic mutability of the data model) while leaving the KEYS (Hubs) and the history of the keys (Satellites) in-tact.  Keep in mind that the Satellites surrounding Hubs and Links provide CONTEXT for the keys.</p>

<p>I've already prototyped this experiment at a customer, where I personally spent time mining the data, the relationships, and the business questions they wanted to ask.  I built 1 new link table as a result with a relationship they didn't have before.  We used a data mining process to populate the table where strength and confidence were over 80%.  The result?  Their business increased their gross profit by 40%.  They opened up a new market of prospects and sales that they didn't previously have visibility to.</p>

<p>Again, I'm building new neural nets, new algorithms using traditional off the shelf software and existing technology.  It can be done, we can "teach" systems at a base level how to interact with us.  They still won't think for themselves, but if they can discover relationships that might be important to us, then alert us to the interesting ones - then we've got a pretty powerful sub-system for back-offices.</p>

<p>More on the mathematics behind the Data Vault is on its way.  I'll be publishing a white paper on the mathematics behind the Data Vault Methodology and Data Vault Modeling on B-Eye-Network.com very shortly.</p>

<p>Cheers,<br />
Dan Linstedt<br />
</p>]]>
</content>
</entry>
<entry>
<title>Part 8: Secrets of the Masters - Business Requirements</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/08/part_8_secrets.php" />
<modified>2008-08-26T12:54:51Z</modified>
<issued>2008-08-26T12:32:21Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2377</id>
<created>2008-08-26T12:32:21Z</created>
<summary type="text/plain">Moving the business rules downstream to &quot;pulling from the warehouse&quot; can have a profound impact on your business requirements gathering phase.  Shortening the cycle time to build your EDW/BI solution, and improving your success rates by gathering requirements quickly.</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>Here is another installment of the secrets of the masters.  Quite frequently customers and IT alike complain about how difficult it is to gather business requirements.  They discuss the pain of having to "get together" for a day, or for a week-long process to write down and document business processes, and ultimately their needs and desires for a new BI/EDW system.  Any good analyst worth their salt has battle-scars from negotiating these treacherous grounds.</p>

<p>We've all walked in to an environment with a blank white-board and asked: business, please give me your requirements, only to be confronted with: "What can you provide to us?"</p>]]>
<![CDATA[<p>I'm here to describe a different way to you.  This is an ancient technique and requires a flack-jacket to be worn by the IT participants at all times.  Remember: nothing is personal, this is only business.  No really, I'm not a Zen master, but a Zen master once said: it is far easier to tear something down than it is to build something up.  So with that in mind, here we go...  (I'm kidding about the ancient part...)</p>

<p>There are many of you in the BI and data integration world who handle BI requirements in a similar fashion: go through the months and months of drudgery discovering business processes, and "requests" for new systems design.  Then go through all kinds of long lasting meetings to pull together a buy-off, and write up a business requirements and a technical requirements document.  Then throughout the design and build phase of the project, have the users "slip in new requirements and larger scope because they forgot to tell you something."</p>

<p>But this process is incredibly painful, requires a lot of money up front before business users can "see" anything, and requires diligence on behalf of the IT and business folks to see it through.  Now don't get me wrong, I'm NOT saying this is a bad thing, I am saying that it simply takes too long and there is a better way up Mount Everest.</p>

<p>Ok - enough with the bad jokes already... How do I do this?<br />
For one, you need to shift your thinking about integration of data sets into your data warehouse.  I've blogged on this before.  There is a paradigm shift in the works for auditability and compliance, and it basically says: move your business rules DOWNSTREAM of the EDW.  That's right, take a deep breath and swallow.  Placing business rules upstream of the EDW will lead you back to the old techniques of waiting and waiting for business requirements.  Moving the business rules DOWNSTREAM and implementing them coming out of the EDW allows us to see new light on gathering business requirements.</p>

<p>Easy for you to say...  I still don't quite get it....<br />
Right.  Have you ever walked in to a room full of business users and started describing the data that their systems are "collecting today" from an integrated stand point?  If you haven't, you should try it.  If you have, then you know: Business users at that point are quick to the draw to point out why you're wrong, where the systems are wrong, what the problems are with the systems you are talking about, and of course - why they have their own special Excel spreadsheet that they built to FIX this problem. </p>

<p>The point is, you can literally start fires with your pen you are writing so fast...  These are the missing business requirements that you've been seeking so long and hard for.  It requires a flak jacket because you cannot take it personally.  </p>

<p>By moving the business rules downstream of the EDW, we can load RAW and AUDITABLE data "as-it-stands" into the EDW.  From there we can produce something called an "AS-IS STAR SCHEMA."  The AS-IS Star shows raw level grain, with un-doctored and uncensored, and un aggregated data sets.  You can then share with the business users "this is the way your source systems (once integrated) are currently capturing data, and by the way, these are the results of your source systems executing your business processes."</p>

<p>They very quickly are more than happy to tear it down, shoot holes in it, tell you why it's wrong and why it won't work.  Again, if you're willing, you can gather nearly all the requirements you need for phase 1 within a day, or 2 day session.  This reduces the cycle time to delivery of your EDW environment, increases the visibility into all the "work-arounds" the business users are currently engaged in to "get the source systems to do what I want."</p>

<p>I've been using this technique for 16+ years, and it hasn't failed me yet.  But again, it requires a strong will, and to move the business rules downstream.  AFTER you've collected the business requirements, then you can build integration processes to take the data from the AS-IS stars into the "business release" star schemas.  Also, by moving the business rules downstream you can meet accountability, auditability and compliance in your EDW.</p>

<p>This is one of the most powerful secrets of the masters available within Integration projects, whether you are executing SOA, ESB, Web Services, or EDW / BI projects, it works, and yes - we teach this in our Data Vault Modeling and Certification course.</p>

<p>I'd love to hear your thoughts and comments.</p>

<p>Thank-you,<br />
Daniel Linstedt<br />
http://www.DanLinstedt.com</p>]]>
</content>
</entry>
<entry>
<title>IT Agility and Repsonsiveness to the Business</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/08/it_agility_and_1.php" />
<modified>2008-08-23T13:26:17Z</modified>
<issued>2008-08-23T12:53:03Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2367</id>
<created>2008-08-23T12:53:03Z</created>
<summary type="text/plain">Successes in IT using the Data Vault modeling and methodology for implementation, including data marts in about an hour, turning IT into a profit center, and building start-to-finish EDW and star schema deliveries in two weeks.</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>It's not often in our industry you get a chance to read about successes.  Too much press is given to negative types of issues.  This entry is about successful implementations.</p>

<p>Would you like your IT team to build "Data marts in about an Hour?", How about full EDW's with AS-IS star schemas in 2 weeks (regardless of size of source systems, or number of systems to integrate)?  Would you as a business user like to hear that your new requirements for reporting can be met within a 2 day turnaround?  How about your IT team becoming a profit center for the stake-holder rather than a cost center?</p>

<p>Sound too good to be true?  It's NOT!  Honestly, this is the first time in a long time that I'm excited again to be in IT.  I'm working with several customers in which we have made these things a reality.  This entry is about how we did it, and how you can do it too.</p>]]>
<![CDATA[<p>The more I go, the more I realize that this is the key to success in any IT project.  Of course hind-sight is 20-20 so therefore I should have realized this years and years ago, wait a minute.... I did.  This is why I built the Data Vault Model, combined it with SEI/CMMI Level 5 approach (methodology for implementation) and now it's playing out.</p>

<p>What I've done over the years is combine, refine, and optimize everything from the project plan to the work breakdown structure, to the risk analysis and mitigation strategies.  We've also applied SEI/CMMI Level 5, and combined it with PMP, Six Sigma, TQM, and Lean Initiatives to end up with drastically reduced cycle time, increased quality of the project, and massively leveraged team resources.  We've reduced cost and improved delivery times of new projects 10x.</p>

<p>At specific customers that I've been visiting over the past 2 years, we've seen the Data Vault modeling and Methodology really take hold within customer sites.  We have happier customers, long-term relationships, and the CORPORATIONS and IT Teams are winning together.</p>

<p>Let me explain: these are real case studies.<br />
<strong>Customer A:</strong> Full as-is star schemas and EDW in 2 weeks<br />
Took 198 tables, 4 source systems, combined the models (with some manual effort up-front - about two weeks worth of work before I came on-site), and within two days produced the following artifacts:<br />
* Staging models<br />
* Data Vault (EDW) Models<br />
* AS-IS Star Schema Models<br />
* Master Data Models<br />
* Exploration Mart Models<br />
* Oracle Stored procedures to Load Source to Stage<br />
* Oracle Stored Procedures to Load Stage to Data Vault<br />
* Oracle Stored Procedures to load Data Vault to AS-IS Star Schema.</p>

<p>At the end of two weeks, we had produced 3 cubes (that incorporate business rules) for the business users to access, see, feel and touch.   We did this all with 3 people (myself, and 2 others) on the team, keeping cost down, delivery high, and quality high.   The customer decided this was such a success that they wanted to make a change while I was on-site.  They fed us a new source system to combine.  We had the integration done and the new system in place within 5 days, again producing 1 new cube.</p>

<p>The business decided that they'd rather use our team for new deliveries instead of building their own analysis and integration projects.  We had successfully stemmed (or at least reduced) the tide of spread-marts.</p>

<p><strong>Company B:</strong>  (Data marts in about an hour)<br />
3 months in to production of the Data Vault EDW, for Manufacturing, we had delivered 5 reporting tables (report collections).  The business users wanted to build new "star schemas".  We created a two page requirements form with a sign-off.  The business users usually took 1 week to "fill in the business requirements", but once they handed it to our team, it only took us about 1 hour to turn it around and have a star schema available (prototype filled with 5000 rows of sample data).  </p>

<p>If they liked it, we would load the full compliment overnight.  We did this over 15 years ago, and the Data Vault EDW is still in place today, and still running strong.  We proved that "Data Marts in about an hour" was possible.  Granted, the more complex the business rules, the longer it took to turn around a prototype.  Our longest time to deliver in this situation was 1 week from receiving the requirements to prototype.  The only other stipulation was that we already had the data, and it didn't require integrating a new system.</p>

<p>We had 3 people on this team; we supported 5000 production reports on a daily basis at the end of 3 months.</p>

<p><strong>Company C:</strong>  (Turning IT into a profit center), ANY good EDW should be able to do this!<br />
At this particular company, we started the project 6 months in the hole.  I was brought in to turn the project around. When I got there they had no requirements, no documentation, no tables, no loading systems, and almost no funding.  The business users were fighting with IT over how it should even begin.</p>

<p>Well, long story short - inside of 3 weeks we had business requirements written.  But, in addition to that the stake holder was concerned with the overall cost of the project, that he couldn't identify hard ROI, or even that justification for hardware would continue to be a "money pit" to grow the warehouse.  We continued building the project this way for several months.</p>

<p>After 6 months we reached phase 1 production state and this is where our success begins.  We were a cost-center for our stake holder at this time.  We began receiving phone calls from other business units and other projects, could we build them a mart X, or a Mart Y, or could we provide them with reporting tables Z?</p>

<p>We said: yes, but here's our deal: we'll build the mart for you and we'll give you a 10 day or 30 day trial period.  After which, if you don't like it, or you don't use it, we'll tear it down and take it away.  If you like it/use it, we will begin charging you for disk space, and CPU load cycles to support the hardware necessary to grow your efforts.</p>

<p>We were able to make accurate projections of the hardware required for disk, and the CPU cycles required to load, along with the RAM used.  We also monitored their query usage to see what data and how much data they accessed, and how often.</p>

<p>What ended up happening was the new business unit would "sign up" for a data mart service, and begin paying our stake-holder for the privilege of "renting" the machine resources.  It got better from there, once they realized that this would work, they began asking for new systems to be incorporated.  We would then begin a project costing and estimating phase where they became the "stake-holder" of that part of the system and data set.</p>

<p>We replicated the business model across the entire enterprise.  We constantly had more projects than we could fill, the business users were happy, and actually able to cross-charge business units for use of their information.  Viola, our main stake holder said we became a profit center for him.</p>

<p>Goes to show you, if you can run your IT team (no matter how big/small) like a business, you will get more business going forward...</p>

<p>I'd love to hear your success stories, if you'd care to share.</p>

<p>Cheers,<br />
Daniel Linstedt<br />
</p>]]>
</content>
</entry>
<entry>
<title>True Temporal Based RDBMS engines</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/07/true_temporal_b.php" />
<modified>2008-07-18T12:57:18Z</modified>
<issued>2008-07-18T12:22:06Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2262</id>
<created>2008-07-18T12:22:06Z</created>
<summary type="text/plain">When I teach, I frequently discuss temporal based data sets - after all, that&apos;s a big piece of what data warehousing and BI is about - Data Over Time. But when examining the database engines ability to &quot;retrieve&quot; specific data...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>When I teach, I frequently discuss temporal based data sets - after all, that's a big piece of what data warehousing and BI is about - Data Over Time.  But when examining the database engines ability to "retrieve" specific data sets as a snapshot in time, it seems there is a problem.  There appears to be no "consistent" manner in which to retrieve these layers for use by the business.  We are left to create physical dimensions and physical fact tables - aggregate our data up to higher levels (to shrink the amount of data) so that joins can execute cleanly and efficiently across information.  So why then, after all these years haven't vendors properly implemented the ANSI-SQL-92 standard of "PERIOD"?</p>

<p>Database Vendors, are you listening?  There is a serious revenue gain to be had by implementing these feature sets...</p>]]>
<![CDATA[<p>First, take a look at the standard defined for ANSI-SQL-92 called "PERIOD"  You can find some of the preliminary work here:</p>

<p><a href="http://www.cs.arizona.edu/~rts/sql3.html">http://www.cs.arizona.edu/~rts/sql3.html</a></p>

<p>And a full research paper here:<br />
<a href="http://www-db.deis.unibo.it/~fgrandi/papers/NGITS95.pdf">http://www-db.deis.unibo.it/~fgrandi/papers/NGITS95.pdf</a></p>

<p>By the way, speaking of temporally based Data Models, the Data Vault model is based on usable concepts (which most EDW systems are, by using start/end, or begin/end, or effective dates). But that's beside the point.</p>

<p>Now according to the first reference I provided, Oracle 9i and 10g have this capability, as does IBM and a few other vendors.  Mr. Snodgrass references slides for Oracle that discuss Oracle's implementation of the temporal components called "Flashback" which once-again (sadly) is for <em><strong>TRANSACTIONAL SYSTEMS ONLY!</strong></em></p>

<p>Why am I so upset?  Well, if database vendors were truly catering to the enterprise data warehouse, they would allow the database architect / data modeler to pre-determine a single field to use as the "PERIOD" field, and end-dates would no longer be needed.  They would then implement these components in such a way as to not require "updates" or "deletes" of the information in order to make it accessible to a time-variant query.</p>

<p>"Flashback" from Oracle is extremely powerful; I'm interested in the way the feature set is implemented.  IBM's Data Propagator Log entries also appear to be extremely powerful.  The problem is again, they are transactional mechanisms that only trigger based on update and delete or DROP TABLE.</p>

<p>These vendors are completely missing the boat in Data Warehousing if they can't bring to the table an ENGINE OPTIMIZED/engine defined temporal notion for enterprise data warehousing models to use.  After all these years, one would think that "temporality of data warehouses" would have been noticed by the engineering staff of database vendors, and that they would have sought out optimizations in the core engine to react to table definitions that are defined by time; and that they would have built a query responsive engine to returning snapshots of data for a specific point in time.</p>

<p>The problem is: we (the implementation specialists and data architects) have had to result to "work-arounds" for all these years.  Work-arounds are: put your begin/end dates in your table, when a new image arrives, insert it - then update the old one (end-date it), followed by a query that executes a BETWEEN to get what should be an easy AS OF command.</p>

<p>It's clear that Oracle understands what needs to be done, with this presentation: <a href="http://www.oracle.com/technology/deploy/availability/pdf/40109_Bednar_ppt.pdf">http://www.oracle.com/technology/deploy/availability/pdf/40109_Bednar_ppt.pdf</a>  But it's not clear that they know they should apply this technology to warehousing.</p>

<p>So to summarize: in my view, RDBMS engines (SQLServer, Teradata, IBM DB2 UDB EEE, Oracle, Sybase ASE, and MySQL) are NOT temporally aware when it comes to data warehousing.  The following features should have been implemented in 2004, I hope we can find these features in 2009...  Furthermore, these features should be DEFAULT BEHAVIOUR when operating as a data warehouse.</p>

<p>* AS OF queries with built-in date-time stamping based on insert date/time of the data set</p>

<p>* Automatic column compare - at the optimizer levels, option switch for table definitions that allow some tables to "be run through a delta before inserts occur", followed by option switches on each column that allow "delta on/off" for specific columns.</p>

<p>When a delta is spotted, the insert takes place automatically.  We should no longer be FORCED to execute these comparisons outside the RDBMS engines.</p>

<p>The point here is THE OPTIMIZER EXECUTES THIS AT ENGINE LEVEL</p>

<p>* SQL Delta commands (to be executed at the SQL level) - in other words a SELECT to "show me all rows between X and Y AS OF Z (max date).  OR show me the FIRST and LAST row between X and Y as of Z, or show me the [FIRST or LAST] row as of Z.</p>

<p>* In keeping with the ANSI-SQL Standard, these rows AS OF should be able to be joined together by the same primary key, producing a "geological layer of data" AS OF a specific point in time.</p>

<p>* When a DELETE is issued, the option of "DELTA COMPARE" across remaining time windows should be available to the delete command... so that the engine automatically removes duplicated data (if there).</p>

<p>* When an UPDATE is issued, the query should be given an option: WITH HARD UPDATE, where the default is a "soft-update".  Hard updates execute against the exact row at that point in time.  Soft-updates, actually issue an INSERT at the core-level, producing a new delta for that point in time.</p>

<p>These queries, and this insert/update/delete behavior should be built-in, automatic execution.  The designers and implementers should NOT have to think about this.  By the way, COLUMN BASED TECHNOLOGY APPLIANCES are in a PERFECT position to execute on this vision TODAY.  Big RDBMS engines are too, but they don't seem to be nimble enough to get it done quickly (in the next 3 months!)</p>

<p>Keep in mind: that the ENGINE CORES should be optimized to make use (high performance, parallelism, partitioning) of the TEMPORAL based logic.  I was hoping (against all odds) that the RDBMS vendors would have seen this years ago, but it just didn't happen (sorry folks).</p>

<p>There are TONS of good articles on search engines: "temporal SQL", or "temporal database" will pull many of the articles around the mathematics of temporal data.  I still wonder why we are left to use a 1992 standard "BETWEEN date_field_1 and date_field_2", and why we are left to compare our own row-sets (outside the core engine), and why we are left to JOIN all of our temporally defined data ourselves (again without core optimizations) ourselves...</p>

<p>It's a sad story to me, but the first "engine" to get here will break some serious performance barriers facing both ETL / ELT loading cycles, and SQL queries for warehousing.</p>

<p>Cheers,<br />
Dan Linstedt<br />
DanL@RapidACE.com<br />
</p>]]>
</content>
</entry>
<entry>
<title>Part 7: Secrets of the Masters, Templates for Projects</title>
<link rel="alternate" type="text/html" href="http://www.b-eye-network.com/blogs/linstedt/archives/2008/07/part_7_secrets.php" />
<modified>2008-07-16T22:38:59Z</modified>
<issued>2008-07-16T22:15:30Z</issued>
<id>tag:www.b-eye-network.com,2008:/blogs/linstedt/9.2256</id>
<created>2008-07-16T22:15:30Z</created>
<summary type="text/plain">Any time we get back to secrets, we seem to fall right back to the category of standards, standardization, measurement and enablement. The old saying is: &quot;if you can&apos;t measure it, you can&apos;t monitor it, and if you can&apos;t monitor...</summary>
<author>
<name>Dan Linstedt</name>
<url>http://www.myersholum.com</url>
<email>Daniel.Linstedt@myersholum.com</email>
</author>
<dc:subject>Business Intelligence</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.b-eye-network.com/blogs/linstedt/">
<![CDATA[<p>Any time we get back to secrets, we seem to fall right back to the category of standards, standardization, measurement and enablement.  The old saying is: "if you can't measure it, you can't monitor it, and if you can't monitor it - you don't know when it's broke, or you can't optimize it/fix it."  Something like this anyhow.</p>

<p>The common feedback from the general project implementation community is usually: "Why do I need to standardize?  Why should I document?  Won't it take more time to follow standards than to build rapidly?"</p>]]>
<![CDATA[<p>Well, yes and no.  If you don't standardize (or your team doesn't standardize), then your project usually cannot be repeated successfully.  If the team doesn't standardize then looking back at "what you did right/wrong" is good and can be done, but doesn't provide any sort of "metrics enablement or measurement" abilities against what was done, versus what was estimated, versus what "should" have been done.</p>

<p>Furthermore, documenting the process usually doesn't occur - and when it does happen it's retro-fitted to the existing project just released to production.  This also can cause a herculean effort to "reverse engineer" and understand what was built just to build up the documentation.</p>

<p>One more side-effect to these efforts (JAD/RAD typically) include a hit on: flexibility, scalability, and maintainability.  In other words, without standards - the project better be a "one-off" never to be repeated in the near future.  Reusability is extremely tough in an environment where standards have been tossed into the wind.  IT ends up (usually) loosing their agility.</p>

<p>Ok - enough of this, this is all project based stuff.  We learned all this in PMP/PMI, Six Sigma, TQM, and so on...  what about the templates, how are they useful, can a project be successful using them, how can a project proceed without the "standards" being seen as a hindrance?</p>

<p>Well, there's always a slight hindrance for issuing and following defined procedures.  There's always a hindrance to defining standard processes and procedures that are acceptable to the team and the organization.  You just can't get away from that.  So in this entry we will explore enabling tools and libraries of templates that will help you on your way.</p>

<p>ITIL:  on the web at: http://www.itlibrary.org/<br />
Has a plethora of templates, best practices, and standards for projects (including EDW projects).  You need to order the books for these.</p>

<p>http://www.isaca.org/<br />
Also has a large array of standards, templates, implementation paradigms and guidance based on SEI/CMMI Level 5.</p>

<p>Or of course you could seek out the Data Vault methodology and approach which has distilled down the templates specific to enterprise data warehousing, enterprise data integration.  These templates have also been optimized for quick and easy to use build-outs of your projects.  The Data Vault approach (when followed appropriately) helps you instantiate your goals to follow lean-initiatives, business process management, and cycle time reduction.</p>

<p>A few of the different templates that you should have in your project folder include the following:<br />
* Statement of Work<br />
* Service Level Agreement (templated, so you can fit the topic in appropriately)<br />
* Roles and Responsibilities sheet (numbered in accordance with the project plan)<br />
* Organizational Breakdown sheet (numbered in accordance with the project plan)<br />
* Data Breakdown Structure (numbered in accordance with the project plan)<br />
* Project plan (numbered - you guessed it - technically - 1.1.1, 1.2.1, etc...)<br />
* Process Breakdown Structure (numbered in accordance with the project plan)<br />
* Risk Estimation, Mitigation, and responsibilities sheets<br />
* AS-IS and TO-BE data flow documents, and process flow documents<br />
* AS-IS and TO-BE system architecture documents<br />
* Project release plan<br />
* Bug tracking/Enhancement tracking plan</p>

<p>and so on.  There are a number of other documents required to make a project successful including the Statement of Work, possibly a letter of intent, Goals and Objectives, Phased approach definitions, Definition of "Success" criteria for development - test and production releases.  Estimated person-hours, level of experience on the team (according to the Roles and Responsibilities), and training plan.</p>

<p>A good set of templates, coupled with a solid project approach can be utilized on any project from 800 person-hours to 50,000 person-hours.  It can be used repeatably, it can be measured as to it's effectiveness, and when a specific "template" is left out, the RISK of removing that process from the project plan can be accurately assessed.</p>

<p>To be successful in one's endeavors one of the final ingredients is: the ability and desire to teach the client to fish, rather than implement what you have and walk out with your own methodology....  But then again, no one does that to you do they?  :)</p>

<p>On the next secrets, we'll get in to what one of these numbering systems looks like and why it helps solve the pain in business today.  We'll also address some of the issues plaguing IT, and keeping them from being "agile" in the business environment.</p>

<p>As always, comments are welcome.</p>

<p>Hope this helps,<br />
Dan Linstedt<br />
DanL@rapidACE.com<br />
</p>]]>
</content>
</entry>

</feed>