Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

November 2006 Archives

If you're like the rest of the world these days you've got an ever growing data set, and at the same time an ever shrinking processing window. This is not something you want to treat lightly. In most cases, you are also experiencing severe performance problems and don't know how to deal with it, or haven't been able to solve these issues. Well, there are ways and means in which performance can be improved - I've been teaching, and consulting on performance of VLDW integration systems for 10+ years, there are techniques and manners in which your performance can improve. The catch? You have to be willing to swallow the blue pill (from the movie: The Matrix). Let's just see how far done the rabbit hole goes...

So you're caught in performance issues, be it ETL, ELT, EII, EAI, or worse: RDBMS system. You have big data to move or maybe lots of data in very short time windows. Maybe you have huge data to bounce it against (look up, match, consolidate, quality cleanse, etc). Maybe you're pulling together 200M rows for one feed to a dimension. You're source is 20 to 40 tables, and you've got a window to worry about. How do you handle this situation and others like it? What do you look for?

There are many different facets to examine in performance and tuning of your systems and your architecture. The major facets that I examine are:
* Hardware: RAM, CPU, Network, Disk Speed and number, I/O Throughput
* Software Concepts: Co-location, over the network, block sizing, parallelism, partitioning abilities, caching, data sharing, piping, and data layout.
* Software: ETL, EAI, ELT, EII, and the big one: RDBMS.

Of course, we also consider where the data is coming from: ERP, CRM, BPM, Financials, and other applications that lay up-stream, and whether or not the systems are delivering data in real-time, batch, or both.

If there's one thing I rely on in many of these situations it's the numbers. The throughput numbers usually tell the story as to where the problems are originating, and then through orders of scale - what we can do about it. The numbers tell me how much, how fast, and give me the ability to re-factor the architecture where the pain points exist.

In plain English please...
What this means is: I look for the following (for example): Number of instances of an RDBMS on a single machine, balanced against the number of CPU's, amount of RAM, speed of Disk, and number of disk controllers. "If you ain't got that "I/O" you aint got a thing...." (modified from Duke Ellington, sorry Duke). Now the problem is: most people see I/O and immediately think "disk" - well that's only one area. I/O stands for Input / Output, no-where does it have the word "disk". I/O represents the balanced speed of all devices and systems that we are trying to measure and improve. When we treat I/O as a constant, the problems become clear very fast.

I go through more than 100+ points in an architecture to pinpoint the top 25 that are causes to the performance problems in a system (too numerous to mention here), but I'll give you a taste of what I usually look at in the RDBMS side of the house. By the way, the customers who go through my performance and tuning assessment have seen on average, anywhere between 400% and 4000% performance improvements - taking run times from 48 hours down to 8 hours, from 8 hours to 2 hours, from 6 hours to 45 minutes, from 56 hours to just under 12 hours and so on. But it means the customers were willing to take my suggestions to heart and implement them.

In the RDBMS world here are some things I measure by:
1. Number of instances on a single box
2. Amount of parallelism setup for the engine
3. Number of Partitions setup for the largest of data sets
4. Data Layout of different tables across disk
5. Number of Disk Controllers, type of controllers, and database layout across the I/O channels
6. Amount of RAM cache available.
7. Percentage of Index kept in RAM from call to call.
8. Number of read-ahead buffers.
9. Size of blocks within the database
10. Cache hit to cache miss ratios
11. Positioning of the Log and Temp areas.
12. Amount of swap space on the machine where the RDBMS resides
13. Amount of RAM the RDBMS is limited to
14. Number of CPU's the RDBMS is limited to.

And so on... If you have an interest in a particular area, please post a question or comment, and I'll try to blog on that going forward. In the mean-time, please be aware that we offer these assessments with fantastic results.

Thank-you,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted November 28, 2006 6:23 AM
Permalink | 1 Comment |

In this entry, I return to Nanohousing(tm), the notion of utilizing nanotechnology for computing, and Business Intelligence purposes. Remember that these writings are an attempt to go beyond the horizon, and are futuristic guesses on what specific points of nanotech can be applied to within the DW / BI world. It will take years to get to these points, but rest assured - changes are happening. One of the areas that have really interested me in nanotech is the notion of DNA computing, that is using DNA strands form and function (combined) to serve specific computational purposes and answer specific questions.

"The hope of this field is that the pattern matching and polymerization processes of DNA chemistry, combined with the enormous numbers of molecules in a pound, will make feasible computations that are now too hard for conventional computers." DNA Computing, http://www.fas.org/irp/agency/dod/jason/dna.pdf

First I'd like to point out (as I have a few times before) that the notion of form and function are recombined at the DNA computing level. In the BI/DW world of today, we have separated form from function, and it is inhibiting our ability to move forward, not to mention it is a severe drain on flexibility, scalability, and applicability. Form in our BI / DW world today would consist of models: Process models, business models, data models, architecture models, network models, and so forth. Function would be what these models do with the data / information passing through them.

For instance, data models today hardly resemble the business processes in which the data sets flow - while there have been some advances, like UML and Object Oriented modeling - they are still (for the most part) diversified from the true business functions. We strive to make sense of the data, and the architectural modeling paradigms by assigning metadata - descriptive context. We also are now headed back towards convergence of business function and "architecture" with Master Data Models and Master Data sets. Finally we're beginning to get it - but still, the nature of the RDBMS engine in today’s world is to apply common functionality to models designed by external means. They are not tightly coupled.

When we examine DNA Computing as a function of nanotechnology we find this to be a tightly coupled form and function process. The "model" in which the data sits, even where the information is encoded within the strand becomes important. The "function" is built in to the type of DNA strand created - in a bio-chemical sense.

"No arithmetical operations are performed, or have been envisioned, in DNA computing. Instead, the potential power of DNA computing lies in the ability to prepare and sort through an exhaustive library of all possible answers to problems of a certain size. ... A single strand of DNA can be abstracted as a string made up of the letters A, C, G, T. ... Complementary strands of DNA will form a doulbe strand (the famous double helix). Two strings are complementary if the second, read backwards is the same as the first, except that A and T are interchanged, and C and G are interchanged."

Now what happens in the BI / DW space if we were to follow this "wet-technology" model? What would happen if we were to combine form and function like the DNA computation machine? Would we see tremendous leaps in traditional computational power? I hypothesize that this is true, that if we were to simulate DNA computation in a newly designed DNA type database engine we would see a number of things happen. But remember, I'm not talking about traditional DNA modeling software on a traditional CPU / Computing Engine - no, I'm talking about a machine that currently only exists in bio-tech labs, in the test tubes.

Ok, so what could we do better today that we haven't done in the past, and do it on conventional computing resources?
We can begin converging form and function, start small (like a web-service for example), combine it with security, access rules, metadata, and definition of groups from a common set of elements (taxonomies). Cross it with the functionality of a web-service and make it available to the world. Self-encapsulated, it might interact (on it's own) with other web-services, in other words - discovery and deterministics are parts of this web-service. It discovers other web-services, and then decides if the other available service has information it can use, and if it has access - pulls the information in and assimilates it automatically.

Obviously the web-service is part of an extended neural network, which is capable of being taught, learning on it's own, and being corrected over time. So we still have some incorporation of traditional practices (due to the ultimate abstraction). This is a fundamental difference between the computational world and the DNA computing world. DNA Computing uses bio-chemistry to solve it's problems, and learn new things. Security is built in (as a function of what a DNA strand can and cannot "tie" to, bond with, cut and merge to - and how it will execute these things.

As a matter of interest to DARPA, here is an interesting look at the applications of nanotech in today's world.

How do you see DNA computing affecting the future of BI / DW?

Cheers,
Dan Linstedt
CTO, Myers-Holum, Inc


Posted November 17, 2006 7:46 AM
Permalink | 1 Comment |

Metadata is an interesting piece, many corporations and individuals fight over the true meaning of metadata and the context to which it applies. This entry is a thought experiment and explores the question of context, deriving context and resolving contextual fights within an organization as they relate to enterprise metadata. I believe everyone can have a metadata sit-in, and maybe finally work this thing out. Note: this is a tiny bit of light reading...

Why should I even have knickers? What are knickers anyway? And why would they be twisted? Well, if you've never visited England, I suggest maybe you do so. It's a beautiful country - anyhow, knickers have multiple definitions depending on the time of reference and who's doing the referencing. For most of us who speak or understand English today, the statement usually refers to under-garments worn around the waist area.

Ok, so what's changed?
The TYPE of undergarment that knickers used to be, versus what they are today. This is an example of a time-sensitive contextual piece of metadata.

According to Websters Dictionary:

knick‧ers  Pronunciation[nik-erz]
–noun (used with a plural verb)
1. Also, knick‧er‧bock‧ers Pronunciation[nik-er-bok-erz] loose-fitting short trousers gathered in at the knees.
2. Chiefly British. a. a bloomers like undergarment worn by women.
b. panties.
3. British Informal. a woman's or girl's short-legged underpants.

—Idiom4. to get one's knickers in a twist, British Slang. to get flustered or agitated: Don't get your knickers in a twist every time the telephone rings.
--------------------------------------------------------------------------------
[Origin: 1880–85; shortened form of knickerbockers, pl. of knickerbocker, special use of Knickerbocker]

Now notice something interesting: At the end of the definition, it doesn't even agree with itself - they've twisted their knickers, and said see the word KNICKERBOCKER - let's see what KNICKERBOCKER has to say:

Knick‧er‧bock‧er Pronunciation[nik-er-bok-er] –noun 1. a descendant of the Dutch settlers of New York. 2. any New Yorker. -------------------------------------------------------------------------------- [Origin: 1800–10, American; generalized from Diedrich Knickerbocker, fictitious author of Washington Irving's History of New York]

Which not surprisingly has NOTHING to do with Knickers in the first place. Look at Definition #1 in the first quote, and definition #1 in the second quote - they DON'T MATCH!!! They are from close to the same time-period in origin. Ok, so we studied the root of the word, this is not so interesting...

But it gives rise to a contextual problem (one that we have throughout our enterprises today. We can't decide on how to define our own terms, and furthermore, the metadata (the definitions and contextual understanding) 1) changes over time, 2) changes based on individual or line of business.

Our enterprise metadata (Master Metadata) needs to be set forth, and needs to be built from an enterprise (top-down) view. That's not to say that we can't all have our cake / definitions and eat them too! We can, and we should. The best way to describe this type of effort is to look at existing Semantic Mapping Technology, or the Semantic Web, or Semantic Integration. Normally these things are done by hand, and if you choose to do so I would highly suggest an investment in a tool that can track, develop, and visualize Taxonomies, and Ontology’s of words.

In order to make this work you might need:
* Clear taxonomy - defined at different Work Breakdown Structures
* Clear Taxonomy - defined at different Organizational Breakdown Structures
* Clear ontology to manage the taxonomies, cross the WBS with the OBS for big success.
* Clear Version control - each piece of metadata MUST be versioned, and tracked to the CHANGE REQUEST that triggered it within the business processes. Yes, (sigh) this too is tied to BPM and SEI/CMMI level 4.

Yes, I'm suggesting Metadata at CMMI level 4, quantitatively tracked. Quality scores could be included, but are subjective to the individual scoring the metadata.

Now on to your knicker problem, uhhh I mean - the Knickers Twisting problem... I mean - don't wear tight pants and then exercise if you don't want your knickers in a twist... Ok - I digress (sorry).

In all honesty, Knickers are _not_ knickerbockers, although the word may have been derived from the original term. Knickers at an enterprise level may be accepted from a pants manufacturing corporation such as Levi Strauss - as the definition of PANTS or UNDERPANTS... but which is it?

In the real-world of metadata this needs to be resolved by the executive team, they need to be the ones to define PRIMARY metadata. Using Taxonomy trees, secondary, and tertiary metadata can be defined based on LOB (lines of business) and work breakdown structures (roles & responsibilities or uses of the metadata). As long as the metadata is tied to the CURRENT VIEW of the organization, and what the data set represents. So that when data is delivered to the enterprise the metadata goes with it, and the organization can drill up/down and across the metadata meanings (provided they have the proper security).

Unfortunately I do not know of any single tool that can accomplish this today. There are a set of open-source tools that manage semantic meaning, and a set of other tools that manage taxonomies, and another set of tools that manage version control / document management, security, and so on. Metadata tool set vendors are still in their infancy, hopefully someone will rise to the challenge - and hopefully I have not put your knickers in a twist!

We can help you sort out the metadata mess, and establish a contextual, enterprise based metadata system that will save you time and money. This is a serious issue and must be solved before the enterprise gives rise to an SOA initiative, or before the enterprise claims to have completed an SOA initiative.

As always, I'd love to hear from you - your thoughts, comments, poetry, haiku, and and tall tales are all welcome.

Thanks,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted November 7, 2006 7:23 AM
Permalink | 1 Comment |

In this entry we'll dive a little further into the pros and cons of master data as a service (MDaas). We'll bring to light the different kinds of master data, and how it will evolve in the market place into a service oriented architecture, housed offsite (generically). MDaaS follows the standard curve of new ideas, individual creation (decentralization), then centralization, and then commodity based master data. I think the firm which undertakes master data as a commodity will be a hot property in the near future.

First, I'd like to discuss the definition of master data (which I've done in other blogs). From a 30,000 ft perspective, master data is operational, quality cleansed, singular in nature, and descriptive about a business key - it is in fact an operational data store for the enterprise (with a few rules twisted). By the way, come see me at TDWI in Orlando next week - I'm teaching on Master Data (how to implement within your enterprise).

Master data should not contain:
* Parent-child relationships (other than recursive hierarchies to itself).
* Degenerate dimensional information
* Junk
* Data that is unrelated or weakly related to the business key.
* multi-part business keys that represent relationships in the business world.

Master data structures should contain:
* The business key, the whole business key and nothing but the business key.
* In addition to the business key, all descriptive data ABOUT the business key (to provide the business key CURRENT CONTEXT)
* 1 to 1 relationship with a surrogate generated number to the business key.
* Load date, create date, last updated date, original record source, updated record source

Basic rules:
* Master data can exist (as a historical record) within the warehouse.
* Master data in the ODS is always updated in place
* Master data can be built from a historical record in the warehouse (if done properly)
* Master data is NOT a materialized view within the warehouse
* Master data is usually stored in a separate data store for performance reasons. It is tuned to be operational in nature
* Each element or attribute within master data tables are defined by Master Metadata (enterprise metadata and ontology’s for further context).
* Master data is hooked to 24x7x365 services layers for bi-directional data streams (updates in, pushed update notification out to subscribers of that service).
* Master data sets are cleansed prior to load into the ODS, this data is partially auditable as a System Of Record (once established and is used to update source systems) However, the caveat is: the cleansing and quality routines MUST provide auditable and traceable actions on what happened to the master data on the way in. These audit logs MUST be reversible.
* Master data updates are reversible
* Master data is a single copy within the enterprise, hence the term MASTER. If copied locally across geographical regions, then it is read-only, and each local copy of the MD is force-fed (is a subscriber) to all updates.

Now, MDaaS requires that Master Data be housed off-site, on hosting services, in a remote database, connected through metadata and service layers. MDaaS can be specific by client (like SalesForce.com does with it's sales companies data it houses).

MDaaS attributes:
* Must be off-site
* Must be accompanied by discovery services
* Must be accessible through web services
* Must be secured through authentication
* Must be encrypted when traveling over the WAN
* Must be accompanied by Master Metadata (Enterprise Metadata)
* Must allow discovery services to query metadata.
* Must be updatable through services
* Must have minimal latency even though it's over a WAN
* Must have constant quality engines running to cleanse the data on the way in.
* Must be accessible via web-browser user interface in order for the business to monitor and manually adjust master data.
* One stream of changes (the old record prior to a new update) must be pushed out to an EDW subscriber for recording purposes.

MDaaS must NOT:
* be locked away within an ERP, or CRM system unless this is the ONLY source system this enterprise is using.
* be down at any time, down-time will kill SLA's and the operations of a company.

Some interesting items, there are some general master data sets that can and should be available to paying subscribers as shared data sets, these include:
* Postal Information
* SIC Codes
* Public records, like patents, locations of buildings, maps, geo-spatial information, public financial calendars and so on, some (regulated) tax / levy data.
* Government registries of registered businesses, and their corresponding names

Any data currently reported to the public and available on the web, should be turned into MDaaS - and in some cases already has.

Types of Master Data Entities might include:
* Portfolio Master Lists
* Invoice Master Lists
* Location Master Lists
* Address Master Lists
* Accounts Master Lists
* Portfolios Master Lists
* Employee Master Lists
* Customer Data Hubs
* Product Master Lists
* Service Master Lists
* Supplier Master Lists
* Manufacturer Master Lists
* Parts Master Lists

Some of these are protected and encrypted and relegated to authentication for access, some are not.

At long last, what are the pros and cons of MDaaS?
Pros:
* Centralized Master Data can improve global quality of information
* Off-Site Master Data can reduce the costs for each customer wanting to get in to the fray.
* Cycle time to attain Master Data for your enterprise is reduced as more vendors offer MDaaS (rapid build out)
* Standardized Metadata is hashed out for Master Data Sets that are shared. For instance, a zip code is a zip code is a zip code - no matter where in the world you live.
* It's already a proven technology (some companies are providing customer master lists with addresses in this light) i.e.: Axciom
* Low risk for implementation success

Cons:
* Could cost a lot of money for ensuring 9x9 uptime in a global environment.
* A breach of security in your MD hosting provider may be an uphill ethical battle in local governments.
* Rount-trip time over the WAN for master data updates may be outside the desired or acceptable time-frame.
* A company hosting your Master Data may use it (without your knowledge) to help other companies achieve standardized master data.
* A question of "Who owns the Master Data" comes in to play - contract negotiations should mitigate this.
* Requires your business to have Metadata already defined for the master data sets, so that context can be established (basic context) when surfing the available MD services.
* Requires your business to be Services Enabled - you don't need to be at the SOA level (yet), but you need to have web-services in play, and operational within your organization. An SOA initiative under-way will help.

Do you have anything to add to this entry? Please share it. I'd love to hear your thoughts. Again, come see me next Friday at TDWI for Master Data Implementation.

Cheers,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted November 3, 2006 5:35 AM
Permalink | 4 Comments |