Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

April 2005 Archives

I usually don't use a title that someone else has used, but I feel that this is a VERY important breakthrough. See this site for the story I'm blogging on: http://www.jefallbright.net/node/2616

In this blog I will go on an exploratory journey into what it is like, what it would be like to establish computing power on the DNA level. This again, is conjecture - pure speculation, so it's ok to let your mind wander a bit.

In one of my recent blogs, I predicted (not the first to say this) that DNA computing appears to be the strongest and most rapidly advancing field (in terms of nanotechnology applied to computational ability). There are other parts of nanotech that are advancing rapidly as well (in other areas, like bio-informatics).

This is quite astounding. Man made atomic level substances built to do something specific, walking around in a liquid solution - attaching to other molecular tracks and actually "walking" across the strands. All in parallel, all atomic layer control. Fascinating.

Now let's step off the path for a minute and see what this might lead to in terms of computational power. Let's ask a few questions, and hope that someone in this field (familiar with this technology) will comment for us.

The first question that comes to my mind is: why was the wheel re-invented? In other words, there are enzymes that travel down a single DNA strand and unzip it, there are other enzymes that travel down the same DNA strand and zip it back together, there are additional enzymes that replicate DNA strands, and sometimes introduce "changes" to DNA (evolution of DNA). It seems as though these walker molecules are an attempt to re-invent the wheel, or is it?

In this case, I suppose the walkers were built because a. we still don't fully understand enzymes b. we can't build our own controllable/programmable enzymes c. the walkers can do things the enzymes can't, like be non-intrusive (non-destructive), carry loads, attach and separate themselves, etc..

Now, let's talk about computational power of the walkers. Let's assume for a minute that they had advanced the walkers enough to carry a load (which they are working on). Can the walkers be programmed to release the load at a specific point? If two walkers "meet", can they join forces and combine their loads if the conditions are favorable? Would this make a single, larger walker that's twice as powerful? Ok, let's assume for a minute that we use some of the "self-assembling" nanotech that has advanced in the area of crystalline structures, and applied to the walkers.

Then we'd have to program each walker with a specific set of identifying codes (business rules) that say when the walker can merge with another, when it can't, when and what it can combine or mix it's loads with, and when it shouldn't. The end result is a tiny bit of "self-intelligence" (using the term intelligence very very loosely here). Now, if we can program and encode walkers to interact, they may actually build or self-assemble something we've never seen before.

They may actually perform "load combinations" or chemical experiments that we can't create in the lab, they may actually create new substances that we haven't seen before - pending the load itself can actually be combined to form specific results (that reaction is bound by the laws of chemistry and physics). Remember, this entry is hypothetical.

What would the combined walkers create? How would they structurally "merge" or self-assemble? How much load could combined walkers carry? My cousin, Adam Linstedt is a microbiologist professor at Carnegie Mellon University, he's suggested to me that when conditions are right, the man-made atoms (nanotech) can actually be separated from the combined chemicals. In other words, he said: making conditions favorable to binding, allows the "walkers" to bind to the target molecules, making conditions unfavorable can have the chemicals self-separate from their target molecules.

So what about the power of computing at this level? What can be said about this advancement? I believe that when this get's far enough, the walkers will indeed be encoded, and instructed to carry different loads. I also predict that self-assembly of specific kinds of walkers is inevitable and only a matter of time. The self-assembly is an interesting point when it comes to modeling.

What if we can construct models with functions that understand the context of what they contain, and what they can combine with. In other words: pair form with function, mimic the nanotech industry - we might discover new modeling and new computational models we hadn't thought of before. Self-assembling processes? Yep.

Just a thought anyhow. Cheers for now.


Posted April 28, 2005 7:49 AM
Permalink | No Comments |

On one of my last blogs, I received an interesting comment. I've requested via email, clarification of the term "redundant synonyms", and am hopeful that I will yet receive a reply. However, I wish to expound a little bit on some of the nature of architecture and design, in terms of what I've seen and worked through in the past 13 years in the industry.

In this blog, I will explore deeper, the business meaning - having clear, consistent, and repeatable design architecture. The methods of applying standards to data modeling are also discussed here.

Welcome back to my fire-side chat. As I recline in my rocker, I invite you to sit back, put on your house-shoes and relax a little with me. I always welcome your feedback, and please - feel free to comment, I'm open to learning new things particularly when it's an area of interest.

Down to business... Data Modeling, Information Modeling, Business Modeling - can it, should it, be one in the same? No, probably not. There is certainly a difference between Information and Data, and there is a visible difference between Information and Business.

But what happens when these modeling techniques diverge from business too much? Costs rise, Impacts rise when change is needed, inconsistencies in architecture bubble up, band-aids appear then IT says: Stop, we either need to buy an out of the box solution to replace this mess, or we need to re-write this spaghetti from the ground up. It's as if we started with a single rose bush, and ended up with a thicket - and we don't know where the starting point or ending point is to trim it.

By the same token, they cannot be exactly the same - hence the different modeling methods: business process modeling, data modeling, and systems architecture modeling, and so on. Wait a minute... Where's Information in all of this? Isn't there some standard on Information Modeling? No, there really isn't, probably because it's the grey area that crosses between Data and Business, and it's usefulness to business.

They can however be bound together by similar needs, similar architecture so that when the business needs change, the underlying data model can change without the heavy business impacts, and without the high cost of maintenance, and without severe divergence. But that's a topic for another day. Let's get on to STANDARDS in Data Modeling.

For years, people have insisted on telling me that "no two data warehouses can be modeled the same way." I beg to differ, why then does Universal Data Models enjoy such a large success? It doesn't stop there, but goes on: why then does CRM have competing vendors with very similar feature/functionality sets? (Probably very different models under the covers), they bear the high cost and brunt of changes... No wait, the customer who decides to upgrade undergoes brain surgery every time a major upgrade is released...

Is it because the data warehousing or Integration industry hasn't been bold enough to step forward and proclaim a data model as a standard starting point? But it's more than that, it's not just the data model that's important - it's the architecture and design of the data model that's important. The architecture provides the guidelines for the infrastructure from which the enterprise builds it's vision.

I liken it to this: a two-story house has many many designs, and as long as the foundation and support beams are in the right place, can be created very differently. Compare that to a 32 floor high-rise office in the city. Limited foot print, must rise straight up (usually), must be flexible in high winds at the top, must withstand earth-quakes (in California), glass must be shatterproof (extra weight). and of course the pylons and support structure (infrastructure) must be solid.

I'm no expert in high-rise buildings, but I'm going to assume that the building codes get stricter, the larger the structure that is to be built. Now imagine, after they frame a high-rise and the owner wants to "move" where the elevator shaft is, how possible is this? What if the owner of a two-story house builds an elevator, and wants it moved during framing? Compare the costs, the impacts - much different.

But that's where the similarities end. In the architecture side of the house - there's a ton of planning, design, and standard architecture, proven architecture (reusable, redundant, and consistent set of standards) that govern the build of a high-rise for success. In a two-story house, there's more lee-way, still - standards exist but it can be "thumbed" up to size. Can you take a bunch of two story houses, stack them on top of one-another and make a high-rise out of it? Probably not going to work. Can a high-rise hold a bunch of "two-story" housing units? Yes, if it's partitioned that way.

The bottom line is, it's time to converge our modeling efforts, it's time to provide some consistency, repeatability, and standards to our designs/architecture - whether we're modeling our data, our information or our business - they should all tie together.

I've seen too many data models where the data modeler points to the top left and says: here we implemented X architecture to get around this problem, down here (bottom right) we built Y because the source system had an issue, and over here (top right) Z is in place because it was built before I got here.... and the story goes on. Band-aids to the architecture because the original architecture and data model no longer meets the needs of the business.

It's best to start with a foundational data modeling approach that lends itself to a repeatable and consistent design when it's being extended to meet new enterprise needs. In this manner, metadata can also be captured through naming conventions and architectural design.

More later... Hope this is a useful topic for everyone. Feedback?


Posted April 27, 2005 8:29 PM
Permalink | 4 Comments |

In my last blog in this category, I discussed repeatable architecture, and repeatable process to build a solid foundational enterprise architecture. I hope I was not giving the message out that the model elements must be repeated, for that is not the case. Data modeling is definitely a cross-combination of understanding the business need (the practitioner) and their ability to represent the business in a structured format. NOTE: This is a biased blog entry, based on a new data modeling technique called the Data Vault. I'll be talking more about the architecture in coming blogs.

However, I believe that with an integrated non-aggregated, low-level detail architecture, there is a mechanism by which to achieve a standard data modeling architecture. Particularly when it comes to "integrating" different enterprises, why else would something like Universal Data Models ever have taken off?

In this blog, I explore beyond the simple data model. I have suggested a new revolution in data modeling (available here: www.DanLinstedt.com) which is based on standard, repeatable architecture - an architecture that builds a granular, integrated, and foundational enterprise view of the facts (the data itself).

It doesn't mean that this model should be used for information dissemination. It just means this model should be used to construct an enterprise data warehouse, non-accessible except to power users and data miners. From this point, we can generate information stores, star-schemas (turn data into information) through integration, quality, cleansing, and aggregation.

At the end of this modeling effort, we begin to realize that it's nothing more than a STANDARD set forth on how to build a decent model. With any standard, the next evolution is automation. Well, I've done it. I've built a "Data Modeling Wizard", one that takes in multiple source data models from a number of relational databases, and spits out: Staging Areas, Data Vault data models, and ETL loading code to go with it. The next version of the software will actually produce all the mathematical combinations of "star-schemas" that appear to be "useful" to the end-user, and allow the end-user (IT modeler) to pick the stars to generate, then cross them with date/time aggregation options.

In other words, I've automated the process of building a back-end enterprise data warehouse data model. One step closer to the truly "Dynamic Data Warehouse" or Dynamic restructuring of information in near-real-time.

I can now produce a data model that is 60% to 80% of the final result that I want, in under 10 minutes (11,000 source tables). Of course the software has limitations, it reads only relational source data models, and relies heavily on Primary/Foreign keys. Also, the quality of the data model output depends directly on the quality of the data model input. But then again, if I can automate what used to take 3 months, down to 1 day - then I can use the rest of the week to manually tweak the data model to my liking.

The point to this blog is not to sell the software (although I am looking for VC's/Angels), or it's usage, but to point out that there is another revolution coming: Automated build-outs of enterprise information stores, and dynamic model changes. For the first time that I can recall, I can play "what-if" games with my architecture before I sink tons of cost and time into it.

After using and generating the Data Vault, a business has the responsibility of turning the DATA into INFORMATION, and actually writing the correct business rules into the processing engine to accomplish this task. This also requires different modeling techniques like Star Schema, and something Dave Wells recently wrote about in Flashpoint (November 2004): Master Dimensions, Master Fact Tables.


Posted April 27, 2005 9:26 AM
Permalink | No Comments |

Well well, lookie here - Old MacDonald had a farm, E-I-E-I-O. (sorry, on a bit of a funny kick today). What do all these things have in common? More over what problem are they trying to solve? Are some of these technology stacks "sun-setting"?

In this blog we explore some of these garbled acronyms, and no - I won't repeat the farm joke... We'll also take a hard look at some of the existing business issues that are forcing changes in the way we (IT) work. If nothing else, a bit of light reading - you might get a laugh or two out of this... :)

Sometimes I wonder just a bit - why we have so many different mechanisms to solve the same problem. Oh yes, but what exactly is the problem to begin with?

It all boils down to this: moving data from point A to point B.
Yes, it really is that simple!

The options (byproducts of having the data available)? Integrating, changing, recording, merging, matching, and cleansing are all by-products. We have it in-stream, in-transit, in-route - now we need to do something with it.

Is ETL dead? Yes, I believe so - in it's current form it won't last much longer as a paradigm. It needs to morph if it will survive, change into something more "becoming" of the integration age we are currently feeling (which by the way, the wave or movement started over 4 years ago).

ETL = extract transform and load.
ELT = extract, load to the RDBMS then transform
EAI = Enterprise application integration
EII = enterprise Information Integration
EIEIO = Old MacDonald had a farm... (sorry, did it again).

As I mention in the data modeling blog, the paradigm is shifting, the need to move data from "all sources" into an integrated business model that houses both current and historical views of consistent data is being sharply focused by business acumen.

So what? That means EL, doesn't it? Yep - from a data movement perspective, extract and load - basically detecting existence of new/changed data, and doing a delta comparison is all that's needed. Icing on the cake is having a visual no-code, drag and drop development paradigm that handles and manages metadata along the way.

Let's talk about the Transformation section for a minute, the Big "T". It's a bottleneck, in fact - it's THE bottleneck in most VLDW / VLDB and very large data integration systems. Over the years it has been more efficient to transform the data in-stream, because the RDBMS engines lacked the scalability, and sometimes the functionality to handle all the complex transformations that are necessary.

Today, all that has changed. RDBMS engines now contain highly complex optimizers, incredible business transformations on the SQL level including (but not limited to) object level transformation, in-database data mining engines, in-database data quality/cleansing/profiling plug-ins, statistical algorithms, hierarchical and drill-down functionality, and on and on...

Along with the paradigm shift for bringing all the data to a single statement-of-fact (across the enterprise), the nature of convergence, and consolidation are now saying: it's more efficient to perform any type of "transformation" within the bounds of the RDBMS engine itself. After all, they have begun growing up and offering Multi-Terabyte solutions, some hundred+ terabyte solutions have been around for a long long time.

If we shift the "T" bottleneck from the ETL into ELT, we have a very strong case for scalability - the resulting engine leveraging the best-of-breed, latest RDBMS capabilities, and taking advantage of every ounce of scalability and parallelism (and load-balancing) that the RDBMS can muster up.

So, ETL is "dead". There, I said it. ETL vendors MUST re-tool towards EL with a focus on "T", or better yet - why not make the transition EASY, make the tool "ETL-T". Give the designers the option to convert to "EL-T" where it makes sense. After all, we have sunk-costs and development time into intensive ETL routines, let them stand for a while and earn back their keep.

Now what? What about ELT, EAI, and EII?

Ok, EAI is a 10+ year old paradigm that focuses on integrating applications. There are vendors like Tibco that "run wall-street". EAI is going strong as long as there are applications to integrate, but does EAI overlap into the world of EL? Let's first define EAI: Every time a change happens in an application "plugged in" to the EAI tool, it pushes the change to the message bus, and looks for business rules and other "listeners" that need to be notified of the change.

These business rules can consist of manual intervention, is this really necessary? or is EAI just another "band-aid" for overcoming source system capture problems and integration problems that exist in (for instance) mainframe interchange protocols? I would argue that EAI is more than that, because it focuses on the business processes of the data - goes above and beyond simple integration and begins to look at HOW the data becomes information, and where/when/why it should be utilized.

I will say this though: EAI as a paradigm is also dead. What? How can I possibly say this? This is blasphemous. In my opinion, EAI is a technology who's time has come - who wants to "push" all this traffic onto centralized busses, especially if there's no-one listening? What I mean is, simply pushing data out onto a bus or into a queuing system just because we have a change in the application doesn't mean it's vibrant and desired data that needs to be absorbed down-stream. Off-topic: how many times during the day do you hear noises that you "tune-out"? What if that noise were never made in the first place?

Besides that, EAI focuses only on ONE aspect of the business: the APPLICATION making the data change. There are many more places that data changes within a corporate environment, and some of them are not application based (take unstructured data for instance)...

What if I don't have applications to integrate? What if my picture is bigger than that? Say, web-services or SOA? Ok, EAI vendors must also adapt to meet the needs of SOA - so they have a paradigm shift to undertake if they wish to survive. Even though the paradigm has outlived it's usefulness, as long as there are new applications to "install locally" within a company, there will exist a need for EAI.

There is a shift afoot: "Applications are being out-sourced" says CIO magazine, DMReview, and couple other sources. Software and app providers are taking up the SOA provisions, and EAI (like it's predecessor EDI (electronic data interchange) will be lost in the fray).

However, if the EAI vendors take heed, and re-tool they have TREMENDOUS business value proposition already built into their "routing and business flow management" side of the house, so why lose all that investment? They could (if they wanted) take the SOA management of components and integration by storm, there is one such vendor I'm thinking of right now who could do this in a flash...

That leaves EII: EII picks up where ETL and EAI left off, it's a Pull on demand solution, which I evaluated 3 years ago (privately). I saw the EII paradigm as a niche player, and still do - it has a VERY limited life-span unless it too encompasses some additional technology and re-tool. EII is wonderful to get data at its' source, on-demand. It could very well fill some of the needs of an SOA if desired (and in some cases does very nicely). EII too, has some nifty capabilities to handle business metadata in it's form of meta-modeling. I've never seen such gracefulness in dealing with multiple modeling formats. Once I got over the horrendous learning curve (of one particular tool), it began to make sense.

Some of EII's problems are: it can't handle massive volumes of data, it performs transformations on the fly (row/by/row column by column). The tools that implement EII usually rely on a middle tier meta-model, some of them have the capability of defining business models and allow business users to actually CHANGE the model without IT intervention - nifty trick. However, the transformation is again, squarely in the way of scalability of these engines, as is Write-back capability.

Ouch, write-back? If I setup an EII query to source from a web-service, a stock ticker, and a data warehouse, how on earth can write-back be enforced? let alone a two phase commit? The rules to determine write back must be extremely complex, and again - volume and complexity are directly juxtaposed to each other, and inversely proportional to performance over a given constant of time.

So we're back to square one (whatever that was). I'm a bit dis-illusioned by the vendor hype and what is truly delivered, and I haven't even begun to discuss the XML, Web-services and their transformation ability. I will say this though: There is a LOT of good technology buried in each of these solutions which has been developed, now if each of these vendors could focus on solving some of these problems:

1. Move the transformation logic into the RDBMS engines (they can do it, they really can!) Maybe add some "RDBMS tuning wizards" to the tool set, metrics gathering and collection would be nice...
2. Move the EL logic into loaders and connectors with high-throughput and CLEAR network channels (to move big-data, fast, and in parallel). Maybe even offering COMPRESSED network traffic as a free add-on? Adding CDC on the SOURCE as a free add-on?
3. Leverage RDBMS BEST features and plug-ins, like built-in data mining, built-in-data quality.
4. Focus their tool set on ease of use (from a business user perspective, developing business process flows).
5. Focus their tool set on "staging data" in an RDBMS of choice, so that write-back becomes a reality (with the caveat, that not all sources can be written back to, only our "copy" or snapshot of a stock ticker feed can actually be changed).
6. Focus their tool set on metadata, BUSINESS METADATA, and how it works with the business process flows, the management, maintenance and reporting of that metadata.
7. Focus their tool set on managing, setting up, and maintaining web-services and the security around them.

I think there would be a winning paradigm, maybe it would be called:
E-L-A-I-I-T-I-BMM (sounds like a foreign language).
Extract
Load
Application
Information
Transformation (in-RDBMS)
Integrate
Business Metadata Manager
(now breathe deeply)

You never know, it might just show up on EggHead's shelves! (just kidding, horrible acronym, but it describes the concepts). Of course if Disk Vendors have their way, they will take over the EL portion soon, and pair it with the compliance packages.

These are just ramblings on the state of these technology areas. I beg your pardon, this is not meant to be an attack, just a very opinionated blog as to how I see this industry evolving. Invest in BMM today! Comments?


Posted April 25, 2005 4:51 PM
Permalink | 11 Comments |

Welcome to part two of this entry. Here we will discuss the impact of business rule changes as they pertain to an SOA, compared to the impact of changes on a Data Warehouse. We will also begin a discussion (that will continue for a while) on the impact of these changes to the metadata underneath, and the other systems that SOA might use, such as EII, and EAI.

Again, this blog is open to comments and corrections - I'm always willing to learn new things, and SOA is a new adventure for me too. I'd be honored of an SOA heavyweight would weigh in and help clarify things.

Business rules change every day, and as I've sugested in a previous blog or white paper, the notion that I follow is that Business Rules are what is constituted as "today's version of the truth." But that again is for yabe (yet another blog entry).

In an SOA, as commented in the previous blog entry - the SOA can be like a restaurant menu - describing choices, prices, and contents. Enough for a customer to make an informed decision - of course all menus follow a similar paradigm: appetizers, soups/salads or lighter-fare, main dishes (chicken, beef, pastas), deserts and drinks. Sometimes they vary slightly, but if you've read one menu - you can read and understand the rest.

That says a lot about common and acceptable metadata in the restaurant world and how they compete, AND do business. They're not afraid to publish pricing, but you have to enter their establishment to see these things (sometimes their menus now, are listed on the web). On the other hand, what can they change from a business rules perspective? They can change: the layout of the restaurant, the nature of service, the number of tables and wait-staff, the number of cooks, quality of ingredients, pricing of products, product specials and so on.

Let's take a look at it from a business intelligence standpoint. We can change the way we report the data, the type of data that's on the report, how the data is aggregated and loaded, and of course - the kicker - what it means to us (the interpretation). In a restaurant, a steak might taste one way to you, and another way to someone else - but there's no right or wrong "taste", it just is. Furthermore, how do you describe a "taste" to someone? It's an experience, nothing more.

Back to business intelligence. Interpreting what is "right and wrong" is up to the executive staff - they need to agree what the real business rules are, and how the data will be interpreted. But as far as businesses go, and humans go, we all interpret requirements differently - which leads to different implementations of the same business "rule". When these business rules change, the descriptions in the SOA need to change (metadata changes), furthermore, it may be that the operations behind the SOA that retrieve the data, change. It may even be that we change WHERE we source the data from.

Today we might get the data from an EII feed, off a source system. Tomorrow the request is for: get the driver of the vehicle, but also get all their history. Or maybe the request changes to: summarize their history, and tell me if the driver's current actions are in accordance with their previous actions (is there a pattern of activity here).

Metadata is the great equalizer. In this case it allows a central point of tracking for impact analysis, discovery, and understanding. If implemented properly, it can support IT, the business and the end-customers using the SOA all at the same time. IT can use it to determine which processes are changing, pulling, and referencing the data element that is to change or be added. Businesses can use it as a metrics measuring point, and an impact analysis assistant - along with the same use as the customer, to understand and define what these elements and combination of elements mean to the business.

As far as implementing a change into an SOA architecture, it's the same process as implementing a change into the data warehouse, or the EAI system, or even the source system. The only difference is that the service architecture usually is much larger than just that of the data warehouse alone, and thus requires more resources to build and maintain. This is one reason why we are seeing SOA providers on the web spring up (so that SOA's can be used without the overhead of the technology underneath).

From a data warehousing viewpoint, the SOA utilizes the DW as just another data source to meet it's needs. Thereby increasing the value to the enterprise by re-using what has already been built and accepted.


Posted April 25, 2005 11:02 AM
Permalink | No Comments |
PREV 1 2 3

Search this blog
Categories ›
Archives ›
Recent Entries ›