We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

May 2005 Archives

Business should understand how decreasing cycle time, improving quality, straightening out business processes all lead to increased profitability. Business should also understand that profitability is directly tied to traceability and accountability both in Business and in the data that business deals with. In these entries we explore the connected notions of cycle time, quality (data and business process), business accountability, and success.

In math, What is the shortest distance between two points?
A straight line.

Can anyone tell me what the shortest distance between Customer Contact and Delivery of goods or services is?
Again, a straight line - through the business that is.

What does the straight line represent?
Profitability - basic formula: costs and overhead increase, customer satisfaction decreases as business processes internally become more complex, or require more than necessary manual intervention.

Machines do a wonderful job of tracking data, massive amounts of it - humans do a wonderful job of turning that data into information and making it useful for organizations. However somewhere in the mix, the real "business" that earns profit is lost in translation when the machines are given complex tasks, and dirty data. Information quality and location, along with business accountability/complexity are two key factors to profitability measurement.

The straight line in business should run from first point of customer contact through all the business processes to delivery of the final goods or services. But I'm sure you already know this.

For instance, most extremely large manufacturing businesses have a cycle as follows:
Sales->Contracts->Finance->Planning->Manufacturing->Quality->Delivery.

Each cycle is represented by business units. Each business unit typically owns it's own "data" and operational systems, each business unit typically uses it's own "customer key" to represent a customer throughout the life-cycle. Furthermore there are many major and minor processes in each of these business units that alter and change the customer data. Finally, as the hand-off of the customer account occurs (from one business unit to the next), the customer account numbers frequently change.

What I'm saying is:
1. most businesses do not have a straight line through their business processes, causing confusion, increase in overhead costs, delays in delivery down stream.
2. Most businesses change the data within their business units - creating "kingdoms, fiefdoms" of data ownership when in fact, all that does is hurt and hinder the overall business effort.
3. Most business units in this situation don't talk to one another, believing their "version of the truth" is absolute and correct, even if the other units financials are "off" by a little bit.

Bottom line for this series (the theme) is to answer: how does this affect my profitability? You may have heard of this approach in the 80's, called Lean Initiatives, or Cycle Time Reduction - these days they call it BAM (business activity management) or BPM - business process management. However, these particular concepts roll up into something bigger: TBM (Total Business Management - which includes activities, processes, data, quality of data, accountability, profitability, overhead costs, and so on).

As this series progresses, we will discuss examples of problems, and possible solutions. For now, if you care to sound off about what you see in your organization, that would be wonderful. In the mean-time, I've been asked to talk about the data modeling and architecture sides of this house at the IQ conference in Houston, TX (september). Hope to see you there.


Posted May 25, 2005 7:23 AM
Permalink | No Comments |

I can't decide if this fits under nanotech or if it fits here, but I'll put it in this category, and focus on the business sides of the house. The winds are blowing outside today, as I sit here anxious for a return call. I've contacted a few individuals at a university which is currently studying structural mining techniques, and will hopefully be discussing some of their progress soon.

In this entry, we will explore the brave new world of what I like to call: Dynamic Data Warehousing. I'm not referring to Dynamic Data sets, I'm referring to Dynamic Structuring and Restructuring of the information systems as a whole.

What is Dynamic Data Warehousing?
I am defining the term as follows:
The ability of a system to 1) interrogate arriving information at run-time 2) discern new "structure" from old "structure" 3) separate the new structure, and build or attach new structural elements to the existing structure, 4) mine existing structural elements for unseen relationships and finally 5) Follow a series of "alert" patterns to notify operators that new nodes or elements have been added, and need to be checked.

In a business sense, or the simplistic definition is: to add and/or change the structure of information on the fly based on "content analysis". The adaptation of the structure is in near-real time, and will result in learning things we didn't know before. It basically changes the data model underneath the covers by using neural net techniques and structural analysis ideas.

Why would I want Dynamic Data Warehousing?
Well, for one - it's convergence (see my nanotech articles here for the series on convergence) of both form and function. Why do we want to converge the two? The electronic computing world is already far behind other sciences and advancements, it's time to UPDATE. It almost feels like we're stuck in the '70s. Ok, here's a reference to Bio-Informatics that talks about what nature does with DNA and convergence of form and function: IEEE Magazine.

What is structural mining and why would you want it?
We've been mining our data sets for years, why not our architecture? What kind of insights would we find from profiling and mining the architecture itself? We might find holes in the source system processing, we might find better methods to re-organize the data underneath (make more sense out of it), we might find relationships between structures that "today" don't have any built.

Structural Mining, or structural analysis is the ability to find out what's right and wrong with the architecture. The ability to discover new and different methods for storing, retrieving and hooking data up. Structural mining is a key component of Dynamic Data Warehousing (could be Dynamic Data Integration too), and the ability to change structure on the fly.

Imagine this: you build a web-service to accept incoming transactions from a provider. Today, it has name and address on it. Tomorrow you ink a deal for them to provide city, state, and zip. It shows up on the feed that night. Let's say that IT "hasn't gotten around to changing the structure" yet, and you have structural analysis engine applied to the service. No sweat, the new fields arrive, and they are in context of the customer record - the SAE (structural analysis engine) doesn't see any harm in automatically adding the fields to the data model, and proceeding with the load.

This is a level 3 change (scale of 1 to 3, 1 being Manual intervention needed before change, 2 being warning: change occurred - 60% to 80% sure that it works, 3 being no problem, context determined with 90% or better confidence rating - change applied).

From time to time (as with all neural nets) we'd have to correct the neural model that the SAE has built, but for the time being, it becomes a central part of the glue to building a Dynamic Data Warehouse (or Dynamic Data Integration store).

On the flip side, it would mean learning some lessons about Fraud detection, and teaching those to the SAE as well - so that it can spot potentially fraudulently added data trying to get in to the system. A gate-keeper of sorts.

I believe that Dynamic Data Warehousing or Dynamic Data Information Stores are the next level of integration, however to get there - it requires a data modeling technique that is capable of being altered without losing existing information or corrupting existing structural integrity.

What might be the ROI on something like this?
Well, that's anyones guess. But I would gather a hunch that if "cleaning up the data sets" can garner 200% ROI or more, then cleaning up the architecture it lives in could be a 4x to 10x multiplier (pure speculation on my part).

Thoughts? Would love to hear your comments on this.

References:
Enterprise GIS Architecture, DDW
Dynamic View Alteration
Comments on Axiom Software DDW
Cross-Linkages with quite a few White Papers listed
Data Warehouse Configuration
Real Time Road Mapping
Percipio Tool for Dynamic Data Warehousing
ENTER GOOGLE SEARCH TERM: "Dynamic Data Warehouse"


Posted May 6, 2005 2:10 PM
Permalink | 3 Comments |

There's a lot of talk in the industry today about VLDW/VLDB (very large data sets), and how too much data might not be such a good thing. I take a different opinion on this subject. In this blog I hope to explore the following questions: What is VLDW/VLDB? What are some of the problems with it? What kinds of ROI multipliers might I find in a big-data set?

I've recently had discussions with a major credit card processor, and as a result will share with you some of the common issues that they face daily.

VLDW/VLDB is defined to be big data, does it mean we have a 1TB, or 10TB or 100TB data store sitting there? No, if the data is sitting there, and is not used for business purposes then by all means - it shouldn't be stored on-line (due to cost), or the business may not be looking at their information hard enough or with the right questions to use all the data.

Something to think about: Data Mining has begun to be a viable solution to providing analytics, trend analysis, and forecasting above and beyond traditional statistics. In other words, companies with extreme competitive advantage are using Data Mining to reach and discover things about their business that they didn't previously know, or to predict future outcomes with a confidence rating that enables business decisions that make sense.

Having big data and using it are two different things. If you use 80% or better of your big-data sets, then you have a VLDB or a VLDW. The base-definition of Big-Data means different things to different people. Someone sitting at 500MB might thing "big" is 2TB. Someone at 2TB might think "big" is 8TB or 10TB, and so on. Instead of trying to define big data, I'll discuss the different levels of changes that happen within terabyte sized data sets.

Ranges:
500MB - 2TB
2TB - 5TB
5TB - 10TB
10TB - 50TB
50TB - 100TB
100TB - 200TB
200TB - 1PB+

The ranges are defined as a rough guide. Things change within each range. Data models, disk layouts, CPU to Disk ratio, Speed of networks, sizes of nodes, Large SMP boxes vs small MPP vs Clusters, Queries, Indexing, Constraints and so on. In other words: what works at 2TB doesn't work at 5-6TB. What works at 6TB won't work at 20TB, and so on. Of course there are some hardware vendors out there who provide so much horsepower that these ranges don't apply, and in fact as they progress and "data warehousing appliances" become more common place, they will handle most of these issues for us under the covers. But for now, assuming we are on existing systems, this is something to think about.

What are some of the problems with VLDB/VLDW?
When the systems reach "live" data usage of 20TB to 100+TB, they experience everything from physical performance breakdowns, to servers crashing. The problems that we have with 500GB of data seem small and are easy to overcome, but all minute problems become very large problems when the data set grows above about 15 TB.

List of potential problems: (assuming large SMP boxes)
* Data modeling breakdown, queries across joined tables no longer work at all, no matter how much RDBMS parallelism you throw at it
* Indexing breakdown, no matter what you ask of the optimizer, there's just no performance improvement to query times - even with partitioning of the data set below.
* Backup and Restore no longer work within the time frames desired, and in some cases are near impossible to backup and/or restore entire data sets. Disaster recovery is HIGH RISK!
* Traditional data over disk layouts STOP working completely, and in fact become a negative performance attributor.
* Replication systems choke over bottlenecked I/O and networks
* Maintaining distributed data centers becomes a 6 month project just to architect how it's going to work, then there's negotiations for 24x7x365 bandwidth to keep the data flowing.
* At about 50TB, cost of maintenance, machines, cooling systems, power grids, and IT resources begins to increase by a multiplier of 5x.
* After about 50TB, there are no "canned standards" that companies can follow for a successful VLDB/VLDW.

As far as mitigation strategies, relying on experts or those that have built and architected systems for these sizes is paramount. Architecture is everything in these systems, without long-term architecture and forward thinking the systems experience growing pains at around 20TB to 48TB, and then the company must put an all-engines-stop out and re-build from the ground up (very costly), or migrate to a new platform (also can be very costly).

Denormalization is one mitigation strategy that will help, but only in certain cases. Remember that denormalization of data sets will instantly double or triple the storage requirements. Here's a fallacy for you: Storage is cheap. NOT SO at big data levels. If you buy cheap storage, you get "poor performance" or lack of parallelism. Furthermore, the more "performance" you want to drive out of a VLDB/VLDW, the more storage you may actually need.

So what about the data sets? Why can't we/shouldn't we reduce them?
I agree with the experts when they say: too much unused data is a bad thing. I disagree with experts when they say: too much bad or poor quality data is a bad thing.

There are two basic types of information in VLDB/VLDB:
1. Good data (aggregated, cleansed, merged)
2. Transactional Data (Auditable/complaint/traceable)

The business users are divided into multiple user groups: 80%-90% of those that use the good data, or moderately good data (good data is open to the end-users interpretation), and 10%-20% of those that require transactional details.

In the Good data set, there's no reason to keep around "old" or unwanted/unused data sets. They should be removed, or placed on a rolling usage cycle. However in the transactional data set (transactional with history), it's at the lowest possible grain. The more data the better! Especially if the business is mining the data set, and/or has audit requirements or federal/international mandates that state it must be kept on line.

Data mining loves big data, the more data it can mine, the better it's predictions and confidence ratings. The less granular detail it can mine, the worse it's predictions are - you might as well go back to aggregates and standard statistics. In this case, the credit-card processing company also has SLA's with it's vendors, along with the need to detect fraudulent activity - they MUST (and do) use a data mining tool on the transactional historical data.

With all these headaches why build a VLDW? Why not just go back to the old-style analytics backed with aggregations, averages, and statistics? Won't that save cost?
Yes, it will save on cost - but here's what you can gain if you build one. This company has 6 months of transactional history on-line, dynamically accessible by end-users. This equates to about 120TB. They are seeing at least a 5x multiplier on ROI (compared with the costs for maintaining and supporting it). They mentioned that if they could keep 12 months of data on line, they would do it in a heart-beat, and their multiplier would go up to 15x or higher.

The reason? They are missing enough data to significantly impact their decision making capabilities, especially with the data mining engine. In this game, the business must spend a little to gain a lot - especially if they know what questions to ask and have a firm grasp on how the answers will make them more effective and more competitive.

There's more, a lot more - I discuss the details in my class, along with mitigation strategies - I'd love to meet with you at TDWI in DC (may 19th 2005) should you wish to drop by. See you next time.


Posted May 5, 2005 4:34 AM
Permalink | No Comments |

That's right, Terminator as in T2 Eyeballs. Well, not really that advanced (yet). I just read in May's issue of Scientific American about nanomorphing silicon implants that take the place of damaged light recognition cells in the back of the eye, basically allowing a blind person to "see" images and outlines. They admit the resolution isn't that hot yet, but it will advance like everything else.

This article will explore Form and Function, and discuss the nature of adaptable neural models, and what it means to build a system that could potentially mimic the human brain.

According to the article, the brain can operate at 10 billion synapse firings per second. Who's Synapse? What's a Synapse? and Why does it Fire? For answers to those questions and more, see your local brain surgeon. (just kidding).

Here's the poor mans definition: imagine for a minute a series of interlinked spider webs. Got the picture? Ok. Now, imagine the spider on the center of each of the web. Each center of the web represents a term called a neuron. Each part of the web spanning outward, let's call that a synapse. Where one web attaches to another, let's call that a dendrite (receptor).

A spider catches prey by first, having a sticky web - second by feeling the vibrations caused on the web when something gets stuck there. Now imagine the neuron (center of the web) building up a charge and sending that charge down one or more synapses (all at once). Once the charge gets' high enough, it fires across the inhibitors to the dendrite receptors on the other side. In other words, capable of shaking another spider web with a directed charge.

Now imagine 15 layers of these webs, each interconnected with the other, and each layer responsible for a "part" of coverage. The inter connectivity can provide a feedback loop to build up a charge, or to "morph" it's neural structure and learn things - or in this case, focus on what's important like edges, highlights.

Nanomorphing is changing the hardware layers to suit the needs of the situation, rather than changing the software layers. The nanotech part of this allows different chemical bonds to be "favored" and "unfavored" depending on the electrical current and stimulation, thus changing the configuration at "run-time".

This is an example of just how important it is to bind form and function closely together - the more specific and targeted the functions are, the more compact they can be, the more efficient they can be. The more bound the form is to that function, the more adaptable the form can be - thus more resilient, and quicker to respond or adapt to it's environment. Also, surprisingly - the more standard, fault-tolerant and redundant the architecture gets which by the way, leads to adapted efficiencies during run-time.

This eye piece (according to the article) is made up of transistors modeled in a neural net fashion, with nanotechnology components, layered 5 layers thick. Each layer provides feedback loops to the last, to allow a charge to build up in a specific area, and "fire" a nerve ending in the back of the eye to the brain, resulting in a perceived image.

Note to self: Where's the ACTIVE feedback loop in our Data Warehouses? Are we still in the cave-man stage here?

Sorry about that... Moving on. You think this stuff is too far out? Hasn't happened yet? too difficult to build? Think again, there's a company "in my back-yard" in Boulder, CO called Genobyte... Check them out: http://www.genobyte.com/ They are already building adaptable hardware, and quite surprisingly, have been doing this since 1997.

Anyhow, my point (that seems to take so long to get to) is this: CONVERGENCE IS EVERYTHING, when it comes to nanotechnology, and nanohousing (nano data warehousing of the future), we will be forced to combine form and function in order to build adaptable systems with virtually unlimited scalability.

If we can build a system of nanomorphing hardware, and compensating software with encapsulated dynamic feedback loops, we may have the beginning of something interesting.

Would love to hear your comments and thoughts or questions.

Cheers,
Dan L


Posted May 3, 2005 4:24 PM
Permalink | No Comments |

The market is shifting, vendors are packing more and more features and functionality into their devices, they are also making their devices smaller and smaller. What does the future Data Warehouse look like? Can it be an appliance like device? What kind of partnerships or acquisitions can we expect? Why would we choose an appliance DW over our own component selections?

In this blog I look into the future, just to see if we can answer these questions. I believe there are changes coming, long overdue changes.

In the land of yesterday we would have to go in search of "best-of-breed" software, and then pair that up with best-of-breed hardware. Size it appropriately, install it all, and integrate it ourselves (within IT). I believe all that is changing. If it hasn't already, it certainly will shortly.

New vendors on the market are offering coupled hardware with built-in RDBMS's. This is just the start and as good of a start it is, it still has a little ways to go. Let's talk for a minute. What if you could walk out and buy an ADW appliance (active data warehouse) - self-configured to perform optimally on the machine, embedded within the BIOS, encapsulated storage, and a black-box interface... Would you do it? Especially at a cheaper cost than buying RDBMS vendor 1, and Hardware vendor 2.

So what does the future device look like?
It should contain not only the RDBMS, but also ELT software. This software should be embedded onto the machine for fastest performance, along with optimized disk routines, and mechanized load balancing. The ELT software should have two types of inputs: the flat file loading process, and the real-time network plug which reads JMS queues after configuration. The ETLT should be fully self-contained on it's own processor slot so that it doesn't interfere with the RDBMS operating in parallel, at high speeds on the disk.

There should also be a BI (reporting tool) card built in. It should have it's own IP connections, and reside on it's own processor slot as well. The tool and the box configuration should all be browser based, all administration could be fat client I suppose, but why? Why not make it all web/app server? It's separated from the RDBMS and ETLT engine slots, again so that it can run in parallel. Although the BI tool and the ETLT tool should be based on a common metadata framework.

Now, depending on the number of nodes purchased - hooking them together through a third pre-configured IP allows them to load-balance across a high-speed backbone. Again, nothing to do with each other but distribute the work-load.

What kind of partnerships or acquisitions can we expect?
I think in the future, you'll see storage vendors partner more heavily with RDBMS vendors, who are already working on "blade servers" - some vendors have the pre-packaged solution there. Other vendors are coming up to speed. I also think you'll see a larger effort to integrate the BI and ETLT software onto hardware platforms. It's getting cheaper to architect and build hardware - and most of the time we need the extra performance boost, even if the BI application card is running on a dual CPU at 450 Mhz, it's the RDBMS that needs the power.

That's all fine and dandy, but where's the value proposition?
The value comes as follows: automatic updates to the software and firmware over the web, little to no configuration needed (all comes pre-installed, and factory tuned), no fancy load-balancing, or parallelism software needed to gain performance. No messy dual environments for upgrades, no multi-cost purchases, plug & play scalability, speed and performance built into the hardware/firmware.

I think you may see compliance vendors entering this game too, they already are partnering with storage vendors for appliance based storage.

What makes this work and why?
The RDBMS engine on the appliance must be extremely fast, and extremely scalable. It must focus on bulk-applying data sets or images of data that is time-stamped. It must have high-quality, high-end compression, data quality, delta-processing capabilities available. As we consolidate our resources, it becomes easier to manage, upgrade, replace. In this environment you're paying for the engineering to be "done", out of the box, into the rack - load your data and away you go. All data is denormalized inside the box, so compression ratios are very very high, storage needs are low, performance is super fast - we don't have to worry about data modeling or indexing any more. (the end goal)

There are a number of companies to watch out there who are moving in these directions. It won't be long before they can meet all these needs with one appliance.

Of course it wouldn't hurt for these companies to consider a metadata appliance either, or possibly incorporate that directly into the warehouse appliance.

Just a few random thoughts, See you next time.


Posted May 3, 2005 4:15 AM
Permalink | 1 Comment |

Search this blog
Categories ›
Archives ›
Recent Entries ›