Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

May 2005 Archives

As I discussed in my first articles, nanotechnology is not only here to stay, it's made it into the R&D labs of some of the hotest integrated circuit manufacturers, and now - it's on the fron page of a massive circulation. Carbon Nanotubes, and Carbon nanowires used to help "cool" and shrink the silicon processor environment.

Watch this space soon, I will begin a journey into what I view as the creation of a super-DNA computer.

Carbon Nanotube based computing devices

Nanotech continues to move at astounding speeds, just keeping up with it will be challenging to say the least. More to come shortly.


Posted May 31, 2005 3:59 PM
Permalink | No Comments |

So you're curious are you? Have I grabbed your attention yet, or is this not making any sense? In thinking about accountability, the data supply chain, and change requests there is one key component to making this all happen. That is: to show the bad data with the good, maybe not in the same reports, but to physically separate the bad data from the good depending on the severity of rules breakages.

There's one more thing to think about, if we accept Data Supply Chain as a paradigm, then we should be keeping the business key on the data unique, and the same - once assigned, always assigned. The concept goes back to RFIDS and the manufacturing supply chain. RFID's are helping "clean up" and provide visibility... Read more...

RFID's are used to clean up data, and provide visibility into the manufacturing supply chain, they are causing accountability in business and providing the means and mechanisms with which to improve and measure supply chains around the world. If RFIDS (which are nothing more than unique identifier keys) can do this for our manufacturing supply chain, imagine what a CONSISTENT business key can do for our DATA SUPPLY CHAIN!

It can become the RFID for the Data within our systems. This means businesses MUST abide by the following rules:
1. A business key assigned, must be mechanical
2. A business key assigned must be unique
3. A business key assigned must never be re-used.
4. The Application (like CRM/ERP/HR etc..) that assigns DUIDS (Data Unique Identifiers), will provide incredible metrics, visibility and consistency to improving the Data Supply Chain. They can begin producing the ultimate application.

I'm talking about more than just simple sequence numbers. We need an international numbering board. What if (just a thought) that the RFID's could have the same exact number or ID as the Data Supply Chain? Of course that would mean tagging the very smallest of parts in all of our assembly lines. Service companies would never have RFIDS to speak of, except maybe on invoices or paper contracts that are printed.

Hmmm, does this mean data is trackable when printed to hard-copy? You bet! Imagine a filing room filled with RFID's - how easy it would be to track down a document? Maybe the US Patent Office or the Library of Congress could undertake something like this, saving billions of dollars a year (Yea! Less Taxes?)

Next step: Printers that stamp RFIDS on documents according to document numbers that come from DUIDS. So you get the point, data identifiers (business keys) are just as important as any hard-coded identifier tags we put on products. Curve ball: if we can place a value on the products we produce, and we can begin uniquely identifying our data and it's elements, then there's no reason why we can't place a value on our data as well.

Back to the real world, since today we have no commonly accepted notion of truly unique identifiers, business keys will have to suffice. So what's the problem with today's business intelligence reports and systems?

The problem is, by the time the business user sees the integrated data set, every effort has been made to adjust, clean, alter, move, remove, and merge data to make it usable by the business. This is fine until we begin to question what is meant by "One version of the truth." As I've stated in several other entries, TRUTH is in the eye of the beholder and is subjective. It has NOTHING to do with the FACTS of the way the data is captured, stored, and moved around the organization.

While there is value to producing "usable information", we (as BI implementors) have long overlooked the fact that there's also value in producing the unusable facts - the raw data that is messed up, wrong, unmatched. However, everything starts with the analysis of the business keys. I propose that there are really two answers, both right at the same time, ahhh - a Conundrum? Yup.

I propose that along with DUIDS, we should be storing a single statement of FACT in our warehouses, then moving the FACTS into polarized/colorized versions of the truth in Data Marts. This means two basic principles apply:

1. Business rules move to the "output" side of the warehouse, between the warehouse and the marts.
2. Raw data that breaks business rules, ends up in one or more ERROR MARTS.

Physical separation of the data is absolutely necessary to begin pushing the accountability back into the business, to begin the IQ cycle and the business process clean up, to begin providing true visibility into ALL data that exists in the source systems, to begin showing the FULL level of rejects in our data supply chain.

Manufacturing supply chains don't throw away "bad parts", they put them in reject bins, record them, try to figure out why they went bad, and then try to improve them so they don't make the same mistake again (because it costs them money, time, and competitive advantage). Why shouldn't we treat our data this way? Why do so many implementation specialists INSIST on cleansing, mixing, merging, and constantly fine-tuning the "truth" so that these errors are hidden or disposed of?

By actually separating the bad data into "reject bins" for the lowest level of grain, before it is cleansed, mixed, merged, etc.. We can really begin to take inventory of our source systems and the business processes - we can finally see where our businesses are HEMMORAGING money, time and competitive advantage.

In our next entry, we'll walk through an example of how this worked at a real customer site. IT'S TIME for OUR DATA SUPPLY CHAIN to step up and begin working for us.

Comments?


Posted May 31, 2005 7:00 AM
Permalink | No Comments |

We're here, dirty data, complex business processes, inconsistent integration points - sounds like what an EDW/ADW is supposed to help solve right? Parts of it anyhow are solved by EDW/ADW, other parts must be solved by accountability of end-users, still other parts must be solved through SOI (service oriented integration, under the SOA stamp).

We've established rule #1: in a sea of data throughout our enterprises, the single most important data point is the business key - the one and only reference across the company that means something to the business, and allows the business direct access to the data set they are after.

Are we ready for rule number two? Not quite yet. Let's explore dirty data further. Not to change track, but Information Quality is extremely important. It's not just about the data itself, but it's about the people, the business processes, the metadata, and the metrics and measurement all used to ensure continuous business improvements.

Dirty data, and broken business processes can make a company "bleed money." And that's just the START! Data Models that help increase accountability from end-users, and systems architectures that help raise the visibility of business process problems help stop the bleeding, and can save millions of dollars a year if done right. But to understand these statements, we must walk through just how the systems got this way.

So we take the case of the broken business, customer SLS123, we just lost $30M to our big competitor because we took 5 weeks to respond, and our competitor took 3 weeks to respond. Please note, just because they responded quicker, doesn't necessarily mean that the quality of their product is better - it just means they stream-lined a portion of their sales, finance, and contracts communications. Now if they deliver faster, with higher quality - then they've truly got us beat, and we will go out of business if we don't do something to correct the situation (keep up).

By the way, this is what ERP systems attempt to address, and sometimes do a good job of it, but obviously they leave a little bit to be desired (due to high levels of customization), hence the usage of additional tool sets like EAI, to move the customer into CRM systems and through even more complex business processes.

After examining our business process here's what we find:
Sally takes the first contact call
Sally assigns SLS123 to the customer record
Sally pre-qualifies and fills out some basic information, to which she accidentally enters the wrong address, or uses special characters to represent information that she can't store in the source system.
But because Sally wants the bonus for this customer, and doesn't want her sales counterpart Joe to get the bonus, she uses her own special characters that only she understands and can interpret to management.
Sally then hands the account off to Finance, and sends an email to Jim, whom she also works closely with because she has a good business relationship with him.
Jim in Finance pulls up the customer record by name, an auto-synchronization routine in the source systems, moved the record from sales to finance last night and changed the account number from SLS123 to FIN456.
Jim then walks through a series of checkpoints on the application, and has to call Sally to understand her encoding of the special characters (over time, Jim begins to understand it, but doesnt annotate any of the metadata).
Jim then changes parts of the application, sends the FIN456 customer to management for approval/disapproval.
Financial Management then approves the customer FIN456, calls Jim and says - pick up the customer, it's ready.
Jim then says, good to go - marks the record for upload to Contracts.
That night the synchronization system moves the record to contracts, and promptly changes the customer number again to CONT259.

And the cycle goes on, the complexity increases, the touch-points increase. When we look at this particular scenario we discover that there are critical touch points and manual approval mechanisms that must be in place, we also discover interesting auto-synchronization mechanisms hidden in our legacy systems, or even in our re-engineering of the legacy into ERP and CRM.

We finally discover that there are unnecessary processes that the data goes through which neither improve the quality nor speed the process up. These are the business processes we wish to eliminate to stop the bleeding. Now there's the data set. One customer: John Smith, 3 Account Numbers - SLS123, FIN456, CONT259. Can the business trace John smith at an enterprise level? Not very effectively. Does the business have deep visibility into their data supply chain? No.

Business Rule #2 for effective profitability:
Once a key is assigned to a data point, it MUST NOT CHANGE.

Not in a box, not with a fox, not here nor there, not anywhere (Dr. Seuss) - the business key must stay as a consistent representation of the data point from this point forward.

Business Rule #3:
If the key changes, you can be certain that there is a break in the business process at that point, and that you are bleeding money.

Business Rule #4:
If the key changes, you can be certain that there is a flavor of ownership of data (kingdoms, fiefdoms) within your organization, and that there are parts of the organization who are guaranteed to produce different financial results - every time, and nearly on purpose (embedded in the culture of that business unit to say the other units are "wrong" in their view of the customer).

Business rule #5:
Use of abstract character annotations to mean certain things in metadata format are usually an indication that the incentive from corporate is misplaced. It also means that the business users cannot be held accountable for poor audits, nor are they incented to improve the data quality, even though the data itself is "broken", as is the capture system.

As we continue down this track, we will discuss how an integrated data store (ADW/EDW) can help pinpoint some of these problems from a metrics driven perspective - but only if the right models are in place. We will also begin showing how to help business users become more accountable in their positions - and actually begin to issue change requests, and allocate dollars to fixing the source capture systems, thus stopping the hemorrhaging of the company, while making it more nimble and stream-lined.

Thoughts? Shout out, enter your comments below... I would love to hear from you.


Posted May 27, 2005 4:57 AM
Permalink | 2 Comments |

I'm happy to see that Doug Laney has joined us here in the blog space. Not to take anything away from his valuable services, but I also would like to say that we are offering a free ETL score-carding mechanism. This is a very short entry to show you where the score-card lives. The downloadable score-card is free/empty. and has no vendors.

The free ETL/ELT scorecard is downloadable at: www.MyersHolum.com

My thoughts on the ETL/ELT scorecard are as follows: ETL and ELT are two different utilities and really shouldn't be compared except in areas of metadata, GUI development (no-code environment during development), flexibility, connectivity. Unfortunately comparing ETL to ELT in the transformation areas is unfair, but necessary. It is important to evaluate which transformations are provided to you by the RDBMS vendors, and which you have to add to the RDBMS (UDF - User Defined Functions) yourself.

However, the true nature of this scorecard looks at sourcing, targeting, metadata, transformations, market stability, cost, number of outside consulting firms, cost of available consulting knowledge, and a few other key metrics. Please feel free to download, and post your comments, questions, remarks or improvements here.

Thanks,
Dan Linstedt
CTO, Myers-Holum, Inc.
daniel.Linstedt@myersHolum.com


Posted May 27, 2005 4:29 AM
Permalink | 3 Comments |

Ok, now that we've introduced the concept let's walk through some examples of complex business processes, and dirty data. Let's find out just what we can do about starting to solve some of these problems. Furthermore, let's explore the real issue of "broken" business processes, do you have some of these in your organization?

So profitability is tied to complexity of business processes coupled with dirty data coupled with too much manual intervention. What exactly does this look like?

Here's an example:
Suppose a customer calls Sales, and says: I would like product X with the following configurations: CA, CB, CC. Sales begins tracking the customer, captures some information (hopefully not fat-fingered) about the customer and their contact point, along with the product and configuration. The customer is then assigned an account number: SLS123.

The customer wants to know approximately when this will be built, and shipped, or if there are ways for them to track the product through it's build cycle. The business says: well, we can only track it once it's shipped to you, and we can't estimate it's cost or it's build time until we have designed the custom parts. Customer says: fair enough, when will you have a design complete? Sales says: can we get back to you in a week?

Ok - sales has the customer contact, they qualify the lead through a number of manual intervention processes before passing it off to Finance. Finance takes SLS123 and changes the account number to FIN123. Now I ask you, is there any traceability in this simple example across Sales And Finance at a corporate level? No, not unless someone in finance or sales records the customer account number change (from/to).

Finance runs it through it's paces, approves financial lending, and then passes it off to contracts who runs it through a series of complex business processes with manual intervention. By the way, contracts changes the account number from FIN123 to CON456. The customer finally gets a call 3 weeks later stating they have a contract for the customer to sign. But before they can give a delivery date they need planning to run the manufacturing phase through their systems, so off it goes.

Another two weeks and planning returns to Contracts to provide an estimated build plan and date. We're already 5 weeks from initial contact, and by the way the customer has put the same bid in to our competitors. 3 weeks ago, our competitor returned the bid and build ETA to the customer. We call the customer back and they say: sorry, your competitor won the bid. We lose $300 Million dollars.

What happened? Our complex business process has not been optimized or stream-lined. There were unnecessary hand-offs between manual intervention, and alternate business units in order to win the business. Imagine if Sales were empowered to a) check financial standing b) run the contract up against previous builds of similar nature (data mining with confidence levels), c) run this by a financial analyst and contracts approval individual - all within 2 days, and return to the customer.

This would be a) a more profitable business, b) cheaper to handle contracts and approve financials c) single out contracts that are too difficult, not our sweet spot, or specialized enough to warrant higher prices d) make us highly nimble and competitive.

In order to get there, we must a) reduce the number of touch points on the data b) utilize data mining tools in an active warehouse to enable insight at the sales contact level c) simplify/streamline the business processes between customer contact, estimation, finance, and contracts approval - which means Cycle Time Reduction, and business process critical path analysis.

Think of the business processes, both mechanical data touch points, and manual data touch points as a graph of 2D lines (x,y coordinants). Complexity of the process going from A to B is the rise/run or Y coordinant. The X coordinant is the process number. Then graph the business processes as best as possible. Finally begin to analyze the graph for critical path - attempting to eliminate touch points, and reducing complexity of the business processes (reducing the Y) to end up with as "straight a line as possible".

Keep in mind that changing keys to information doubles complexity, even if the changes are recorded. I think you'll be delightfully surprised. All companies who undertake this effort can save millions of dollars a year with 1/2 the investment, furthermore this drives the quality up, profitability up, complexity down, overhead down and time to deliver speeds up. Result? More satisfied customers, the business is more nimble.

Now let's take a look at the dirty data problem (which we'll explore further in Part 3). The first problem is we need an enterprise view of this customer, even if it has to span business SECTORS, and not just companies within those sectors. This will be the ONLY way to roll up a single customer and pinpoint exactly where their deliveries are within the entire organization. Sometimes this is referred to as the Data Supply Chain (Jill Dyche, Baseline Consulting TDWI 2005).

What if we kept the SAME customer account number throughout all processes? We can pinpoint exactly where in the data supply chain their application is, and we can begin tracking and monitoring (metrics, KPA/KPI) on the efficiency of the business process. Ahh you say, we have that in place! Ok, but what happens when you re-bill a customer? Do your systems change the Invoice Number? It's the same problem, different data.

Paradigm Rule #1:
1. KEYS to information within the organization must remain consistent over time.

So business keys are extremely important to start with as a metric in business profitability. If you can start with pinpointing the places where keys are changed throughout the business, you can begin identifying major breaks in the data supply chain.

We'll dive deeper into these concepts in Part 3. Thanks, By the way, TDWI - November, Orlando - come see the Data Vault Data Modeling in play, or read about it at: www.DanLinstedt.com


Posted May 26, 2005 6:11 AM
Permalink | No Comments |

Business should understand how decreasing cycle time, improving quality, straightening out business processes all lead to increased profitability. Business should also understand that profitability is directly tied to traceability and accountability both in Business and in the data that business deals with. In these entries we explore the connected notions of cycle time, quality (data and business process), business accountability, and success.

In math, What is the shortest distance between two points?
A straight line.

Can anyone tell me what the shortest distance between Customer Contact and Delivery of goods or services is?
Again, a straight line - through the business that is.

What does the straight line represent?
Profitability - basic formula: costs and overhead increase, customer satisfaction decreases as business processes internally become more complex, or require more than necessary manual intervention.

Machines do a wonderful job of tracking data, massive amounts of it - humans do a wonderful job of turning that data into information and making it useful for organizations. However somewhere in the mix, the real "business" that earns profit is lost in translation when the machines are given complex tasks, and dirty data. Information quality and location, along with business accountability/complexity are two key factors to profitability measurement.

The straight line in business should run from first point of customer contact through all the business processes to delivery of the final goods or services. But I'm sure you already know this.

For instance, most extremely large manufacturing businesses have a cycle as follows:
Sales->Contracts->Finance->Planning->Manufacturing->Quality->Delivery.

Each cycle is represented by business units. Each business unit typically owns it's own "data" and operational systems, each business unit typically uses it's own "customer key" to represent a customer throughout the life-cycle. Furthermore there are many major and minor processes in each of these business units that alter and change the customer data. Finally, as the hand-off of the customer account occurs (from one business unit to the next), the customer account numbers frequently change.

What I'm saying is:
1. most businesses do not have a straight line through their business processes, causing confusion, increase in overhead costs, delays in delivery down stream.
2. Most businesses change the data within their business units - creating "kingdoms, fiefdoms" of data ownership when in fact, all that does is hurt and hinder the overall business effort.
3. Most business units in this situation don't talk to one another, believing their "version of the truth" is absolute and correct, even if the other units financials are "off" by a little bit.

Bottom line for this series (the theme) is to answer: how does this affect my profitability? You may have heard of this approach in the 80's, called Lean Initiatives, or Cycle Time Reduction - these days they call it BAM (business activity management) or BPM - business process management. However, these particular concepts roll up into something bigger: TBM (Total Business Management - which includes activities, processes, data, quality of data, accountability, profitability, overhead costs, and so on).

As this series progresses, we will discuss examples of problems, and possible solutions. For now, if you care to sound off about what you see in your organization, that would be wonderful. In the mean-time, I've been asked to talk about the data modeling and architecture sides of this house at the IQ conference in Houston, TX (september). Hope to see you there.


Posted May 25, 2005 7:23 AM
Permalink | No Comments |

I can't decide if this fits under nanotech or if it fits here, but I'll put it in this category, and focus on the business sides of the house. The winds are blowing outside today, as I sit here anxious for a return call. I've contacted a few individuals at a university which is currently studying structural mining techniques, and will hopefully be discussing some of their progress soon.

In this entry, we will explore the brave new world of what I like to call: Dynamic Data Warehousing. I'm not referring to Dynamic Data sets, I'm referring to Dynamic Structuring and Restructuring of the information systems as a whole.

What is Dynamic Data Warehousing?
I am defining the term as follows:
The ability of a system to 1) interrogate arriving information at run-time 2) discern new "structure" from old "structure" 3) separate the new structure, and build or attach new structural elements to the existing structure, 4) mine existing structural elements for unseen relationships and finally 5) Follow a series of "alert" patterns to notify operators that new nodes or elements have been added, and need to be checked.

In a business sense, or the simplistic definition is: to add and/or change the structure of information on the fly based on "content analysis". The adaptation of the structure is in near-real time, and will result in learning things we didn't know before. It basically changes the data model underneath the covers by using neural net techniques and structural analysis ideas.

Why would I want Dynamic Data Warehousing?
Well, for one - it's convergence (see my nanotech articles here for the series on convergence) of both form and function. Why do we want to converge the two? The electronic computing world is already far behind other sciences and advancements, it's time to UPDATE. It almost feels like we're stuck in the '70s. Ok, here's a reference to Bio-Informatics that talks about what nature does with DNA and convergence of form and function: IEEE Magazine.

What is structural mining and why would you want it?
We've been mining our data sets for years, why not our architecture? What kind of insights would we find from profiling and mining the architecture itself? We might find holes in the source system processing, we might find better methods to re-organize the data underneath (make more sense out of it), we might find relationships between structures that "today" don't have any built.

Structural Mining, or structural analysis is the ability to find out what's right and wrong with the architecture. The ability to discover new and different methods for storing, retrieving and hooking data up. Structural mining is a key component of Dynamic Data Warehousing (could be Dynamic Data Integration too), and the ability to change structure on the fly.

Imagine this: you build a web-service to accept incoming transactions from a provider. Today, it has name and address on it. Tomorrow you ink a deal for them to provide city, state, and zip. It shows up on the feed that night. Let's say that IT "hasn't gotten around to changing the structure" yet, and you have structural analysis engine applied to the service. No sweat, the new fields arrive, and they are in context of the customer record - the SAE (structural analysis engine) doesn't see any harm in automatically adding the fields to the data model, and proceeding with the load.

This is a level 3 change (scale of 1 to 3, 1 being Manual intervention needed before change, 2 being warning: change occurred - 60% to 80% sure that it works, 3 being no problem, context determined with 90% or better confidence rating - change applied).

From time to time (as with all neural nets) we'd have to correct the neural model that the SAE has built, but for the time being, it becomes a central part of the glue to building a Dynamic Data Warehouse (or Dynamic Data Integration store).

On the flip side, it would mean learning some lessons about Fraud detection, and teaching those to the SAE as well - so that it can spot potentially fraudulently added data trying to get in to the system. A gate-keeper of sorts.

I believe that Dynamic Data Warehousing or Dynamic Data Information Stores are the next level of integration, however to get there - it requires a data modeling technique that is capable of being altered without losing existing information or corrupting existing structural integrity.

What might be the ROI on something like this?
Well, that's anyones guess. But I would gather a hunch that if "cleaning up the data sets" can garner 200% ROI or more, then cleaning up the architecture it lives in could be a 4x to 10x multiplier (pure speculation on my part).

Thoughts? Would love to hear your comments on this.

References:
Enterprise GIS Architecture, DDW
Dynamic View Alteration
Comments on Axiom Software DDW
Cross-Linkages with quite a few White Papers listed
Data Warehouse Configuration
Real Time Road Mapping
Percipio Tool for Dynamic Data Warehousing
ENTER GOOGLE SEARCH TERM: "Dynamic Data Warehouse"


Posted May 6, 2005 2:10 PM
Permalink | 3 Comments |

There's a lot of talk in the industry today about VLDW/VLDB (very large data sets), and how too much data might not be such a good thing. I take a different opinion on this subject. In this blog I hope to explore the following questions: What is VLDW/VLDB? What are some of the problems with it? What kinds of ROI multipliers might I find in a big-data set?

I've recently had discussions with a major credit card processor, and as a result will share with you some of the common issues that they face daily.

VLDW/VLDB is defined to be big data, does it mean we have a 1TB, or 10TB or 100TB data store sitting there? No, if the data is sitting there, and is not used for business purposes then by all means - it shouldn't be stored on-line (due to cost), or the business may not be looking at their information hard enough or with the right questions to use all the data.

Something to think about: Data Mining has begun to be a viable solution to providing analytics, trend analysis, and forecasting above and beyond traditional statistics. In other words, companies with extreme competitive advantage are using Data Mining to reach and discover things about their business that they didn't previously know, or to predict future outcomes with a confidence rating that enables business decisions that make sense.

Having big data and using it are two different things. If you use 80% or better of your big-data sets, then you have a VLDB or a VLDW. The base-definition of Big-Data means different things to different people. Someone sitting at 500MB might thing "big" is 2TB. Someone at 2TB might think "big" is 8TB or 10TB, and so on. Instead of trying to define big data, I'll discuss the different levels of changes that happen within terabyte sized data sets.

Ranges:
500MB - 2TB
2TB - 5TB
5TB - 10TB
10TB - 50TB
50TB - 100TB
100TB - 200TB
200TB - 1PB+

The ranges are defined as a rough guide. Things change within each range. Data models, disk layouts, CPU to Disk ratio, Speed of networks, sizes of nodes, Large SMP boxes vs small MPP vs Clusters, Queries, Indexing, Constraints and so on. In other words: what works at 2TB doesn't work at 5-6TB. What works at 6TB won't work at 20TB, and so on. Of course there are some hardware vendors out there who provide so much horsepower that these ranges don't apply, and in fact as they progress and "data warehousing appliances" become more common place, they will handle most of these issues for us under the covers. But for now, assuming we are on existing systems, this is something to think about.

What are some of the problems with VLDB/VLDW?
When the systems reach "live" data usage of 20TB to 100+TB, they experience everything from physical performance breakdowns, to servers crashing. The problems that we have with 500GB of data seem small and are easy to overcome, but all minute problems become very large problems when the data set grows above about 15 TB.

List of potential problems: (assuming large SMP boxes)
* Data modeling breakdown, queries across joined tables no longer work at all, no matter how much RDBMS parallelism you throw at it
* Indexing breakdown, no matter what you ask of the optimizer, there's just no performance improvement to query times - even with partitioning of the data set below.
* Backup and Restore no longer work within the time frames desired, and in some cases are near impossible to backup and/or restore entire data sets. Disaster recovery is HIGH RISK!
* Traditional data over disk layouts STOP working completely, and in fact become a negative performance attributor.
* Replication systems choke over bottlenecked I/O and networks
* Maintaining distributed data centers becomes a 6 month project just to architect how it's going to work, then there's negotiations for 24x7x365 bandwidth to keep the data flowing.
* At about 50TB, cost of maintenance, machines, cooling systems, power grids, and IT resources begins to increase by a multiplier of 5x.
* After about 50TB, there are no "canned standards" that companies can follow for a successful VLDB/VLDW.

As far as mitigation strategies, relying on experts or those that have built and architected systems for these sizes is paramount. Architecture is everything in these systems, without long-term architecture and forward thinking the systems experience growing pains at around 20TB to 48TB, and then the company must put an all-engines-stop out and re-build from the ground up (very costly), or migrate to a new platform (also can be very costly).

Denormalization is one mitigation strategy that will help, but only in certain cases. Remember that denormalization of data sets will instantly double or triple the storage requirements. Here's a fallacy for you: Storage is cheap. NOT SO at big data levels. If you buy cheap storage, you get "poor performance" or lack of parallelism. Furthermore, the more "performance" you want to drive out of a VLDB/VLDW, the more storage you may actually need.

So what about the data sets? Why can't we/shouldn't we reduce them?
I agree with the experts when they say: too much unused data is a bad thing. I disagree with experts when they say: too much bad or poor quality data is a bad thing.

There are two basic types of information in VLDB/VLDB:
1. Good data (aggregated, cleansed, merged)
2. Transactional Data (Auditable/complaint/traceable)

The business users are divided into multiple user groups: 80%-90% of those that use the good data, or moderately good data (good data is open to the end-users interpretation), and 10%-20% of those that require transactional details.

In the Good data set, there's no reason to keep around "old" or unwanted/unused data sets. They should be removed, or placed on a rolling usage cycle. However in the transactional data set (transactional with history), it's at the lowest possible grain. The more data the better! Especially if the business is mining the data set, and/or has audit requirements or federal/international mandates that state it must be kept on line.

Data mining loves big data, the more data it can mine, the better it's predictions and confidence ratings. The less granular detail it can mine, the worse it's predictions are - you might as well go back to aggregates and standard statistics. In this case, the credit-card processing company also has SLA's with it's vendors, along with the need to detect fraudulent activity - they MUST (and do) use a data mining tool on the transactional historical data.

With all these headaches why build a VLDW? Why not just go back to the old-style analytics backed with aggregations, averages, and statistics? Won't that save cost?
Yes, it will save on cost - but here's what you can gain if you build one. This company has 6 months of transactional history on-line, dynamically accessible by end-users. This equates to about 120TB. They are seeing at least a 5x multiplier on ROI (compared with the costs for maintaining and supporting it). They mentioned that if they could keep 12 months of data on line, they would do it in a heart-beat, and their multiplier would go up to 15x or higher.

The reason? They are missing enough data to significantly impact their decision making capabilities, especially with the data mining engine. In this game, the business must spend a little to gain a lot - especially if they know what questions to ask and have a firm grasp on how the answers will make them more effective and more competitive.

There's more, a lot more - I discuss the details in my class, along with mitigation strategies - I'd love to meet with you at TDWI in DC (may 19th 2005) should you wish to drop by. See you next time.


Posted May 5, 2005 4:34 AM
Permalink | No Comments |

That's right, Terminator as in T2 Eyeballs. Well, not really that advanced (yet). I just read in May's issue of Scientific American about nanomorphing silicon implants that take the place of damaged light recognition cells in the back of the eye, basically allowing a blind person to "see" images and outlines. They admit the resolution isn't that hot yet, but it will advance like everything else.

This article will explore Form and Function, and discuss the nature of adaptable neural models, and what it means to build a system that could potentially mimic the human brain.

According to the article, the brain can operate at 10 billion synapse firings per second. Who's Synapse? What's a Synapse? and Why does it Fire? For answers to those questions and more, see your local brain surgeon. (just kidding).

Here's the poor mans definition: imagine for a minute a series of interlinked spider webs. Got the picture? Ok. Now, imagine the spider on the center of each of the web. Each center of the web represents a term called a neuron. Each part of the web spanning outward, let's call that a synapse. Where one web attaches to another, let's call that a dendrite (receptor).

A spider catches prey by first, having a sticky web - second by feeling the vibrations caused on the web when something gets stuck there. Now imagine the neuron (center of the web) building up a charge and sending that charge down one or more synapses (all at once). Once the charge gets' high enough, it fires across the inhibitors to the dendrite receptors on the other side. In other words, capable of shaking another spider web with a directed charge.

Now imagine 15 layers of these webs, each interconnected with the other, and each layer responsible for a "part" of coverage. The inter connectivity can provide a feedback loop to build up a charge, or to "morph" it's neural structure and learn things - or in this case, focus on what's important like edges, highlights.

Nanomorphing is changing the hardware layers to suit the needs of the situation, rather than changing the software layers. The nanotech part of this allows different chemical bonds to be "favored" and "unfavored" depending on the electrical current and stimulation, thus changing the configuration at "run-time".

This is an example of just how important it is to bind form and function closely together - the more specific and targeted the functions are, the more compact they can be, the more efficient they can be. The more bound the form is to that function, the more adaptable the form can be - thus more resilient, and quicker to respond or adapt to it's environment. Also, surprisingly - the more standard, fault-tolerant and redundant the architecture gets which by the way, leads to adapted efficiencies during run-time.

This eye piece (according to the article) is made up of transistors modeled in a neural net fashion, with nanotechnology components, layered 5 layers thick. Each layer provides feedback loops to the last, to allow a charge to build up in a specific area, and "fire" a nerve ending in the back of the eye to the brain, resulting in a perceived image.

Note to self: Where's the ACTIVE feedback loop in our Data Warehouses? Are we still in the cave-man stage here?

Sorry about that... Moving on. You think this stuff is too far out? Hasn't happened yet? too difficult to build? Think again, there's a company "in my back-yard" in Boulder, CO called Genobyte... Check them out: http://www.genobyte.com/ They are already building adaptable hardware, and quite surprisingly, have been doing this since 1997.

Anyhow, my point (that seems to take so long to get to) is this: CONVERGENCE IS EVERYTHING, when it comes to nanotechnology, and nanohousing (nano data warehousing of the future), we will be forced to combine form and function in order to build adaptable systems with virtually unlimited scalability.

If we can build a system of nanomorphing hardware, and compensating software with encapsulated dynamic feedback loops, we may have the beginning of something interesting.

Would love to hear your comments and thoughts or questions.

Cheers,
Dan L


Posted May 3, 2005 4:24 PM
Permalink | No Comments |

The market is shifting, vendors are packing more and more features and functionality into their devices, they are also making their devices smaller and smaller. What does the future Data Warehouse look like? Can it be an appliance like device? What kind of partnerships or acquisitions can we expect? Why would we choose an appliance DW over our own component selections?

In this blog I look into the future, just to see if we can answer these questions. I believe there are changes coming, long overdue changes.

In the land of yesterday we would have to go in search of "best-of-breed" software, and then pair that up with best-of-breed hardware. Size it appropriately, install it all, and integrate it ourselves (within IT). I believe all that is changing. If it hasn't already, it certainly will shortly.

New vendors on the market are offering coupled hardware with built-in RDBMS's. This is just the start and as good of a start it is, it still has a little ways to go. Let's talk for a minute. What if you could walk out and buy an ADW appliance (active data warehouse) - self-configured to perform optimally on the machine, embedded within the BIOS, encapsulated storage, and a black-box interface... Would you do it? Especially at a cheaper cost than buying RDBMS vendor 1, and Hardware vendor 2.

So what does the future device look like?
It should contain not only the RDBMS, but also ELT software. This software should be embedded onto the machine for fastest performance, along with optimized disk routines, and mechanized load balancing. The ELT software should have two types of inputs: the flat file loading process, and the real-time network plug which reads JMS queues after configuration. The ETLT should be fully self-contained on it's own processor slot so that it doesn't interfere with the RDBMS operating in parallel, at high speeds on the disk.

There should also be a BI (reporting tool) card built in. It should have it's own IP connections, and reside on it's own processor slot as well. The tool and the box configuration should all be browser based, all administration could be fat client I suppose, but why? Why not make it all web/app server? It's separated from the RDBMS and ETLT engine slots, again so that it can run in parallel. Although the BI tool and the ETLT tool should be based on a common metadata framework.

Now, depending on the number of nodes purchased - hooking them together through a third pre-configured IP allows them to load-balance across a high-speed backbone. Again, nothing to do with each other but distribute the work-load.

What kind of partnerships or acquisitions can we expect?
I think in the future, you'll see storage vendors partner more heavily with RDBMS vendors, who are already working on "blade servers" - some vendors have the pre-packaged solution there. Other vendors are coming up to speed. I also think you'll see a larger effort to integrate the BI and ETLT software onto hardware platforms. It's getting cheaper to architect and build hardware - and most of the time we need the extra performance boost, even if the BI application card is running on a dual CPU at 450 Mhz, it's the RDBMS that needs the power.

That's all fine and dandy, but where's the value proposition?
The value comes as follows: automatic updates to the software and firmware over the web, little to no configuration needed (all comes pre-installed, and factory tuned), no fancy load-balancing, or parallelism software needed to gain performance. No messy dual environments for upgrades, no multi-cost purchases, plug & play scalability, speed and performance built into the hardware/firmware.

I think you may see compliance vendors entering this game too, they already are partnering with storage vendors for appliance based storage.

What makes this work and why?
The RDBMS engine on the appliance must be extremely fast, and extremely scalable. It must focus on bulk-applying data sets or images of data that is time-stamped. It must have high-quality, high-end compression, data quality, delta-processing capabilities available. As we consolidate our resources, it becomes easier to manage, upgrade, replace. In this environment you're paying for the engineering to be "done", out of the box, into the rack - load your data and away you go. All data is denormalized inside the box, so compression ratios are very very high, storage needs are low, performance is super fast - we don't have to worry about data modeling or indexing any more. (the end goal)

There are a number of companies to watch out there who are moving in these directions. It won't be long before they can meet all these needs with one appliance.

Of course it wouldn't hurt for these companies to consider a metadata appliance either, or possibly incorporate that directly into the warehouse appliance.

Just a few random thoughts, See you next time.


Posted May 3, 2005 4:15 AM
Permalink | 1 Comment |

Now that I've blogged on the needs for an ETL-T engine, I think it only fair to discuss what EL-T still leaves to be desired, and what is required to make EL-T perform. While ETL-T is the industry direction, EL-T has a ways to go before it can "take-over". Of course the notions of ELT "successes" are highly dependant on the RDBMS engine that it puts its' data in.

Let's explore these notions a little deeper...

EL-T (as I blogged recently) is where the integration industry is headed. Some of the comments I received were in regards to specific tool sets in the integration space. In another blog this week, I'll explore what these tools will need to have in order to survive the next couple years.

Let's start with the advantages of ETL over ELT:
1. ETL can off-load the transformation, this can be particularly helpful if you have a powerful ETL machine, or fast enough hardware and network pipes to perform parallel transformations.
2. ETL with 64 bit has nearly unlimited reach into physical memory, allowing most of the transformations to be optimally performed in stream, in memory.
3. ETL transformations in stream only have to pass the data once, if architected properly.
4. ETL can offer cross-RDBMS best of breed features that sometimes the databases don't have (which now, the RDBMS engines are catching up).
5. ETL can circumvent a poorly tuned RDBMS server, or an overloaded RDBMS server, or an overloaded DISK I/O channel that the RDBMS is using.
6. ETL has a tremendous leverage point for metadata, consistent and re-usable metadata due to the in stream processing. This makes it easier to track dependencies and data changes.

Now, before we knock ETL, let's just say there's still some big benefits to being able to perform "T" in stream, even though the ETL paradigm is indeed "dead" or morphing into something else.


What ETL traditionally has trouble with is:
1. Near-Real Time Processing, these are typically "bolt-ons" to a batch engine architecture - unfortunately the engines simply are not equipped for high-performance NRT processing.
2. XML technology, this also is typically a "bolt-on" rather than an engineered core-view, mainly because batch processing at high speeds requires highly structured data, and XML just isn't so. Even though there are "structures", the parent-child relationships, one-to-many, and many-to-many relationships are data driven, along with optional structural components. Typical ETL engines fall down on high-performance XML, and/or ease-of-use XML.
3. Transaction processing, backup, distribution, logging, and all things transactional. Transaction processing isn't just near-real-time processing, it includes the business rules too. Unfortunately ETL has a HUGE hole in the business rules arena. They simply do not operate as "business process workflow engines", they operate today, more as "IT data integration process workflow engines." Transaction processing sometimes includes emails, escalation paths, time-outs, delays, queue's, and business user manual interaction. The ETL architecture simply isn't flexible enough to handle these needs.

Ok, now let's talk about ELT and what it's pros and cons are.
Pro's:
1. ELT offers tremendous flexibility, as long as the RDBMS engine can be extended, the ELT engine can live on. When the RDBMS engine receives upgrades, performance tuning, additional hardware, ELT engine takes advantage of it right away.
2. ELT engines can offer extreme performance in terms of "copy-drop" data movement, they can parallelize the heck out of threads that move data from point A to point B, not much magic there to think about, except maybe fault-tolerance and recovery.
3. ELT's don't need to move data out and back in to a single RDBMS, they can work with the data in the target RDBMS, within the RAM specifications and CPU that the database engine offers.
4. ELT technology can be trigger driven, real-time (near-real-time), and is more apt to recognize or be based on transactional processing.
5. ELT technology is truly the "next-generation" of data integration tools, of course EII and EAI are vying for this space as well.
6. ELT doesn't require middle tier (extra servers and so on) for deployment of jobs.
7. ELT engines typically have no trouble dealing with XML, mostly because the RDBMS engines have that handling built in now.

Some of the cons:
1. ELT relies heavily on the performance and tuning of the RDBMS instance. If the instance is slow, ELT has no where to go! It will run only as fast as the RDBMS server allows.
2. ELT with huge batches of data, can eat tremendous resources on an RDBMS server, if you're running extremely large data sets, you better have a super-duty RDBMS engine, and it better be water-cooled, twin engine, air-intake with overhead cam shaft. In other words, your DBA's have to be the cream of the crop, and really be wizzes at making your RDBMS hum.
3. Some ELT engines don't allow control over the "array batch size" within the RDBMS, this could easily blow log segments/redo's/temp spaces.
4. Some ETL vendors will tell you that their engine is an ELT engine, only if it generates optimized native RDBMS SQL code with advanced functionality.
5. ELT MUST stage the data in order to run delta's, if the vendor claims in-memory delta processing, then they are an ETL engine not an ELT engine (unless again they generate native RDBMS SQL code) - then they might be an ETL-T engine (new breed).
6. ELT software today usually doesn't have all the connectivity options that ETL has (but that will change soon).
7. ELT engines frequently stage to flat-file for bulk-loader processes, if your ELT engine loads through an OS PIPE, be-careful! The OS Pipe sizes can be limited, and become a bottleneck in the flow. In other words: loading through a pipe directly into an RDBMS bulk-load facility can be slower than staging to a flat file and blasting the bulk-load with buffering mechanisms.
8. ELT engines REQUIRE extra RDBMS space to transform data, particularly when dealing with VLDB (very large databases). Why? Because READS must be processed in a batch form, so they don't conflict with WRITES, especially if the machine itself or the RDBMS cannot show a linear performance increase with the increase in the size of the hardware.
9. ELT vendors (most of them) need to show their integration to the business rules (this is where the EII vendors have really thrived lately - that and metadata).
10. If you can't tune your SQL, then you're better off with an ETL engine (today). ELT will require technicians with a high proficiency in SQL tuning, and RDBMS tuning.

Ok, that said: at the end of the day, I still would like the option of ETL-T with a lower cost, and be flexible enough to deal with the situations that arise. More to come.

Cheers for now,
Dan L


Posted May 2, 2005 6:29 PM
Permalink | 2 Comments |
Search this blog
Categories ›
Archives ›
Recent Entries ›