Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

March 2006 Archives

MDM = Master Data Management, why should you care what it means? There are many vendors out there who've defined which part of MDM they implement, unfortunately they've called it MDM; it's just one piece of MDM that they are tackling. MDM or true Master Data Management is a much larger umbrella than just "master conformed dimensions", or "Master lists of quality/cleansed information." MDM includes the term: Data Management and we all agree (at least for the most part) that Data Management is all encompassing right? So why the fuss over defining such narrow implementations and then titling them MDM?

What is Master Data Management anyway?
Let's break it down in to two pieces: Master Data, and Data Management - I think "Master Data" invokes quite a number of elements that need to be considered. Below is a check-list which my firm uses during implementation of Data Governance and best practices:

Master Data Questions
1. Does the data need to be compliant and Auditable?
If yes, then their are two copies of Master Data in the integration arena, the first copy of the data is compliant, and auditable, and merged only according to the business key. Merged is the wrong word here, by merged I mean attached to the same key defined to be the same semantic level across the business, that essentially means the same thing to the business. This is the only level of integration that takes place, without a loss of grain. Each target zone contains a record source that allows maximum traceability and compliance of the data. The data set is housed _exactly_ as it stood on the source system. Read the Data Vault data modeling for more information.

The second copy of the Master Data is in the data mart side of the house, remember: any time the data is truly merged, quality cleansed, altered, or prepared for user-utilization, that I call that: a data mart (it can be a single table, a conformed dimension, or a master data table with "quality" data) - it's all a data mart, where end-users can get their information quickly, and the model is tuned for speed and quality of data. In this case, the second copy of Master Data might be a "Master Dimension".

2. Does the data require existing as a System Of Record?
If yes, then the data set must be a "master" single consolidated instance of the data at the lowest level of grain, again see my response above about compliance and auditability.

If no, then maybe all you have is the notion of "Master Data Sets" embedded in conformed dimensions. Maybe you've defined your warehouse as a Conformed Dimension quality-cleansed warehouse, again this is ok if you don't want, or don't need an SoR in your warehouse, or don't have compliance and accountability to deal with.

3. Is your Master Data housed on your source system?
If so, then you might have a single copy of the data set. If not, then the data set might be spread across multiple source systems.

Data Management Questions
Note: this area is huge, and due to time and space constraints I will not address the entire segment of Data Management, hopefully I will present the feeling that MDM is a much larger umbrella than the "vendor hype" would lead you to believe.

1. What is Data Management?
That is a huge question with a huge answer. Data Management encompasses the act of understanding, monitoring, managing, maintaining, moving, integrating, synchronizing, dis-integrating, and delivering your data. It also includes items like security and access (data Governance), growth (metrics), value (KPA's & KPI's), and mining (asking unknown questions). Master Data Management encompasses the "master data notions", with Data Management notions - it is bigger (much bigger) than just data warehousing, or data integration.

2. How does Data Management affect Master Data?
Data Management is an initiative, a journey - not a destination. It is an ongoing effort to track, manage, and unravel the information across the enterprise. Data Management stretches across all data sets, but in particular, the Master Data shows high-value and high return (when done properly). Managing that data set over time will prove to keep the high-value proposition in tact. Without data management strategies and best practices, the "master data" sets decay over time, their usefulness slowly fades as the enterprise changes business rules and business processes. What's a master data set today, may not constitute the "right" master data set tomorrow. Be careful not to fall in the trap of letting your Master Data Set be set in stone from this day forward.

Master data sets need to be managed. Ahh but don't forget semi and unstructured data sets, those too must be managed, and believe it or not there are Master Data sets lurking in these areas as well.

Do you have an MDM initiative? What's worked for you and your organization? What differs from the vendor definitions?

Thanks,
Dan Linstedt


Posted March 21, 2006 5:49 AM
Permalink | No Comments |

I've been applying performance and tuning techniques to systems world-wide for the past 15 years. I grew up as an assembly level programmer on a CPM-Z-80/Z-8080 Digital VT-180 computer, along with Unix System 5, and a few other machines. It used to be that many would say Performance and Tuning is an art more than a science, well, these days - the science part of it is what really makes this work. In this entry I'm going to introduce to you the top golden rules for performance and tuning your systems, and architectures - this is just a peek at what I'm going to be teaching at the TDWI conference in San Diego in August. In my assessments I cover everything from hardware, to architecture, to systems, to overloading, and platform sizing.

Have you ever wondered how to reduce ETL/ELT processing times from 72 hours to 23 hours? or 18 hours to 2 hours? Have you wondered how to gain 400% to 4000% performance improvements from the systems you have? How do you know if your RAM/CPU, Disk, IP/Network, Applications and RDBMS are in balanced and peak performance modes? Have you questioned when to buy new hardware, and what platform to move to?

The following golden rules are the top tips of performance across systems architecture - these are all part of a workshop course, and assessment that I offer on-site - which tailors the responses to your organization.

The top rules to performance and tuning any system are as follows:
1. Reduce the amount of data you're dealing with
2. Increase Parallelism

The rest of the rules are assigned to meet different categories and can include:
3. Balance load across the applications
4. Re-Architect the processes
5. Limit the RDBMS engines to their use of hardware
6. Partition, partition, partition
7. Manage restartability, fault-tolerance, fail-over
8. Classify Error categories
9. Do not overload hardware with too much parallelism
10. Tune disk access

There are about 250 such recommendations which go in to tuning any sort of system ranging from midrange to client/server based. Mainframes work slightly differently and require (sometimes) a lifting of the CU limitation put on the login. But let me talk about the first two rules: 1 & 2.

Decreasing the data set:
There are a multitude of ways in which to decrease the data set:
1. The first is identify which data you will actually be using during processing, and then ensures that only that data actually passes through the process. Sometimes this requires re-architecture in order to see the performance gains or to be able to reduce the data set.
2. The second is to partition the data, and apply the second rule - increase parallelism. Once partitioned, each partition within the parallel set of processes deals with "less data", therefore if the hardware can handle it, performance will increase.
3. Vertical and horizontal partitioning are two kinds of partitioning available: Vertical is split by number of "columns" or precision of the data set, horizontal is what we are used to with RDBMS table partitioning. These two are NOT mutually exclusive, unfortunately most RDBMS engines today do NOT do a good job of vertical partitioning.

Increasing Parallelism:
1. Remember this: DBA's often make the mistake of setting only "1" switch in the RDBMS engine to engage parallelism, then if the performance gain isn't seen, they change the switch back. This approach will NOT work. Most RDBMS engines require 10 to 16 switches be set just to engage parallelism the proper way, and allow the engine to rewrite queries, perform inserts and updates in parallel, and so on.
2. By simplifying (re-architecting) the processes, most processes can be created to run in parallel. Large complex processes are bound (in most cases) to serialize unless they are constructed to execute block style SQL. There are some RDBMS vendors that don't allow any other kind of processing because they execute everything in parallel for you.
3. WATCH YOUR PARALLELISM - too much of a good thing can overload your system, again, balance must be achieved. Watch your system resources - there are ways to baseline your system to gain the maximum performance for the minimum amount of changes. I can help you identify the quick changes to be made.
4. Remember: most RDBMS engines these days have parallel insert, update, and delete - but to take advantage of parallel updates and parallel deletes usually requires a script be executed within the RDBMS (as opposed to a direct connect). This is most of the RDBMS vendors don't offer PARALLELISM for updates/deletes in their API / SDK's for applications to use (this should change in the near future).

I/O's can kill performance, balancing I/O's and caching activity can be a huge performance gain (or loss if done improperly). One day when we have nanotech storage devices, the "disk" I/O will disappear. Until then, we must live with it.

I'd love to hear what you've done to tune your environments, if I use your story at TDWI I'll quote you as the source. Please let me know if you'd like to be quoted, feel free to drop me an email in private as well. This entry is just a glimpse into the P&T world.

Thanks,
Dan L


Posted March 20, 2006 5:45 AM
Permalink | No Comments |

From where I stand (ok - sit).... I was on a plane this morning, and had the opportunity to view the captain’s cockpit for a brief while, while they ran through some of their pre-flight checks. As usual, my mind began to wander and ask the "what-if" questions, what if they didn't have a history of best-practices, how would they know what to check for pre-flight? Are all the gauges real-time or do some gauges offer "historical" data? How many of these gauges "manage data" for a single context? And then it hit me, all the gauges and knobs are really a "visualization" of the information they need to prepare for flight, fly, land, and do all the things a captain and co-captain need to do to move an air-plane through the air safely.

This entry is more about unanswered questions than it is about speculation. I'd love to hear about your experiences as management, executive level, or otherwise - and what you might do in this situation.

Well, that got me to thinking. I know cockpits are complicated, I can see that. It takes hundreds of hours (if not thousands) to learn to fly a commercial jet safely, to understand all the switches and knobs, and "heads-up" displays that constantly stream information at them. I started to reason: if getting a commercial pilots license requires all this training, should CEO's, executives, and board-of-directors also go through rigorous training? Where are the instructors for "running a company?"

I also began to wonder: what would happen if some of these fancy "real-time read-out displays" were not computerized, or visual? Maybe there's a pilot out there who can comment on what it's like to fly through a storm without visual aids, knowing what's up/down, or broken gauges that needed to be repaired.

I began to wonder - why isn't there a "cockpit" approach to running corporations? Would it or could it become that standardized? Is there a way to visualize all the information in a corporation? If you could visualize corporate business management in a cockpit manner, how would you describe the nature of the graphs, charts, landscape / horizon layouts? What kinds of knobs and dials would you have?

I began to think of the cockpit as Master Data Management (all data in the right place at the right time, attuned to the right purpose) for an airplane. Share with us how this might affect your visualization or MDM efforts.

Thanks,
Dan Linstedt


Posted March 13, 2006 8:58 PM
Permalink | 1 Comment |

I've been granted permission by Bill to discuss DW2.0 on the blog, and in other articles that I write. This entry is an introductory look at DW2.0, the overall definition, sections, and components. If you wish to use the terms you will need to contact Bill directly. I've included Bill Inmon's stringent legal ramifications below:

"The definition of DW2.0 is intended for the non commercial use of anyone who wants to use the material. However, any commercial use of the material and the trademark is strictly forbidden and will be vigorously monitored and prosecuted. Commercial usage of DW2.0 specifically pertains to (but is not limited to) commercial usage in seminars, presentations, books, articles, speeches, web sites, white papers, panel discussions, reports, and other written and oral forms is forbidden. If you wish to use material about DW2.0 commercially, licensing can be arranged for a fee."

There are 4 sectors of DW 2.0 which comprise the "data warehouse" in a disciplined format: (note: all quoted material is from Bill Inmon’s site and description of DW2.0)

Interactive Sector - The place where high performance data warehouse processing occurs Integrated Sector - The place where integrated data resides Near Line Sector - The place where data with a lower probability of access resides Archival Sector - The place where data with a truly low probability of access resides

From a 3000 ft perspective, each "sector" looks to be (at first) like separate copies of data, this may not turn out to be the case. In fact, these can be made into logical divisions - particularly if the data model underneath supports the logical architecture in a physical format. I've created a public domain (freely available) data modeling architecture called the Data Vault which supports both the interactive and integrated sectors. The notion of Near Line and Archival Sectors appear (at first glance) to be more physically related to storage. I'll dive into these in future blog entries.

In my opinion, the RDBMS vendors should be the first to stand up and take notice (along with the appliance vendors). They should be rushing to the table to support DW2.0 from a mechanical standpoint - offering the developers "seamless" integration across each of the four sectors. That would bring the reality of a logical model and metadata management to the implementation cycles. I long for the day when I can "logically model" the data and no longer care (or know) how the physical implementation takes place - the only addition to the logical model might be data types and field lengths from the physical world.

Let's switch gears and discuss DW2.0 Compliance, audit ability, and SOR (system of record) for a moment. Below is Bill's definition of SOR and the best place to identify data as arriving from an SOR.

Because the data that enters DW2.0 has its first appearance in the operational environment, great care needs to be taken with the data. In a word, the data that eventually finds its way into DW2.0 needs to be as accurate, up to date, and complete as possible. There needs to be defined what can be determined the source data system of record. The source data system of record is the data that is the best source of data.

I often ponder the question: what does SOR truly mean? Hmmm - by that I wonder about the following case study (which actually happened to me 10 years ago on a government data warehouse).

We built a data warehouse, it contained a master parts list, and a few other master lists (hence my recent entries on Master Data Management). Our warehouse also contained integrated data organized by business key, but stored at the lowest level of grain. Furthermore the information was not "transformed" except in raw data type, and defaults were assigned in specific cases documented by SLA's with the business.

Three things happened. Auditors were brought in because naysayer’s were stating that the warehouse was "wrong", and they wanted the project stopped. The first thing that happened was around data audit ability. The auditors asked: why do the reports from the data marts not match the operational reports? Our team demonstrated the value of raw integrated data (both bad and good) stored within the warehouse, and that the warehouse reflected what was in the source system - the auditor passed the warehouse, and then proceeded to tell the business that the operational report (financial calculation) was wrong and needed to be corrected. The business would not have had "accountability" much less found or fixed the problem if our data warehouse was not deemed a "reliable and compliant" source of data.

The second thing that happened (at the same time): the auditor saw the parts list, employee list, work order list, and so on... and then asked: does this "vision of integrated data" exist in any one source system? The answer was clearly no. The auditor then checked the individual data elements for audit ability and traced them back to their source systems, once satisfied he labeled the warehouse suitable to become a "system of record" as it was the only place that data existed.

The third thing that happened: the auditor then asked for a source system that was called "the master system" for bill of materials to be re-loaded with 5 year old data. But the business had changed, the models in the source system had changed and the restore could not take place - making it impossible for the "master system" to be a system of record for historical data. The only place that data could be loaded was in the warehouse.

As I read through DW2.0 specification I believe there is a place for accountability, SOR, and compliance within the warehouse, again it has a lot to do with the traceability of the data sets and creating audit trails where they didn't exist before. We'll dive into this more later.

For now, if you have thoughts or comments - I'd love to hear about them. What part of DW2.0 would you like to know about?

Thank-you,
Dan L


Posted March 9, 2006 6:59 PM
Permalink | No Comments |

DNA computing is rapidly making strides in the nanotech industry. There is an interesting evolution with absolutely profound implications: control over a single DNA molecule via nano crystal antennae. The presentation is available for a small fee, but shows just what is possible. Imagine, a massively parallel computing engine at phenomenal speeds, controlling millions or billions of DNA molecules via radio signals.. Wow! How about a thumb drive with 10^8 terabytes of computing power in a couple grams of DNA solution? Searching this solution in less than 3 seconds for answers, computing within the solution in 3 to 10 seconds...

The presentation is on the MIT web site.
The implications are profound. The notion of controlling a single DNA molecule from a radio wave is incredible. Let's step off the edge, and look into the future, over the horizon - let's see if we can think of applications and implications of this technology within the DW / BI space. Beyond the obvious applications in bio-tech, and medical science, let's see what we can come up with.

The web blurb talks about the following:

Anyone can imagine controlling a model car or airplane with radio signals, remotely guiding the machine along a prescribed pathway. In this Knowledge Update, readers learn that the same is being done with DNA and other molecules. This Update describes the tools behind this molecular control, which relies on nanotechnology. In addition, readers learn how this technique can control the binding of DNA, which governs biological processes from cell division to switching genes on and off. Consequently, controlling bimolecular operations opens many possibilities, such as using this nano-control for genetic testing, building molecule-size devices that move on command, and much more.

Now, lets' dive into nano-computing for a moment: imagine a computing system containing a few grams of DNA - say within the size of a thumb drive for a USB port. Within that thumb drive are two things: modified DNA with nano crystal antennae, and a computing system that produces super short, very "weak" radio transmission waves; just enough of a wave to reach the localized DNA. Of course the frequency must be localized as well, and the radio wave must be too weak to travel outside the bounds of the thumb drive - maybe the inside of the thumb drive is coated with a shielding material that keeps the radio waves within the device.

Power consumption is low for this kind of thing. It would be very easy to "program" the DNA, especially since the radio waves cut, splice, and control on/off of the molecules. The challenge would be in reading the DNA results. Suppose there are two mechanisms available to "read results", one possibility might be based on a solution, encouraging and discouraging bonding based on ionization of the molecules - then the reading mechanism might be a segment of light that passes through the entire solution, and either shadow and/or intensity of shadow can produce a read-out of the result, or instead of light and colors, maybe additional radio waves are passed through the solution - ones that don't interact with the antennae, what bounces is read into an "imaging" device - the image is then interpreted by standard programmatic methods.

It is possible then, by combining existing technology with nanotechnology into a single device, to see how "exponentially hard" computational problems can be solved through a simple USB plug and play, and that existing technology can be used to "read" the answers, and send the signals in parallel to the actual computation engine. However, now that I think of it, why not use this for simple solutions too? Solved in parallel, all the DNA strands and programmable DNA molecules should come up with the same answer, every time.

Radio waves offer the dynamics of the same signal to each programmable element at the same time, using imaging and light/color/shadowing techniques - the solution could be "read". Localizing the radio waves and shielding the cover would minimize interference.

I'd love to hear from you, and see what you think of this future vision.

Thank-you,
Dan Linstedt


Posted March 8, 2006 8:01 AM
Permalink | No Comments |
PREV 1 2

Search this blog
Categories ›
Archives ›
Recent Entries ›