Blog: Dan E. Linstedt« February 2006 | Main | April 2006 » March 21, 2006Demystifying the MDM definitionMDM = Master Data Management, why should you care what it means? There are many vendors out there who've defined which part of MDM they implement, unfortunately they've called it MDM; it's just one piece of MDM that they are tackling. MDM or true Master Data Management is a much larger umbrella than just "master conformed dimensions", or "Master lists of quality/cleansed information." MDM includes the term: Data Management and we all agree (at least for the most part) that Data Management is all encompassing right? So why the fuss over defining such narrow implementations and then titling them MDM? What is Master Data Management anyway? Master Data Questions The second copy of the Master Data is in the data mart side of the house, remember: any time the data is truly merged, quality cleansed, altered, or prepared for user-utilization, that I call that: a data mart (it can be a single table, a conformed dimension, or a master data table with "quality" data) - it's all a data mart, where end-users can get their information quickly, and the model is tuned for speed and quality of data. In this case, the second copy of Master Data might be a "Master Dimension". 2. Does the data require existing as a System Of Record? If no, then maybe all you have is the notion of "Master Data Sets" embedded in conformed dimensions. Maybe you've defined your warehouse as a Conformed Dimension quality-cleansed warehouse, again this is ok if you don't want, or don't need an SoR in your warehouse, or don't have compliance and accountability to deal with. 3. Is your Master Data housed on your source system? Data Management Questions 1. What is Data Management? 2. How does Data Management affect Master Data? Master data sets need to be managed. Ahh but don't forget semi and unstructured data sets, those too must be managed, and believe it or not there are Master Data sets lurking in these areas as well. Do you have an MDM initiative? What's worked for you and your organization? What differs from the vendor definitions? Thanks, March 20, 2006Golden Rules of Performance and TuningI've been applying performance and tuning techniques to systems world-wide for the past 15 years. I grew up as an assembly level programmer on a CPM-Z-80/Z-8080 Digital VT-180 computer, along with Unix System 5, and a few other machines. It used to be that many would say Performance and Tuning is an art more than a science, well, these days - the science part of it is what really makes this work. In this entry I'm going to introduce to you the top golden rules for performance and tuning your systems, and architectures - this is just a peek at what I'm going to be teaching at the TDWI conference in San Diego in August. In my assessments I cover everything from hardware, to architecture, to systems, to overloading, and platform sizing. Have you ever wondered how to reduce ETL/ELT processing times from 72 hours to 23 hours? or 18 hours to 2 hours? Have you wondered how to gain 400% to 4000% performance improvements from the systems you have? How do you know if your RAM/CPU, Disk, IP/Network, Applications and RDBMS are in balanced and peak performance modes? Have you questioned when to buy new hardware, and what platform to move to? The following golden rules are the top tips of performance across systems architecture - these are all part of a workshop course, and assessment that I offer on-site - which tailors the responses to your organization. The top rules to performance and tuning any system are as follows: The rest of the rules are assigned to meet different categories and can include: There are about 250 such recommendations which go in to tuning any sort of system ranging from midrange to client/server based. Mainframes work slightly differently and require (sometimes) a lifting of the CU limitation put on the login. But let me talk about the first two rules: 1 & 2. Decreasing the data set: Increasing Parallelism: I/O's can kill performance, balancing I/O's and caching activity can be a huge performance gain (or loss if done improperly). One day when we have nanotech storage devices, the "disk" I/O will disappear. Until then, we must live with it. I'd love to hear what you've done to tune your environments, if I use your story at TDWI I'll quote you as the source. Please let me know if you'd like to be quoted, feel free to drop me an email in private as well. This entry is just a glimpse into the P&T world. Thanks, March 13, 2006Does MDM include Data Visualization?From where I stand (ok - sit).... I was on a plane this morning, and had the opportunity to view the captain’s cockpit for a brief while, while they ran through some of their pre-flight checks. As usual, my mind began to wander and ask the "what-if" questions, what if they didn't have a history of best-practices, how would they know what to check for pre-flight? Are all the gauges real-time or do some gauges offer "historical" data? How many of these gauges "manage data" for a single context? And then it hit me, all the gauges and knobs are really a "visualization" of the information they need to prepare for flight, fly, land, and do all the things a captain and co-captain need to do to move an air-plane through the air safely. This entry is more about unanswered questions than it is about speculation. I'd love to hear about your experiences as management, executive level, or otherwise - and what you might do in this situation. Well, that got me to thinking. I know cockpits are complicated, I can see that. It takes hundreds of hours (if not thousands) to learn to fly a commercial jet safely, to understand all the switches and knobs, and "heads-up" displays that constantly stream information at them. I started to reason: if getting a commercial pilots license requires all this training, should CEO's, executives, and board-of-directors also go through rigorous training? Where are the instructors for "running a company?" I also began to wonder: what would happen if some of these fancy "real-time read-out displays" were not computerized, or visual? Maybe there's a pilot out there who can comment on what it's like to fly through a storm without visual aids, knowing what's up/down, or broken gauges that needed to be repaired. I began to wonder - why isn't there a "cockpit" approach to running corporations? Would it or could it become that standardized? Is there a way to visualize all the information in a corporation? If you could visualize corporate business management in a cockpit manner, how would you describe the nature of the graphs, charts, landscape / horizon layouts? What kinds of knobs and dials would you have? I began to think of the cockpit as Master Data Management (all data in the right place at the right time, attuned to the right purpose) for an airplane. Share with us how this might affect your visualization or MDM efforts. Thanks, March 9, 2006DW2.0 - Introductory ThoughtsI've been granted permission by Bill to discuss DW2.0 on the blog, and in other articles that I write. This entry is an introductory look at DW2.0, the overall definition, sections, and components. If you wish to use the terms you will need to contact Bill directly. I've included Bill Inmon's stringent legal ramifications below: "The definition of DW2.0 is intended for the non commercial use of anyone who wants to use the material. However, any commercial use of the material and the trademark is strictly forbidden and will be vigorously monitored and prosecuted. Commercial usage of DW2.0 specifically pertains to (but is not limited to) commercial usage in seminars, presentations, books, articles, speeches, web sites, white papers, panel discussions, reports, and other written and oral forms is forbidden. If you wish to use material about DW2.0 commercially, licensing can be arranged for a fee." There are 4 sectors of DW 2.0 which comprise the "data warehouse" in a disciplined format: (note: all quoted material is from Bill Inmon’s site and description of DW2.0) Interactive Sector - The place where high performance data warehouse processing occurs Integrated Sector - The place where integrated data resides Near Line Sector - The place where data with a lower probability of access resides Archival Sector - The place where data with a truly low probability of access resides From a 3000 ft perspective, each "sector" looks to be (at first) like separate copies of data, this may not turn out to be the case. In fact, these can be made into logical divisions - particularly if the data model underneath supports the logical architecture in a physical format. I've created a public domain (freely available) data modeling architecture called the Data Vault which supports both the interactive and integrated sectors. The notion of Near Line and Archival Sectors appear (at first glance) to be more physically related to storage. I'll dive into these in future blog entries. In my opinion, the RDBMS vendors should be the first to stand up and take notice (along with the appliance vendors). They should be rushing to the table to support DW2.0 from a mechanical standpoint - offering the developers "seamless" integration across each of the four sectors. That would bring the reality of a logical model and metadata management to the implementation cycles. I long for the day when I can "logically model" the data and no longer care (or know) how the physical implementation takes place - the only addition to the logical model might be data types and field lengths from the physical world. Let's switch gears and discuss DW2.0 Compliance, audit ability, and SOR (system of record) for a moment. Below is Bill's definition of SOR and the best place to identify data as arriving from an SOR. Because the data that enters DW2.0 has its first appearance in the operational environment, great care needs to be taken with the data. In a word, the data that eventually finds its way into DW2.0 needs to be as accurate, up to date, and complete as possible. There needs to be defined what can be determined the source data system of record. The source data system of record is the data that is the best source of data. I often ponder the question: what does SOR truly mean? Hmmm - by that I wonder about the following case study (which actually happened to me 10 years ago on a government data warehouse). We built a data warehouse, it contained a master parts list, and a few other master lists (hence my recent entries on Master Data Management). Our warehouse also contained integrated data organized by business key, but stored at the lowest level of grain. Furthermore the information was not "transformed" except in raw data type, and defaults were assigned in specific cases documented by SLA's with the business. Three things happened. Auditors were brought in because naysayer’s were stating that the warehouse was "wrong", and they wanted the project stopped. The first thing that happened was around data audit ability. The auditors asked: why do the reports from the data marts not match the operational reports? Our team demonstrated the value of raw integrated data (both bad and good) stored within the warehouse, and that the warehouse reflected what was in the source system - the auditor passed the warehouse, and then proceeded to tell the business that the operational report (financial calculation) was wrong and needed to be corrected. The business would not have had "accountability" much less found or fixed the problem if our data warehouse was not deemed a "reliable and compliant" source of data. The second thing that happened (at the same time): the auditor saw the parts list, employee list, work order list, and so on... and then asked: does this "vision of integrated data" exist in any one source system? The answer was clearly no. The auditor then checked the individual data elements for audit ability and traced them back to their source systems, once satisfied he labeled the warehouse suitable to become a "system of record" as it was the only place that data existed. The third thing that happened: the auditor then asked for a source system that was called "the master system" for bill of materials to be re-loaded with 5 year old data. But the business had changed, the models in the source system had changed and the restore could not take place - making it impossible for the "master system" to be a system of record for historical data. The only place that data could be loaded was in the warehouse. As I read through DW2.0 specification I believe there is a place for accountability, SOR, and compliance within the warehouse, again it has a lot to do with the traceability of the data sets and creating audit trails where they didn't exist before. We'll dive into this more later. For now, if you have thoughts or comments - I'd love to hear about them. What part of DW2.0 would you like to know about? Thank-you, March 8, 2006DNA Computing - Control over DNA MoleculesDNA computing is rapidly making strides in the nanotech industry. There is an interesting evolution with absolutely profound implications: control over a single DNA molecule via nano crystal antennae. The presentation is available for a small fee, but shows just what is possible. Imagine, a massively parallel computing engine at phenomenal speeds, controlling millions or billions of DNA molecules via radio signals.. Wow! How about a thumb drive with 10^8 terabytes of computing power in a couple grams of DNA solution? Searching this solution in less than 3 seconds for answers, computing within the solution in 3 to 10 seconds... The presentation is on the MIT web site. The web blurb talks about the following: Anyone can imagine controlling a model car or airplane with radio signals, remotely guiding the machine along a prescribed pathway. In this Knowledge Update, readers learn that the same is being done with DNA and other molecules. This Update describes the tools behind this molecular control, which relies on nanotechnology. In addition, readers learn how this technique can control the binding of DNA, which governs biological processes from cell division to switching genes on and off. Consequently, controlling bimolecular operations opens many possibilities, such as using this nano-control for genetic testing, building molecule-size devices that move on command, and much more. Now, lets' dive into nano-computing for a moment: imagine a computing system containing a few grams of DNA - say within the size of a thumb drive for a USB port. Within that thumb drive are two things: modified DNA with nano crystal antennae, and a computing system that produces super short, very "weak" radio transmission waves; just enough of a wave to reach the localized DNA. Of course the frequency must be localized as well, and the radio wave must be too weak to travel outside the bounds of the thumb drive - maybe the inside of the thumb drive is coated with a shielding material that keeps the radio waves within the device. Power consumption is low for this kind of thing. It would be very easy to "program" the DNA, especially since the radio waves cut, splice, and control on/off of the molecules. The challenge would be in reading the DNA results. Suppose there are two mechanisms available to "read results", one possibility might be based on a solution, encouraging and discouraging bonding based on ionization of the molecules - then the reading mechanism might be a segment of light that passes through the entire solution, and either shadow and/or intensity of shadow can produce a read-out of the result, or instead of light and colors, maybe additional radio waves are passed through the solution - ones that don't interact with the antennae, what bounces is read into an "imaging" device - the image is then interpreted by standard programmatic methods. It is possible then, by combining existing technology with nanotechnology into a single device, to see how "exponentially hard" computational problems can be solved through a simple USB plug and play, and that existing technology can be used to "read" the answers, and send the signals in parallel to the actual computation engine. However, now that I think of it, why not use this for simple solutions too? Solved in parallel, all the DNA strands and programmable DNA molecules should come up with the same answer, every time. Radio waves offer the dynamics of the same signal to each programmable element at the same time, using imaging and light/color/shadowing techniques - the solution could be "read". Localizing the radio waves and shielding the cover would minimize interference. I'd love to hear from you, and see what you think of this future vision. Thank-you, March 7, 2006Is it time to re-define your Data Warehouse?I've commented in the past on my definition of the data warehouse, and recently, based on that definition I've been commenting on Master Data Management. In this blog I take a step back, and post the pro's and con's of constructing a compliant (active) data warehouse. I would love to have everyone weigh in, and tell us what kind of a data warehouse your organization is implementing and why. I'd like to clear the air and see if compliance within a data warehouse is really an issue for the enterprise. What exactly does a COMPLIANT data warehouse mean to you? Please tell us, we'd love to hear about it. I've grown up in the industry believing that constructing auditable historical data stores is the proper way to build "data warehouses." I've had huge successes in passing audits, proving the warehouse contains correct data according to the source, and producing data marts of all shapes and sizes. In the environment we were in, with this approach, we've shown time and time again: the flaws in the operational systems (including operational reports) which were costing the company millions of dollars a year. Without auditable historical data stores (what I call a compliant data warehouse), the nay-sayers would've been right when the blamed the warehouse for being "wrong" and our team would have been put "out of business." This approach to defining data warehouses and the process of data warehousing has lead me to new architectures (like the Data Vault data model), new methods of loading data and validating utilizing ETL/ELT routines, and writing articles on compliance and the nature of the data loads. However, I understand from a number of sources that Not all data warehouses need to be compliant - but is this really true? I'd like to hear from those who don't need the warehouse to be compliant nor auditable within their organization. I'd like to know exactly what the enterprise is using the warehouse for, and how they justify the data within. With that, let's take a look at the pros' and cons' (from my opinionated stance) of compliant versus non-compliant warehouses: Compliant: Cons: Now, let's take a look at a traditionally defined data warehouse, or a non-compliant data warehouse. Cons: These are just my thoughts, I'd love to hear what you would add to the pros and cons of each of these lists - I want to know what you are experiencing in the market place. Many of the warehouses built with compliance in mind (as I've described it above), have had 10+ years of success and are in fact growing today, with buy-in from finance, HR, sales, and even the corporate board of directors. Please let me know what you think, I'm also curious to know how many of you are seeing a request for a compliant data warehouse - and just what does that mean? Hope to hear from you soon, March 6, 2006Hidden in the un-structured information...Welcome again, unstructured data is a hard thing to grasp, let alone to process; but if we (businesses) are going after it, then we MUST have a reason. That reason? There must be value in the information hidden in the unstructured layers - after-all, what is "unstructured" data anyway? I think free-form text, is still semi-structured, images are semi-structured, emails, word-docs, and other such elements - they are all structured to some degree, otherwise programmatic approaches would not be able to display the documents, search the images, allow alterations, perform matches. I think what we should be focusing on in the Data Warehousing / Data Integration industry is how to best leverage the "unstructured information" programs and algorithms already built. Think about it, with images there are all kinds of image processing programs, image matching, alteration, consolidation, over-lay, resizing, colorization, and so on. For drawings, there are cad-programs, element tags at the end or in the middle of the image that explain all the components. For chemical images there are sets of commands and tags that explain how to build a rotating 3D visual of the chemical elements and their associative parts. For word-docs, and other docs there are "parsing and processing programs" like Microsoft Word, and KDE KOffice (open Source), Star Office, and so on. For e-mails, there are many different programs - but most of the email traffic can actually be "sniffed" off TCP/IP packets without much damage to the content (if any today). Given this definition, the question I have truly, is WHAT IS Unstructured data? I'm not so sure it's such a good term to use, but let's just accept (for the purposes of this entry) that unstructured data is everything that isn't defined (easily) by a standard RDBMS table structure - without blobs and CLOBS of course; let's pretend that everything defined by a BLOB or CLOB is considered "unstructured" for a minute and then return to the question above. Ten years ago (or more) I worked as an employee for a government manufacturing corporation, big money, big contracts, compliance, and unstructured data. Our manufacturing plan was filled with unstructured data. At that time we needed (as a part of our effort) to integrate parts drawings, and to look for text within the CAD drawings to figure out what impact it had on the plan; in other words, annotations for specific parts drawings. Now back then, the CAD images were just that, CAD images - and picking the text out wasn't as simple as "looking for the text attached to the image". We literally had to process vector graphic commands. Why? What was hidden in our unstructured information? Why is this important for us? So how do we access this information easily today? What you do with the information after you discovered it should actually be pre-determined by the business case, or the reason for purchasing and installing EII in the beginning. As usual, the business needs to drive the need for IT to solve the problem of accessibility. Establish the value of "finding" and "using" the data in the unstructured world before you set out to implement. What are some of the EII's strengths today? What are some of the features that EII will need in the near future? Remember, summarization and what is done with that summarization of unstructured and semi-structured information can often shed light on "how" these documents are utilized, or meet the business requirements set before them. EII is a tool that can and should help in these areas, don't forget unstructured Search tools as well - EII should partner up with these vendors in order to have a wider grasp of "tagging" technology and summarization/scoring technology. The best use of Unstructured/Semi-Structured data is the one that has a predefined business question/business case to answer to. Are you accessing unstructured/semi-structured data? I'd love to hear from you - what are your challenges or successes with what you've done? Thanks, March 3, 2006VLDW: What happens in a scaled cluster?I wrote a blog on this a while back, about MPP vs Clustering, now I'm going to discuss what happens in an Active cluster (to use an MS term) that usually causes problems. I'll also talk about clustering within a single node-group under an MPP option. While there are many issues surrounding clustering and volume, some are more prevalent than others. The golden rule is: Volume and Latency change everything! Come see me in August, at TDWI - where I teach a VLDW class and it's technical aspects. In my last entry on this topic, I stated that it is better to run with MPP than with Clustering and that the more volume the clusters contain, the costlier it is to keep them going. Here's what's happening under the covers. This pertains to a completely clustered SMP system: The network traffic between the nodes MUST be dedicated to server to server communication, if the network between the servers is mixed with disk traffic, or client traffic, or other server traffic, the communication layers begin to break down. Maintaining or keeping "up with the nodes", in other words synchronizing each node in the cluster every millisecond for access to the shared master memory allocation table becomes a bear. The more "nodes" that are added to the SMP cluster, the more network traffic there will be, the harder it becomes (mathematically) to keep them all in synch - due to the limitations of speed of the network, speed of the CPU's, speed of RAM, speed of disk. These upper limits are constantly being raised as speed of hardware increases, however - they lower back down with addition of RAM on the machines, or addition of data to manage on shared disk. In a clustered environment such as this, everything is shared across all machines. The next thing that happens is the sharing of disk. The sharing of disk introduces I/O collisions across the I/O network (which should also be independent of every other kind of traffic). I/O contention must be managed, all nodes have access to the same data at the same time, the trick is to (again) setup a master data access table, just like a master RAM access table, and then synchronize the master data access table across all nodes in the cluster. The problem comes (again) as the data set grows, and localization of the information becomes a burden. In other words, the database that runs a cluster needs to scan 50 Million rows, and run computations. The table is partitioned, but it becomes a shared job - the process starts on a single node in the cluster, the database thinks it has 40GB of RAM to access, so it begins to load the RAM in each of the clustered machines with different data sets - as it does, this exponentially increases the network traffic between the machines (in order to synchronize the RAM and CPU actions across the machines), and increases the network traffic between the machines and the disk device. Ultimately a second request for large data comes in, the first request hasn't finished yet - there's not much RAM left on each clustered node, so SWAPPING ensues. This is again an exponential increase in I/O (I/O includes everything from network to RAM to CPU to disk access). Again, the synchronization routines take over, and every single node in the cluster tries it's best to balance the resources. This of course leads to extremely slow response times for both the first and the second access points, and so forth. The synchronization routines slow this process down, way down to a crawl. Operations begin to take on sequential nature as opposed to parallel nature because they run out of RAM, run out of CPU computing power, and the network gets' so bogged down that it cannot handle any further requests. Now, we think by adding a new clustered node that we'll solve the problem - but instead it only makes the problem worse. I think by now it's evident that clustering for a very large data warehousing solution is NOT desirable. Can you put small numbers (2 to 4) clustered nodes on a single MPP solution? Yes, if you architect the nodes to operate independently, clustered nodes in a single MPP solution is one way to handle this kind of volume growth, adding another MPP node of clusters is ok - because it maintains autonomy, and scales linearly. BUT if you put too much data on a single "clustered" node within the MPP, you run into the same problems that large clusters present. Large "clustering" of machines (in my opinion) won't necessarily be feasible until we have speed-of-light communication between the clusters and, we are using RAM-based or nanotech based data storage rather than physical mechanical disk. MPP On the other hand splits the load, and the trick with MPP is to avoid a "hot-node" which acts like a cluster in trouble. Balance of the data and the processing in the MPP world is EVERYTHING. But with balance and appropriate "split" of the data sets, near-linear scalability can be achieved. Today, I nearly always choose the MPP option for data warehousing, in another entry at another time, we will explore MPP versus Clustering from an operational standpoint. If you have success stories about clustering, I'd like to hear about them - please also include the estimated size of the data set, number of nodes, amount of RAM on each clustered node, and number of CPUs' on each clustered node. If you have horror stories, I welcome those too. By sharing your experiences we can begin to shed light on this subject. Hope to hear from you, March 1, 2006Data Warehouse Appliance, another lookAppliance based data warehousing is on the rise, and no wonder - the costs per terabyte are cheaper, and for specific applications of the warehouse - sometimes these platforms are blazingly fast. They offer plug and play technology with HA (high availability) and Fail Over just by plugging in another appliance. They offer remote management, self-updates to the BIOS, and firmware, and most of them run on open operating systems like Linux. In this blog entry I'll discuss both the pros and cons of Appliance Based warehousing, but I still believe that this will be a market segment to watch, and will eventually flood the market with the backbone for high availability data integration and warehouses. There was a comment a while back that discussed an article in DMReview about appliances. It was written by Roger Gaskell of White-Cross systems; they build hardware for high-performance MPP and low and behold - produce a PROPRIETARY appliance. What do appliances bring to the business? We saw it with the disk market in the 80's, we saw it with other devices in the 90’s like consolidation of the cell phone, with podcasts, downloads and now music on demand - appliances are everywhere. I've written on this subject of "CONVERGENCE" on B-Eye before, convergence is everywhere. With the disk manufacturers they've now grown up - the disks are no longer "just simply storage", they contain CPU's, RAM, caching algorithms, load-balancing mechanisms, reformatting (under the covers), hot-swapping, fail-over, dynamic traffic re-routing, hot-spot contention resolution, self-monitoring, remote updates, and more than that, they all adhere to common SAN or NASD standards, meaning we can plug in an IBM device next to an EMC device, and they don't care - they'll talk to each other over standard DISK I/O protocol. What is missing from the Appliance today? I do not believe that proprietary hardware will "stop" the flow of appliances, nor do I believe it's necessarily a bad thing, EMC has it, IBM has it, Fujitsu has it, just about every disk manufacturer out there has it in their appliance and they are well-received today. It's a matter of opening up the architecture to a STANDARDS BASED service exchange, one that is obviously high-speed. These standards do not exist today, but they will - particularly as companies purchase these solutions from multiple vendors. Just look at IBM's DB2 UDB - MPP option that sits on Data Blades, it shares similar concepts - although maybe not quite an "appliance" just yet. Feel free to contact me for more information, I would also love to hear what you think - both positive and negative, add your bullets to the list of why or why not - appliances in the future. Thanks, |