Blog: Dan E. Linstedt« July 2006 | Main | September 2006 » August 29, 2006Where should Dirty Data be cleaned up?I just blogged on the need for allowing dirty data to flow through to an auditable reporting area. There are a lot of questions about WHERE the dirty data should reside, and where the dirty data should be cleaned up. In this blog entry we'll dive in to take a look at that in a short consolidated view. If you are fighting with compliance and auditability at a systems or data level, hopefully this will be helpful. In Nirvana land, there would never be dirty data. Our source systems would be perfect, and would capture all clean data. And if the data wasn't clean it would go to great lengths to get it that way before releasing it to the rest of the corporation. Of course in Nirvana land we would have a SINGLE STACK (one operational system, one data store that had both - current / transactional and historical / warehouse like) information system. In Nirvana land the single system would integrate with web-services external to our company. In Nirvana land, this "single system" would really be the be-all-end-all, dare I say it: Appliance / Platform / pre-configured with all that you need, everything from DSS to source system OLTP and edit checks to information quality, and reporting and dashboards. In Nirvana land, it would all be... well... nirvana. But alas, this is not the case. In the real world, for years - the data warehousing industry has taught or preached that data quality should be between the source system and whatever the next "step" of integration is: which might be an ODS, an EDW, an ADW, a Kimball Bus Staging area, and from a strategic point of view - this may work nicely. As long as whatever tools are being used to change and alter the data produce the correct audit trail. However from a tactical or operational point of view, or from a point of view where we are trying our best to "catch" broken source systems, it doesn't work so well. Cleansing the data or using data quality to produce a "single version of the truth" can have a negative impact on our ability to be accountable (as business users) and actually get the source systems fixed. That said: where oh-where do we put the IQ systems? Between Sources and ODS? And where oh-where do we put the "bad" records? Do we kick them back to the source system? Do we report on them separately for end-users to clean up? Do we leave them in a daily-snapshot set of error tables? Do we let them through our ODS, and Data warehouse environments to end up in separate data mart areas? Remember that TRUTH is subjective, and no one can tell "who's truth it is - or when". And also, that auditors have a very different definition of what truth is, they try to take an objective view of truth - one that is traceable and auditable. Putting DQ or IQ between sources and the ODS allows Master Data to be built within the ODS - the caveats are that the bad data, pre-altered data trails, and any audit trails produced by the integration or "data alteration" stream MUST be warehoused somewhere over-time. Simply "improving the quality of the data" without showing the errors to the business causes all kinds of heartburn downstream. Kicking bad records back to the source system is fine (for a while), until more code must be hand-written on the source system to take it as a new feed, this drives maintenance costs up through the roof. Having quality / cleansed data in the ODS allows great interaction from a strategic standpoint, and possibly allows transactional feeds back to the source systems. Ultimately it greatly increases the complexity of the solution that is handling "BAD DATA" up-stream. Putting DQ or IQ between the Sources and the Warehouse can (and often does) cause auditability problems. Not many of these ETL or ELT tool sets have the capacity to create data "as of before it was changed/aggregated/filtered", that must be architected and increases the difficulty of the load routines - not to mention slows down the overall load cycles. Furthermore, Bad data must be captured in "error tables" which are usually in the staging area, sometimes these are only daily refresh - no analytics can be run on these data sets, they disappear to fast. As we move towards real-time or near-real-time processing, there is NO TIME to process, integrate, aggregate the data (either between sources and ODS or between Sources and your DW). Real-time dictates that when the data is available, good bad or indifferently, that it be loaded to your warehouse AS IS - no delay's, otherwise the data decays and becomes useless, or the "attempt to detect fraud" is cleansed away, cleaned up and never detected. Putting DQ or IQ between the Warehouse and the Data Marts is probably (in my opinion) the best way to handle this. The Data Warehouse then becomes a statement of integrated facts (integrated by business key), and the marts then become specific versions of the truth from which management can begin to dissect where they are losing money. This all applies when loading "today’s" data, and in fact, cleansing with audit trails before reaching either the ODS or the DW applies when loading "today’s" data as well. What doesn't fly is the data model, once the data model has been constructed for "cleansed" data, it very rarely allows broken data (or error data) in, in other words, loading historical data wreaks havoc on the data model, or dirty historical data is cleansed - and through that process the "broken business process / broken business rules" are cleansed and washed away as well. Now, is there value to cleaning up the source systems? You bet, this is the ultimate goal. Nirvana. Is there value to having cleansed data in your ODS? Absolutely! You get the idea, there's definite value in having cleansed and quality checked data, but there's also value in running analytics on the dirty data, when did it break the business rule (back in history), why? What was the pattern? Then the real work begins: what caused the pattern? Is the pattern still happening today? Is the business losing money or opportunities with customers because the source system is still broken? And so on. So next time you think about where to PUT your DQ/IQ solution, consider what the impact is of NOT looking at the dirty data, ask the question: instead of kicking the data back to the source system, is there somewhere we can warehouse this information to examine it's patterns? I think you'll be surprised at all the business accountability you'll uncover. Love to hear your thoughts on this. Dan Linstedt August 20, 2006Dirty Data Sets = Hemorrhaging Money in BusinessIn my most recent blog entry I discuss temperature related data sets, near the bottom I bring up a lot of questions about large scale data sets and dirty data. Let's pick up where I left off... One thing is clear, as we march forward, our data sets will only grow, not shrink. Something to take note is: what exactly is "garbage data", what does it mean? Can you identify it and remove it from your systems without impact to audits? If you clear it out, are you removing the possibility of tying together or discovering a meaningful relationship across your business that you didn't have before? If you have garbage data, does it mean your business is hemorrhaging money? Data storage is growing, our requirements to "keep it all" is growing, not shrinking. But is it justified? Maybe, let's take a look at some of the other things causing our data sets to grow: That's right folks, dirty data or garbage data doesn't necessarily mean it can be washed away, tossed aside, or removed from the system anymore. A fully integrated view of the enterprise means: I have a view which tells me what I'm doing WRONG as much as it tells me what I'm doing RIGHT. If you ever are to get to a point where you can consider DATA AS AN ASSET or undergo asset evaluation, we must be able to quantify and qualify the following: 1. What is generating dirty data? Ok and the KICKER: Why must I (business owner) be accountable? The Dirty Data tells the story. Mining Dirty Data and understanding it's existence over time is just as important as "consolidating, collecting, and producing" high quality, cleansed data for the business (this data can be used to run the business in day-to-day operations. Dirty data can point out where the business is hemorrhaging $$ (money), or perhaps is suffering losses due to unseen fraud. The moral of the story? When dirty data begins to get cleaned up at the source system, at the business processes, and by business users, usually morale improves, excitement improves, new projects are granted, and productivity gains are seen. WHY? Because business users finally feel empowered to take charge and make the system work the way it always should have. Business users see an opportunity to spend less time worrying about getting around dirty data, and more time using "correct" data downstream. They finally feel justified in fixing 25 year old systems. I was there; I was in an environment where this happened. For the first time in 15 years, we saw more change requests to the source system than had ever been issued in the past. As the data cycled back through the organization and improved in quality, management took notice of the newly "agreeing" numbers coming through the business reports from the enterprise data integration store. They applauded the efforts, and overhauled business processes from the top down. It was refreshing. You said that Dirty Data causes hemorrhaging of money.. how does this work? This cycle starts all over again when new dirty data is entered, costing the business more time and more money (on an exponential curve). It get's tougher and tougher to fix as business rules change. IF A BUSINESS WANTS OR NEEDS TO BE NIMBLE, they must fix the SOURCE of these problems, but in order to do that, they must undertake a dirty data expedition (project). My good friend Larry English speaks of these things in is TQM courses. So, in your organization - where does dirty data get put? Out with the garbage? or is it addressed the way it should be, as an organizational or business process problem? Stop hemorrhaging money! Learn how to capture the dirty data and get it fixed now, so that over time, there is less and less dirty data. Don't just assume that the haystack is all bad, and that the data is throw-away, this will lead to bad business decisions in the future. My firm has a set of industry best practices that we are currently employing at large government, financial, and travel agencies where we can show the return on investment for cleaning up dirty data. Thoughts? Comments? Temperature of Data for RDBMS, and DW 2.0As you know Temperature of Data is one of the next "big things" to come from RDBMS engine vendors. In this blog entry we will discuss temperature of data and what it is, and how it ties to DW 2.0. The initial business problem is that data sets are continuing to grow, and grow - particularly as enterprises come under more scrutiny for compliance and auditing. In fact, I read yesterday (no surprise) that all email trails are capable of being under subpoena in a court of law - this means that email alone deserves it's own compliant data warehouse. But that's for another day. Temperature of Data, what is it? HOT = Accessed all the time, or extremely important data requiring sub-second response times. Hot data must reside in RAM continuously. The more HOT data, the larger the RAM requirements or so one thinks. In-RAM database engines have recently been acquired by most of the leading RDBMS technologies; some vendors are rolling their own. In-RAM RDBMS engines require lots of RAM, or require complete compression across every column without sorting, which can be done through an indirect access to a hash table of existing values. The problem is: as user requests grow across business, more and more data becomes "hot" requiring additional RAM resources to keep it in memory. The only thing that can answer this call is Nanotech Memory (which I will blog on shortly). HOT Data in DW2.0 is defined as Interactive or Integrated, depending on the need. In DW 2.0 the HOT data must include not only current transactional data sets, but some level of context as well - which usually means Master Data sets, Pointers to Metadata that describe Textual Data Sets, and Transactional Detail Data Sets, or minimum descriptive detail data sets. Medium Data = is data accessed most of the time, but where response times can be anywhere from 1 second to 10 or 15 seconds. In this category, aggregation analysis, strategic analysis, and some levels of master data can be requested. Certainly this is NOT a desirable area for transactional data, particularly in detection of Fraud. Detection of Fraud should be in the HOT data section. Medium Data is usually stored on slower storage, including internal disk, super-fast SAN, high-speed controlled attached disk - something with lots of I/O channels, high parallelism, and incredibly fast access times. Medium Data in DW 2.0 is defined to include: Some Master Data (that which is not HOT, Partial Textual Data, and partial descriptive data. It may also include partial aggregation layers for "current" snapshots (within the past 10 minutes, refreshed every 10 minutes) so that information action time is significantly reduced. Medium Data sits smack in the middle of integrated data sets. Luke-Warm Data = Data that is accessed rarely, rarely may be (for example) once every 30 minutes or twice every 4 hours. Data like this is usually SCANNED in nature. In other words, Luke-warm data is accessed but for other reasons - to be aggregated (mostly), or to be mined, cleansed, or updated. Luke Warm data will be tossed out of RAM as soon as the accessibility is complete. In DW2.0, Luke Warm data sits on slower external Disk, over the WAN possibly, or sits on near-line storage like WAN drives, DASD, or slow SAN drives. The Data sets in Luke-Warm data are defined for use in contextual queries, or strategic long-term queries, or when users are digging for answers over-night. Cold Data = historical context that is hardly ever accessed, but when requested, must have a response time equal to that of a couple minutes. Cold data is typically used by auditors once a year, or twice a year, or even once or twice a month. Cold data sits on Archival storage, like CD drives, worm drives, and slow tape libraries. Cold Data in DW 2.0 is considered Archival Data; Cold data sits out of play until requested. It might contain the full textual reference data (from emails, word docs, full images), it might contain full descriptive data like unused addresses, or historical references from data that was "relevant two years ago". Cold Data is inactive for the most part, but when needed - can be brought on to near-line or integrated storage temporarily (and automatically) to handle the request. There is no direct fine line between any of these types of data, and for the most part it will vary from corporation to corporation, until the definitions are nailed down across best practices of Active Data Warehousing and Real-Time data integration. What is clear is that METADATA MUST be defined in order to "classify" data components within the RDBMS engines, the thresholds that are set must be defined within the metadata as business rules, and these rules MUST be dynamic no matter which RDBMS engine you are using. Furthermore, the METADATA must be available to BI tool sets, and technical business users. With the help of IT, the technical business users can set and re-set the definitions of HOT, COLD, Medium, and Luke-Warm. This means that a tool like EII is primed for taking over this space - it can manage the metadata, interface with business users, and reset or re-write the thresholds back to the RDBMS technology that define where data sits. The RDBMS engines must then deal with where to put this data and how it fits. One thing is clear, as we march forward, our data sets will only grow, not shrink. Something to take note is: what exactly is "garbage data", what does it mean? Can you identify it and remove it from your systems without impact to audits? If you clear it out, are you removing the possibility of tying together or discovering a meaningful relationship across your business that you didn't have before? If you have garbage data, does it mean your business is hemorrhaging money? YES... more on these questions in another blog. What does this mean to business? Questions? Thoughts? Love to hear from you. Cheers, August 19, 2006MDM and Consolidation of Data SetsMDM data often is dispersed across the organization. This begs the question: how can the MDM be a viable asset to the business base? Is the Master Data reused throughout the organization the way it should be? Is it defined in the right context (Master Metadata)? But technically, Master Data should be consolidated into a single global data center. MDM is not a tool, not a toy, not a process - it's a way of doing business that includes tools, best practices, people, governance, metadata and single answers. Remember: MASTER DATA is NOT, I repeat: NOT a single version of the facts, rather it IS a single version of today's corporately accepted TRUTH. Also remember: what's true today was NOT true yesterday (last week, last year, 5 years ago...) Master Data IS auditable as a system of record, but it's one of 3 system of record definitions (see my blog on System of Record). Ok, so where does that leave me? To get to Master Data, I would strongly suggest that you have already defined enterprise wide conformed dimensions, or if you've got a 3rd normal form warehouse, that you've already defined enterprise wide accepted metadata definitions. In other words, I don't believe that you can get to Master Data Management successfully without FIRST going through a Master Metadata Management effort. So what is Master Data Exactly? Can the Master Data Tables be linked together? Keep in mind that Master Data CHANGES CONTEXT depending on who's using it!!! That means, that Master Metadata must be defined, and metadata definitions over-ridden at operational levels (as long as there exists a 100% dependancy on the parent metadata chain) in order to determine context of the Master Data Set. For example: A car is a car is a car, it has a VIN number - the VIN number doesn't change even though the car's color changes, or the seats change, or the radio is swapped out. The CAR CAN EXIST WITHOUT A DRIVER / OWNER! The business rule for "shipping" of that Car cares about a parent Container, and a Parent ship to that container. The Sales-floor cares about the "car" and the prospective buyer of that car, and the OWNER cares about the car itself. Context of the CAR and how the master data of the CAR changes depending on who's using the data and how it's viewed. Master Data MUST be consolidated in a single data center. If it's to be utilized (for performance purposes) in other systems, it must flow downstream from the central synchronization point to a local copy. It's a read-only local copy in order to avoid stove-piping in the industry again. Master Metadata must also be centralized, AND the master metadata MUST be delivered along with the Master Data in order to make sense of it at run-time or access time. Are there problems with the Master Data centralization effort? Before you tackle Master Data, please please please spend time in considering the architecture carefully, and I would suggest that you're ADW, EII, Metadata, and web-services components are in place for at least one component of the business. We are a world-class implementation firm who builds solutions that scale to the Petabyte ranges. Come see my MDM night school course at TDWI next week, followed by my VLDW class the next day. Thoughts? Questions? Comments? Thanks, The nature of Appliances in the DW SpaceMany are now discussing the "Appliance" for Enterprise Data Warehousing. I've blogged on this in the past about definitions (or lack thereof), and what it means to be an appliance. I ran a google search on "appliance+data warehousing" and came up with a few vendors, and their thoughts on what an appliance is. We'll try to sort through the claims in this entry - and drive it back to a simple level of understanding. DatAllegro: Calpont: Netezza: Now when we take a look at the "APPLIANCE" side of the market, given the definitions I've written in the past, let's keep in mind that something new is brewing: DBMS Vendors are now moving toward a mix of components, use of high-end for super-fast, and super critical data (hot data), low-end for slow data, non-critical, least accessed (cool data), and intermediate pricing and parts for luke-warm data. Temperature rating of data is what's just around the corner. The "Appliance" platform (including RDBMS, and hardware, and possible data mining applications) of the future will plug in to your network, provide self-configuration, and allow data to be temperature rated. In some cases the temperature of the data will be managed by the RDBMS software itself. The appliance will have LOTS of RAM, as RAM get's cheaper, and faster - the appliance of the future will invest in utilizing RAM solutions with compression technologies, and other components. Calpont is new on the scene, and we will have to wait and see how they handle things. One of the items I blogged on in the past is Samsung and it's use of Nanotech to create a 16GB RAM chip: "Samsung Announces It Has Developed World’s First 16Gb NAND Memory Chip" What this does, is allow "platform/appliance" manufacturers to extend their RAM reach in to huge amounts of storage, with tiny spaces - provided they are willing to pay for the expensive technology. Eventually Terabytes of storage will be available on RAM chips. All data will be RAM based once nanotech breaks the price barriers and it becomes efficient to produce. All of that aside, one must look very carefully before deciding on an appliance. Ask the questions: what types of parts are in the platform? What is the MTBF for the parts themselves? Including disk, RAM, CPU, motherboard, etc.. How can the appliance be upgraded? Is there a remote monitoring facility that can predict when failures occur? In my mind, a EMC has it spot-on. Their DISK is an appliance, they use high-quality parts in their high-end systems, they have remote monitoring, and are often on-site before the disk fails to fix it, and so on. RDBMS vendors venturing into the world of Appliances can take a lessons-learned approach from EMC, In fact - it would be interesting to see one of these vendors partner up with EMC to leverage its entire pre-built monitoring infrastructure. I wonder if EMC could even outsource something like that... Anyhow, don't be afraid to ask the tough questions - find out where the product roadmap is going, try to align your goals with a hands-off approach to the "appliance". It should be self-maintaining, self-tuning, and self-upgrading. I'd love to hear your questions. Cheers, |