Blog: Dan E. Linstedt« November 2006 | Main | January 2007 » December 19, 2006Necessities of GovernanceGovernance is an industry buzzword these days, with all the SOA initiatives going on, one would think that Governance would be on the top of the list as well. If you're not governing your enterprise consolidation, you probably are not taking full advantage of the benefits and cost savings that could be coming your way. Sure governance is an uphill battle in the beginning, sure everyone fights standards and agreed standards, and yes - absolutely - no one can seem to decide on how to define the common data sets (common data model). But if you're involved in, or working with SOA it is imperative to engage governance at the enterprise level. However it's not just governance that makes it work, a formal methodology should be utilized to assist with the governance as the organization organically grows its efforts. These include: ITIL, SEI/CMMI and a few others. I've defined different kinds of governance in my articles here on B-Eye Network in the past, just for re-iteration, I'll define the governance again: Governance IT Governance Data Governance SOA Governance So what is SEI/CMMI in the first place? And how does it affect my governance? What are some of the groups involved in governance? MDM Initiatives are like any other, they will require governance to be executed properly. In fact, any initiative that “serves” at the enterprise level should fall under an over-laying governance initiative. Master Data and Master Metadata are highly visible, therefore: high risk = high return = high visibility = high pressure to do it right. This means that Governance cannot and should not be ignored when addressing the MDM initiative. Again, MASTER DATA and MASTER METADATA serve the entire enterprise. Central governance: Distributed governance: Guiding principles:
Specific architectural principles:
Why should I utilize SEI/CMMI as my methodology guide for Governance?
CMMI Helps reduce and control IT Spending, in other words, CMMI is GOVERNANCE for IT in action! Come see my TDWI presentation in Las Vegas, February 2007 on Governance, Compliance, and CMMI principles. I'm also teaching "Defining and Understanding MDM" As always, I'd love to hear your experiences, positive and negative about governance principles. Please comment! Cheers, December 15, 2006ETL, ELT - Challenges and MetadataI had some good questions come in recently, thank-you. In this entry I'll share my experiences with ETL and ELT with regards to metadata; I'll also try to elaborate on when it is right to use which type of technology. This also goes back to my original articles in Teradata Magazine on ETLT. When to use which technology. Bottom line: ETL still has it's uses, and will continue to be useful moving forward. Here are some typical cases and reasons why you might still want to use ETL: In other circumstances we will use the ETL engine to provide EL in parallel and partitioned jobs, providing maximum throughput with minimum development time. So what does ETL provide in the metadata space: If we look at ELT from a scripting point of view, and not using a tool to execute, we lose all kinds of metadata from process flow, to business logic, to full traceability. Particularly this is seen when scripting Teradata BTEQ, Oracle SQL Loader and Stored Procedures, SQLServer command line, and DB2 UDB Functions, and so on... Any time we return to CODE base for ELT, we're back to deciphering scripts to figure out metadata, and much of it is hidden. Unfortunately (today) most of the ELT that I run into is still hand-coded, although this is set to change in the near future. This is where people stop me and say: well, what about Sunopsis? They were 100% ELT... True - but their metadata repository left a LOT to be desired, they were young in the field, and hadn't quite grown up yet. Ok, so what about Informatica, Ascential, Ab-Initio, Business Objects Data Mover, and so on... So then this argument kicks in: Well, ok - but what about when I override the SQL in these tools? That get's me 100% compliance with ELT and database feature sets... Yes, but then some tools simply don't have the SQL de-composition routines built into their engine yet. They can't pick these apart for the metadata lineage and achieve end-to-end lineage. For those tools that offer end-to-end lineage with SQL, the next thing that I find is: Finally, when it comes down to it - ELT requires much simpler processes, resulting in many more steps, resulting in a lot more "temporary" tables during execution cycles, metadata is bound to be lost through this process (today) - this is also due to change in the future, and get better as the tools find the gaps and fix them. Now what are some of the pros and cons of ELT? Cons: Pros and Cons of ETL Cons: So, you see - it all depends on how you look at the picture and what you want. Ultimately the nirvana is that no metadata will be lost, regardless of database specific SQL, or ETL "functions / transformations" utilized. The other nirvana is that you will ultimately design an "ETL" stream, and when it no longer performs, flip a switch on a job that says: turn this whole thing into ELT for me without me re-designing; that is the day when ELT will have grown up on the job, and be our best friend. Of course, it all depends on where you can afford to put your system load, and where you can spend your money to achieve scalability.... Hope this helps, December 13, 2006Does Big Data Equal More Business Intelligence?The question has been argued over the past two decades, is more data better? Do I really need more data? Where on earth is all this data coming from? How do I manage the ever-growing data sets? Does more data mean better business decisions? How can I reconcile these monstrous data sets? and so on... You've heard by now (I'm sure) many different folks in the industry offer their valued opinions. We can stand up on our feet and say: I'm on the fence - because half the time I hear it's the quality of the data that matters, the other half the time I have to defend the auditability and traceability of the data set in my warehouse. We can also stand on the fence because we can now "mine" for bad data patterns (only if we are collecting them), and learn where our mistakes are. This is the never ending story of a data set... (sorry, I'm punchy this morning). Really, it's a fantasy land... Ok - time to get real. Where is all this data coming from? Why do I need all this darned data? What about data quality? Doesn't it reduce the data set and improve my decision making abilities? Well if I let all this "bad" data into my warehouse, won't I overwhelm the users? Won't I paralyze my abilities to make good decisions? Use the data warehouse and the traceability of the data in the warehouse to "mine" the bad data for patterns - then use business discovery to find out why it's bad, and how much its' costing the enterprise. I think the dollar figures you can save may astound you. Keep in mind I am NOT advocating release of this "bad" data to the general end-user base. Rather, that it's a different kind of BI - one that is used to watch the metrics of business activity management, and business process improvement. As the data improves, one can actually (quantitatively) see the impact of business changes in the source data providers. The dollar cost can be measured, thus you've reached Level 4 of CMMI principles. From there, you can OPTIMIZE your business processes, and again, quantitatively measure the results as the bad data "subsides" from being loaded into your data warehouse. Take control of your business, stop spending ruthlessly, understand the critical path of business processing - a path to true enlightment... (not really, I just threw that in there for fun). Is more data really better? Do I really need more data? Where on earth is all this data coming from? * Note: Master Data As A Service is something that may help "consolidate" across organizations. For a prime example of customer MDaaS, look at the company Acxiom. How do I manage the ever growing data sets? This also requires proper hardware sizing, performance and tuning of applications to make it work. Does more data mean better business decisions? How do I reconcile these monstrous data sets? I specialize in big-data systems, big-data problems, and business intelligence. Feel free to comment, or contact me directly. Thank-you, December 12, 2006IBM - DB2 UDB 9.x - hot new technologyIBM is coming to the table with their RDBMS systems, finally making waves with their MPP options (labeled as DPF - data partitioning format), and multiple nodes. From a performance side of the house, scalability, and MPP are finally here. Not to mention new true XML interfaces, embedded XML within the RDBMS systems. In their new 9.x DB2 UDB database, they are finally coupling MPP with hardware and software drivers that increase performance, scalability and manageability. They've got a good showing with their latest releases in BCU technology. 1. JUST ANNOUNCED: BCU for AIX Version 2.1 based DB2 Version 9 On December 6, IBM announced the BCU for AIX 2.1 - a refresh of the initial launch of the Balanced Configuration Unit for AIX in June 2005. This refreshed version of the BCU is based on DB2 Version 9, the latest p5 processors and the DS4800 Storage Subsystem. DB2 Version 9 offers some new features that will be applicable to many data warehouse implementations like table partitioning and data compression. More information about DB2 Version 9 can be found at: ftp://ftp.software.ibm.com/software/data/pubs/brochures/db2_9_overview.pdf At one of our customer sites, the partitioning options are running at 300,000 rows per second on inserts - this is from a connected ETL tool. The performance is there, and no - the customer is not on a BCU, they are in fact, on Linux dual-core nodes. They are seeing linear performance increases when engaging multi-threaded connections. DB2 V9 brings to the table the combination that makes big systems tick: hardware, coupled with optimized software algorithms to utilize that hardware, and additional partitioning options that allow optimizers to take advantage of data model designs - of course those designs must be built properly. One of the hottest new features (I have yet to try) is the true XML (seamless embedding of XML information within the RDBMS) - allowing indexing, querying, updating, reading/writing / locking, etc... Seems to be very complete solution, and extends the Relational world into the semi-structured world. Well done IBM. I would say that this release closes the gap between Teradata's technology and IBM's technology. *Note: Teradata was rated as the top right quadrant (leader in execution, leader in visionary) by a recent Gartner survey. Now, if we could only get Updates and Deletes to be parallel / multi-threaded from an Application Programmers Interface (API) standpoint.... That would be awesome. This technology doesn't come cheap - but its well worth the investment - especially if you are launching SOA, or incorporating unstructured data, or looking at DW2.0. Cheers, December 8, 2006Data Integration - Performance NumbersI've been teaching and consulting on performance and tuning of systems architectures for 10+ years. I've seen the increases in performance across the board from many different vendors - hardware, software, network to disk to RAM to CPU and so on. This entry does not mention particular vendor names, but rather discusses the _nature_ of performance and tuning at the core of Very Large Systems - whether you're doing Business Intelligence / Data Warehousing, or simply data movement / data integration - these numbers (hopefully) will make sense to you. When I say very large environments, I'm talking about 1 billion rows+ _per file_ - handled within a single batch run, 1B rows per year of history loaded, with 5 years of history to load, that means the second through 5 years of history must manage "update detection" against an increasing set of rows on the target, on the order of billions. So what kind of performance do you want from your systems? With DW2.0 around the corner, and unstructured data creeping in - and near-line, active, and historical storage coming on-board, it's more important than ever before to get your systems in top shape to handle massive volumes. Here's a run-down of what I see in the market place - again, no vendor names will be mentioned. These are numbers that I see before tuning architectures: We're talking an average row size of 1250 bytes, same data in Unicode: 2500 bytes. Usually without parallelism, without partitioning, without tuning: Major data integration engines on the market * PLEASE NOTE: I CANNOT AND WILL NOT DISCLOSE THE VENDOR NAMES, IN A PUBLIC VENUE, IT IS A LEGAL ISSUE. My point is solely to give you (the reader) ideas as to what the best practice is, what performance numbers you should be shooting for regardless if vendor. Where are the common problems hidden within the architecture? 1. Common problem #1: Updates and Deletes and Inserts are commonly mixed into a single load stream - this can slow ETL / ELT loading processes by a factor of 4x to 12x. Using SET LOGIC and statistics alone, separate the data upstream as soon as possible (another reason why CDC has such a HUGE POSITIVE IMPACT on performance). In 95% of the customers I visit, out of billions of rows, usually only 10% to 20% of the incoming data sets are updates. Only 5% to 15% are deletes. The most costly and most complex operation in any database system and ETL / ELT is an UPDATE, by separating the logic and separating the data set, one can stream-line the updates and inserts and tune appropriately. So what are the magic numbers? What do we want to shoot for? What kind of performance are we after? Keep in mind that EVERY TIME we tune something, we begin to specialize the system further. Which means (in general terms) For every further tuning effort, we are customizing. When we customize we break standards, usually when we break standards we end up with "exception cases". When we have exception cases, we are increasing overal maintenance complexity, and our TCO increases - it's bound to happen. Another angle is: the cost of faster hardware and more bandwidth get's exponentially steeper, so picking the right time, balancing standards with the 80/20 rule (when good is good enough / fast is fast enough) is a KEY to our continuing success. Only tune what the end-users say is "broken" - in other words, notify them of the COST of tuning, and if they sign the SLA and justify the cost, then you're good to go. What do I look for? 1. ETL / ELT systems: Well, in systems (standard batch loads) of 100M per process or less, I look for and usually can live with 40,000 rows per second to 80,000 rows per second for inserts. I look for 20,000 to 30,000 rows per second for updates. For Deletes, I look for 10,000 to 100,000 rows per second (depending on the RDBMS). What's the math look like? 80% inserts of 100M rows = 80M rows of inserts, running at 80,000 rows per second = total run time of: 16.67 minutes. A livable piece, considering I would be running 10 to 20 of these types of loads concurrently, and considering I haven't even begun to partition yet. 18% updates: 100M Rows = 18M rows of updates, running at (slowly): 15,000 rows per second = 20 minutes. Deletes: 2% 2M rows / 10,000 rows per second = 3.3 minutes. If I ran all three (Inserts, updates, then deletes) in SERIAL: my total run time is: 16.67 minutes + 20 minutes + 3.3 minutes = 39.97 minutes (just over a half hour). If the ORIGINAL MIXED WORKLOAD data flow ran at: 8,000 rows per second (contains updates and inserts and deletes), then the total time would be: 100M rows / 8,000 rows per second = 208.33 minutes or 3.4 hours Now, 3.4 hours reduced to 39.97 minutes is almost an 80% performance increase. 2. With RDBMS Systems, I look for the same numbers as above - the RDBMS system must perform at the same rates as the ETL / ELT systems - or none of this works. What do I look for from DBMS systems when I employ ELT? For example: Suppose this: 5 sequential process in ETL move 1B rows to an RDBMS at: 50,000 rows per second. That’s: 3B rows / 50,000 rows per second = 16.67 hours total run time. Now suppose I have 42 steps of ELT for the same operational logic, that's 42B rows / 336,000 rows per second = 34.72 hours to process. Of course it's not always this way, because each "step" of the 42 steps reduces the data set, and not all of the 42 steps process all 42B rows, this is a worst case scenario. Oh and by the way, this is just ONE of the 10 batch processes that have to run concurrently. By the way, the RDBMS supporting this is not as large as you would think - it's about 8 dual core CPU's, and 12GB of RAM - that's all I can say about that... These are real numbers from real customer sites that I deal with. If you want to play in the big-data leagues, you've got to be able to produce in a mixed work load environment. Please let us know if we can help you with your performance and tuning, we'd be glad to take a look at your systems. Vendors take heed: these are not "numbers to shoot for", these are numbers that must be met within the next year or two - Especially with the unstructured data sets on the rise, and massive amounts of XML data being passed around and integrated. Vendors: I can work with you as well, as I sit on a number of technical advisory boards. Come see me the next time I teach VLDW at TDWI, I'd be happy to talk to you, or contact me at my company to schedule a performance review of your systems. What kinds of volumes are you dealing with today? Please let me know. Thank-you, December 6, 2006ETLT or ELT - Either way, pull back the sheets.Ok, I've said it before in previous entries, I've discussed ELT and ETL and loss of metadata here on the blog before. I've worked in both situations, I've worked in VLDW for 8 years, I've worked in ADW (Active or real-time data warehousing) for the past 3 to 4 years, I've been involved with non-data warehousing data integration projects using ETL and massive volumes. Now I'll say it again - there's a difference, a time a process, or set of processes, and singular points of architecture where everything converges. If you've seen me present VLDW, you've seen my Pyramid diagram that shows the impact of volume and latency on the number of ways to "execute". This entry redresses ELT or a term I wrote about in Teradata Magazine over 5 years ago called ETLT. Warning: read on if you like a good "one-sided" rant... I lay it on the table here, after all my years of experience, I just need to share. As always, I'm open to other sides of the argument - and invite anyone with alternative views to comment on this entry. If you agree, I'd love to hear from you as well... I've had the pleasure today of sitting through a two hour architecture meeting. The client we are working for at the moment has to move (consistently) 1 Billion rows during a single load in a single process, the maximum growth rate for this single target is potentially 1 Billion rows a month. Of course, this is the maximum rates. 80% of our data sits on average around 100 Million a day; we have maybe 10 to 50 source tables to work with at these volumes. What would you do in this situation? We are working with a particular ETL tool which has ventured into ELT land recently. We are also working with the worlds leading VLDW database engine (ranked according to Gartner Magic Quadrant). By the way, if you think this is large - you should try dealing with Telco Switch Data - same problem, same size of data, different data, and different requirements - same issues. Our requirements are to take "micro-batches", and real-time data feeds in the future, along side the standard volume batch updates. There's no question that the architecture must be accommodating to this volume, that the data model must be accommodating to the overall architecture and data arrival timings. What am I saying when I state: the larger volume, and/or the lower the latency, the less architecture choices you have to solve the problem? Well, here's an example: if I ask you to load 500 rows of data and tell you, you have 1 hour to get it done. You can probably tell me, 1000 different ways (architectures / designs) that will work - from the data model to the coding solution, to the architecture of the data flow (ETL or ELT - it won't matter). And I'll say great: make it happen. Now, if a year goes by, all's well, and I tell you, wait: you now have 5 minutes to load 500 rows - you might say... Hmmmm our current process runs in 8 minutes - we need to re-architect. Or maybe you'd say: we need a larger box / faster box to get it done. I might say: ok - make it happen. Now if I tell you instead: you now have to load 100 million rows, oh - and you still only have 1 hour to get it done. You might say.... "You’re crazy" or "we have to re-architect", there might only be 5 ways instead of 1000 ways to get it done. Now if I change the scope and say (6 months later): 100 M rows must be loaded in 15 minutes... You might tell me, we have to change the data model, change the base-architecture, change the loading paradigm, and change the hardware and the tool sets we are using to get it done. This is at the crux of what I say when I talk about the fact that "real-time" is really "right-time", and that the cost of VLDW (to implement) increases exponentially when your time factor is reduced below a 8 to 10 minute refresh rate, and your data set stays consistently large. But back to the point of this blog - how do you solve this problem? Well, everything, and I mean everything starts and ends with architecture - first and foremost data model / data architecture, second in line is hardware _architecture_ i.e.: MPP, SMP, Clusters, Grid, etc... Third, is data processing architecture - everything converges on a single point: performance, performance, performance. All roads lead to a single architectural solution: ELT, with MPP, with a SINGLE SPECIFIC NORMALIZED DATA MODEL - no R.I.!! Not at the Billion+ row mark... no way. Surprised? Maybe, maybe not. My point: ETL is changing folks, to ETLT (ETL & ELT) mixed workload. Why? Because there's a COST involved in going 100% ELT, there's a loss of metadata as transformation pieces are broken up into smaller execution steps, and run as ELT (disappears into RDBMS complex SQL logic). The kicker? There's still a place for ETL, to take advantage of development and maintenance speed. There's still a place for where ETL is good enough, where Metadata in-stream is good enough, and massive volumes simply don't exist. Now, let's look at this from a STANDARDS side of the house: So you see, achieving a balance across the standards (due to cost) is vital in this time of "changing lines" and shifting sands. In the future ELT will be better supported by the remainng vendors, and it will become easier to "flip a switch" and let the traditional ETL tool make the decisions as to what to execute. Back to standards: what do we really want with an architecture? Ok - these are conflicting goals today. ELT provides most of these at a cost, ETL on the other hand also provides most of these (again at a cost), but they both lead to different architectural implementations. So why can't they co-exist? Because ETL runs on data that's cached within the ETL server - breaking coherency with "real-time" feeds that exist within a true active data warehouse (the cache is extremely difficult - if not impossible) to synchronize to the real-time changes occurring within the RDBMS. That's right - the closer to right-time the warehouse gets, the more it becomes operational in nature, the more operational, the more it requires transaction consistency. ELT on the other hand requires tons of staged/cached data steps along the way within the RDBMS - particularly to execute against this level of volume. Both can be created to be repeatable, and restartable, and consistent, and optimized. However what we learn (quickly) by mixing the two is that ETL has a different set of technical issues - mixing a cache with a right-time data warehouse is just one of the problems we see. The other is restartability begins to erode at a certain level. Jobs that were once restartable (in one form or the other) are now reliant on major checkpoints to re-establish caches, or to reset "staging tables." It's a bad mix... So what should I do? What are you're thoughts / Comments? Thanks, December 4, 2006High Volume, Low Performance - Can CDC Help?In my last entry, I blogged on High Volume and Low Performance issues that you might run into. In this entry we'll talk a little about CDC (Change Data Capture) and how this is paramount to the success of your systems moving forward. If you've got high volume issues, and you don't have CDC in place today, you may be fighting an ever growing data set that will at some point become un-manageable. If CDC and Real-time / Right-Time processing (you know how I feel about the term Real-Time) are not implemented together, your right-time delivery system runs a high risk of pushing all kinds of extra traffic across the wire. Change Data Capture, or CDC should be a vital part to any back-office BI solution that is put in place today. It may mean getting over the hurdles of signing SLA's with your data service providers, but believe me, it will be worth it. The question as is often missed is: what is my ever-growing cost of NOT implementing a CDC solution? When we think about it this way, we end up at the right conclusion. Why? Because the data sets continue to grow, and when data sets grow, traffic on our network grows, the logic to decipher changes and transform / remove / record duplicates becomes more complex. With complexity comes system slow-down, with more network traffic also comes system slow-down. All in all, not implementing CDC causes costs to rise - and the faster the business changes / moves - the quicker the costs of not having CDC at the source, rises. Wait a minute! CDC At the SOURCE? How in the world can I do that? I don't even own all my sources... CDC is required on ALL source systems, and by the way - if you are BUILDING an SOA, or an MDM solution, or you're setting up data governance or a governance initiative, putting CDC in place will sooner or later become a necessity. Not just the source systems you own, but also with the data and service providers you don't. Let's take sales force for example. If you outsource your sales management to Sales Force, then you'll want them to implement CDC on any of the "changes" that take place. Change Data Capture systems become the "expert" logic for providing traceability and auditability demanded by auditors and compliance initiatives around the world. They provide safe and consistent means to extract every data set that changes, when it changed, and what it was versus what it changed to. The overhead on the source systems is often used as an excuse NOT to engage in CDC - this is the wrong way to look at the cost. A better question to ask is what is the cost of all that extra traffic on my network, traveling through my Transformation tools (be it: EAI, EII, or ETLT). I'm sure the cost of all that extra traffic is much much higher than the overhead cost of CDC on source systems, especially when the data set grows again, or when the frequency of delivery is reduced again. Now, what do you want CDC feeding? What kind of features should I look for in my CDC offering? Ultimately, if a record changes and has 360 fields (let's say it's a mainframe record or a Cobol based structure), can I gear the CDC to issue an UPDATE transaction with JUST THE PRIMARY KEY, and JUST THE CHANGES? maybe only 10% or 36 fields changed, I don't want all 360 fields running across my network... These are just some of the questions I would ask of CDC vendors, there are others - but this is a start. If you have CDC installed, I'd love to hear your comments as to how it helped your business, and what your headaches may have been in putting CDC in place. If you don't have CDC, and your business is fighting the concepts, I'd love to hear the arguments used (post anonymously if you wish) against CDC. Thank-you very much for your time, |