Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

December 2006 Archives

Governance is an industry buzzword these days, with all the SOA initiatives going on, one would think that Governance would be on the top of the list as well. If you're not governing your enterprise consolidation, you probably are not taking full advantage of the benefits and cost savings that could be coming your way. Sure governance is an uphill battle in the beginning, sure everyone fights standards and agreed standards, and yes - absolutely - no one can seem to decide on how to define the common data sets (common data model). But if you're involved in, or working with SOA it is imperative to engage governance at the enterprise level. However it's not just governance that makes it work, a formal methodology should be utilized to assist with the governance as the organization organically grows its efforts. These include: ITIL, SEI/CMMI and a few others.

I've defined different kinds of governance in my articles here on B-Eye Network in the past, just for re-iteration, I'll define the governance again:

Governance
n 1: the persons (or committees or departments etc.) who make up a body for the purpose of administering something; "he claims that the present administration is corrupt"; "the governance of an association is responsible to its members"; "he quickly became recognized as a member of the establishment" [syn: administration, governing body, establishment, brass, organization, organisation] 2: the act of governing; exercising authority; "regulations for the governing of state prisons"; "he had considerable experience of government" [syn: government, governing, government activity] Webster’s Definition of Governance, http://dictionary.reference.com/search?q=governance

IT Governance
IT governance ensures IT-related decisions match company-wide objectives by establishing mechanisms for linking objectives to measurable goals. IT governance is the decision rights and accountability framework for encouraging desirable behavior in the use of IT. IT Governance: http://www.faculty.de.gcsu.edu/~dyoung/MMIS-6393/Reading-IT-Governance-defined.htm

Data Governance
Is a combination of people, process, and technology required to support the ongoing management of the enterprise wide data that will be centralized. (my definition)

SOA Governance
SOA Governance is the ability to ensure that all of the independent efforts (whether in the design, development, deployment, or operations of a Service) come together to meet the enterprise SOA requirements. … Including SOA Policies, Auditing and Conformance, Management (track, review, improve), and integration. SOA Governance: http://www.weblayers.com/gcn/whitepapers/Introduction_to_SOA_Governance.pdf

So what is SEI/CMMI in the first place?
Capability Maturity Model Integration (CMMI) is a process improvement approach that provides organizations with the essential elements of effective processes. http://www.sei.cmu.edu/cmmi/general/general.html

And how does it affect my governance?
It can BE your governance guidelines. SEI/CMMI has (built-in) governance procedures. All governance requires that the efforts be monitored, measured, and of course estimated before started. There are all kinds of estimations that must take place ranging from RISK analysis, to implementation ability, to project lifecycle, and implementation time. If you aren't estimating, then measuring (something akin to Earned Value Management), then you are not exercising effective governance, and certainly - you cannot improve on what you cannot measure.

What are some of the groups involved in governance?
governanceImage.jpg
Diagram adapted from: http://www-128.ibm.com/developerworks/webservices/library/ws-improvesoa/

MDM Initiatives are like any other, they will require governance to be executed properly. In fact, any initiative that “serves” at the enterprise level should fall under an over-laying governance initiative. Master Data and Master Metadata are highly visible, therefore: high risk = high return = high visibility = high pressure to do it right. This means that Governance cannot and should not be ignored when addressing the MDM initiative. Again, MASTER DATA and MASTER METADATA serve the entire enterprise.

Central governance:
Best for the enterprise. The governance council has representation from each service domain (more on this later) and from subject matter experts who can speak to the people who implement key technological components of the solution. The central governance council reviews any additions or deletions to the list of services, along with changes to existing services, before authorizing the implementation of such changes.

Distributed governance:
Best for distributed teams. Each business unit has control over how it provides the services within its own organization. This requires a functional service domain approach. A central committee can provide guidelines and standards to different teams.

Guiding principles:


  • Reuse, granularity, modularity, composability, and componentization
  • Compliance to standards (both common and industry-specific)
  • Services identification and categorization, provisioning and delivery, and monitoring and tracking

Specific architectural principles:


  • Encapsulation
  • Separation of business logic from the underlying technology
  • Single implementation and enterprise-view of components
  • Leveraging existing assets wherever an opportunity exists
  • Life cycle management
  • Efficient use of system resources
  • Service maturity and performance

Why should I utilize SEI/CMMI as my methodology guide for Governance?


  • About 95% of companies have a formal IT strategy which in most cases is "reasonably“ aligned to the business strategy (~80%) Source: The Compass World IT Strategy Census 2001
  • However, research concludes that there is no evidence that IT spending levels positively correlate to companies’ productivity (IT productivity paradox) Source: Information Productivity, P. Strassmann 1999, Information Economic Press
  • Paradoxically, 80% of strategic decisions related to IT are only based on “gut feeling“ Source: Gartner Symposium News Preview 2002, Florence, Italy
  • Businesses typically waste 20 % of corporate IT budgets on investments which fail to achieve their objectives Source: Gartner Symposium News Preview 2002, Florence, Italy
  • The companies that manage their IT most successfully generate returns as much as 40% higher than their competitors Source: Accenture project experience

CMMI Helps reduce and control IT Spending, in other words, CMMI is GOVERNANCE for IT in action!

Come see my TDWI presentation in Las Vegas, February 2007 on Governance, Compliance, and CMMI principles. I'm also teaching "Defining and Understanding MDM"

As always, I'd love to hear your experiences, positive and negative about governance principles. Please comment!

Cheers,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted December 19, 2006 6:53 AM
Permalink | No Comments |

I had some good questions come in recently, thank-you. In this entry I'll share my experiences with ETL and ELT with regards to metadata; I'll also try to elaborate on when it is right to use which type of technology. This also goes back to my original articles in Teradata Magazine on ETLT.

When to use which technology.
As it turns out (and this is a good thing), not all systems have the super-huge volumes that I've been discussing. Bill recently documented a specific case where too much granularity is a bad thing - which I agree with. There's a time and place for asking the question: what's the value of this data set? Is the value increased or decreased by going to finer grain of data? But that's for another blog.

Bottom line: ETL still has it's uses, and will continue to be useful moving forward. Here are some typical cases and reasons why you might still want to use ETL:
* When volumes are medium to low sized, in my estimation, a single feed (single file) to be processed, that is under 180M rows can easily be processed by ETL when architected properly.
* When the time-line for loading the data is not super critical, in other words even if I have 500M rows, but I have 2 or 3 days to load, because it's a monthly feed, then maybe this is Ok too.
* When the source systems provide data slower than 60,000 rows per second - or the networks are restricting data flow. In these cases, you really have no choice. You now have the time (due to slow sources) to do transformation in-stream, so why not take advantage of this?
* When the target system won't perform at more than 50,000 or 60,000 rows per second, even if it's been tuned. (see my last bullet)
* When code and reference tables are needed to populate default values or provide simple data replacement through key matching.
* When small micro-bursts of data are arriving in real-time, and we need to perform transformation in-line between systems, or between queue's.

In other circumstances we will use the ETL engine to provide EL in parallel and partitioned jobs, providing maximum throughput with minimum development time.

So what does ETL provide in the metadata space:
Generally ETL's metadata repositories have been around for a little bit longer than the ELT tool vendors' repositories, and those ETL tool sets now dabbling in ELT are just growing into this area.

If we look at ELT from a scripting point of view, and not using a tool to execute, we lose all kinds of metadata from process flow, to business logic, to full traceability. Particularly this is seen when scripting Teradata BTEQ, Oracle SQL Loader and Stored Procedures, SQLServer command line, and DB2 UDB Functions, and so on... Any time we return to CODE base for ELT, we're back to deciphering scripts to figure out metadata, and much of it is hidden. Unfortunately (today) most of the ELT that I run into is still hand-coded, although this is set to change in the near future.

This is where people stop me and say: well, what about Sunopsis? They were 100% ELT... True - but their metadata repository left a LOT to be desired, they were young in the field, and hadn't quite grown up yet.

Ok, so what about Informatica, Ascential, Ab-Initio, Business Objects Data Mover, and so on...
They grew up in the ETL space and have strong metadata components, however, they are just now completing their transition to ELT - or in some cases, they've completed it, but it's not 100% compliant with the all the database features available that I can see.

So then this argument kicks in: Well, ok - but what about when I override the SQL in these tools? That get's me 100% compliance with ELT and database feature sets... Yes, but then some tools simply don't have the SQL de-composition routines built into their engine yet. They can't pick these apart for the metadata lineage and achieve end-to-end lineage.

For those tools that offer end-to-end lineage with SQL, the next thing that I find is:
Some database engines do not record the number of "rows" which were operated on within these SQL statements that executed within the database. They record them, but the tool sets don't pick them up. Other database engines record this information (for every SQL statement) in metadata tables in the RDBMS - as it should be!

Finally, when it comes down to it - ELT requires much simpler processes, resulting in many more steps, resulting in a lot more "temporary" tables during execution cycles, metadata is bound to be lost through this process (today) - this is also due to change in the future, and get better as the tools find the gaps and fix them.

Now what are some of the pros and cons of ELT?
Pros:
* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.

Cons:
* ELT relies on proper database tuning, proper data model architecture, normalized data model architecture
* ELT relies on MPP hardware
* ELT can easily eat 100% of the hardware resources available for complex and huge operations
* ELT can't balance the workload
* ELT can't reach out to alternate systems (all data must exist in the RDBMS BEFORE ELT operations take place)
* ELT easily doubles, triples, and quadruples disk storage requirements (more processes, each simpler, each requiring intermediate temporary tables).
* ELT (sometimes) is not 100% metadata lineage traceable.
* ELT can take longer to design and implement, more steps, less complicated per step, but usually results in more custom SQL code (sometimes this is where metadata is lost).

Pros and Cons of ETL
Pros:
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware

Cons:
* ETL requires separate and equally powerful hardware in order to scale.
* ETL can "bounce" data to and from the target database, requires separate caching mechanisms which sometimes don't scale to the magnitude of the data set - this can result in scalability and performance issues.
* ETL cannot perform as fast as ELT without twice the size of hardware (usually for RAM and CPU resources).

So, you see - it all depends on how you look at the picture and what you want. Ultimately the nirvana is that no metadata will be lost, regardless of database specific SQL, or ETL "functions / transformations" utilized. The other nirvana is that you will ultimately design an "ETL" stream, and when it no longer performs, flip a switch on a job that says: turn this whole thing into ELT for me without me re-designing; that is the day when ELT will have grown up on the job, and be our best friend.

Of course, it all depends on where you can afford to put your system load, and where you can spend your money to achieve scalability....

Hope this helps,
Dan Linstedt
CTO, http://www.MyersHolum.com


Posted December 15, 2006 10:53 AM
Permalink | 3 Comments |

The question has been argued over the past two decades, is more data better? Do I really need more data? Where on earth is all this data coming from? How do I manage the ever-growing data sets? Does more data mean better business decisions? How can I reconcile these monstrous data sets? and so on... You've heard by now (I'm sure) many different folks in the industry offer their valued opinions. We can stand up on our feet and say: I'm on the fence - because half the time I hear it's the quality of the data that matters, the other half the time I have to defend the auditability and traceability of the data set in my warehouse. We can also stand on the fence because we can now "mine" for bad data patterns (only if we are collecting them), and learn where our mistakes are.

This is the never ending story of a data set... (sorry, I'm punchy this morning). Really, it's a fantasy land... Ok - time to get real.

Where is all this data coming from?
It's coming from: Unstructured, Semi-Structured, and more granular data feeds. It's coming in on web-services, off the web-scraping of alternate sites, it's coming in from providers, suppliers, customers, consumers, builders... You get the picture.

Why do I need all this darned data?
How about:
* Compliance, accountability
* Metrics
* Business Discovery of "what's going wrong and when"

What about data quality? Doesn't it reduce the data set and improve my decision making abilities?
I've said it before, and I'll say it again..... Data quality or better put: Information Quality definitely reduces the data set, and absolutely allows a smoother, better flow toward more accurate business decisions. But it does something else: it HIDES the broken business processes, it HIDES the problem areas in the source collection mechanisms which are COSTING YOU MONEY and time.

Well if I let all this "bad" data into my warehouse, won't I overwhelm the users? Won't I paralyze my abilities to make good decisions?
No. Absolutely not. Take a look ad DW2.0 and the stack. Make your Data Warehouse (data integration store) a system of record, make it a granular accountable data store integrated by horizontal business keys. Then produce MARTS, error marts, and data marts - two classifications. Separate the BAD data from the GOOD data at the mart loading level. There are only a few power users who can take the bad data and find out WHY it's bad, this is the discovery process that keeps the business from Hemorrhaging money at the source system or data provider level. The rest of the user base should access the GOOD data in the data mart, that's where the cleansed data should be put.

Use the data warehouse and the traceability of the data in the warehouse to "mine" the bad data for patterns - then use business discovery to find out why it's bad, and how much its' costing the enterprise. I think the dollar figures you can save may astound you.

Keep in mind I am NOT advocating release of this "bad" data to the general end-user base. Rather, that it's a different kind of BI - one that is used to watch the metrics of business activity management, and business process improvement. As the data improves, one can actually (quantitatively) see the impact of business changes in the source data providers. The dollar cost can be measured, thus you've reached Level 4 of CMMI principles. From there, you can OPTIMIZE your business processes, and again, quantitatively measure the results as the bad data "subsides" from being loaded into your data warehouse.

Take control of your business, stop spending ruthlessly, understand the critical path of business processing - a path to true enlightment... (not really, I just threw that in there for fun).

Is more data really better?
Yes - but it depends on what you do with it, and how you separate it into two major classes: "Today’s GOOD data / i.e.: today’s version of the truth", and "Today’s BAD data / that data that doesn't FIT in today’s version of the truth." Bad Data patterns can expose really broken business processes, even from a historical stand-point.

Do I really need more data? Where on earth is all this data coming from?
No, we never need MORE data (although for data mining, the more granular the data set, and the more of it, the better the mining algorithms can predict things you're looking for). On the other hand, we're being forced to use more data: Compliance, Unstructured, Semi-Structured, Web, and Web Services are all contributing data to this integration vision.

* Note: Master Data As A Service is something that may help "consolidate" across organizations. For a prime example of customer MDaaS, look at the company Acxiom.

How do I manage the ever growing data sets?
Good question - ARCHITECTURE, ARCHITECTURE, ARCHITECTURE - it's all in the models we use. If the models aren't scalable, flexible, repeatable and consistent in their design, they will fail - but then again, it's impossible to reach the holy grail of the perfect architecture all the time. Flexibility and standardization are key to making this work. The data model architecture is the crux of success for this kind of effort. You can read more about this on my web site (The Data Vault Data Modeling Architecture) at: http://www.DanLinstedt.com

This also requires proper hardware sizing, performance and tuning of applications to make it work.

Does more data mean better business decisions?
Maybe not external facing business decisions, but internally facing business decisions about optimizing business processes - absolutely.

How do I reconcile these monstrous data sets?
It takes a couple of months, and some dedicated power users on the business side. The business drivers must be there to cut-costs, cut-delivery time, improve product/service quality, or drive to CMMI level 5 across the business - or there's no method to pay for the time it takes to reconcile the data sets and find out why broken data is coming in from source systems in the first place.

I specialize in big-data systems, big-data problems, and business intelligence. Feel free to comment, or contact me directly.

Thank-you,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted December 13, 2006 7:52 AM
Permalink | No Comments |

IBM is coming to the table with their RDBMS systems, finally making waves with their MPP options (labeled as DPF - data partitioning format), and multiple nodes. From a performance side of the house, scalability, and MPP are finally here. Not to mention new true XML interfaces, embedded XML within the RDBMS systems.

In their new 9.x DB2 UDB database, they are finally coupling MPP with hardware and software drivers that increase performance, scalability and manageability. They've got a good showing with their latest releases in BCU technology.

1. JUST ANNOUNCED: BCU for AIX Version 2.1 based DB2 Version 9 On December 6, IBM announced the BCU for AIX 2.1 - a refresh of the initial launch of the Balanced Configuration Unit for AIX in June 2005. This refreshed version of the BCU is based on DB2 Version 9, the latest p5 processors and the DS4800 Storage Subsystem. DB2 Version 9 offers some new features that will be applicable to many data warehouse implementations like table partitioning and data compression. More information about DB2 Version 9 can be found at: ftp://ftp.software.ibm.com/software/data/pubs/brochures/db2_9_overview.pdf

At one of our customer sites, the partitioning options are running at 300,000 rows per second on inserts - this is from a connected ETL tool. The performance is there, and no - the customer is not on a BCU, they are in fact, on Linux dual-core nodes. They are seeing linear performance increases when engaging multi-threaded connections.

DB2 V9 brings to the table the combination that makes big systems tick: hardware, coupled with optimized software algorithms to utilize that hardware, and additional partitioning options that allow optimizers to take advantage of data model designs - of course those designs must be built properly.

One of the hottest new features (I have yet to try) is the true XML (seamless embedding of XML information within the RDBMS) - allowing indexing, querying, updating, reading/writing / locking, etc... Seems to be very complete solution, and extends the Relational world into the semi-structured world. Well done IBM.

I would say that this release closes the gap between Teradata's technology and IBM's technology. *Note: Teradata was rated as the top right quadrant (leader in execution, leader in visionary) by a recent Gartner survey.

Now, if we could only get Updates and Deletes to be parallel / multi-threaded from an Application Programmers Interface (API) standpoint.... That would be awesome.

This technology doesn't come cheap - but its well worth the investment - especially if you are launching SOA, or incorporating unstructured data, or looking at DW2.0.

Cheers,
Dan L
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted December 12, 2006 6:33 AM
Permalink | No Comments |

I've been teaching and consulting on performance and tuning of systems architectures for 10+ years. I've seen the increases in performance across the board from many different vendors - hardware, software, network to disk to RAM to CPU and so on. This entry does not mention particular vendor names, but rather discusses the _nature_ of performance and tuning at the core of Very Large Systems - whether you're doing Business Intelligence / Data Warehousing, or simply data movement / data integration - these numbers (hopefully) will make sense to you. When I say very large environments, I'm talking about 1 billion rows+ _per file_ - handled within a single batch run, 1B rows per year of history loaded, with 5 years of history to load, that means the second through 5 years of history must manage "update detection" against an increasing set of rows on the target, on the order of billions. So what kind of performance do you want from your systems?

With DW2.0 around the corner, and unstructured data creeping in - and near-line, active, and historical storage coming on-board, it's more important than ever before to get your systems in top shape to handle massive volumes.

Here's a run-down of what I see in the market place - again, no vendor names will be mentioned. These are numbers that I see before tuning architectures:

We're talking an average row size of 1250 bytes, same data in Unicode: 2500 bytes.

Usually without parallelism, without partitioning, without tuning:
RDBMS numbers (before tuning)
1. RDBMS A: 8,000 to 10,000 rows per second handled
2. RDBMS B: 12,000 to 15,000 rows per second
3. RDBMS C: 6,500 to 8,500 rows per second.
4. RDBMS D & E & F cannot be stated, they always run partitioning and parallelism by default.

Major data integration engines on the market
1. ETL / ELT Engine #1: 6,000 to 8,000 rows per second
2. ETL / ELT Engine #2: 1250 to 4,000 rows per second
3. ETL / ELT Engine #3: 18,000 to 25,000 rows per second out of the box.

* PLEASE NOTE: I CANNOT AND WILL NOT DISCLOSE THE VENDOR NAMES, IN A PUBLIC VENUE, IT IS A LEGAL ISSUE. My point is solely to give you (the reader) ideas as to what the best practice is, what performance numbers you should be shooting for regardless if vendor.

Where are the common problems hidden within the architecture?
Here are the top 10 issues I run into ALL the time, I'd be happy to consult you and your customer virtually to dig into each area, and help you gain massive performance from your existing systems by disclosing HOW to implement these issues, please contact me directly. Please note: there are over 50 such criteria I use that pinpoint all the bottlenecks in the systems and data architectures.

1. Common problem #1: Updates and Deletes and Inserts are commonly mixed into a single load stream - this can slow ETL / ELT loading processes by a factor of 4x to 12x. Using SET LOGIC and statistics alone, separate the data upstream as soon as possible (another reason why CDC has such a HUGE POSITIVE IMPACT on performance). In 95% of the customers I visit, out of billions of rows, usually only 10% to 20% of the incoming data sets are updates. Only 5% to 15% are deletes. The most costly and most complex operation in any database system and ETL / ELT is an UPDATE, by separating the logic and separating the data set, one can stream-line the updates and inserts and tune appropriately.
2. Common Problem #2: Database Engines are tuned for OLTP, not for ADW / EDW.
3. Common Problem #3: Disk I/O is a huge source of pain (never fast enough)
4. Common Problem #4: Network is shared, is not VPN between the ETL server and the DBMS server.
5. Common Problem #5: Data Modeling Architecture in the ADW / EDW is flawed.
6. Common Problem #6: Too many serial processes / dependencies inside the ETL / ELT stream.
7. Common Problem #7: Too many instances of a Database Engine on a single machine
8. Common Problem #8: Not enough hardware to support the RDBMS ADW / EDW or ETL engines.
9. Common Problem #9: Block sizing is wrong in the RDBMS
10. Common Problem #10: Too complex ETL / ELT processes, complexity drives dependencies, mixed updates and inserts and deletes in a single stream drives complexity. Mathematics and set logic always provide some of the largest instant gains in this situation.

So what are the magic numbers? What do we want to shoot for? What kind of performance are we after?
Well, here's the thing - what kind of performance is "livable" for your organization, and the next question is: for how long will that number be acceptable? Is your SLA to end processing at 6:30am every morning, and you can't start until 1am that morning? Or is your SLA that you have a 12 hour load window, but you need it to load in 5 hours? Or are you trickle feeding your warehouse in conjunction with batch processing? It's really about RIGHT-TIME for the business.

Keep in mind that EVERY TIME we tune something, we begin to specialize the system further. Which means (in general terms) For every further tuning effort, we are customizing. When we customize we break standards, usually when we break standards we end up with "exception cases". When we have exception cases, we are increasing overal maintenance complexity, and our TCO increases - it's bound to happen. Another angle is: the cost of faster hardware and more bandwidth get's exponentially steeper, so picking the right time, balancing standards with the 80/20 rule (when good is good enough / fast is fast enough) is a KEY to our continuing success. Only tune what the end-users say is "broken" - in other words, notify them of the COST of tuning, and if they sign the SLA and justify the cost, then you're good to go.

What do I look for?
I look first, to a non-partitioned, non-parallel stream for performance number gains. This way, once I've tuned the non-partitioned, non parallel stream - I can get additional multiplied performance gains by using the parallelism and partitioning up to the hardware limitations. Why? because the hardware limitations are a FIXED resource.

1. ETL / ELT systems: Well, in systems (standard batch loads) of 100M per process or less, I look for and usually can live with 40,000 rows per second to 80,000 rows per second for inserts. I look for 20,000 to 30,000 rows per second for updates. For Deletes, I look for 10,000 to 100,000 rows per second (depending on the RDBMS). What's the math look like? 80% inserts of 100M rows = 80M rows of inserts, running at 80,000 rows per second = total run time of: 16.67 minutes. A livable piece, considering I would be running 10 to 20 of these types of loads concurrently, and considering I haven't even begun to partition yet. 18% updates: 100M Rows = 18M rows of updates, running at (slowly): 15,000 rows per second = 20 minutes. Deletes: 2% 2M rows / 10,000 rows per second = 3.3 minutes. If I ran all three (Inserts, updates, then deletes) in SERIAL: my total run time is: 16.67 minutes + 20 minutes + 3.3 minutes = 39.97 minutes (just over a half hour).

If the ORIGINAL MIXED WORKLOAD data flow ran at: 8,000 rows per second (contains updates and inserts and deletes), then the total time would be: 100M rows / 8,000 rows per second = 208.33 minutes or 3.4 hours

Now, 3.4 hours reduced to 39.97 minutes is almost an 80% performance increase.

2. With RDBMS Systems, I look for the same numbers as above - the RDBMS system must perform at the same rates as the ETL / ELT systems - or none of this works.

What do I look for from DBMS systems when I employ ELT?
I shoot for performance rates of 100,000 rows per second to 400,000 rows per second - an order of magnitude difference, however most RDBMS engines can't come close to this, there are only a few that can. These kinds of rates make it possible to deal with 1B rows+ of incoming data on a single stream.

For example: Suppose this: 5 sequential process in ETL move 1B rows to an RDBMS at: 50,000 rows per second. That’s: 3B rows / 50,000 rows per second = 16.67 hours total run time.

Now suppose I have 42 steps of ELT for the same operational logic, that's 42B rows / 336,000 rows per second = 34.72 hours to process. Of course it's not always this way, because each "step" of the 42 steps reduces the data set, and not all of the 42 steps process all 42B rows, this is a worst case scenario. Oh and by the way, this is just ONE of the 10 batch processes that have to run concurrently. By the way, the RDBMS supporting this is not as large as you would think - it's about 8 dual core CPU's, and 12GB of RAM - that's all I can say about that...

These are real numbers from real customer sites that I deal with. If you want to play in the big-data leagues, you've got to be able to produce in a mixed work load environment. Please let us know if we can help you with your performance and tuning, we'd be glad to take a look at your systems.

Vendors take heed: these are not "numbers to shoot for", these are numbers that must be met within the next year or two - Especially with the unstructured data sets on the rise, and massive amounts of XML data being passed around and integrated. Vendors: I can work with you as well, as I sit on a number of technical advisory boards.

Come see me the next time I teach VLDW at TDWI, I'd be happy to talk to you, or contact me at my company to schedule a performance review of your systems.

What kinds of volumes are you dealing with today? Please let me know.

Thank-you,
Dan Linstedt
CTO, Myers-Holum
http://www.MyersHolum.com


Posted December 8, 2006 5:46 AM
Permalink | No Comments |
PREV 1 2