Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

September 2007 Archives

I've blogged a little bit about this before. In this entry we'll explore the mathematics side of normalization and denormalization in the data model. Of course, this entry is also specifically targeted at very large data sets. On small systems that can't handle the mathematics, performance degrades to a certain degree. There's a rumor out there, an un-truth, a fallacy that states: if you're having performance problems in your database, simply denormalize (flatten the tables, remove the joins). This works, but only to a point.

There is a physical upper limit to denormalization where mathematically it simply stops working, and performance degrades on an exponential scale. That magical point is where the width of the row increases so much, that it forces multiple blocks (chained together) to meet the needs of queries.

That said: VLDW databases should ALWAYS have 64k block sizes (bigger if the hardware can handle it). Especially if the rows are 4k each (in width). If you're dealing with uni-code data, then you have wide rows (2 bytes per character) which make it difficult to survive without 64k block sizes.

Let's take a look at the mathematics:
64k block sizes
4k per row
---------------
16 rows per block

This is a nice even number, but none-the-less, if you are dealing with 60 million, or maybe 500 million rows, that's a lot of I/O's. Let's explore the I/O counts (simple mans math) just based on disk retrieval (ignoring network, RAM, caching, etc...) for a specific query.

This query wants 10 million rows out of 500 million, there is no parallelism, and no partitioning on the table (the only DB that this doesn't work for is Teradata - it's 100% parallel all the time).

Here's the math on I/O's:
10 million
16 rows per block
-------------------------
625,000 blocks (assume 1 I/O per block) = 625,000 disk reads.

Now, just for simplicity sake, suppose we really needed only 800 bytes of each row, would it make a difference in the I/O's in a denormalized solution? NO. The 800 bytes are completely buried within each 4k row, so even with the 10M row return, the I/O's will hit the disk 625k times.

Now, suppose you normalize the table, and you break it in to three groups of data: 800 bytes, 955 bytes, and 2245 bytes. Suppose you need only the first table now to answer the query... Here's the math:
64k block size / 800 bytes = 80 rows per block
10M rows / 80 rows per block = 125000 blocks or 125,000 disk reads.

I don't know about you, but in my book: 125,000 is a lot less than 625,000 to get the same data... It's a multiplier to performance, magnitudes faster.

Now suppose you needed ALL 4k to put the row back together, how does this work?
Well, without parallelism there would be only speed / performance loss - that means if the database reads ONE TABLE at a time (800 bytes, then 955 bytes, then 2245 bytes) to put the rows together, then the I/O's are (logically) 3x higher than a single table scan, why? Because the database now has to read 30M rows (10M from each separate table) and then join them to answer the query. The join operation introduces the extra I/O's because it's not done in parallel either.

Now, suppose I introduce a) parallel query, b) parallel join. How does the timing work?
800 bytes per row reads, at the same time 955 bytes per row reads at the same time as 2245 bytes per row reads, which all happen at the same time as the join operation (as the rows are fed in to RAM, they are joined together).

Parallelism doesn't necessarily reduce the I/O count, it does two different things: 1) it runs the I/O's at the same time, and 2) it reduces the amount of TIME to execute.

So what happens when I denormalize? Why is this wrong?
1. Most database block sizes default (on install) to 4k or 8k, causing I/O's to go through the roof.
2. Most EDW row sizes are between 2k and 4k themselves (just look at your average conformed dimension for this)... Causing 1 or 2 rows per block, or worse yet, causing "chained blocks" where a single row has half it's data in one block, and half it's data in another, THIS is where denormalization BREAKS THE PERFORMANCE BANK.

When chained rows are introduced, the data is denormalized beyond the boundaries of I/O execution time. This causes double, then triple, then quadruple I/O's compared to a normalized solution with parallel joins. Now, denormalization has broken the performance curve and the only way to get it back is to normalize the data AND introduce parallelism.

Without parallelism, yes - performance of normalized data sets IS usually slower than denormalized data sets, but with parallelism - it will blow away any denormalized set at nearly any time - especially with a large set of data.

Keep in mind, these mathematics only become a problem with larger sets of data. Usually anything under 10 million rows per table is not an issue (as a guideline, this varies from DB to DB, engine to engine). That is IF the block size is 8k, or even 16k.

In LARGE systems, there are only two ways to improve performance: a) reduce the data set (in this case, normalize), b) increase parallelism (done through tuning and hardware capability). By decreasing the amount of data that each parallel thread has to deal with, we improve performance on a factored scale.

Also be aware, that indexes are smaller and more nimble in normalized tables, and that indexes as well play by the same mathematics as the data set itself.

NOTES OF INTEREST:
In terms of appliances (so to speak). There are some appliances in the market place that show marked performance gains on DENORMALIZED data sets only, but they too have their upper limits in terms of the sheer volume of data they can handle effectively.
There are other "appliances" and software that is entering the market place today that is based on Column normalization, or column storage. This is a technique that was pushed to the public market by Sybase IQ many years ago, but IQ had other limitations.

Their secret sauce is in the hardware optimizations, and the disk layouts. The one problem (again) with denormalization, is mathematically there is _always_ an upper limit (to the amount of data) before performance degrades exponentially.

Now, if we MODEL our data sets incorrectly in a normalized data warehouse, performance will also be inversely affected. There are some systems that can handle huge volumes extremely well, (MPP Systems that is), but the data model that is put in place severely limits the parallelism and also increases the width of the rows unnecessarily. Data Models are the next important key to performance success at HUGE volumes. I urge you to check out the Data Vault (Common Foundational Integrated Data Modeling Architecture) - which is built for Petabytes. It's an architectural solution that is FREE (just like 3rd normal form or star-schema) that solves performance and flexibility problems in data warehousing. You can find it at: http://www.DanLinstedt.com

Cheers,
Dan Linstedt


Posted September 29, 2007 6:54 AM
Permalink | 2 Comments |

In my last entry I got caught up explaining how to reduce the data set and increase parallelism. That should provide a huge bang for the buck for most installations. Again, we must go against the grain of what we learn when we deal with very large volumes of data. What we find is that large data takes a specific architecture to solve these problems. In this entry we'll discuss the ability to multi-pass data sets rather than single pass data sets, and why this makes a difference in performance.

The rumor, the un-truth, the fallacy:
"Run all your data through a single data flow, in a single pass - it's the fastest way to deal with it if you only touch it once."

Well, this isn't always true - this often times leads to:
a) Poor design
b) Sacrificed performance
c) Complex data flow diagrams

These elements cause:
1) Complex data flows
2) Multi-targets
3) (Often) multi-sources that are inter-dependent
4) Extreme amount of caching within a single process

Now, we all know that the shortest distance between two points is a straight line (in standard time/space continuum, with today’s mathematics). Let's assume for a minute that our data flow diagram is a straight line (well, at least that's how it starts). If we introduce too many transformations, too much complexity, too much caching, and too much processing we interrupt the flow of the straight line, and it now becomes a very crooked and jagged line. For that reason, maintenance costs increase, performance drops, and inevitably we reach a point where "more processing" is not feasible without re-architecture / simplifying the process.

Well, the real truth (on today's hardware) is as follows:
The "one-passing" of the data is the fastest way ONLY WHEN:
a) all the data that must be cached can be
b) all the data (including in the caches) can be executed in parallel and in block style
c) The entire flow can be split from end-to-end in a multi-parallel fashion
d) _all_ the source data and _all_ the target data can be cached in RAM before interacting with the process itself or the target database

This usually requires a monster 64 bit box with 80+ GB of RAM, and 64+ CPU's at high speeds, but even then - the data set is likely to outgrow the hardware, such that one or more of the "caches" hits disk. Once this happens, all bets of this being the fastest architecture are off.

Now, back to the analogy: what is the "simplest and fastest" data flow diagram we can ever build?
You guessed it: a straight line - "copy map". Pick up data at point "A" provide a double buffering mechanism within the ETL engine, and drop it / insert it in to point "B". This is the fastest mechanism (all data sets and structures being equal).

Here's another question: in real-life, what do we do to solve BIG PROBLEMS?
We break them in to bite-sized chunks, manageable chunks, pieces that we can solve one at a time... Well, common sense says we should do this for VLDW data integration loads as well.

How do we get to multi-passing?
Suppose we have a mainframe table "MF-CUST-SOMETHING-ELSE" with 360 fields on it, the row size in the mainframe copybook equates to around 4k in size. This source table contains 5.5 Billion records, and grows on a yearly basis by 1.5 billion rows (average over 365 days is about: 41 million records a day). (I know, the table I used was not really customer, but I can't discuss the real table names that I've worked with).

Now also suppose that this table could not, nor would it ever have any sort of "trigger" to tell us what rows are new, and which (out of the existing 5.5 billion) have been changed on a daily basis. Also suppose that their were no hope of putting any sort of CDC engine on the mainframe... is this beginning to sound like your situation (except maybe for the volume of course?) The source table also, did NOT have any sort of date, but it did have a sequencer column.

Ok - this is a real-live case that we had to deal with. The questions are:
a) how do we identify the NEW rows to pull only the new rows from the source and compare them against the target. In other words, which 41 million rows are actually new? that we didn't get on yesterday's feed?
b) how do we identify the "changed" rows, and separate those from the existing 5.5 billion rows? We can't possibly pull all 5.5 billion rows across to the target for comparison purposes, nor can we try to "copy" the target back to the source for comparison purposes.

Using the sequencer we were able to determine the "last" sequence we pulled from the last successful feed, and only pull brand new rows from that point. But the updates were a night-mare. Why? we didn't have a "date" to work with, but even with a date on the rows, it still can be an issue.

Let's make it a little easier and put an "updated_date" on the source table, suppose this "updated_date" is really "applied_date" - when the data applied in time, so when they updated the row, they may dictate that the data really "applies" to 2 years, or 5 years ago - now we have a problem. Even if you don't have these volumes you might have these issues.

Now suppose that updates made up about 20 million rows of historical data, how do we best:
a) identify the updates
b) pull only the updates over the network
c) run the updates through the complex logic
d) separate the "inserts" from the "updates"?

Ok, first:
1) we separate inserts from updates
2) we setup the insert "flow" to be partitioned and parallel (although given the performance rates I usually get - around 20,000 rows per second to 40,000 rows per second for a 4k row) we can still handle the inserts (all 41 million rows) in about 17 minutes. With parallelism we can double, triple, and quadruple these throughput rates.
3) We need to change the way updates are processed, we can't simply "copy the whole table" to the target, we can't even "copy just the keys" to the target to see what's changed. It simply is too big, with the target growing too large as well.

We split the updates in to a multiple pass algorithm.
1) we went to a single scan of all source rows containing two columns: the primary key (source sequence number), and the update date, scanned in PK/sequence order
2) we joined (sorted heterogeneous join, no caching) to the target table, and pulled: the "recorded" update date, and the "recorded source" sequence number (sorted the target) in the target database...
3) we bounced the two columns against each other to see "which dates had really changed"
4) for the rows with detected "changes" to the dates, we wrote ONE column BACK to the source system to a new table (which was truncated and re-loaded every cycle), the one column was the source sequence primary key, upon which we built an index post-load.

Now for the math:
Source Width: PK Sequence = number(12), could be between 8 and 12 bytes, update date: actually 6 bytes (stored in character format: DDMMYY in a PIC statement)
Target width: PK sequence = number(12), could be between 8 and 12 bytes, update date: actually 4 bytes (date format in the target)

Ok, assuming 8 bytes for the numbers, + 6 bytes for the unconverted date, our row was: 14 bytes wide for this particular part of the process, this process had block sizes of 256k within the transformation engine (32 bit system because that's all we had at the time - this was 10 years ago).

TCP/IP Packet: 1k, fit about 73 rows per packet
ETL engine: 256k, fit about 18,285 rows per block

We scanned the source table: 5.5 billion rows on the mainframe - funnelling the data back as fast as the Mainframe could read rows, but only pulling those two columns (not the 4k for each row, because at this STEP we didn't need all the data).
At the SAME TIME, we sourced the EDW data set (it arrived in sorted order because of the clustered and partitioned index table we had on the EDW table).
Already a parallelism factor of 2 + what ever parallelism the source had + what ever parallelism the EDW table had.

The sorted join was an INNER JOIN MATCH on sequence number, arriving in sorted order followed by a filter which eliminated rows from the stream that had matching update dates.

Our "target" table was back on the mainframe, again consisted only of the sequence key. So we were writing only 8 bytes ber row which means our writer block size at 256k could now fit: 32,000 rows before being dumped to the main-frame.

This was just the FIRST PASS, in the second pass we "joined" the new sequence table to the original 5.5 billion row table to pull JUST THE UPDATES (we dramatically reduced the data set, and found ALL the updates) the total number of updates on average per day was around 14 million rows, we then pulled only the columns we needed based on the PK sequence match and loaded them to a staging area in the target database. This was the second pass.

The third pass compared columns in the staging area to the actual target table to determine the "REAL" updated rows.

Each "pass" of the data was setup to increase parallelism and partitioning, I think our total job stream ran to completion in about 30 to 45 minutes on a daily basis... remember, this was 10 years ago on a 32 bit ETL machine (64 bit target database/64 bit target DB OS) and a mainframe. We had to do some tuning to the mainframe "CPU utilization rates", and some tuning to the target DB to get this to work, but it all worked.

Why did 3 passes of the data set (some would say 4) work faster than if we had "scanned" the whole table ONCE?
1) Because each pass dealt with smaller and smaller chunks of data
2) Because each pass took advantage of parallelism
3) Because each pass could focus on JUST the data it needed to operate on

So by: a) reducing the data set and b) increasing the parallelism we significantly cut the run-time of this process. By the way, the original run-time of the single pass process was well over 14 hours - to try to handle updates, inserts, and "all the data" in a single pass. So the client was very happy to reduce 14 hours to 45 minutes... a huge time savings.

Feel free to contact me, I'd love to hear your stories about situations like this.

Thanks,
Dan Linstedt


Posted September 25, 2007 5:36 AM
Permalink | No Comments |

I've been blogging on this topic now for at least the past week. I've gotten some good feedback from a few of you out there in blog-land (thank-you). I've seen a couple questions on "how" to reduce the data set, and what I mean by "multi-passing" the data, and so on. Keep in mind that it's much easier to teach these subjects in classes, than it is to blog on them and get all the needed details put in place for everyone to read. In this blog I provide additional information on how to make performance work for large data sets and data integration.

How do I reduce the data set?
In my last entry I made mention of "rows being too wide." The comments I see from good people are: If you need the data, you need the data, how can you "reduce" the data set?

Mathematics provide the answer.
As I stated early on in performance and Tuning, and in every class I teach, I discuss this: there are only two / maybe three ways to improve performance: 1) reduce the data set, 2) increase parallelism, 3) combination of the two - really there's nothing else that can be done. Ok, improving the architecture really falls under #2 (providing the means within the architecture to improve parallelism), most architectures and data flow diagram designs are too restrictive and too complex too allow improved parallelism.

So, that said: Suppose you have 150 columns, or even 360 columns (SAP / R3 feed for example).... Worse yet, suppose the row is 5 columns... What matters? The byte length of the row - that's what matters, not the number of columns. The first incorrect conclusion that experts tend to make, is simply to look at the number of columns. However, as a rule of thumb, USUALLY the higher the number of columns, the larger the row size.

Ok, that said: suppose I have 150 columns that amount to a 4k row size. This is a WIDE ROW, and will cause performance issues to no end. Problems in the network, CPU, RAM, Disk, I/O, Swapping, Database, Application - you name it, it happens...

For simplicity sake, if we focus on the database for a minute... Most database vendors install their database engines at a default 8k block size... Simple math says: I can fit 2 rows per block, that means: 2 rows per I/O on disk (simple-mans math, because I'm a simple man)... Now, suppose that row size is 4500 bytes... What happens to the rest of row 2 when inserting to the database?? It carries over to another block... Now, 2 rows = 2 I/O's (at a minimum). In some databases this is referred to as "block chaining" or "row chaining". This is a problem... But back to the focus...

How large are most TCP/IP packets? In fact, what's the maximum size of a raw IP packet? Anyone? You might be surprised to know it's close to 1024 bytes, that's right: 1k. (Hint: this is one of the reasons that Network drives are slower than DASD, or Fibre Channel attached drives)... How many I/O's are needed to get 2 rows of 4500 bytes of data across to the disk? Again, in simple math: 9000 bytes = 9x1k packets = 9 I/O's over the network. Ouch.

I/O's cost time, cost money, and can quickly and easily overflow the hardware... but wait - there's another monkey wrench in the works... Lack of parallelism (serialized operation).

If your application (ETL tool?) sends the "block" of two rows over the network, most likely it does it in serial... HEY NETWORK: I HAVE 9k WORTH OF DATA, SEND IT TO THIS DISK AT THIS ADDRESS... Network: OK, 9 I/O's in serial... Processed. HEY DATABASE: COMMIT THESE ROWS TO THE DISK... Database: Ok... 2 I/O's on disk, that'll be 2 blocks please...

And so on.

Now, what can we do? We still need the 9k of data right? Right... But we have one piece left at our disposal: increase parallelism...

How do we do that?
1) Split the data flow in to two components, or three, or four... Whatever it takes to NORMALIZE the structures, INCLUDING IN THE DATABASE.
2) Run the Primary Key with Flow 2, Flow 3, Flow 4... (let's say we split 4500 bytes into an average of 1k each with 4 data flows...)
3) Run all 4 data flows from the SAME SOURCE to the NORMALIZED TARGETS in parallel.

Voila, we just increased the parallelism - wait a minute!! We ALSO reduced the data set! We receive a huge multiplier in performance _every time_ I guarantee it, as long as the hardware can handle the parallelism...

Now wait a minute! Won't this run "slower" because now it has to read the same data 4 times? What about the amount of work over TCP/IP - isn't it the same number of packets? What about the target disk and database? Don't they have the same amount of work to do?

Ok, here's the low-down:
1) Systems (including databases) have been optimized over the years to "Cache" data from disk into RAM the first time it's requested, more often than not - running all 4 data flows at the same time, only 1 of them at any given point will pay the Disk I/O price, leaving the rest to "grab data from RAM"
2) Systems have also been optimized to handle small fast transactions in parallel, it's what they DO!!! So normalizing the data allows the hardware to be utilized the way it was originally intended. If you want DENORMALIZED DATA, then buy an appliance - that's what they specialize in: big blocks, huge wide rows, etc... (more on this later).
3) Applications "usually" by default (like ETL engines) open single connections to the database per TARGET. Databases (at the end of the day) STILL use round-robin style processing against ALL their connections, and give more weight to system processes. Suppose your connections given 3 milliseconds to handle 4500 byte rows of data - no matter how you slice the data you still have 1 connection = 3 milliseconds of execution time. Now, if you "normalize" the data, you now have 4 connections simultaneously attached, now the round-robin scheduler in the database gives you 12 milliseconds to process the same amount of data IN PARALLEL. More Time + less Data = Faster Execution, guaranteed.

What else am I saying?
DENORMALIZATION OF DATA FOR PERFORMANCE IS A MYTH! Trust me, it works great to eliminate "joins" in the database, but once you've blown the magic point at which the ratio of: Number Of Rows per block to I/O's it takes to read that number of rows, you've doubled, tripled, quadrupled your Disk I/O activity alone - then add in all the serialization to deal with those rows across the network, CPU, RAM, and you've got yourself a resource hog.

I've seen it time and time again.... Database engines these days are TUNED to take advantage of parallelism, the more we Denormalize, the LESS parallelism the engines can engage in. If Joins are a "problem" for performance, then either: a) the database isn't tuned properly or b) the hardware isn't scaled properly or c) both - to take advantage of parallelism.

I've hinted at performance numbers in my blog entry that I'm receiving on a 1TB (raw data) SQLServer2005 EDW across a 4 / 5 table join in a 32 bit Windows environment. I hope this helps.

The problem again, is look at the mathematics of wide rows, they are in direct correlation to slow performance (because they exponentially increase the number of I/O's required to handle the data set). Do not believe the hype that denormalization will solve the performance problems, it doesn't. It is a stop-gap measure for another deeper issue: usually architecturally driven or incorrect tuning.

Love to have your thoughts/comments...

Thanks,
Dan Linstedt
DanL@DanLinstedt.com


Posted September 24, 2007 10:16 AM
Permalink | No Comments |

I hope you've enjoyed this series; I've not received any comments either way. I'll be finishing up this series soon. In this entry I'll address "ETL" processing in general, in another entry I'll discuss "ELT" in general, and then I'll begin to discuss BI Queries and Query engines going forward, finally at the end of this series I'll bring in a couple "appliances" or "appliance like" performance enhancers.

In this entry, I'm going to focus on super fast, high speed ETL. This entry is less about the tools, and more about the architectures that work. I hope you enjoy it.

ETL is frequently used to move huge batches of data around; in fact, it's now an "aging" technology, but has proven itself time and time again as a necessity for the following reasons:

a) Managing metadata: technical metadata, process flows, process scheduling, structured data sets, and more recently: "semi-structured" data sets like XML and so on.
b) GUI Driven partitioning and parallelism
c) GUI Driven grid setup / administration and maintenance
d) Data Quality.
e) Access to unstructured data, and conversion to a structured world.
f) GUI driven connections to heterogeneous sources and targets

When we look at this list, there's not only a lot of power available, there are tremendous benefits to using the ETL tools on the market (which I'm sure there are more, but some of the ones I'm familiar with are) - (hopefully in alphabetical order)
* Ab-Initio
* Business Objects - Data Integrator
* Hummingbird - Genio
* IBM - DataStage
* Informatica - PowerCenter
* Microsoft - SSIS
* Oracle - Sunopsis (whom we haven't heard from since the acquisition)
* Pentaho - KETLE
* Talend
* Teradata - Warehousing Utilities

Within these tool sets, there are a variety of options to choose from, but what is most important in VLDW or large data sets is the architecture. The _manner_ in which we construct our loads greatly impacts performance.

Too often, the industry or the vendor suggest that simply by throwing hardware at the problem, performance can be solved (oh if only you would buy grid, or buy this, or buy that...) That helps for a little bit, but ultimately doesn't solve the problem.

If you're ETL routines (during peak operation) are not pushing the existing hardware beyond an average load of 60%, or are not sustaining a hardware load of 80%, then most likely the architectures of individual "data flows" are not correct, especially for big data.

Now wait a minute! You're telling me that just because I can, doesn't mean I should?
Yep... Too many tools make it too easy to "add another source, add another target, mix inserts, with updates, with deletes" all in the same data flow. This adds unwanted and unneeded complexity. Any time complexity enters, performance leaves.

I've blogged about this before... If you're headed down a highway towards a brick wall, and you refuse to change direction, what will happen? You'll hit the brick wall.

If you've got performance problems, and you refuse to change the architecture to try new things, you'll still have performance problems. There's only so much tweaking of knobs that can help performance, then it's all up to the architecture.

With Big Data & ETL, the first thing to realize is the mathematics behind the arrival of the data sets. The mistakes that ETL designers make are as follows: (these get more painful, the larger the data set, and / or the lower the latency of arrival of the data)

1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
3) Targeting more than 1 or 2 target tables
4) moving rows that are too wide through a single process
5) loading very large data sets to targets WITH INDEXES ON
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed
9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur
13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).

Oh, and did I mention running on a 32 bit environment which causes significant OS Swapping to occur?
Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....

Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.

Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.

By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.

Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.

Linux, and Unix do not have these limitations either.

So, architecture is everything. If you take the list I've published of what NOT to do, and change your architecture accordingly, you should see significant performance gains when running "the tuned data flow" by itself (as a test).

There's one more piece I'd like to share (for instance, why letting the DB handle errors slows down the processing).
Most ETL engines handle "blocks" of rows, which is why if you think transactionally, or architect transactionally (row by row), then performance will NEVER reach it's maximum. Let's say the ETL engine puts 150 rows in a block, and sends this block to the database. The database then caches this block until it reaches a commit point (in general).

The database is then "asked" to commit the rows it has cached (in TEMP mind you). The database then has to iterate all the rows in each block (using optimized bind array) and bounce the "rows in error" ONE AT A TIME back to the ETL engine. Now I ask you: is this transactional or "batch" oriented thinking? Right! Transactional... Of course it's going to be slow. Now, the other piece that happens is the ETL engine has to STOP PROCESSING NEW ROWS, Pick up the "rows in error" one at a time, and flow them out to disk.

We've just quadrupled the I/O for that row, decreased our performance by a factor of 4x (at least). Why? The row traveled from the ETL buffer to the network, over the network to the database, was cached in TEMP in the database, then was read from TEMP (possibly disk), checked against the real table (again disk), and bounced back over the network. The ETL engine at this point STOPS it's parallel processing of all kinds, just to handle the row in error, one at a time, and write it to local disk.

This same process is repeated whether we load direct from ETL or we load from a database Loader. The process is worse when we mix inserts with updates with deletes, why? because the database can no longer use block style bind arrays, it now must process each row in it's own individual statement (transactionally based).

So already you've lost 4x to 10x the performance you COULD be achieving, which means that a "data flow" that COULD execute at 80,000 rows per second now executes at 8,000 rows per second. A significant price to pay when dealing with large data sets.

The math:
800 million rows / 80,000 rows per second = 2.77 hours (about)
800 million rows / 8,000 rows per second = 27.77 hours (about)

A huge performance drain. Again, performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines.

Thoughts? Comments? I'd love to hear from you.

Cheers,
Dan L


Posted September 21, 2007 6:14 AM
Permalink | 5 Comments |

I've got this series going on VLDW, and I'd like to continue it for a little bit longer. In this entry I'll dive into different types of applications in a VLDW environment, and their impact on the machines underneath. There are indeed impacts to large data sets passing through machines with specific types of applications (including database engines) enclosed within. What some people tend to forget is that Volume & Latency change everything. It is absolutely vital when performance and tuning for extremely large data that sight of the applications on the hardware is not lost.

The causes of problems with performance on large scale data sets or high volume combined with low latency are as follows:

1. Inter-application MIXED WORKLOAD
2. Dedication of hardware "in theory is shared" In reality, is fighting for resources
3. When it comes time for a SINGLE application to do its job, other applications have 60% or more of the resources already tied up.

I can't begin to tell you how many times I walk in to a client that has an a web server running on the same machine as their database engine, and then complains about performance because of volume in the database. This is very problematic. Their reasoning usually is: "well, when we installed it, it wasn't a problem..." or "we don't have very many users on the web-server." or "we don't seem to think that's the issue." Or my personal favorite: "The vendor provided us with a web server; we just wanted to take advantage of it."

A prime example of mixed-workload and application clash is going on here. Web servers within the RDBMS engines are there so that "specific" instances of the database can be tied, with security to the web-server front-ends. This is a wonderful feature, until you have volume to deal with. NEVER in a volume solution should an RDBMS engine (responsible for managing the volume) be tied to a web-server component. IF any of the web-components are used, they should only be used to import/export data (i.e. XML).

Workload clash across applications can KILL performance right off the bat. Furthermore, do not use web-services directly on top of the RDBMS engine, why? The hardware isn't typically scaled to do so.

I'll give you an example of a system I am working on right now (I can't disclose the customer name, nor the performance I'm seeing due to legal licensing agreements).

I have a couple SQLServer2005 instances on high-end dual core machines with 8 GB RAM, 32 bit Windows OS, and 32 Bit SQLServer. I've got a Data Vault data model under the covers as a data warehouse, and have built over 1 TB in test data on a single box, about 3 TB across multiple boxes, and have a separate web-server (also on Windows, we are running *.NET) which performs extremely fast, I can query the data sets of 1 Billion rows (joined to 300 million rows), and produce a result set of 100,000 rows very quickly.

I have a huge load on SQLServer, 1 TB per machine, Raid 0+1, internal SCSI, 146GB Hot Swap drives at 10,000 RPM. The performance is phenomenal.

Why? Because I've cleared the machines from doing _anything_ except serving data. These machines are highly focused, highly tuned and have nothing but the OS, and SQLServer on them to run and process data sets. If I through a web-server on the RDBMS engine itself, or if I turned on "web-services processing/email processing", or used "CLR" or in-database stored procedure logic, my performance would slow down by factors of 4x to 10x or more.

Do not let your applications "cloud" the performance of your RDBMS engines. You've paid good money for licenses to your database engines and by now if you haven't guessed it: separating (divide and conquer) the applications across machines, and improving network speed will help improve parallelism, tuning and functionality. These RDBMS engines really need to be on their own box. This is just one of the reasons why such high performance can be seen from APPLIANCE vendors. They LOCK the hardware down (in most cases) to perform data servicing and management functions, they also provide additional and sometimes proprietary hardware to improve speed even further.

1. Web servers are the number one performance killers of VLDW / VLDB RDBMS engines.
2. OLTP applications are the number two performance killers of VLDW (data warehousing specific) RDBMS engines.
3. XML In and Out can be a performance killer of structured text loads. Try pre-processing the XML, normalizing or fully denormalizing "flattening" it.
4. Stored procedures, functions, and "programmatic" solutions like triggers, and logic within the RDBMS can be performance killers. The database engines MUST be clean, and simple - used to house the structures only. Keep your application logic in the application layers where it belongs. After all, the acronym RDBMS stands for "Relational Database Management System", not "relational application logic data manipulation system".... I realize that some databases (like Teradata) provide for high degrees of parallelism and application logic within the database. They go through great pains in engineering to make this work, hence the high cost, but I still say: processing should not necessarily happen within the database UNLESS it can happen on a "block by block" basis in parallel with linear scalability.

Whoa, hold-up, what are you saying?
Ok, here's where part of the rubber hits the road. Most programmers write logic this way:

(open a cursor of some sort)
(like a macro in excel)

This is SINGLE ROW PROCESSING, and is extremely slow in huge volume environments. Remember the performance numbers I spoke of (in terms of rows-per-second) that you want to have in large environments? 120,000 to 160,000 rows per second in processing time... To my knowledge, writing any program this way (as stated above), in a row-by-row processing loop will never perform as fast as block style (see below)...

The above code re-written in block style:

1. INSERT INTO
Select
where....

2. INSERT INTO
select
from temp table #1
where...

3. UPDATE
where

4. INSERT INTO
select
from Temp table #2
where ...

This is block style programming, you're telling the database to deal with BLOCKS of rows in RAM rather than a single row at a time. The performance factor of block style is 10x to 25x faster than single row hits, ESPECIALLY if the database is tuned for parallelism and partitioning, and has enough hardware to perform appropriately.

If however, the database does NOT have enough hardware, OR it is not tuned properly, block style processing might "never finish" where looping (while slow) actually completes. The looping process in this instance may run at 600 rows per second, at best it might run at 1200 rows per second, but no where near the 120,000 rows per second of the block style.

Keep in mind that block style processing requires LOTS of use of TEMP/SWAP space.

Ok, I digress... back to the point:
* DO NOT mix workload on RDBMS engines in volume solutions
* DO: divide and conquer, separate the application spaces from each other, and from their intended purpose... besides: how much can it cost to get a Linux box, or a new windows machine just for web-services logic? Compared to how long you're currently waiting for your data to complete its processing?
* MOVE the application logic BACK in to the application, and out of the database, let the ETL/ELT tools do their jobs in high speed/parallel formats.
* Partition and tune your database, buy the hardware needed for the RDBMS engine to breathe.

Hope this helps,
Dan Linstedt


Posted September 17, 2007 5:10 AM
Permalink | No Comments |
PREV 1 2 3

Search this blog
Categories ›
Archives ›
Recent Entries ›