Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« August 2007 | Main | October 2007 »

September 29, 2007

Normalizing / Denormalizing: VLDW & Architecture

I've blogged a little bit about this before. In this entry we'll explore the mathematics side of normalization and denormalization in the data model. Of course, this entry is also specifically targeted at very large data sets. On small systems that can't handle the mathematics, performance degrades to a certain degree. There's a rumor out there, an un-truth, a fallacy that states: if you're having performance problems in your database, simply denormalize (flatten the tables, remove the joins). This works, but only to a point.

There is a physical upper limit to denormalization where mathematically it simply stops working, and performance degrades on an exponential scale. That magical point is where the width of the row increases so much, that it forces multiple blocks (chained together) to meet the needs of queries.

That said: VLDW databases should ALWAYS have 64k block sizes (bigger if the hardware can handle it). Especially if the rows are 4k each (in width). If you're dealing with uni-code data, then you have wide rows (2 bytes per character) which make it difficult to survive without 64k block sizes.

Let's take a look at the mathematics:
64k block sizes
4k per row
---------------
16 rows per block

This is a nice even number, but none-the-less, if you are dealing with 60 million, or maybe 500 million rows, that's a lot of I/O's. Let's explore the I/O counts (simple mans math) just based on disk retrieval (ignoring network, RAM, caching, etc...) for a specific query.

This query wants 10 million rows out of 500 million, there is no parallelism, and no partitioning on the table (the only DB that this doesn't work for is Teradata - it's 100% parallel all the time).

Here's the math on I/O's:
10 million
16 rows per block
-------------------------
625,000 blocks (assume 1 I/O per block) = 625,000 disk reads.

Now, just for simplicity sake, suppose we really needed only 800 bytes of each row, would it make a difference in the I/O's in a denormalized solution? NO. The 800 bytes are completely buried within each 4k row, so even with the 10M row return, the I/O's will hit the disk 625k times.

Now, suppose you normalize the table, and you break it in to three groups of data: 800 bytes, 955 bytes, and 2245 bytes. Suppose you need only the first table now to answer the query... Here's the math:
64k block size / 800 bytes = 80 rows per block
10M rows / 80 rows per block = 125000 blocks or 125,000 disk reads.

I don't know about you, but in my book: 125,000 is a lot less than 625,000 to get the same data... It's a multiplier to performance, magnitudes faster.

Now suppose you needed ALL 4k to put the row back together, how does this work?
Well, without parallelism there would be only speed / performance loss - that means if the database reads ONE TABLE at a time (800 bytes, then 955 bytes, then 2245 bytes) to put the rows together, then the I/O's are (logically) 3x higher than a single table scan, why? Because the database now has to read 30M rows (10M from each separate table) and then join them to answer the query. The join operation introduces the extra I/O's because it's not done in parallel either.

Now, suppose I introduce a) parallel query, b) parallel join. How does the timing work?
800 bytes per row reads, at the same time 955 bytes per row reads at the same time as 2245 bytes per row reads, which all happen at the same time as the join operation (as the rows are fed in to RAM, they are joined together).

Parallelism doesn't necessarily reduce the I/O count, it does two different things: 1) it runs the I/O's at the same time, and 2) it reduces the amount of TIME to execute.

So what happens when I denormalize? Why is this wrong?
1. Most database block sizes default (on install) to 4k or 8k, causing I/O's to go through the roof.
2. Most EDW row sizes are between 2k and 4k themselves (just look at your average conformed dimension for this)... Causing 1 or 2 rows per block, or worse yet, causing "chained blocks" where a single row has half it's data in one block, and half it's data in another, THIS is where denormalization BREAKS THE PERFORMANCE BANK.

When chained rows are introduced, the data is denormalized beyond the boundaries of I/O execution time. This causes double, then triple, then quadruple I/O's compared to a normalized solution with parallel joins. Now, denormalization has broken the performance curve and the only way to get it back is to normalize the data AND introduce parallelism.

Without parallelism, yes - performance of normalized data sets IS usually slower than denormalized data sets, but with parallelism - it will blow away any denormalized set at nearly any time - especially with a large set of data.

Keep in mind, these mathematics only become a problem with larger sets of data. Usually anything under 10 million rows per table is not an issue (as a guideline, this varies from DB to DB, engine to engine). That is IF the block size is 8k, or even 16k.

In LARGE systems, there are only two ways to improve performance: a) reduce the data set (in this case, normalize), b) increase parallelism (done through tuning and hardware capability). By decreasing the amount of data that each parallel thread has to deal with, we improve performance on a factored scale.

Also be aware, that indexes are smaller and more nimble in normalized tables, and that indexes as well play by the same mathematics as the data set itself.

NOTES OF INTEREST:
In terms of appliances (so to speak). There are some appliances in the market place that show marked performance gains on DENORMALIZED data sets only, but they too have their upper limits in terms of the sheer volume of data they can handle effectively.
There are other "appliances" and software that is entering the market place today that is based on Column normalization, or column storage. This is a technique that was pushed to the public market by Sybase IQ many years ago, but IQ had other limitations.

Their secret sauce is in the hardware optimizations, and the disk layouts. The one problem (again) with denormalization, is mathematically there is _always_ an upper limit (to the amount of data) before performance degrades exponentially.

Now, if we MODEL our data sets incorrectly in a normalized data warehouse, performance will also be inversely affected. There are some systems that can handle huge volumes extremely well, (MPP Systems that is), but the data model that is put in place severely limits the parallelism and also increases the width of the rows unnecessarily. Data Models are the next important key to performance success at HUGE volumes. I urge you to check out the Data Vault (Common Foundational Integrated Data Modeling Architecture) - which is built for Petabytes. It's an architectural solution that is FREE (just like 3rd normal form or star-schema) that solves performance and flexibility problems in data warehousing. You can find it at: http://www.DanLinstedt.com

Cheers,
Dan Linstedt

  Posted by Dan Linstedt at 6:54 AM | | Comments (0)


September 25, 2007

ETL/ELT: VLDW and Multi-passing data

In my last entry I got caught up explaining how to reduce the data set and increase parallelism. That should provide a huge bang for the buck for most installations. Again, we must go against the grain of what we learn when we deal with very large volumes of data. What we find is that large data takes a specific architecture to solve these problems. In this entry we'll discuss the ability to multi-pass data sets rather than single pass data sets, and why this makes a difference in performance.

The rumor, the un-truth, the fallacy:
"Run all your data through a single data flow, in a single pass - it's the fastest way to deal with it if you only touch it once."

Well, this isn't always true - this often times leads to:
a) Poor design
b) Sacrificed performance
c) Complex data flow diagrams

These elements cause:
1) Complex data flows
2) Multi-targets
3) (Often) multi-sources that are inter-dependent
4) Extreme amount of caching within a single process

Now, we all know that the shortest distance between two points is a straight line (in standard time/space continuum, with today’s mathematics). Let's assume for a minute that our data flow diagram is a straight line (well, at least that's how it starts). If we introduce too many transformations, too much complexity, too much caching, and too much processing we interrupt the flow of the straight line, and it now becomes a very crooked and jagged line. For that reason, maintenance costs increase, performance drops, and inevitably we reach a point where "more processing" is not feasible without re-architecture / simplifying the process.

Well, the real truth (on today's hardware) is as follows:
The "one-passing" of the data is the fastest way ONLY WHEN:
a) all the data that must be cached can be
b) all the data (including in the caches) can be executed in parallel and in block style
c) The entire flow can be split from end-to-end in a multi-parallel fashion
d) _all_ the source data and _all_ the target data can be cached in RAM before interacting with the process itself or the target database

This usually requires a monster 64 bit box with 80+ GB of RAM, and 64+ CPU's at high speeds, but even then - the data set is likely to outgrow the hardware, such that one or more of the "caches" hits disk. Once this happens, all bets of this being the fastest architecture are off.

Now, back to the analogy: what is the "simplest and fastest" data flow diagram we can ever build?
You guessed it: a straight line - "copy map". Pick up data at point "A" provide a double buffering mechanism within the ETL engine, and drop it / insert it in to point "B". This is the fastest mechanism (all data sets and structures being equal).

Here's another question: in real-life, what do we do to solve BIG PROBLEMS?
We break them in to bite-sized chunks, manageable chunks, pieces that we can solve one at a time... Well, common sense says we should do this for VLDW data integration loads as well.

How do we get to multi-passing?
Suppose we have a mainframe table "MF-CUST-SOMETHING-ELSE" with 360 fields on it, the row size in the mainframe copybook equates to around 4k in size. This source table contains 5.5 Billion records, and grows on a yearly basis by 1.5 billion rows (average over 365 days is about: 41 million records a day). (I know, the table I used was not really customer, but I can't discuss the real table names that I've worked with).

Now also suppose that this table could not, nor would it ever have any sort of "trigger" to tell us what rows are new, and which (out of the existing 5.5 billion) have been changed on a daily basis. Also suppose that their were no hope of putting any sort of CDC engine on the mainframe... is this beginning to sound like your situation (except maybe for the volume of course?) The source table also, did NOT have any sort of date, but it did have a sequencer column.

Ok - this is a real-live case that we had to deal with. The questions are:
a) how do we identify the NEW rows to pull only the new rows from the source and compare them against the target. In other words, which 41 million rows are actually new? that we didn't get on yesterday's feed?
b) how do we identify the "changed" rows, and separate those from the existing 5.5 billion rows? We can't possibly pull all 5.5 billion rows across to the target for comparison purposes, nor can we try to "copy" the target back to the source for comparison purposes.

Using the sequencer we were able to determine the "last" sequence we pulled from the last successful feed, and only pull brand new rows from that point. But the updates were a night-mare. Why? we didn't have a "date" to work with, but even with a date on the rows, it still can be an issue.

Let's make it a little easier and put an "updated_date" on the source table, suppose this "updated_date" is really "applied_date" - when the data applied in time, so when they updated the row, they may dictate that the data really "applies" to 2 years, or 5 years ago - now we have a problem. Even if you don't have these volumes you might have these issues.

Now suppose that updates made up about 20 million rows of historical data, how do we best:
a) identify the updates
b) pull only the updates over the network
c) run the updates through the complex logic
d) separate the "inserts" from the "updates"?

Ok, first:
1) we separate inserts from updates
2) we setup the insert "flow" to be partitioned and parallel (although given the performance rates I usually get - around 20,000 rows per second to 40,000 rows per second for a 4k row) we can still handle the inserts (all 41 million rows) in about 17 minutes. With parallelism we can double, triple, and quadruple these throughput rates.
3) We need to change the way updates are processed, we can't simply "copy the whole table" to the target, we can't even "copy just the keys" to the target to see what's changed. It simply is too big, with the target growing too large as well.

We split the updates in to a multiple pass algorithm.
1) we went to a single scan of all source rows containing two columns: the primary key (source sequence number), and the update date, scanned in PK/sequence order
2) we joined (sorted heterogeneous join, no caching) to the target table, and pulled: the "recorded" update date, and the "recorded source" sequence number (sorted the target) in the target database...
3) we bounced the two columns against each other to see "which dates had really changed"
4) for the rows with detected "changes" to the dates, we wrote ONE column BACK to the source system to a new table (which was truncated and re-loaded every cycle), the one column was the source sequence primary key, upon which we built an index post-load.

Now for the math:
Source Width: PK Sequence = number(12), could be between 8 and 12 bytes, update date: actually 6 bytes (stored in character format: DDMMYY in a PIC statement)
Target width: PK sequence = number(12), could be between 8 and 12 bytes, update date: actually 4 bytes (date format in the target)

Ok, assuming 8 bytes for the numbers, + 6 bytes for the unconverted date, our row was: 14 bytes wide for this particular part of the process, this process had block sizes of 256k within the transformation engine (32 bit system because that's all we had at the time - this was 10 years ago).

TCP/IP Packet: 1k, fit about 73 rows per packet
ETL engine: 256k, fit about 18,285 rows per block

We scanned the source table: 5.5 billion rows on the mainframe - funnelling the data back as fast as the Mainframe could read rows, but only pulling those two columns (not the 4k for each row, because at this STEP we didn't need all the data).
At the SAME TIME, we sourced the EDW data set (it arrived in sorted order because of the clustered and partitioned index table we had on the EDW table).
Already a parallelism factor of 2 + what ever parallelism the source had + what ever parallelism the EDW table had.

The sorted join was an INNER JOIN MATCH on sequence number, arriving in sorted order followed by a filter which eliminated rows from the stream that had matching update dates.

Our "target" table was back on the mainframe, again consisted only of the sequence key. So we were writing only 8 bytes ber row which means our writer block size at 256k could now fit: 32,000 rows before being dumped to the main-frame.

This was just the FIRST PASS, in the second pass we "joined" the new sequence table to the original 5.5 billion row table to pull JUST THE UPDATES (we dramatically reduced the data set, and found ALL the updates) the total number of updates on average per day was around 14 million rows, we then pulled only the columns we needed based on the PK sequence match and loaded them to a staging area in the target database. This was the second pass.

The third pass compared columns in the staging area to the actual target table to determine the "REAL" updated rows.

Each "pass" of the data was setup to increase parallelism and partitioning, I think our total job stream ran to completion in about 30 to 45 minutes on a daily basis... remember, this was 10 years ago on a 32 bit ETL machine (64 bit target database/64 bit target DB OS) and a mainframe. We had to do some tuning to the mainframe "CPU utilization rates", and some tuning to the target DB to get this to work, but it all worked.

Why did 3 passes of the data set (some would say 4) work faster than if we had "scanned" the whole table ONCE?
1) Because each pass dealt with smaller and smaller chunks of data
2) Because each pass took advantage of parallelism
3) Because each pass could focus on JUST the data it needed to operate on

So by: a) reducing the data set and b) increasing the parallelism we significantly cut the run-time of this process. By the way, the original run-time of the single pass process was well over 14 hours - to try to handle updates, inserts, and "all the data" in a single pass. So the client was very happy to reduce 14 hours to 45 minutes... a huge time savings.

Feel free to contact me, I'd love to hear your stories about situations like this.

Thanks,
Dan Linstedt

  Posted by Dan Linstedt at 5:36 AM | | Comments (0)


September 24, 2007

ETL/ELT: VLDW & Data Integration Details

I've been blogging on this topic now for at least the past week. I've gotten some good feedback from a few of you out there in blog-land (thank-you). I've seen a couple questions on "how" to reduce the data set, and what I mean by "multi-passing" the data, and so on. Keep in mind that it's much easier to teach these subjects in classes, than it is to blog on them and get all the needed details put in place for everyone to read. In this blog I provide additional information on how to make performance work for large data sets and data integration.

How do I reduce the data set?
In my last entry I made mention of "rows being too wide." The comments I see from good people are: If you need the data, you need the data, how can you "reduce" the data set?

Mathematics provide the answer.
As I stated early on in performance and Tuning, and in every class I teach, I discuss this: there are only two / maybe three ways to improve performance: 1) reduce the data set, 2) increase parallelism, 3) combination of the two - really there's nothing else that can be done. Ok, improving the architecture really falls under #2 (providing the means within the architecture to improve parallelism), most architectures and data flow diagram designs are too restrictive and too complex too allow improved parallelism.

So, that said: Suppose you have 150 columns, or even 360 columns (SAP / R3 feed for example).... Worse yet, suppose the row is 5 columns... What matters? The byte length of the row - that's what matters, not the number of columns. The first incorrect conclusion that experts tend to make, is simply to look at the number of columns. However, as a rule of thumb, USUALLY the higher the number of columns, the larger the row size.

Ok, that said: suppose I have 150 columns that amount to a 4k row size. This is a WIDE ROW, and will cause performance issues to no end. Problems in the network, CPU, RAM, Disk, I/O, Swapping, Database, Application - you name it, it happens...

For simplicity sake, if we focus on the database for a minute... Most database vendors install their database engines at a default 8k block size... Simple math says: I can fit 2 rows per block, that means: 2 rows per I/O on disk (simple-mans math, because I'm a simple man)... Now, suppose that row size is 4500 bytes... What happens to the rest of row 2 when inserting to the database?? It carries over to another block... Now, 2 rows = 2 I/O's (at a minimum). In some databases this is referred to as "block chaining" or "row chaining". This is a problem... But back to the focus...

How large are most TCP/IP packets? In fact, what's the maximum size of a raw IP packet? Anyone? You might be surprised to know it's close to 1024 bytes, that's right: 1k. (Hint: this is one of the reasons that Network drives are slower than DASD, or Fibre Channel attached drives)... How many I/O's are needed to get 2 rows of 4500 bytes of data across to the disk? Again, in simple math: 9000 bytes = 9x1k packets = 9 I/O's over the network. Ouch.

I/O's cost time, cost money, and can quickly and easily overflow the hardware... but wait - there's another monkey wrench in the works... Lack of parallelism (serialized operation).

If your application (ETL tool?) sends the "block" of two rows over the network, most likely it does it in serial... HEY NETWORK: I HAVE 9k WORTH OF DATA, SEND IT TO THIS DISK AT THIS ADDRESS... Network: OK, 9 I/O's in serial... Processed. HEY DATABASE: COMMIT THESE ROWS TO THE DISK... Database: Ok... 2 I/O's on disk, that'll be 2 blocks please...

And so on.

Now, what can we do? We still need the 9k of data right? Right... But we have one piece left at our disposal: increase parallelism...

How do we do that?
1) Split the data flow in to two components, or three, or four... Whatever it takes to NORMALIZE the structures, INCLUDING IN THE DATABASE.
2) Run the Primary Key with Flow 2, Flow 3, Flow 4... (let's say we split 4500 bytes into an average of 1k each with 4 data flows...)
3) Run all 4 data flows from the SAME SOURCE to the NORMALIZED TARGETS in parallel.

Voila, we just increased the parallelism - wait a minute!! We ALSO reduced the data set! We receive a huge multiplier in performance _every time_ I guarantee it, as long as the hardware can handle the parallelism...

Now wait a minute! Won't this run "slower" because now it has to read the same data 4 times? What about the amount of work over TCP/IP - isn't it the same number of packets? What about the target disk and database? Don't they have the same amount of work to do?

Ok, here's the low-down:
1) Systems (including databases) have been optimized over the years to "Cache" data from disk into RAM the first time it's requested, more often than not - running all 4 data flows at the same time, only 1 of them at any given point will pay the Disk I/O price, leaving the rest to "grab data from RAM"
2) Systems have also been optimized to handle small fast transactions in parallel, it's what they DO!!! So normalizing the data allows the hardware to be utilized the way it was originally intended. If you want DENORMALIZED DATA, then buy an appliance - that's what they specialize in: big blocks, huge wide rows, etc... (more on this later).
3) Applications "usually" by default (like ETL engines) open single connections to the database per TARGET. Databases (at the end of the day) STILL use round-robin style processing against ALL their connections, and give more weight to system processes. Suppose your connections given 3 milliseconds to handle 4500 byte rows of data - no matter how you slice the data you still have 1 connection = 3 milliseconds of execution time. Now, if you "normalize" the data, you now have 4 connections simultaneously attached, now the round-robin scheduler in the database gives you 12 milliseconds to process the same amount of data IN PARALLEL. More Time + less Data = Faster Execution, guaranteed.

What else am I saying?
DENORMALIZATION OF DATA FOR PERFORMANCE IS A MYTH! Trust me, it works great to eliminate "joins" in the database, but once you've blown the magic point at which the ratio of: Number Of Rows per block to I/O's it takes to read that number of rows, you've doubled, tripled, quadrupled your Disk I/O activity alone - then add in all the serialization to deal with those rows across the network, CPU, RAM, and you've got yourself a resource hog.

I've seen it time and time again.... Database engines these days are TUNED to take advantage of parallelism, the more we Denormalize, the LESS parallelism the engines can engage in. If Joins are a "problem" for performance, then either: a) the database isn't tuned properly or b) the hardware isn't scaled properly or c) both - to take advantage of parallelism.

I've hinted at performance numbers in my blog entry that I'm receiving on a 1TB (raw data) SQLServer2005 EDW across a 4 / 5 table join in a 32 bit Windows environment. I hope this helps.

The problem again, is look at the mathematics of wide rows, they are in direct correlation to slow performance (because they exponentially increase the number of I/O's required to handle the data set). Do not believe the hype that denormalization will solve the performance problems, it doesn't. It is a stop-gap measure for another deeper issue: usually architecturally driven or incorrect tuning.

Love to have your thoughts/comments...

Thanks,
Dan Linstedt
DanL@DanLinstedt.com

  Posted by Dan Linstedt at 10:16 AM | | Comments (0)


September 21, 2007

ETL Engines: VLDW & Loading / Transforming

I hope you've enjoyed this series; I've not received any comments either way. I'll be finishing up this series soon. In this entry I'll address "ETL" processing in general, in another entry I'll discuss "ELT" in general, and then I'll begin to discuss BI Queries and Query engines going forward, finally at the end of this series I'll bring in a couple "appliances" or "appliance like" performance enhancers.

In this entry, I'm going to focus on super fast, high speed ETL. This entry is less about the tools, and more about the architectures that work. I hope you enjoy it.

ETL is frequently used to move huge batches of data around; in fact, it's now an "aging" technology, but has proven itself time and time again as a necessity for the following reasons:

a) Managing metadata: technical metadata, process flows, process scheduling, structured data sets, and more recently: "semi-structured" data sets like XML and so on.
b) GUI Driven partitioning and parallelism
c) GUI Driven grid setup / administration and maintenance
d) Data Quality.
e) Access to unstructured data, and conversion to a structured world.
f) GUI driven connections to heterogeneous sources and targets

When we look at this list, there's not only a lot of power available, there are tremendous benefits to using the ETL tools on the market (which I'm sure there are more, but some of the ones I'm familiar with are) - (hopefully in alphabetical order)
* Ab-Initio
* Business Objects - Data Integrator
* Hummingbird - Genio
* IBM - DataStage
* Informatica - PowerCenter
* Microsoft - SSIS
* Oracle - Sunopsis (whom we haven't heard from since the acquisition)
* Pentaho - KETLE
* Talend
* Teradata - Warehousing Utilities

Within these tool sets, there are a variety of options to choose from, but what is most important in VLDW or large data sets is the architecture. The _manner_ in which we construct our loads greatly impacts performance.

Too often, the industry or the vendor suggest that simply by throwing hardware at the problem, performance can be solved (oh if only you would buy grid, or buy this, or buy that...) That helps for a little bit, but ultimately doesn't solve the problem.

If you're ETL routines (during peak operation) are not pushing the existing hardware beyond an average load of 60%, or are not sustaining a hardware load of 80%, then most likely the architectures of individual "data flows" are not correct, especially for big data.

Now wait a minute! You're telling me that just because I can, doesn't mean I should?
Yep... Too many tools make it too easy to "add another source, add another target, mix inserts, with updates, with deletes" all in the same data flow. This adds unwanted and unneeded complexity. Any time complexity enters, performance leaves.

I've blogged about this before... If you're headed down a highway towards a brick wall, and you refuse to change direction, what will happen? You'll hit the brick wall.

If you've got performance problems, and you refuse to change the architecture to try new things, you'll still have performance problems. There's only so much tweaking of knobs that can help performance, then it's all up to the architecture.

With Big Data & ETL, the first thing to realize is the mathematics behind the arrival of the data sets. The mistakes that ETL designers make are as follows: (these get more painful, the larger the data set, and / or the lower the latency of arrival of the data)

1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
3) Targeting more than 1 or 2 target tables
4) moving rows that are too wide through a single process
5) loading very large data sets to targets WITH INDEXES ON
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed
9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur
13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).

Oh, and did I mention running on a 32 bit environment which causes significant OS Swapping to occur?
Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....

Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.

Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.

By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.

Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.

Linux, and Unix do not have these limitations either.

So, architecture is everything. If you take the list I've published of what NOT to do, and change your architecture accordingly, you should see significant performance gains when running "the tuned data flow" by itself (as a test).

There's one more piece I'd like to share (for instance, why letting the DB handle errors slows down the processing).
Most ETL engines handle "blocks" of rows, which is why if you think transactionally, or architect transactionally (row by row), then performance will NEVER reach it's maximum. Let's say the ETL engine puts 150 rows in a block, and sends this block to the database. The database then caches this block until it reaches a commit point (in general).

The database is then "asked" to commit the rows it has cached (in TEMP mind you). The database then has to iterate all the rows in each block (using optimized bind array) and bounce the "rows in error" ONE AT A TIME back to the ETL engine. Now I ask you: is this transactional or "batch" oriented thinking? Right! Transactional... Of course it's going to be slow. Now, the other piece that happens is the ETL engine has to STOP PROCESSING NEW ROWS, Pick up the "rows in error" one at a time, and flow them out to disk.

We've just quadrupled the I/O for that row, decreased our performance by a factor of 4x (at least). Why? The row traveled from the ETL buffer to the network, over the network to the database, was cached in TEMP in the database, then was read from TEMP (possibly disk), checked against the real table (again disk), and bounced back over the network. The ETL engine at this point STOPS it's parallel processing of all kinds, just to handle the row in error, one at a time, and write it to local disk.

This same process is repeated whether we load direct from ETL or we load from a database Loader. The process is worse when we mix inserts with updates with deletes, why? because the database can no longer use block style bind arrays, it now must process each row in it's own individual statement (transactionally based).

So already you've lost 4x to 10x the performance you COULD be achieving, which means that a "data flow" that COULD execute at 80,000 rows per second now executes at 8,000 rows per second. A significant price to pay when dealing with large data sets.

The math:
800 million rows / 80,000 rows per second = 2.77 hours (about)
800 million rows / 8,000 rows per second = 27.77 hours (about)

A huge performance drain. Again, performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines.

Thoughts? Comments? I'd love to hear from you.

Cheers,
Dan L

  Posted by Dan Linstedt at 6:14 AM | | Comments (5)


September 17, 2007

Applications In General: VLDW and Machine Play

I've got this series going on VLDW, and I'd like to continue it for a little bit longer. In this entry I'll dive into different types of applications in a VLDW environment, and their impact on the machines underneath. There are indeed impacts to large data sets passing through machines with specific types of applications (including database engines) enclosed within. What some people tend to forget is that Volume & Latency change everything. It is absolutely vital when performance and tuning for extremely large data that sight of the applications on the hardware is not lost.

The causes of problems with performance on large scale data sets or high volume combined with low latency are as follows:

1. Inter-application MIXED WORKLOAD
2. Dedication of hardware "in theory is shared" In reality, is fighting for resources
3. When it comes time for a SINGLE application to do its job, other applications have 60% or more of the resources already tied up.

I can't begin to tell you how many times I walk in to a client that has an a web server running on the same machine as their database engine, and then complains about performance because of volume in the database. This is very problematic. Their reasoning usually is: "well, when we installed it, it wasn't a problem..." or "we don't have very many users on the web-server." or "we don't seem to think that's the issue." Or my personal favorite: "The vendor provided us with a web server; we just wanted to take advantage of it."

A prime example of mixed-workload and application clash is going on here. Web servers within the RDBMS engines are there so that "specific" instances of the database can be tied, with security to the web-server front-ends. This is a wonderful feature, until you have volume to deal with. NEVER in a volume solution should an RDBMS engine (responsible for managing the volume) be tied to a web-server component. IF any of the web-components are used, they should only be used to import/export data (i.e. XML).

Workload clash across applications can KILL performance right off the bat. Furthermore, do not use web-services directly on top of the RDBMS engine, why? The hardware isn't typically scaled to do so.

I'll give you an example of a system I am working on right now (I can't disclose the customer name, nor the performance I'm seeing due to legal licensing agreements).

I have a couple SQLServer2005 instances on high-end dual core machines with 8 GB RAM, 32 bit Windows OS, and 32 Bit SQLServer. I've got a Data Vault data model under the covers as a data warehouse, and have built over 1 TB in test data on a single box, about 3 TB across multiple boxes, and have a separate web-server (also on Windows, we are running *.NET) which performs extremely fast, I can query the data sets of 1 Billion rows (joined to 300 million rows), and produce a result set of 100,000 rows very quickly.

I have a huge load on SQLServer, 1 TB per machine, Raid 0+1, internal SCSI, 146GB Hot Swap drives at 10,000 RPM. The performance is phenomenal.

Why? Because I've cleared the machines from doing _anything_ except serving data. These machines are highly focused, highly tuned and have nothing but the OS, and SQLServer on them to run and process data sets. If I through a web-server on the RDBMS engine itself, or if I turned on "web-services processing/email processing", or used "CLR" or in-database stored procedure logic, my performance would slow down by factors of 4x to 10x or more.

Do not let your applications "cloud" the performance of your RDBMS engines. You've paid good money for licenses to your database engines and by now if you haven't guessed it: separating (divide and conquer) the applications across machines, and improving network speed will help improve parallelism, tuning and functionality. These RDBMS engines really need to be on their own box. This is just one of the reasons why such high performance can be seen from APPLIANCE vendors. They LOCK the hardware down (in most cases) to perform data servicing and management functions, they also provide additional and sometimes proprietary hardware to improve speed even further.

1. Web servers are the number one performance killers of VLDW / VLDB RDBMS engines.
2. OLTP applications are the number two performance killers of VLDW (data warehousing specific) RDBMS engines.
3. XML In and Out can be a performance killer of structured text loads. Try pre-processing the XML, normalizing or fully denormalizing "flattening" it.
4. Stored procedures, functions, and "programmatic" solutions like triggers, and logic within the RDBMS can be performance killers. The database engines MUST be clean, and simple - used to house the structures only. Keep your application logic in the application layers where it belongs. After all, the acronym RDBMS stands for "Relational Database Management System", not "relational application logic data manipulation system".... I realize that some databases (like Teradata) provide for high degrees of parallelism and application logic within the database. They go through great pains in engineering to make this work, hence the high cost, but I still say: processing should not necessarily happen within the database UNLESS it can happen on a "block by block" basis in parallel with linear scalability.

Whoa, hold-up, what are you saying?
Ok, here's where part of the rubber hits the road. Most programmers write logic this way:

(open a cursor of some sort)
(like a macro in excel)

This is SINGLE ROW PROCESSING, and is extremely slow in huge volume environments. Remember the performance numbers I spoke of (in terms of rows-per-second) that you want to have in large environments? 120,000 to 160,000 rows per second in processing time... To my knowledge, writing any program this way (as stated above), in a row-by-row processing loop will never perform as fast as block style (see below)...

The above code re-written in block style:

1. INSERT INTO
Select
where....

2. INSERT INTO
select
from temp table #1
where...

3. UPDATE
where

4. INSERT INTO
select
from Temp table #2
where ...

This is block style programming, you're telling the database to deal with BLOCKS of rows in RAM rather than a single row at a time. The performance factor of block style is 10x to 25x faster than single row hits, ESPECIALLY if the database is tuned for parallelism and partitioning, and has enough hardware to perform appropriately.

If however, the database does NOT have enough hardware, OR it is not tuned properly, block style processing might "never finish" where looping (while slow) actually completes. The looping process in this instance may run at 600 rows per second, at best it might run at 1200 rows per second, but no where near the 120,000 rows per second of the block style.

Keep in mind that block style processing requires LOTS of use of TEMP/SWAP space.

Ok, I digress... back to the point:
* DO NOT mix workload on RDBMS engines in volume solutions
* DO: divide and conquer, separate the application spaces from each other, and from their intended purpose... besides: how much can it cost to get a Linux box, or a new windows machine just for web-services logic? Compared to how long you're currently waiting for your data to complete its processing?
* MOVE the application logic BACK in to the application, and out of the database, let the ETL/ELT tools do their jobs in high speed/parallel formats.
* Partition and tune your database, buy the hardware needed for the RDBMS engine to breathe.

Hope this helps,
Dan Linstedt

  Posted by Dan Linstedt at 5:10 AM | | Comments (0)


September 12, 2007

Database Specifics: VLDW & Switches, making it work

So just how do you get a particular Database to work for you in a growing data environment? There must be something you can do right? Yes, there are certain switches, and specific knowledge which you need to focus on in order to get the RDBMS engines to scale. This entry is all about those specific databases, and their switches. While unfortunately (due to legal ramifications) I cannot discuss publicly the performance numbers (it's all in the fine-print of the licensing agreements they have), I can discuss the switches that make things fast. Remember, that unless you have the proper hardware, and the right architecture, these switch settings won't make a difference. In fact, they may even slow your system down.

There are TWO fundamental tenants to making ANY system perform:
1. Increase parallelism (or the opportunity for parallelism)
2. Decrease the data set that you are working with or on.

Ok - the third, is to do both of the above. This is where the orders of magnitude GAIN is achieved, and yes - this is AFTER you have already obtained the desired I/O performance from your system.

Yea, man... So what... So how do I get my system to work properly?
In short: hire a very knowledgable DBA, their rare and hard to find (and usually aged to some degree). (just kidding).

Ok, back to the REAL issues:

Oracle 8 & Oracle 9i
1. upgrade to 10g to get the optimizer features (which they did a lot of work on)
2. Alter the default parameters for the ORACLE instance see the following references:
http://www.oracle-base.com/articles/10g/PerformanceTuningEnhancements10g.php
http://www.dbazine.com/olc/olc-articles/ault7

Oracle default parameters on INSTALL look at the "whole machine", but the installer does NOT run any type of machine performance testing. It simply counts the number of CPU's and sets the Number of CPU's available to the MAX number of CPU's on the box. It also sums up the total amount of RAM available, and sets the SGA and total RAM to the total RAM on the box.

What does this mean? It means that when the instance is pushed to 80% or more of utilization, it will take ALL the resources available on the box. What Oracle doesn't publish is that it's core routines change the "niceness" prioritization of it's core routines, and push the Operating System out of the way... This can cause machine hard-lock and re-boot.

I will suggest that the previous post covers pretty much all databases in general, and what you should do with the RAM & CPU and Instance settings for Oracle, and other RDBMS engines.
3. The other parameters you should be tweaking include: # of read-ahead buffers, SGA, Sort Area Size, Bitmap Area Size, Parallelism, Size of Parallel Packets, Number of Log Threads, Number of I/O buffers, Number of I/O threads, where the LOG/TEMP files sit in relation to the Data and Indexing. Think about the partitioning of the tables, max degrees of parallelism for the queries and how these queries work.
For instance: Oracle 32 bit or 64 bit on a 32 bit windows platform (OS) is limited to a maximum parallelism of 4. This is severe in terms of handling large data sets. You need a lot more parallelism from the query set in order to make this work properly within big data. Recommendation: move to LINUX or UNIX with Oracle, and increase the parallelism to 8, 10, or 12 for starters.
4. Move from 32 bit Oracle to 64 bit Oracle. Most definately, take advantage of BIG RAM (buy more RAM for your box).
5. If you can, LOCK the SGA in to RAM, refuse to allow the RAM to be swapped to disk under any circumstance. This can kill performance
6. Put the TEMP space on a RAM DISK (screaming performance), but yet again, you have to buy more RAM.
7. Change the BLOCK SIZE. Enterprise Data Warehouses should use a MINIMUM block size of 32k. EDW's today, should actually ALWAYS be allocated at 64k, however once again, windows & oracle maximum block size is around 8k (due to windows 32 bit limitations - not sure what the limitation (if any) is in 64 bit).
8. use Oracle Bitmap indexing WHENEVER POSSIBLE.
9. Partition your tables, do it, do it, do it... Manage the partitions
10. DE-FRAGMENT, de-frag, de-frag. For any database, this can be a performance killer - especially in big systems. The complaint usually is: "Defragging always takes a long time" Well, if you defrag more often, there is "less" defragging to do, and it will always take "less" time.
11. Turn on Query Re-Write
12. Increase the number of "estimated plans" that Oracle is allowed to use to make it run fast.
13. Read, read, read, there are hundreds of knobs to turn, and even more opinions on which ones to use...
14. RECOMPILE YOUR VIEWS, after every major data change, recompile your views - it takes a fraction of a second, but will help performance tremendously.
15. PRE-EXTEND your file size BEFORE loading more data, this will help establish contiguous portions of your disk space, and will help manage/minimize fragmentation.
16. Put a single instance of Oracle on BIG IRON. Oracle does best when it has at least 10 to 12 CPU's and 16 to 64 GB of RAM to work with. Especially with Big Data, and it SCREAMS on LINUX with super-fast DASD in 64 bit mode.
13. DO NOT CLUSTER!!!
14. use RAW format as a last resort, Raid 0+1 should be good enough
15. SWITCH TO COST BASED OPTIMIZATION!!!!!!
16. Increase the TDU and SDU sizing in the Oracle Client that is doing the loading, but only for the "connection" which uses the Bulk Loaders or ETL engines that do the batch loading.

DB2 UDB - Upgrade to 9 if you're not there.
1. DO NOT USE GRID (YET) FOR HIGH SPEED LOADING, ONLY FOR HIGH SPEED QUERYING. They (IBM) are currently working on resolving some issues with loading in parallel across grid machines, they _are_ being dilligent about their efforts. I expect to hear that they've solved this problem shortly.
2. Super-Fast I/O Disk (400MB to 500MB per second)
3. Use LINUX 64 bit, tune the switches for Linux.
4. Put it on a Z-series or P-Series machine to make it scream
5. Setup huge block sizes
6. setup RAID 0+1, do NOT use Raid 5 (without enough Disk RAM CACHING to acheive proper throughput)
7. Use Bitmap indexing where possible
8. PARTITION, PARTITION, PARTITION
9. Provide enough CPU's to run queries in Parallel (usually 6 to 8 CPU's, high speed, multi-core, 64 bit).
10. IF YOU HAVE A GRID SYSTEM (MPP aka: DPF), then Load to ONE NODE, then use their buffering technology to disperse the data inside the grid. There are special INSERT INTO buffering commands that can only be issued from the control node.

SQLServer 2000: UPGRADE, UPGRADE, UPGRADE!!! 2005 IS A BLAST, and REALLY is fast.
1. You cannot afford to stay on SQLServer2000, the performance gains from 2005 are enough to make your mouth water. SQLServer has finally grown up. They still need some additional engineering work to make partitioning easier (configuring the partitions within a table IS A CHALLENGE), but if you can make it work, it's awesome.
2. NORMALIZE, NORMALIZE, NORMALIZE. Stick with normalization of your tables, take advantage of Clustered index joins across SURROGATE sequence numbers. SCREAMING performance. Denormalizing in SQLServer2005 is the kiss of slow performance. Why? Because SQLServer 32 bit block size is locked in to 8k. This is a severe performance inhibitor. As I said earlier in the Oracle section, all RDBMS engines for Warehousing should operate on no-less than 32k block sizes.
3. move to dual-core CPU's at 3.0 GHZ with high speed internal disk (10,000 RPM a pop), with Raid 0+1, and if you can get it - move to 64 bit, with Windows OS 64 bit - but beware: UPGRADE THE CPU & THE RAM, otherwise your system _will_ slow down.
4. USE FOREIGN KEYS instead of separate indexes for your joins. Foreign keys in SQLServer2005 are much faster. They've improved the optimizer quite a bit.
5. Increase the NETWORK PACKET SIZES on the server, and on DTS, BCP, SSIS, and other client applications. Be-warned: only do this for LOADING mechanisms. Increasing packet sizes for Reporting tools or BI Queries will not only take your database down and will eat memory, but may also take down the client software. Only the bulk-loading mechanisms should request a large packet size. Recommended packet sizes is 32k & 64k
6. Increase the "query reserved for client", but ONLY while loading - once loading, reset it back to 8k. Why? Because _every_ client that connects will receive these chunks of memory, and maybe more, especially if they connect more than once, or run multiple threads.
7. Recompile your views, after every load
8. Use clustered indexes, but ONLY on sequence numbers (identity columns)
9. Use fill-factors of 90% or higher for indexes, or data blocks
10. PRE-Allocate the DISK space before loading, do NOT allow the data files to self-extend, this can cause severe fragmentation.
11. Manage fragmentation
12. Move to COST BASE OPTIMIZATION
13. Get RID of all the hints in the SQL statements (you don't need most of them in SQLServer2005).
14. REMOVE ALL OTHER APPLICATIONS from the SQLServer Machine, leaving only the database on that machine.
15. TURN DOWN the amount of memory that SQLServer is allowed to take, to 75% or slightly more (80%), leaving the OS room to breathe.
16. Turn on USE FIBERS (unless you are "remote linking servers")
17. Do NOT remote link servers, this is a HUGE performance detriment.
18. Turn up "CPU priority"
19. DURING LOAD: Change the Recovery Model to BULK, turn on Select Into/BulkCopy, Turn on "Trunc. Log on Chkpt" then reverse these settings once you're finished loading.

Teradata:
1. UNDERSTAND HOW YOUR DATA MUST BE EVENLY SPREAD OUT ACROSS THE NODES, once even distribution has been achieved, the only things left to tune are:
2. How many nodes? how old/new? how fast are they?
3. PRIORITY of execution by user, and types of queries
4. Logging mechanisms of the table for redundancies.

Sybase is mostly the same as SQLServer

Hope this helps somewhat, have ideas? Please share them.

Dan Linstedt
http://www.COBICC.org

  Posted by Dan Linstedt at 7:20 PM | | Comments (2)


September 9, 2007

Databases and VLDW: Petabyte Scalability

If you've been following along, you've noticed that I've been writing about Terabyte (50+) to Petabyte levels. In this entry, it's no different. I'll discuss database engines generically, and their usage as a VLDB component. I'll use another entry to discuss appliances (separately) that contain embedded database engines. Databases all have their quirks and notions, but there are some generalities that simply stick - that you cannot ignore, that most people have ignored over the years... I'll try to dispell these false-hoods, and bring the truth to the table.

What are the issues around BIG DATA and RDBMS engines, especially when it comes to VLDW (Data warehousing)?

1. Most database engines do _not_ support temporal mathematics within their query optimizers. The issue? EDW's contain TEMPORAL data!!
2. Most database engines do _not_ work well with NATURAL BUSINESS KEYS. The issue? Data warehousing SHOULD be all about the business and the business keys.
3. Most database engines perform in-memory caching. The issue? They cannot handle the volume of data that needs to be "compressed/cached/stored/referenced" in RAM when we get into VLDW.
4. Most database engines were WRITTEN (core engine that is) for OLTP (transaction processing), with EDW's as an after-thought, an add-on to meet growing demand. The issue? This _severely_ limits performance of the optimizers, and often lacks parallelism when needed by the queries.
5. Most database engines CANNOT handle FAST/PARALLEL updates. This is a REQUIREMENT of a star-schema, or any "data distribution system", or ODS. (see my comment about deletes, I'm not talking about one single mass-update...)
6. Most database engines CANNOT perform FAST/PARALLEL deletes. I'm not talking about a single grouped delete statement, I'm talking about 100,000 individual row specificiation deletes - there's no "fast interface" in which to QUEUE the data set/keys/where clauses, and run these in parallel.
7. Most database engines DO NOT provide an EASY way to compare column values on a row by row basis IN BULK, FAST LOADING. The don't even address duplicates on the incomming stream... The problem? 90% or better of the incomming data within a VLDW usually HAS duplicates, even if the source system/source CDC engine told us that there's an update.
8. Most database engines DO NOT provide row-based temporal compression/expansion, especially while keeping the data set useful. The problem? Very large data sets can usually be compressed (according to flat-file compression rates I get on incomming sets) anywhere between 60% and 90% for structured data. This could significantly improve performance.
9. Most database engines DO NOT tell you that the DISK I/O performance is too slow.
10. Most DBMS engines DO NOT automatically select "parallel paths" to the disk, nor do they (when the install) profile the system enough to figure out WHERE to put logs/temp and data.

* Appliances answer some of these questions for us... Others are "managing persons" related...

Ok, common operator/installation flaws or mistakes we make...

1. Allowing Default Installation parameters to RULE the instance. If you have 40GB of RAM, and 18 CPU's when you install, the RDBMS engine installer will setup to use 100% of the available resources IF or WHEN pushed to the load limit. The PROBLEM with this, is that there is NO ROOM for the operating system to breathe under load, and when your in a VLDW, you've always got LOAD!! (usually). The solution? Set the database to be 1/2 or possibly 80% of the available resources. ALWAYS leave 20% available for other applications INCLUDING the operating system. THEN tune the database to effectively use all 100% of that "80%" of the hardware when pushed to operating limits.

2. Installing "more than one" instance of a single database engine on an underpowered server. Typically too many people make this mistake in the name of "recovery" or "fault-tolerance." This is a problem, it not only eats up the resources, but does NOT handle fault-tolerance very well. Why? because what usually fails are the hardware/networking layers, or the CPU's quickly become over-loaded and rebooting the machine is necessary. PLEASE PLEASE PLEASE in a VLDW system, Consolodate the "instances" to a SINGLE instance, multiple databases, and manage a core set of services. The exceptions to this rule are: Teradata, and IBM DB2 which is smart enough to know to launch only ONE set of core-operating services.

What happens when I make this mistake?
(when pushed to limits, or operating at 80% or above in capacity) The RDBMS engines:
a. begin to fight with each other (see #1 on the list), thus starting "b" below...
b. can and WILL re-prioritize their CORE services "nice-ness" and WILL take over the machine, pushing even the operating system OUT of execution range. This results in a hard-lock and cold-boot...
c. No-one gets the "most" of the resources available, parallelism is severly damaged and cannot be tuned properly for any instance, the hardware (even a large box like I described above) quickly is over-whelmed.
d. Large data sets require LARGE RAM CACHES and BUFFERING MECHANISMS. The database cannot properly "load-balance" queries, query types, and data retrievals (I/O channels) if too many instances are running.

3. We fail to PROPERLY tune the database for the hardware we're on, and we do not make allowances for network speeds, disk speeds, buffering, network traffic, and so on. The result is: the longer this goes, the more "fragmentation" this leads to in our VLDW data set. Fragmentation of the queries, fragmentation of the data in RAM, non-parallel operations, and non-optimized operations (more blocking I/O operations). Properly tuning requires an intimate knowledge of the hardware platform on which our RDBMS engines sit, along with base-line performance testing.

4. We fail to setup parallelism and partitioning according to time or nodes. In VLDW, the only way to manage data sets effectively is to partition by time (that is "load-time" or "load cycle number"), or even distribution by available MPP nodes. If you are running in an MPP environment, then data proximity and layout across the MPP nodes is critical in accordance with the MOST USED join columns, or better yet - an even distribution across parallel MPP nodes for optimal parallelism in any query. Why? because the join technology can be orchestrated through alternative means (like Join Indexes).

5. We under-power the hardware and expect the RDBMS engine to perform. RDBMS engines NEED LARGE MEMORY CACHES, and BIG BUFFERS to perform decently well in terabyte sized solutions. DON'T put a Windows Dual-Core server, 32 bit with 2 GB RAM in place with Oracle 9i and then try to place 30TB of data into the database and USE it as a VLDW... It just simply won't scale. On the flip side, DON'T count out the Windows 64 bit OS, and SQLServer 2005 64 bit system, they can run SUPER FAST if tuned properly, and if the hardware is there to support it.

6. Ok, we place 32 bit systems in play to do what really requires a 64 bit system to run with.

7. We place the Logs / temp / and data and indexes on the same DISK ARRAY, then we further damage ourselves by using RAID 5. Then we further damage ourselves by adding CLIENT TRAFFIC to the disk device, or even the network under which our database REQUIRES dedicated network access to the disk device. Oh yes, it gets' worse... We sometimes outsource our disk arrays to a "storage farm" and don't hold the outsourcers to an SLA for accurate performance, then they put our critical RDBMS engines on a SHARED DISK with other companies, and the network traffic has to load balance on the way in and out.

The real problem? All of these things limit parallelism... Once parallelism is limited, the performance drops like a rock in water. Factors of 4x, 6x, 12x, all the way to 100x, 200x performance is negatively affected. In VLDW solutions, you CANNOT "absolutely CANNOT" limit parallelism in any way.

What about the settings in the RDBMS?
That's for another entry. There are tons of common mistakes we make when we tweak databases without an understanding of VLDW (50+ Terabytes). Chances are, if you make these changes EVEN TO YOUR 100GB RDBMS you'll see tremendous performance improvements without changing hardware.

Comments? Thoughts? Did I leave something out? Please post your ideas.

Thank-you kindly,
Daniel Linstedt
Come get a Masters of Science in BI and DW at Daniels College of Business, DU.
http://www.COBICC.org

  Posted by Dan Linstedt at 5:25 AM | | Comments (0)


September 8, 2007

Disk and VLDW: What you need, can use

In very large data warehousing, or VLDB for that matter, I am constantly asked: what kind of disk should I have? can I use a SAN, how about a NASD, what about DASD? I have RAID 5, is that good? Now there is Raid 7, 5+, S, 10, and so on. There are differences that DO make a performance difference in the disk that you are using, and when you are dealing with very large data sets you MUST have throughput. This is the optimal end-game.

The answers are quite simple really: Faster the better, but a minimum throughput of 300MB to 400 MB per second is required. (This was 3 years ago!) Today, I would suggest that 400 MB to 500 MB per second is better, faster the better.

Now that said, if your disk cannot achieve these throughput levels then it's
a) time to buy a new disk device
b) time to reconfigure your existing disk device
c) time to re-work the storage array and how it's attached to the server
d) time to add RAM cache to the storage device

and so on. Unitl you can reach these levels of throughput it, performance will be difficult if not impossible to achieve in a 30+ or even 100+ Terabyte system, Did you know that CERN produces 100TB of data every time they smash an atom? That's an interesting tid-bit to chew on, and to top it off, they capture it all....

Ok, so what do you need in your disk device?
1. BIG RAM CACHE, you should have write-through buffers, balanced load algorithms, smart-caching across the I/O channels, and fully controlled storage arrays.
2. High speed platters
3. RAID 0+1 should be used, Or better yet: Database RAW format (only for the fastest MPP databases on the planet). DO NOT use RAID 5 and expect performance UNLESS the disk device has on-board RAM cache, and can reach the throughput speeds mentioned above.
4. Multiple I/O channels working in parallel.
5. DEDICATED DISK ARRAY to SERVER TRAFFIC ONLY, there should NEVER be any client traffic on the network, or on the disk array when expecting performance from large volumes of data. YES: DEDICATED!!!!! Again, client traffic is a) too sporatic, b) typically used to "backup and restore local windows disks" which could be gigabytes at a time, and this is painful... c) drops performance of your high speed disk array by factors of 4x, 6x, and 10x.
6. DO NOT USE A SHARED, OUTSOURCED, DISK ARRAY FOR HIGH-SPEED, HIGH-VOLUME OPERATIONS, UNLESS: it's hosted on a garaunteed VPN, and the device is DEDICATED to your servers, and they can PROVE the performance on the transfer rates. Hosted I/O solutions in a HUGE environment usually are detrimental to performance by factors of 12x, 14x and 20x.

I don't mind whether it's SAN, NASD, DASD, but I will say this: Internal disks are fastest, next up are DASD (this is the preferred choice in the HUGE volumes), then NASD and SAN IF the network is VPN direct to the server, and has garaunteed throughput.

If you are working with a storage hosting vendor, then ensure you have an SLA in place for garaunteed throughput, then ask to see the throughput test results on a bi-monthly basis. This will keep them honest. I've been in places where the hosting service will "move" your data around to different disk arrays in their system based on their own internal needs. I've also been in places where they will _not_ garauntee dedicated access nor will they garauntee exclusive access to disk devices.

Think about it, if you're willing to spend that much money to have a high volume solution, shouldn't you be protecting yourself and getting your money's worth?

Love to hear your thoughts,
Dan Linstedt

  Posted by Dan Linstedt at 6:45 AM | | Comments (0)


September 6, 2007

Operating Systems: What you need for VLDW

So you read my post on Hardware right? If not, take a gander at it... I'd like to think it's mostly complete. This entry focuses on Operating Systems, what the machine needs to work under severe volume loads, and how the operating systems react. In the near future I'll have a post on Applications including ETL / ELT, and Database engines, and then I'll move back in to the business side: skill sets, requirements, defining/gathering/estimating, etc.. All of these things are items that I teach at TDWI in VLDW. I try to keep it fresh. This however, is a short posting.

Operating Systems require "room to breathe." And in high-volume and/or low-latency situations (like near-real-time), OS's are often left in the dust when considering the "operating" aspects of the system. i.e. how a system actually works. People tend to forget that an operating system can make or break the actual application layer on top.

Let's take a look at some of these systems and how they work (in general), then we'll see how to help them work better with very large data warehouses.

1) Mainframe type OS: Very good at high speed computation, extremely good at parallelism, and partitioning of the work, multi-threading operations based on complex algorithms built in to the firm-ware, even when there is no "multi-threading" in the applications. Extremely high-speed hardware bus underneath to support parallel applications. These Operating Systems typically run each application in their own "slice" of memory, CPU, and disk resources. They time-slice accurately and efficiently. The downside? High engineering costs, mean high costs to consumers. You're paying for all this fancy engineering. The upside? Things like LPAR and VPAR abilities (Logical and virtual partitioning) meaning you can split the machine into multiple components, and say: install windows on one partition, AIX on another, and "Mainframe/cobol" code on a third.

They can all talk to each other without ever hitting the network, they can "share" memory across boundaries, and are extremely fast and reliable.

Usually, what you get out of the box is not necessarily "tunable" per-say, and is already pretty high quality, and pretty fast.

2) Unix OS systems: Vary depending on the manufacturer, but there are three or four major players: IBM (AIX), HP (UX), SUN (Solaris), SCO (unix). These systems are highly tunable with many different knobs to tweak depending on the hardware they reside on. The largest thing to remember here is that HPUX and AIX and Solaris all manage memory differently, they also handle swap space and threading differently at the core-level. This means you'll get different levels of performance depending on the actual hardware configuration. With HP you might want a SUPERDOME with MPP built in to the scalable chip-set. With Solaris, you want a similar machine, with IBM AIX, you might even consider a P-series or Z-Series system, also capable of running Zos underneath with LPAR's and VPAR's.

The largest thing to remember about UNIX OS systems is the delegation of SWAP or TEMP space. It's all about the DISK I/O throughput at that point, and in each unix (except AIX) you MUST set the space available to 1.5x to 2x the size of RAM. So if I have 16GB of RAM, I'd really want: 32GB of temp or swap, so that disk fragmentation is kept to a minimum, and multiple threads don't constantly run at 80% of the machine resources. This is called "run-cool" setup and will serve you well.

3) Linux OS systems, are very similar to UNIX and have multiple knobs for tuning and tweaking, it's best to call in an expert for both Unix and Linux to tune to specific hardware platforms, or the OS simply will not perform _as well as it could_ in those circumstances. (same thing applies to swap/temp).

4) Windows OS systems, 32-bit, should not be utilized for VLDW systems. Why? Because of several reasons: 1) PageFile.sys (swap/temp) is SINGLE THREADED BLOCKING I/O. 2) ALL code in Windows 32 bit MUST execute under the 640k boundary!! Required!! 3) multi-TASKING is available, yes, but true multi-threading is not. Example: launch Windows Media Player, start one of your favorite songs, then launch Outlook, two or three good sized word documents, an Excel Document, and to top it off, start a "download" of a significant sized file. What happens to the playback of Windows Media Player? It skips, even if it's made high priority... This is the result of 1, 2, and 3 working against you. Imagine what happens in a VLDW environment...

Now: Windows 64 bit, yes - good choice, with SQLServer 2005 64 bit... BUT you MUST MUST MUST give it enough hardware to perform properly. Without the right hardware, it will only execute SLOWER than the 32 bit systems.

Cheers for now,
Dan L

  Posted by Dan Linstedt at 9:35 PM | | Comments (0)


Hardware: What do you need for VLDW?

In this entry we'll explore the requirements of VLDW, what you might need to build one, or to grow it going forward. Don't forget that everyone's definition of VLDW is different depending on the starting point. No matter if you're scaling from 30GB to 1 TB, or from 50 TB to 1 PB, these guidelines and simple pieces should help you move forward. I guess you could treat some of this as a "score-card" for your environment. This is a HARDWARE LOOK at what you need. In the future I'll blog on the software requirements, and then finally, I'll address appliances.

In this entry we'll explore the requirements of VLDW, what you might need to build one, or to grow it going forward. Don't forget that everyone's definition of VLDW is different depending on the starting point. No matter if you're scaling from 30GB to 1 TB, or from 50 TB to 1 PB, these guidelines and simple pieces should help you move forward. I guess you could treat some of this as a "score-card" for your environment. This is a HARDWARE LOOK at what you need. In the future I'll blog on the software requirements, and then finally, I'll address appliances.

What about the basics?
The basics are easy; I'd suggest you look at the following landscape of options:

1) High speed disk should be transferring data to / from disk (any disk) at I/O rates of 300 MB to 400 MB per second. Without this magic number, the whole scalability effort is sunk right out of the gate.

2) High speed VPN server to server networking, and Server to Disk Networking. Gigabit Ethernet is about as fast as you can get today, the faster the better. BUT the mistake that's often made is to put the DISK device on a common shared network, and then wire it to servers, and THEN expect it to perform. WRONG! SORRY! That simply won't work. You cannot achieve scalability to the Petabyte Levels with this approach. You _might_ make it to 20 to 40 TB before you see problems, but even then.... ISOLATE THE NETWORK TRAFFIC, Big data costs big money, if it's not bringing BIG BUSINESS VALUE in it's return, then DUMP some of the data, reduce the data set size until value can be achieved or justified.

3) RAM CACHES on Disk Devices, double buffering the I/O across the network channels provide all the performance you need - again FOCUS on the throughput numbers on an average day with average load on your Disk Devices. 300MB to 400MB per second, no less. MORE is better in this case. But if your disk device does not house a large RAM cache (to buffer you from the impact of writes) then you WILL see performance problems.

4) Network, network, network: VPN everywhere. Ensure dedicated network cards between the servers and the disk arrays, and server to server communication on a separate VPN network connection. The more "segmentation" the higher the degree of parallelism across each VPN networked devices. The more segmentation, the faster the processing, the easier it is to route correct network traffic for directed / specific tasks. If you are using Optical to reach the DISK, then ENSURE that there are enough optical up-links to handle the performance throughput requirements. Again, it's _all_ about the performance in volume situations.

5) CPU's and RAM... Ensure that you have 1.2x the amount of RAM (multiplied by) the number of CPU Core's on the machine. So If I have 2, dual-core CPU's that’s 4 cpu cores, I would need at LEAST 1.2 x 4 = 4.8 GB of RAM JUST to handle the executing code on the CPU's and maximize parallelism and threading on the CPU cores. I like to say, at the maximum: 2x cpu cores is what I run with. If I have 32 cores, I like to have 64GB of RAM just to handle the operations of the CPU's. People often forget that CPU's are constantly swapping threads in & out of RAM, and NEED absolutely NEED RAM to do this. Now, about on-board caches... Level 1 caches are most important, they are closest to the CPU, and THE LARGER THE BETTER!!! Level 1 caches are expensive due to miniaturization. Level 2 Caches are next in line, THE LARGER THE BETTER - this will help the multi-threading, and reduce execution cycles, particularly useful in Database Joins, string matches (where clauses), Transformation operations, etc...

6) RAM & Network Packet Sizes. If POSSIBLE, then increase the network packet sizes, look for network cards with high levels of "RAM CACHE" and feature sets like auto-compression, and double buffering. This means faster transfer rates over the Giga ether cables. Increasing the network packet sizes means that the network card itself will build "larger" packets, and hopefully with compression and double buffering you'll see more data transferred faster with less actual network traffic.

7) WATCH THE BOOT LOGS FOR HARDWARE ERRORS. Often times, administrators don't have time to construct the proper script to look at the boot logs for RAM, CPU, Device, and Network Errors. But this is CRITICAL, it can be a significant and negative impact on the performance. Especially bad checksums in the RAM chips, this can cause erroneous errors, AND performance degradation by a factor of 4x to 20x.

If you have some hardware tips to add to this list, I'd love to hear from you. Let's build some knowledge here.

Cheers,
Dan Linstedt

  Posted by Dan Linstedt at 10:53 AM | | Comments (0)