Blog: Dan E. Linstedt« August 2007 | Main | October 2007 » September 29, 2007Normalizing / Denormalizing: VLDW & ArchitectureI've blogged a little bit about this before. In this entry we'll explore the mathematics side of normalization and denormalization in the data model. Of course, this entry is also specifically targeted at very large data sets. On small systems that can't handle the mathematics, performance degrades to a certain degree. There's a rumor out there, an un-truth, a fallacy that states: if you're having performance problems in your database, simply denormalize (flatten the tables, remove the joins). This works, but only to a point. There is a physical upper limit to denormalization where mathematically it simply stops working, and performance degrades on an exponential scale. That magical point is where the width of the row increases so much, that it forces multiple blocks (chained together) to meet the needs of queries. That said: VLDW databases should ALWAYS have 64k block sizes (bigger if the hardware can handle it). Especially if the rows are 4k each (in width). If you're dealing with uni-code data, then you have wide rows (2 bytes per character) which make it difficult to survive without 64k block sizes. Let's take a look at the mathematics: This is a nice even number, but none-the-less, if you are dealing with 60 million, or maybe 500 million rows, that's a lot of I/O's. Let's explore the I/O counts (simple mans math) just based on disk retrieval (ignoring network, RAM, caching, etc...) for a specific query. This query wants 10 million rows out of 500 million, there is no parallelism, and no partitioning on the table (the only DB that this doesn't work for is Teradata - it's 100% parallel all the time). Here's the math on I/O's: Now, just for simplicity sake, suppose we really needed only 800 bytes of each row, would it make a difference in the I/O's in a denormalized solution? NO. The 800 bytes are completely buried within each 4k row, so even with the 10M row return, the I/O's will hit the disk 625k times. Now, suppose you normalize the table, and you break it in to three groups of data: 800 bytes, 955 bytes, and 2245 bytes. Suppose you need only the first table now to answer the query... Here's the math: I don't know about you, but in my book: 125,000 is a lot less than 625,000 to get the same data... It's a multiplier to performance, magnitudes faster. Now suppose you needed ALL 4k to put the row back together, how does this work? Now, suppose I introduce a) parallel query, b) parallel join. How does the timing work? Parallelism doesn't necessarily reduce the I/O count, it does two different things: 1) it runs the I/O's at the same time, and 2) it reduces the amount of TIME to execute. So what happens when I denormalize? Why is this wrong? When chained rows are introduced, the data is denormalized beyond the boundaries of I/O execution time. This causes double, then triple, then quadruple I/O's compared to a normalized solution with parallel joins. Now, denormalization has broken the performance curve and the only way to get it back is to normalize the data AND introduce parallelism. Without parallelism, yes - performance of normalized data sets IS usually slower than denormalized data sets, but with parallelism - it will blow away any denormalized set at nearly any time - especially with a large set of data. Keep in mind, these mathematics only become a problem with larger sets of data. Usually anything under 10 million rows per table is not an issue (as a guideline, this varies from DB to DB, engine to engine). That is IF the block size is 8k, or even 16k. In LARGE systems, there are only two ways to improve performance: a) reduce the data set (in this case, normalize), b) increase parallelism (done through tuning and hardware capability). By decreasing the amount of data that each parallel thread has to deal with, we improve performance on a factored scale. Also be aware, that indexes are smaller and more nimble in normalized tables, and that indexes as well play by the same mathematics as the data set itself. NOTES OF INTEREST: Their secret sauce is in the hardware optimizations, and the disk layouts. The one problem (again) with denormalization, is mathematically there is _always_ an upper limit (to the amount of data) before performance degrades exponentially. Now, if we MODEL our data sets incorrectly in a normalized data warehouse, performance will also be inversely affected. There are some systems that can handle huge volumes extremely well, (MPP Systems that is), but the data model that is put in place severely limits the parallelism and also increases the width of the rows unnecessarily. Data Models are the next important key to performance success at HUGE volumes. I urge you to check out the Data Vault (Common Foundational Integrated Data Modeling Architecture) - which is built for Petabytes. It's an architectural solution that is FREE (just like 3rd normal form or star-schema) that solves performance and flexibility problems in data warehousing. You can find it at: http://www.DanLinstedt.com Cheers, September 25, 2007ETL/ELT: VLDW and Multi-passing dataIn my last entry I got caught up explaining how to reduce the data set and increase parallelism. That should provide a huge bang for the buck for most installations. Again, we must go against the grain of what we learn when we deal with very large volumes of data. What we find is that large data takes a specific architecture to solve these problems. In this entry we'll discuss the ability to multi-pass data sets rather than single pass data sets, and why this makes a difference in performance. The rumor, the un-truth, the fallacy: Well, this isn't always true - this often times leads to: These elements cause: Now, we all know that the shortest distance between two points is a straight line (in standard time/space continuum, with today’s mathematics). Let's assume for a minute that our data flow diagram is a straight line (well, at least that's how it starts). If we introduce too many transformations, too much complexity, too much caching, and too much processing we interrupt the flow of the straight line, and it now becomes a very crooked and jagged line. For that reason, maintenance costs increase, performance drops, and inevitably we reach a point where "more processing" is not feasible without re-architecture / simplifying the process. Well, the real truth (on today's hardware) is as follows: This usually requires a monster 64 bit box with 80+ GB of RAM, and 64+ CPU's at high speeds, but even then - the data set is likely to outgrow the hardware, such that one or more of the "caches" hits disk. Once this happens, all bets of this being the fastest architecture are off. Now, back to the analogy: what is the "simplest and fastest" data flow diagram we can ever build? Here's another question: in real-life, what do we do to solve BIG PROBLEMS? How do we get to multi-passing? Now also suppose that this table could not, nor would it ever have any sort of "trigger" to tell us what rows are new, and which (out of the existing 5.5 billion) have been changed on a daily basis. Also suppose that their were no hope of putting any sort of CDC engine on the mainframe... is this beginning to sound like your situation (except maybe for the volume of course?) The source table also, did NOT have any sort of date, but it did have a sequencer column. Ok - this is a real-live case that we had to deal with. The questions are: Using the sequencer we were able to determine the "last" sequence we pulled from the last successful feed, and only pull brand new rows from that point. But the updates were a night-mare. Why? we didn't have a "date" to work with, but even with a date on the rows, it still can be an issue. Let's make it a little easier and put an "updated_date" on the source table, suppose this "updated_date" is really "applied_date" - when the data applied in time, so when they updated the row, they may dictate that the data really "applies" to 2 years, or 5 years ago - now we have a problem. Even if you don't have these volumes you might have these issues. Now suppose that updates made up about 20 million rows of historical data, how do we best: Ok, first: We split the updates in to a multiple pass algorithm. Now for the math: Ok, assuming 8 bytes for the numbers, + 6 bytes for the unconverted date, our row was: 14 bytes wide for this particular part of the process, this process had block sizes of 256k within the transformation engine (32 bit system because that's all we had at the time - this was 10 years ago). TCP/IP Packet: 1k, fit about 73 rows per packet We scanned the source table: 5.5 billion rows on the mainframe - funnelling the data back as fast as the Mainframe could read rows, but only pulling those two columns (not the 4k for each row, because at this STEP we didn't need all the data). The sorted join was an INNER JOIN MATCH on sequence number, arriving in sorted order followed by a filter which eliminated rows from the stream that had matching update dates. Our "target" table was back on the mainframe, again consisted only of the sequence key. So we were writing only 8 bytes ber row which means our writer block size at 256k could now fit: 32,000 rows before being dumped to the main-frame. This was just the FIRST PASS, in the second pass we "joined" the new sequence table to the original 5.5 billion row table to pull JUST THE UPDATES (we dramatically reduced the data set, and found ALL the updates) the total number of updates on average per day was around 14 million rows, we then pulled only the columns we needed based on the PK sequence match and loaded them to a staging area in the target database. This was the second pass. The third pass compared columns in the staging area to the actual target table to determine the "REAL" updated rows. Each "pass" of the data was setup to increase parallelism and partitioning, I think our total job stream ran to completion in about 30 to 45 minutes on a daily basis... remember, this was 10 years ago on a 32 bit ETL machine (64 bit target database/64 bit target DB OS) and a mainframe. We had to do some tuning to the mainframe "CPU utilization rates", and some tuning to the target DB to get this to work, but it all worked. Why did 3 passes of the data set (some would say 4) work faster than if we had "scanned" the whole table ONCE? So by: a) reducing the data set and b) increasing the parallelism we significantly cut the run-time of this process. By the way, the original run-time of the single pass process was well over 14 hours - to try to handle updates, inserts, and "all the data" in a single pass. So the client was very happy to reduce 14 hours to 45 minutes... a huge time savings. Feel free to contact me, I'd love to hear your stories about situations like this. Thanks, September 24, 2007ETL/ELT: VLDW & Data Integration DetailsI've been blogging on this topic now for at least the past week. I've gotten some good feedback from a few of you out there in blog-land (thank-you). I've seen a couple questions on "how" to reduce the data set, and what I mean by "multi-passing" the data, and so on. Keep in mind that it's much easier to teach these subjects in classes, than it is to blog on them and get all the needed details put in place for everyone to read. In this blog I provide additional information on how to make performance work for large data sets and data integration. How do I reduce the data set? Mathematics provide the answer. So, that said: Suppose you have 150 columns, or even 360 columns (SAP / R3 feed for example).... Worse yet, suppose the row is 5 columns... What matters? The byte length of the row - that's what matters, not the number of columns. The first incorrect conclusion that experts tend to make, is simply to look at the number of columns. However, as a rule of thumb, USUALLY the higher the number of columns, the larger the row size. Ok, that said: suppose I have 150 columns that amount to a 4k row size. This is a WIDE ROW, and will cause performance issues to no end. Problems in the network, CPU, RAM, Disk, I/O, Swapping, Database, Application - you name it, it happens... For simplicity sake, if we focus on the database for a minute... Most database vendors install their database engines at a default 8k block size... Simple math says: I can fit 2 rows per block, that means: 2 rows per I/O on disk (simple-mans math, because I'm a simple man)... Now, suppose that row size is 4500 bytes... What happens to the rest of row 2 when inserting to the database?? It carries over to another block... Now, 2 rows = 2 I/O's (at a minimum). In some databases this is referred to as "block chaining" or "row chaining". This is a problem... But back to the focus... How large are most TCP/IP packets? In fact, what's the maximum size of a raw IP packet? Anyone? You might be surprised to know it's close to 1024 bytes, that's right: 1k. (Hint: this is one of the reasons that Network drives are slower than DASD, or Fibre Channel attached drives)... How many I/O's are needed to get 2 rows of 4500 bytes of data across to the disk? Again, in simple math: 9000 bytes = 9x1k packets = 9 I/O's over the network. Ouch. I/O's cost time, cost money, and can quickly and easily overflow the hardware... but wait - there's another monkey wrench in the works... Lack of parallelism (serialized operation). If your application (ETL tool?) sends the "block" of two rows over the network, most likely it does it in serial... HEY NETWORK: I HAVE 9k WORTH OF DATA, SEND IT TO THIS DISK AT THIS ADDRESS... Network: OK, 9 I/O's in serial... Processed. HEY DATABASE: COMMIT THESE ROWS TO THE DISK... Database: Ok... 2 I/O's on disk, that'll be 2 blocks please... And so on. Now, what can we do? We still need the 9k of data right? Right... But we have one piece left at our disposal: increase parallelism... How do we do that? Voila, we just increased the parallelism - wait a minute!! We ALSO reduced the data set! We receive a huge multiplier in performance _every time_ I guarantee it, as long as the hardware can handle the parallelism... Now wait a minute! Won't this run "slower" because now it has to read the same data 4 times? What about the amount of work over TCP/IP - isn't it the same number of packets? What about the target disk and database? Don't they have the same amount of work to do? Ok, here's the low-down: What else am I saying? I've seen it time and time again.... Database engines these days are TUNED to take advantage of parallelism, the more we Denormalize, the LESS parallelism the engines can engage in. If Joins are a "problem" for performance, then either: a) the database isn't tuned properly or b) the hardware isn't scaled properly or c) both - to take advantage of parallelism. I've hinted at performance numbers in my blog entry that I'm receiving on a 1TB (raw data) SQLServer2005 EDW across a 4 / 5 table join in a 32 bit Windows environment. I hope this helps. The problem again, is look at the mathematics of wide rows, they are in direct correlation to slow performance (because they exponentially increase the number of I/O's required to handle the data set). Do not believe the hype that denormalization will solve the performance problems, it doesn't. It is a stop-gap measure for another deeper issue: usually architecturally driven or incorrect tuning. Love to have your thoughts/comments... Thanks, September 21, 2007ETL Engines: VLDW & Loading / TransformingI hope you've enjoyed this series; I've not received any comments either way. I'll be finishing up this series soon. In this entry I'll address "ETL" processing in general, in another entry I'll discuss "ELT" in general, and then I'll begin to discuss BI Queries and Query engines going forward, finally at the end of this series I'll bring in a couple "appliances" or "appliance like" performance enhancers. In this entry, I'm going to focus on super fast, high speed ETL. This entry is less about the tools, and more about the architectures that work. I hope you enjoy it. ETL is frequently used to move huge batches of data around; in fact, it's now an "aging" technology, but has proven itself time and time again as a necessity for the following reasons: a) Managing metadata: technical metadata, process flows, process scheduling, structured data sets, and more recently: "semi-structured" data sets like XML and so on. When we look at this list, there's not only a lot of power available, there are tremendous benefits to using the ETL tools on the market (which I'm sure there are more, but some of the ones I'm familiar with are) - (hopefully in alphabetical order) Within these tool sets, there are a variety of options to choose from, but what is most important in VLDW or large data sets is the architecture. The _manner_ in which we construct our loads greatly impacts performance. Too often, the industry or the vendor suggest that simply by throwing hardware at the problem, performance can be solved (oh if only you would buy grid, or buy this, or buy that...) That helps for a little bit, but ultimately doesn't solve the problem. If you're ETL routines (during peak operation) are not pushing the existing hardware beyond an average load of 60%, or are not sustaining a hardware load of 80%, then most likely the architectures of individual "data flows" are not correct, especially for big data. Now wait a minute! You're telling me that just because I can, doesn't mean I should? I've blogged about this before... If you're headed down a highway towards a brick wall, and you refuse to change direction, what will happen? You'll hit the brick wall. If you've got performance problems, and you refuse to change the architecture to try new things, you'll still have performance problems. There's only so much tweaking of knobs that can help performance, then it's all up to the architecture. With Big Data & ETL, the first thing to realize is the mathematics behind the arrival of the data sets. The mistakes that ETL designers make are as follows: (these get more painful, the larger the data set, and / or the lower the latency of arrival of the data) 1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process. Oh, and did I mention running on a 32 bit environment which causes significant OS Swapping to occur? Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there. Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area. By the way, here's a tid-bit for you that you have to try to believe: Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware. Linux, and Unix do not have these limitations either. So, architecture is everything. If you take the list I've published of what NOT to do, and change your architecture accordingly, you should see significant performance gains when running "the tuned data flow" by itself (as a test). There's one more piece I'd like to share (for instance, why letting the DB handle errors slows down the processing). The database is then "asked" to commit the rows it has cached (in TEMP mind you). The database then has to iterate all the rows in each block (using optimized bind array) and bounce the "rows in error" ONE AT A TIME back to the ETL engine. Now I ask you: is this transactional or "batch" oriented thinking? Right! Transactional... Of course it's going to be slow. Now, the other piece that happens is the ETL engine has to STOP PROCESSING NEW ROWS, Pick up the "rows in error" one at a time, and flow them out to disk. We've just quadrupled the I/O for that row, decreased our performance by a factor of 4x (at least). Why? The row traveled from the ETL buffer to the network, over the network to the database, was cached in TEMP in the database, then was read from TEMP (possibly disk), checked against the real table (again disk), and bounced back over the network. The ETL engine at this point STOPS it's parallel processing of all kinds, just to handle the row in error, one at a time, and write it to local disk. This same process is repeated whether we load direct from ETL or we load from a database Loader. The process is worse when we mix inserts with updates with deletes, why? because the database can no longer use block style bind arrays, it now must process each row in it's own individual statement (transactionally based). So already you've lost 4x to 10x the performance you COULD be achieving, which means that a "data flow" that COULD execute at 80,000 rows per second now executes at 8,000 rows per second. A significant price to pay when dealing with large data sets. The math: A huge performance drain. Again, performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines. Thoughts? Comments? I'd love to hear from you. Cheers, September 17, 2007Applications In General: VLDW and Machine PlayI've got this series going on VLDW, and I'd like to continue it for a little bit longer. In this entry I'll dive into different types of applications in a VLDW environment, and their impact on the machines underneath. There are indeed impacts to large data sets passing through machines with specific types of applications (including database engines) enclosed within. What some people tend to forget is that Volume & Latency change everything. It is absolutely vital when performance and tuning for extremely large data that sight of the applications on the hardware is not lost. The causes of problems with performance on large scale data sets or high volume combined with low latency are as follows: 1. Inter-application MIXED WORKLOAD I can't begin to tell you how many times I walk in to a client that has an a web server running on the same machine as their database engine, and then complains about performance because of volume in the database. This is very problematic. Their reasoning usually is: "well, when we installed it, it wasn't a problem..." or "we don't have very many users on the web-server." or "we don't seem to think that's the issue." Or my personal favorite: "The vendor provided us with a web server; we just wanted to take advantage of it." A prime example of mixed-workload and application clash is going on here. Web servers within the RDBMS engines are there so that "specific" instances of the database can be tied, with security to the web-server front-ends. This is a wonderful feature, until you have volume to deal with. NEVER in a volume solution should an RDBMS engine (responsible for managing the volume) be tied to a web-server component. IF any of the web-components are used, they should only be used to import/export data (i.e. XML). Workload clash across applications can KILL performance right off the bat. Furthermore, do not use web-services directly on top of the RDBMS engine, why? The hardware isn't typically scaled to do so. I'll give you an example of a system I am working on right now (I can't disclose the customer name, nor the performance I'm seeing due to legal licensing agreements). I have a couple SQLServer2005 instances on high-end dual core machines with 8 GB RAM, 32 bit Windows OS, and 32 Bit SQLServer. I've got a Data Vault data model under the covers as a data warehouse, and have built over 1 TB in test data on a single box, about 3 TB across multiple boxes, and have a separate web-server (also on Windows, we are running *.NET) which performs extremely fast, I can query the data sets of 1 Billion rows (joined to 300 million rows), and produce a result set of 100,000 rows very quickly. I have a huge load on SQLServer, 1 TB per machine, Raid 0+1, internal SCSI, 146GB Hot Swap drives at 10,000 RPM. The performance is phenomenal. Why? Because I've cleared the machines from doing _anything_ except serving data. These machines are highly focused, highly tuned and have nothing but the OS, and SQLServer on them to run and process data sets. If I through a web-server on the RDBMS engine itself, or if I turned on "web-services processing/email processing", or used "CLR" or in-database stored procedure logic, my performance would slow down by factors of 4x to 10x or more. Do not let your applications "cloud" the performance of your RDBMS engines. You've paid good money for licenses to your database engines and by now if you haven't guessed it: separating (divide and conquer) the applications across machines, and improving network speed will help improve parallelism, tuning and functionality. These RDBMS engines really need to be on their own box. This is just one of the reasons why such high performance can be seen from APPLIANCE vendors. They LOCK the hardware down (in most cases) to perform data servicing and management functions, they also provide additional and sometimes proprietary hardware to improve speed even further. 1. Web servers are the number one performance killers of VLDW / VLDB RDBMS engines. Whoa, hold-up, what are you saying?
This is SINGLE ROW PROCESSING, and is extremely slow in huge volume environments. Remember the performance numbers I spoke of (in terms of rows-per-second) that you want to have in large environments? 120,000 to 160,000 rows per second in processing time... To my knowledge, writing any program this way (as stated above), in a row-by-row processing loop will never perform as fast as block style (see below)... The above code re-written in block style: 1. INSERT INTO 2. INSERT INTO 3. UPDATE 4. INSERT INTO This is block style programming, you're telling the database to deal with BLOCKS of rows in RAM rather than a single row at a time. The performance factor of block style is 10x to 25x faster than single row hits, ESPECIALLY if the database is tuned for parallelism and partitioning, and has enough hardware to perform appropriately. If however, the database does NOT have enough hardware, OR it is not tuned properly, block style processing might "never finish" where looping (while slow) actually completes. The looping process in this instance may run at 600 rows per second, at best it might run at 1200 rows per second, but no where near the 120,000 rows per second of the block style. Keep in mind that block style processing requires LOTS of use of TEMP/SWAP space. Ok, I digress... back to the point: Hope this helps, September 12, 2007Database Specifics: VLDW & Switches, making it workSo just how do you get a particular Database to work for you in a growing data environment? There must be something you can do right? Yes, there are certain switches, and specific knowledge which you need to focus on in order to get the RDBMS engines to scale. This entry is all about those specific databases, and their switches. While unfortunately (due to legal ramifications) I cannot discuss publicly the performance numbers (it's all in the fine-print of the licensing agreements they have), I can discuss the switches that make things fast. Remember, that unless you have the proper hardware, and the right architecture, these switch settings won't make a difference. In fact, they may even slow your system down. There are TWO fundamental tenants to making ANY system perform: Ok - the third, is to do both of the above. This is where the orders of magnitude GAIN is achieved, and yes - this is AFTER you have already obtained the desired I/O performance from your system. Yea, man... So what... So how do I get my system to work properly? Ok, back to the REAL issues: Oracle 8 & Oracle 9i Oracle default parameters on INSTALL look at the "whole machine", but the installer does NOT run any type of machine performance testing. It simply counts the number of CPU's and sets the Number of CPU's available to the MAX number of CPU's on the box. It also sums up the total amount of RAM available, and sets the SGA and total RAM to the total RAM on the box. What does this mean? It means that when the instance is pushed to 80% or more of utilization, it will take ALL the resources available on the box. What Oracle doesn't publish is that it's core routines change the "niceness" prioritization of it's core routines, and push the Operating System out of the way... This can cause machine hard-lock and re-boot. I will suggest that the previous post covers pretty much all databases in general, and what you should do with the RAM & CPU and Instance settings for Oracle, and other RDBMS engines. DB2 UDB - Upgrade to 9 if you're not there. SQLServer 2000: UPGRADE, UPGRADE, UPGRADE!!! 2005 IS A BLAST, and REALLY is fast. Teradata: Sybase is mostly the same as SQLServer Hope this helps somewhat, have ideas? Please share them. Dan Linstedt September 9, 2007Databases and VLDW: Petabyte ScalabilityIf you've been following along, you've noticed that I've been writing about Terabyte (50+) to Petabyte levels. In this entry, it's no different. I'll discuss database engines generically, and their usage as a VLDB component. I'll use another entry to discuss appliances (separately) that contain embedded database engines. Databases all have their quirks and notions, but there are some generalities that simply stick - that you cannot ignore, that most people have ignored over the years... I'll try to dispell these false-hoods, and bring the truth to the table. What are the issues around BIG DATA and RDBMS engines, especially when it comes to VLDW (Data warehousing)? 1. Most database engines do _not_ support temporal mathematics within their query optimizers. The issue? EDW's contain TEMPORAL data!! * Appliances answer some of these questions for us... Others are "managing persons" related... Ok, common operator/installation flaws or mistakes we make... 1. Allowing Default Installation parameters to RULE the instance. If you have 40GB of RAM, and 18 CPU's when you install, the RDBMS engine installer will setup to use 100% of the available resources IF or WHEN pushed to the load limit. The PROBLEM with this, is that there is NO ROOM for the operating system to breathe under load, and when your in a VLDW, you've always got LOAD!! (usually). The solution? Set the database to be 1/2 or possibly 80% of the available resources. ALWAYS leave 20% available for other applications INCLUDING the operating system. THEN tune the database to effectively use all 100% of that "80%" of the hardware when pushed to operating limits. 2. Installing "more than one" instance of a single database engine on an underpowered server. Typically too many people make this mistake in the name of "recovery" or "fault-tolerance." This is a problem, it not only eats up the resources, but does NOT handle fault-tolerance very well. Why? because what usually fails are the hardware/networking layers, or the CPU's quickly become over-loaded and rebooting the machine is necessary. PLEASE PLEASE PLEASE in a VLDW system, Consolodate the "instances" to a SINGLE instance, multiple databases, and manage a core set of services. The exceptions to this rule are: Teradata, and IBM DB2 which is smart enough to know to launch only ONE set of core-operating services. What happens when I make this mistake? 3. We fail to PROPERLY tune the database for the hardware we're on, and we do not make allowances for network speeds, disk speeds, buffering, network traffic, and so on. The result is: the longer this goes, the more "fragmentation" this leads to in our VLDW data set. Fragmentation of the queries, fragmentation of the data in RAM, non-parallel operations, and non-optimized operations (more blocking I/O operations). Properly tuning requires an intimate knowledge of the hardware platform on which our RDBMS engines sit, along with base-line performance testing. 4. We fail to setup parallelism and partitioning according to time or nodes. In VLDW, the only way to manage data sets effectively is to partition by time (that is "load-time" or "load cycle number"), or even distribution by available MPP nodes. If you are running in an MPP environment, then data proximity and layout across the MPP nodes is critical in accordance with the MOST USED join columns, or better yet - an even distribution across parallel MPP nodes for optimal parallelism in any query. Why? because the join technology can be orchestrated through alternative means (like Join Indexes). 5. We under-power the hardware and expect the RDBMS engine to perform. RDBMS engines NEED LARGE MEMORY CACHES, and BIG BUFFERS to perform decently well in terabyte sized solutions. DON'T put a Windows Dual-Core server, 32 bit with 2 GB RAM in place with Oracle 9i and then try to place 30TB of data into the database and USE it as a VLDW... It just simply won't scale. On the flip side, DON'T count out the Windows 64 bit OS, and SQLServer 2005 64 bit system, they can run SUPER FAST if tuned properly, and if the hardware is there to support it. 6. Ok, we place 32 bit systems in play to do what really requires a 64 bit system to run with. 7. We place the Logs / temp / and data and indexes on the same DISK ARRAY, then we further damage ourselves by using RAID 5. Then we further damage ourselves by adding CLIENT TRAFFIC to the disk device, or even the network under which our database REQUIRES dedicated network access to the disk device. Oh yes, it gets' worse... We sometimes outsource our disk arrays to a "storage farm" and don't hold the outsourcers to an SLA for accurate performance, then they put our critical RDBMS engines on a SHARED DISK with other companies, and the network traffic has to load balance on the way in and out. The real problem? All of these things limit parallelism... Once parallelism is limited, the performance drops like a rock in water. Factors of 4x, 6x, 12x, all the way to 100x, 200x performance is negatively affected. In VLDW solutions, you CANNOT "absolutely CANNOT" limit parallelism in any way. What about the settings in the RDBMS? Comments? Thoughts? Did I leave something out? Please post your ideas. Thank-you kindly, September 8, 2007Disk and VLDW: What you need, can useIn very large data warehousing, or VLDB for that matter, I am constantly asked: what kind of disk should I have? can I use a SAN, how about a NASD, what about DASD? I have RAID 5, is that good? Now there is Raid 7, 5+, S, 10, and so on. There are differences that DO make a performance difference in the disk that you are using, and when you are dealing with very large data sets you MUST have throughput. This is the optimal end-game. The answers are quite simple really: Faster the better, but a minimum throughput of 300MB to 400 MB per second is required. (This was 3 years ago!) Today, I would suggest that 400 MB to 500 MB per second is better, faster the better. Now that said, if your disk cannot achieve these throughput levels then it's and so on. Unitl you can reach these levels of throughput it, performance will be difficult if not impossible to achieve in a 30+ or even 100+ Terabyte system, Did you know that CERN produces 100TB of data every time they smash an atom? That's an interesting tid-bit to chew on, and to top it off, they capture it all.... Ok, so what do you need in your disk device? I don't mind whether it's SAN, NASD, DASD, but I will say this: Internal disks are fastest, next up are DASD (this is the preferred choice in the HUGE volumes), then NASD and SAN IF the network is VPN direct to the server, and has garaunteed throughput. If you are working with a storage hosting vendor, then ensure you have an SLA in place for garaunteed throughput, then ask to see the throughput test results on a bi-monthly basis. This will keep them honest. I've been in places where the hosting service will "move" your data around to different disk arrays in their system based on their own internal needs. I've also been in places where they will _not_ garauntee dedicated access nor will they garauntee exclusive access to disk devices. Think about it, if you're willing to spend that much money to have a high volume solution, shouldn't you be protecting yourself and getting your money's worth? Love to hear your thoughts, September 6, 2007Operating Systems: What you need for VLDWSo you read my post on Hardware right? If not, take a gander at it... I'd like to think it's mostly complete. This entry focuses on Operating Systems, what the machine needs to work under severe volume loads, and how the operating systems react. In the near future I'll have a post on Applications including ETL / ELT, and Database engines, and then I'll move back in to the business side: skill sets, requirements, defining/gathering/estimating, etc.. All of these things are items that I teach at TDWI in VLDW. I try to keep it fresh. This however, is a short posting. Operating Systems require "room to breathe." And in high-volume and/or low-latency situations (like near-real-time), OS's are often left in the dust when considering the "operating" aspects of the system. i.e. how a system actually works. People tend to forget that an operating system can make or break the actual application layer on top. Let's take a look at some of these systems and how they work (in general), then we'll see how to help them work better with very large data warehouses. 1) Mainframe type OS: Very good at high speed computation, extremely good at parallelism, and partitioning of the work, multi-threading operations based on complex algorithms built in to the firm-ware, even when there is no "multi-threading" in the applications. Extremely high-speed hardware bus underneath to support parallel applications. These Operating Systems typically run each application in their own "slice" of memory, CPU, and disk resources. They time-slice accurately and efficiently. The downside? High engineering costs, mean high costs to consumers. You're paying for all this fancy engineering. The upside? Things like LPAR and VPAR abilities (Logical and virtual partitioning) meaning you can split the machine into multiple components, and say: install windows on one partition, AIX on another, and "Mainframe/cobol" code on a third. They can all talk to each other without ever hitting the network, they can "share" memory across boundaries, and are extremely fast and reliable. Usually, what you get out of the box is not necessarily "tunable" per-say, and is already pretty high quality, and pretty fast. 2) Unix OS systems: Vary depending on the manufacturer, but there are three or four major players: IBM (AIX), HP (UX), SUN (Solaris), SCO (unix). These systems are highly tunable with many different knobs to tweak depending on the hardware they reside on. The largest thing to remember here is that HPUX and AIX and Solaris all manage memory differently, they also handle swap space and threading differently at the core-level. This means you'll get different levels of performance depending on the actual hardware configuration. With HP you might want a SUPERDOME with MPP built in to the scalable chip-set. With Solaris, you want a similar machine, with IBM AIX, you might even consider a P-series or Z-Series system, also capable of running Zos underneath with LPAR's and VPAR's. The largest thing to remember about UNIX OS systems is the delegation of SWAP or TEMP space. It's all about the DISK I/O throughput at that point, and in each unix (except AIX) you MUST set the space available to 1.5x to 2x the size of RAM. So if I have 16GB of RAM, I'd really want: 32GB of temp or swap, so that disk fragmentation is kept to a minimum, and multiple threads don't constantly run at 80% of the machine resources. This is called "run-cool" setup and will serve you well. 3) Linux OS systems, are very similar to UNIX and have multiple knobs for tuning and tweaking, it's best to call in an expert for both Unix and Linux to tune to specific hardware platforms, or the OS simply will not perform _as well as it could_ in those circumstances. (same thing applies to swap/temp). 4) Windows OS systems, 32-bit, should not be utilized for VLDW systems. Why? Because of several reasons: 1) PageFile.sys (swap/temp) is SINGLE THREADED BLOCKING I/O. 2) ALL code in Windows 32 bit MUST execute under the 640k boundary!! Required!! 3) multi-TASKING is available, yes, but true multi-threading is not. Example: launch Windows Media Player, start one of your favorite songs, then launch Outlook, two or three good sized word documents, an Excel Document, and to top it off, start a "download" of a significant sized file. What happens to the playback of Windows Media Player? It skips, even if it's made high priority... This is the result of 1, 2, and 3 working against you. Imagine what happens in a VLDW environment... Now: Windows 64 bit, yes - good choice, with SQLServer 2005 64 bit... BUT you MUST MUST MUST give it enough hardware to perform properly. Without the right hardware, it will only execute SLOWER than the 32 bit systems. Cheers for now, Hardware: What do you need for VLDW?In this entry we'll explore the requirements of VLDW, what you might need to build one, or to grow it going forward. Don't forget that everyone's definition of VLDW is different depending on the starting point. No matter if you're scaling from 30GB to 1 TB, or from 50 TB to 1 PB, these guidelines and simple pieces should help you move forward. I guess you could treat some of this as a "score-card" for your environment. This is a HARDWARE LOOK at what you need. In the future I'll blog on the software requirements, and then finally, I'll address appliances. In this entry we'll explore the requirements of VLDW, what you might need to build one, or to grow it going forward. Don't forget that everyone's definition of VLDW is different depending on the starting point. No matter if you're scaling from 30GB to 1 TB, or from 50 TB to 1 PB, these guidelines and simple pieces should help you move forward. I guess you could treat some of this as a "score-card" for your environment. This is a HARDWARE LOOK at what you need. In the future I'll blog on the software requirements, and then finally, I'll address appliances. What about the basics? 1) High speed disk should be transferring data to / from disk (any disk) at I/O rates of 300 MB to 400 MB per second. Without this magic number, the whole scalability effort is sunk right out of the gate. 2) High speed VPN server to server networking, and Server to Disk Networking. Gigabit Ethernet is about as fast as you can get today, the faster the better. BUT the mistake that's often made is to put the DISK device on a common shared network, and then wire it to servers, and THEN expect it to perform. WRONG! SORRY! That simply won't work. You cannot achieve scalability to the Petabyte Levels with this approach. You _might_ make it to 20 to 40 TB before you see problems, but even then.... ISOLATE THE NETWORK TRAFFIC, Big data costs big money, if it's not bringing BIG BUSINESS VALUE in it's return, then DUMP some of the data, reduce the data set size until value can be achieved or justified. 3) RAM CACHES on Disk Devices, double buffering the I/O across the network channels provide all the performance you need - again FOCUS on the throughput numbers on an average day with average load on your Disk Devices. 300MB to 400MB per second, no less. MORE is better in this case. But if your disk device does not house a large RAM cache (to buffer you from the impact of writes) then you WILL see performance problems. 4) Network, network, network: VPN everywhere. Ensure dedicated network cards between the servers and the disk arrays, and server to server communication on a separate VPN network connection. The more "segmentation" the higher the degree of parallelism across each VPN networked devices. The more segmentation, the faster the processing, the easier it is to route correct network traffic for directed / specific tasks. If you are using Optical to reach the DISK, then ENSURE that there are enough optical up-links to handle the performance throughput requirements. Again, it's _all_ about the performance in volume situations. 5) CPU's and RAM... Ensure that you have 1.2x the amount of RAM (multiplied by) the number of CPU Core's on the machine. So If I have 2, dual-core CPU's that’s 4 cpu cores, I would need at LEAST 1.2 x 4 = 4.8 GB of RAM JUST to handle the executing code on the CPU's and maximize parallelism and threading on the CPU cores. I like to say, at the maximum: 2x cpu cores is what I run with. If I have 32 cores, I like to have 64GB of RAM just to handle the operations of the CPU's. People often forget that CPU's are constantly swapping threads in & out of RAM, and NEED absolutely NEED RAM to do this. Now, about on-board caches... Level 1 caches are most important, they are closest to the CPU, and THE LARGER THE BETTER!!! Level 1 caches are expensive due to miniaturization. Level 2 Caches are next in line, THE LARGER THE BETTER - this will help the multi-threading, and reduce execution cycles, particularly useful in Database Joins, string matches (where clauses), Transformation operations, etc... 6) RAM & Network Packet Sizes. If POSSIBLE, then increase the network packet sizes, look for network cards with high levels of "RAM CACHE" and feature sets like auto-compression, and double buffering. This means faster transfer rates over the Giga ether cables. Increasing the network packet sizes means that the network card itself will build "larger" packets, and hopefully with compression and double buffering you'll see more data transferred faster with less actual network traffic. 7) WATCH THE BOOT LOGS FOR HARDWARE ERRORS. Often times, administrators don't have time to construct the proper script to look at the boot logs for RAM, CPU, Device, and Network Errors. But this is CRITICAL, it can be a significant and negative impact on the performance. Especially bad checksums in the RAM chips, this can cause erroneous errors, AND performance degradation by a factor of 4x to 20x. If you have some hardware tips to add to this list, I'd love to hear from you. Let's build some knowledge here. Cheers, |