Blog: Dan E. Linstedt« Disk and VLDW: What you need, can use | Main | Database Specifics: VLDW & Switches, making it work » Databases and VLDW: Petabyte ScalabilityIf you've been following along, you've noticed that I've been writing about Terabyte (50+) to Petabyte levels. In this entry, it's no different. I'll discuss database engines generically, and their usage as a VLDB component. I'll use another entry to discuss appliances (separately) that contain embedded database engines. Databases all have their quirks and notions, but there are some generalities that simply stick - that you cannot ignore, that most people have ignored over the years... I'll try to dispell these false-hoods, and bring the truth to the table. What are the issues around BIG DATA and RDBMS engines, especially when it comes to VLDW (Data warehousing)? 1. Most database engines do _not_ support temporal mathematics within their query optimizers. The issue? EDW's contain TEMPORAL data!! * Appliances answer some of these questions for us... Others are "managing persons" related... Ok, common operator/installation flaws or mistakes we make... 1. Allowing Default Installation parameters to RULE the instance. If you have 40GB of RAM, and 18 CPU's when you install, the RDBMS engine installer will setup to use 100% of the available resources IF or WHEN pushed to the load limit. The PROBLEM with this, is that there is NO ROOM for the operating system to breathe under load, and when your in a VLDW, you've always got LOAD!! (usually). The solution? Set the database to be 1/2 or possibly 80% of the available resources. ALWAYS leave 20% available for other applications INCLUDING the operating system. THEN tune the database to effectively use all 100% of that "80%" of the hardware when pushed to operating limits. 2. Installing "more than one" instance of a single database engine on an underpowered server. Typically too many people make this mistake in the name of "recovery" or "fault-tolerance." This is a problem, it not only eats up the resources, but does NOT handle fault-tolerance very well. Why? because what usually fails are the hardware/networking layers, or the CPU's quickly become over-loaded and rebooting the machine is necessary. PLEASE PLEASE PLEASE in a VLDW system, Consolodate the "instances" to a SINGLE instance, multiple databases, and manage a core set of services. The exceptions to this rule are: Teradata, and IBM DB2 which is smart enough to know to launch only ONE set of core-operating services. What happens when I make this mistake? 3. We fail to PROPERLY tune the database for the hardware we're on, and we do not make allowances for network speeds, disk speeds, buffering, network traffic, and so on. The result is: the longer this goes, the more "fragmentation" this leads to in our VLDW data set. Fragmentation of the queries, fragmentation of the data in RAM, non-parallel operations, and non-optimized operations (more blocking I/O operations). Properly tuning requires an intimate knowledge of the hardware platform on which our RDBMS engines sit, along with base-line performance testing. 4. We fail to setup parallelism and partitioning according to time or nodes. In VLDW, the only way to manage data sets effectively is to partition by time (that is "load-time" or "load cycle number"), or even distribution by available MPP nodes. If you are running in an MPP environment, then data proximity and layout across the MPP nodes is critical in accordance with the MOST USED join columns, or better yet - an even distribution across parallel MPP nodes for optimal parallelism in any query. Why? because the join technology can be orchestrated through alternative means (like Join Indexes). 5. We under-power the hardware and expect the RDBMS engine to perform. RDBMS engines NEED LARGE MEMORY CACHES, and BIG BUFFERS to perform decently well in terabyte sized solutions. DON'T put a Windows Dual-Core server, 32 bit with 2 GB RAM in place with Oracle 9i and then try to place 30TB of data into the database and USE it as a VLDW... It just simply won't scale. On the flip side, DON'T count out the Windows 64 bit OS, and SQLServer 2005 64 bit system, they can run SUPER FAST if tuned properly, and if the hardware is there to support it. 6. Ok, we place 32 bit systems in play to do what really requires a 64 bit system to run with. 7. We place the Logs / temp / and data and indexes on the same DISK ARRAY, then we further damage ourselves by using RAID 5. Then we further damage ourselves by adding CLIENT TRAFFIC to the disk device, or even the network under which our database REQUIRES dedicated network access to the disk device. Oh yes, it gets' worse... We sometimes outsource our disk arrays to a "storage farm" and don't hold the outsourcers to an SLA for accurate performance, then they put our critical RDBMS engines on a SHARED DISK with other companies, and the network traffic has to load balance on the way in and out. The real problem? All of these things limit parallelism... Once parallelism is limited, the performance drops like a rock in water. Factors of 4x, 6x, 12x, all the way to 100x, 200x performance is negatively affected. In VLDW solutions, you CANNOT "absolutely CANNOT" limit parallelism in any way. What about the settings in the RDBMS? Comments? Thoughts? Did I leave something out? Please post your ideas. Thank-you kindly, |