Blog: Dan E. Linstedt« Database Specifics: VLDW & Switches, making it work | Main | ETL Engines: VLDW & Loading / Transforming » Applications In General: VLDW and Machine PlayI've got this series going on VLDW, and I'd like to continue it for a little bit longer. In this entry I'll dive into different types of applications in a VLDW environment, and their impact on the machines underneath. There are indeed impacts to large data sets passing through machines with specific types of applications (including database engines) enclosed within. What some people tend to forget is that Volume & Latency change everything. It is absolutely vital when performance and tuning for extremely large data that sight of the applications on the hardware is not lost. The causes of problems with performance on large scale data sets or high volume combined with low latency are as follows: 1. Inter-application MIXED WORKLOAD I can't begin to tell you how many times I walk in to a client that has an a web server running on the same machine as their database engine, and then complains about performance because of volume in the database. This is very problematic. Their reasoning usually is: "well, when we installed it, it wasn't a problem..." or "we don't have very many users on the web-server." or "we don't seem to think that's the issue." Or my personal favorite: "The vendor provided us with a web server; we just wanted to take advantage of it." A prime example of mixed-workload and application clash is going on here. Web servers within the RDBMS engines are there so that "specific" instances of the database can be tied, with security to the web-server front-ends. This is a wonderful feature, until you have volume to deal with. NEVER in a volume solution should an RDBMS engine (responsible for managing the volume) be tied to a web-server component. IF any of the web-components are used, they should only be used to import/export data (i.e. XML). Workload clash across applications can KILL performance right off the bat. Furthermore, do not use web-services directly on top of the RDBMS engine, why? The hardware isn't typically scaled to do so. I'll give you an example of a system I am working on right now (I can't disclose the customer name, nor the performance I'm seeing due to legal licensing agreements). I have a couple SQLServer2005 instances on high-end dual core machines with 8 GB RAM, 32 bit Windows OS, and 32 Bit SQLServer. I've got a Data Vault data model under the covers as a data warehouse, and have built over 1 TB in test data on a single box, about 3 TB across multiple boxes, and have a separate web-server (also on Windows, we are running *.NET) which performs extremely fast, I can query the data sets of 1 Billion rows (joined to 300 million rows), and produce a result set of 100,000 rows very quickly. I have a huge load on SQLServer, 1 TB per machine, Raid 0+1, internal SCSI, 146GB Hot Swap drives at 10,000 RPM. The performance is phenomenal. Why? Because I've cleared the machines from doing _anything_ except serving data. These machines are highly focused, highly tuned and have nothing but the OS, and SQLServer on them to run and process data sets. If I through a web-server on the RDBMS engine itself, or if I turned on "web-services processing/email processing", or used "CLR" or in-database stored procedure logic, my performance would slow down by factors of 4x to 10x or more. Do not let your applications "cloud" the performance of your RDBMS engines. You've paid good money for licenses to your database engines and by now if you haven't guessed it: separating (divide and conquer) the applications across machines, and improving network speed will help improve parallelism, tuning and functionality. These RDBMS engines really need to be on their own box. This is just one of the reasons why such high performance can be seen from APPLIANCE vendors. They LOCK the hardware down (in most cases) to perform data servicing and management functions, they also provide additional and sometimes proprietary hardware to improve speed even further. 1. Web servers are the number one performance killers of VLDW / VLDB RDBMS engines. Whoa, hold-up, what are you saying?
This is SINGLE ROW PROCESSING, and is extremely slow in huge volume environments. Remember the performance numbers I spoke of (in terms of rows-per-second) that you want to have in large environments? 120,000 to 160,000 rows per second in processing time... To my knowledge, writing any program this way (as stated above), in a row-by-row processing loop will never perform as fast as block style (see below)... The above code re-written in block style: 1. INSERT INTO 2. INSERT INTO 3. UPDATE 4. INSERT INTO This is block style programming, you're telling the database to deal with BLOCKS of rows in RAM rather than a single row at a time. The performance factor of block style is 10x to 25x faster than single row hits, ESPECIALLY if the database is tuned for parallelism and partitioning, and has enough hardware to perform appropriately. If however, the database does NOT have enough hardware, OR it is not tuned properly, block style processing might "never finish" where looping (while slow) actually completes. The looping process in this instance may run at 600 rows per second, at best it might run at 1200 rows per second, but no where near the 120,000 rows per second of the block style. Keep in mind that block style processing requires LOTS of use of TEMP/SWAP space. Ok, I digress... back to the point: Hope this helps, |