Business Intelligence Network Business Intelligence Resources

Blog: Dan E. Linstedt

« Database Specifics: VLDW & Switches, making it work | Main | ETL Engines: VLDW & Loading / Transforming »

Applications In General: VLDW and Machine Play

I've got this series going on VLDW, and I'd like to continue it for a little bit longer. In this entry I'll dive into different types of applications in a VLDW environment, and their impact on the machines underneath. There are indeed impacts to large data sets passing through machines with specific types of applications (including database engines) enclosed within. What some people tend to forget is that Volume & Latency change everything. It is absolutely vital when performance and tuning for extremely large data that sight of the applications on the hardware is not lost.

The causes of problems with performance on large scale data sets or high volume combined with low latency are as follows:

1. Inter-application MIXED WORKLOAD
2. Dedication of hardware "in theory is shared" In reality, is fighting for resources
3. When it comes time for a SINGLE application to do its job, other applications have 60% or more of the resources already tied up.

I can't begin to tell you how many times I walk in to a client that has an a web server running on the same machine as their database engine, and then complains about performance because of volume in the database. This is very problematic. Their reasoning usually is: "well, when we installed it, it wasn't a problem..." or "we don't have very many users on the web-server." or "we don't seem to think that's the issue." Or my personal favorite: "The vendor provided us with a web server; we just wanted to take advantage of it."

A prime example of mixed-workload and application clash is going on here. Web servers within the RDBMS engines are there so that "specific" instances of the database can be tied, with security to the web-server front-ends. This is a wonderful feature, until you have volume to deal with. NEVER in a volume solution should an RDBMS engine (responsible for managing the volume) be tied to a web-server component. IF any of the web-components are used, they should only be used to import/export data (i.e. XML).

Workload clash across applications can KILL performance right off the bat. Furthermore, do not use web-services directly on top of the RDBMS engine, why? The hardware isn't typically scaled to do so.

I'll give you an example of a system I am working on right now (I can't disclose the customer name, nor the performance I'm seeing due to legal licensing agreements).

I have a couple SQLServer2005 instances on high-end dual core machines with 8 GB RAM, 32 bit Windows OS, and 32 Bit SQLServer. I've got a Data Vault data model under the covers as a data warehouse, and have built over 1 TB in test data on a single box, about 3 TB across multiple boxes, and have a separate web-server (also on Windows, we are running *.NET) which performs extremely fast, I can query the data sets of 1 Billion rows (joined to 300 million rows), and produce a result set of 100,000 rows very quickly.

I have a huge load on SQLServer, 1 TB per machine, Raid 0+1, internal SCSI, 146GB Hot Swap drives at 10,000 RPM. The performance is phenomenal.

Why? Because I've cleared the machines from doing _anything_ except serving data. These machines are highly focused, highly tuned and have nothing but the OS, and SQLServer on them to run and process data sets. If I through a web-server on the RDBMS engine itself, or if I turned on "web-services processing/email processing", or used "CLR" or in-database stored procedure logic, my performance would slow down by factors of 4x to 10x or more.

Do not let your applications "cloud" the performance of your RDBMS engines. You've paid good money for licenses to your database engines and by now if you haven't guessed it: separating (divide and conquer) the applications across machines, and improving network speed will help improve parallelism, tuning and functionality. These RDBMS engines really need to be on their own box. This is just one of the reasons why such high performance can be seen from APPLIANCE vendors. They LOCK the hardware down (in most cases) to perform data servicing and management functions, they also provide additional and sometimes proprietary hardware to improve speed even further.

1. Web servers are the number one performance killers of VLDW / VLDB RDBMS engines.
2. OLTP applications are the number two performance killers of VLDW (data warehousing specific) RDBMS engines.
3. XML In and Out can be a performance killer of structured text loads. Try pre-processing the XML, normalizing or fully denormalizing "flattening" it.
4. Stored procedures, functions, and "programmatic" solutions like triggers, and logic within the RDBMS can be performance killers. The database engines MUST be clean, and simple - used to house the structures only. Keep your application logic in the application layers where it belongs. After all, the acronym RDBMS stands for "Relational Database Management System", not "relational application logic data manipulation system".... I realize that some databases (like Teradata) provide for high degrees of parallelism and application logic within the database. They go through great pains in engineering to make this work, hence the high cost, but I still say: processing should not necessarily happen within the database UNLESS it can happen on a "block by block" basis in parallel with linear scalability.

Whoa, hold-up, what are you saying?
Ok, here's where part of the rubber hits the road. Most programmers write logic this way:

(open a cursor of some sort)
(like a macro in excel)

This is SINGLE ROW PROCESSING, and is extremely slow in huge volume environments. Remember the performance numbers I spoke of (in terms of rows-per-second) that you want to have in large environments? 120,000 to 160,000 rows per second in processing time... To my knowledge, writing any program this way (as stated above), in a row-by-row processing loop will never perform as fast as block style (see below)...

The above code re-written in block style:

1. INSERT INTO
Select
where....

2. INSERT INTO
select
from temp table #1
where...

3. UPDATE
where

4. INSERT INTO
select
from Temp table #2
where ...

This is block style programming, you're telling the database to deal with BLOCKS of rows in RAM rather than a single row at a time. The performance factor of block style is 10x to 25x faster than single row hits, ESPECIALLY if the database is tuned for parallelism and partitioning, and has enough hardware to perform appropriately.

If however, the database does NOT have enough hardware, OR it is not tuned properly, block style processing might "never finish" where looping (while slow) actually completes. The looping process in this instance may run at 600 rows per second, at best it might run at 1200 rows per second, but no where near the 120,000 rows per second of the block style.

Keep in mind that block style processing requires LOTS of use of TEMP/SWAP space.

Ok, I digress... back to the point:
* DO NOT mix workload on RDBMS engines in volume solutions
* DO: divide and conquer, separate the application spaces from each other, and from their intended purpose... besides: how much can it cost to get a Linux box, or a new windows machine just for web-services logic? Compared to how long you're currently waiting for your data to complete its processing?
* MOVE the application logic BACK in to the application, and out of the database, let the ETL/ELT tools do their jobs in high speed/parallel formats.
* Partition and tune your database, buy the hardware needed for the RDBMS engine to breathe.

Hope this helps,
Dan Linstedt

  Posted by Dan Linstedt on September 17, 2007 5:10 AM |

Post a comment