Business Intelligence Network Business Intelligence Resources

Blog: Dan E. Linstedt

« Operating Systems: What you need for VLDW | Main | Databases and VLDW: Petabyte Scalability »

Disk and VLDW: What you need, can use

In very large data warehousing, or VLDB for that matter, I am constantly asked: what kind of disk should I have? can I use a SAN, how about a NASD, what about DASD? I have RAID 5, is that good? Now there is Raid 7, 5+, S, 10, and so on. There are differences that DO make a performance difference in the disk that you are using, and when you are dealing with very large data sets you MUST have throughput. This is the optimal end-game.

The answers are quite simple really: Faster the better, but a minimum throughput of 300MB to 400 MB per second is required. (This was 3 years ago!) Today, I would suggest that 400 MB to 500 MB per second is better, faster the better.

Now that said, if your disk cannot achieve these throughput levels then it's
a) time to buy a new disk device
b) time to reconfigure your existing disk device
c) time to re-work the storage array and how it's attached to the server
d) time to add RAM cache to the storage device

and so on. Unitl you can reach these levels of throughput it, performance will be difficult if not impossible to achieve in a 30+ or even 100+ Terabyte system, Did you know that CERN produces 100TB of data every time they smash an atom? That's an interesting tid-bit to chew on, and to top it off, they capture it all....

Ok, so what do you need in your disk device?
1. BIG RAM CACHE, you should have write-through buffers, balanced load algorithms, smart-caching across the I/O channels, and fully controlled storage arrays.
2. High speed platters
3. RAID 0+1 should be used, Or better yet: Database RAW format (only for the fastest MPP databases on the planet). DO NOT use RAID 5 and expect performance UNLESS the disk device has on-board RAM cache, and can reach the throughput speeds mentioned above.
4. Multiple I/O channels working in parallel.
5. DEDICATED DISK ARRAY to SERVER TRAFFIC ONLY, there should NEVER be any client traffic on the network, or on the disk array when expecting performance from large volumes of data. YES: DEDICATED!!!!! Again, client traffic is a) too sporatic, b) typically used to "backup and restore local windows disks" which could be gigabytes at a time, and this is painful... c) drops performance of your high speed disk array by factors of 4x, 6x, and 10x.
6. DO NOT USE A SHARED, OUTSOURCED, DISK ARRAY FOR HIGH-SPEED, HIGH-VOLUME OPERATIONS, UNLESS: it's hosted on a garaunteed VPN, and the device is DEDICATED to your servers, and they can PROVE the performance on the transfer rates. Hosted I/O solutions in a HUGE environment usually are detrimental to performance by factors of 12x, 14x and 20x.

I don't mind whether it's SAN, NASD, DASD, but I will say this: Internal disks are fastest, next up are DASD (this is the preferred choice in the HUGE volumes), then NASD and SAN IF the network is VPN direct to the server, and has garaunteed throughput.

If you are working with a storage hosting vendor, then ensure you have an SLA in place for garaunteed throughput, then ask to see the throughput test results on a bi-monthly basis. This will keep them honest. I've been in places where the hosting service will "move" your data around to different disk arrays in their system based on their own internal needs. I've also been in places where they will _not_ garauntee dedicated access nor will they garauntee exclusive access to disk devices.

Think about it, if you're willing to spend that much money to have a high volume solution, shouldn't you be protecting yourself and getting your money's worth?

Love to hear your thoughts,
Dan Linstedt

  Posted by Dan Linstedt on September 8, 2007 6:45 AM |

Post a comment