With the super-swell of data sets these days it can be challenging if not impossible to make your way through which "DBMS" vendor does what. Most vendors offer different sets of features and functionality, they all leapfrog from one feature to the next, one version to the next - but at the end of the day, we as customers must decipher which solution fits our needs.
This blog is an attempt at suggesting what features are critical (in a generic sense) to managing VLDB/VLDW going forward. If you have features you'd like to suggest, or things your company really needs, please comment.
In the VLDB/VLDW world things change, as strategic EDW's become tactical EDW's, and our world shifts into near-real time this and that. Instant responses aren't always what they're cracked up to be. Vendors throw around the term "single version of the truth" when all the while it really should be "single version of the FACTS" because "truth" is purely subjective, and squarely in the hands of the BI user.
However, volume does funny things to our systems. It forces us to shift paradigms (from SMP and shared X) to MPP, MPP/SMP clusters with shared nothing under the covers. It forces our architectures to change, data models to change, and our latency for loading data to get smaller. Of course - I'm assuming this is all business driven right?
Let's put it this way: you can wash 1 car 5 ways from Sunday, and take all day Sunday to get it sparkling clean, but if you have 500 Cars to wash, well - you need a system, a standardized system whereby each part takes X amount of time, and there are multiple people working in parallel to get all the cars clean. If you double that again to 1000 cars a day, then 2000, then 4000, then 8000 pretty soon you've overloaded the mechanism for cleaning all these cars in one day. You begin to need efficient machines that work on the standardized system, giant machines all operating in parallel, that can wash 500 cars in two hours or so.
Just like this example: their is a breaking point in DBMS vendorâ€™s architectures that promote SMP clustering (without MPP controllers), and shared-X architectures. What you could do architecturally with 5000 rows of data and 1 hour, doesn't necessarily work at 500M rows in the same hour.
I would suggest that the following criteria be important when evaluating VLDW/VLDB vendors (I'm not just talking about having 500M rows, and not using them. I'm talking about active information - that flows into the database, and is utilized or queried, summarized and acted on).
* Shared Nothing MPP
* Fail-over and fault-tolerant SMP's underneath
* Redundant Networking, Redundant Disk, Redundant CPU, Redundant RAM
* High speed throughput
* Dynamic and Batch loading capabilities
* High speed I/O, redundant I/O, dedicated I/O 300-400MB per second or better in raw data copy speed.
There are hundreds more criteria, but these should get you out of the blocks. If embarking on VLDW or VLDB, it would be wise to review your current architecture, and possibly to load test it by duplicating the data set you currently have. Vendors know where and when you'll hit the wall with your VLDB/VLDW - and in some cases if they're called in to save the day, they'll jack the prices (not always, and not all vendors) to help with the switchover, because they know you have no choice. Your systems will reach a point of no-return and fail. I've seen it happen.
If you'd like to hear more about this subject, feel free to reply with thoughts and comments. If you disagree, I'd like to know why, and what your experience has been - especially if it's been positive with a particular vendor.
Posted July 28, 2005 7:04 AM
Permalink | No Comments |