Column based databases/appliances are making headway in the VLDB/VLDW world. There is no doubt that there are benefits to this approach, but there are also drawbacks. In this entry I explore some of the articles, links, facts and figures - as related to my personal experience. Then I compare what different authors are saying against Row-Based MPP technologies to see what the differences and similarities are. This by no means is a complete research paper, but just a peek into what the future may hold for RDBMS vendors and the new Column based data stores. Of course, Solid state disk, and RAM/Flash based data sets will change things again shortly. I'll also touch on the impacts to Data Modeling and what it may mean going forward.
Let's first set the table by defining what the terms mean:
1) for VLDB/VLDW I'm referring specifically to a 300TB and above system.
2) I'm also referring to LIVE data sets, where it isn't JUST 300TB sitting in a storage disk somewhere, but there's a significant amount of information being loaded AND queried at the same time, utilization is somewhere around 100TB "used/accessed/referenced/loaded" per week.
3) I'm also referring to a MIXED WORKLOAD system, meaning real-time transactions are streaming in, batch loads are occurring, and both tactical and strategic queries are taking place at the same time.
By MPP: I mean Massively Parallel Processing capabilities, like DPF from DB2 UDB (IBM - running shared-nothing architecture), and Teradata with independent nodes to scale out, I'm also referring to theses traditional database systems as "row-based" database engines.
For Column Based "appliances" I am referring to Sybase IQ, Vertica, Dataupia, and others which provide column based data storage. NOTE: Netezza is NOT a column based store, rather it is a flat-wide appliance with hardware that figures out exactly what data set you need before hitting disk to retrieve it.
Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).http://www.databasecolumn.com/2008/07/debunking-a-myth-columnstores.html
Though this argument sounds reasonable, it is completely incorrect. It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.
If you're interested in furthering your knowledge on indexing versus column compression, the article: http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf is a very good source for examining the mathematics behind the tuple sets and joins.
Most of the articles I've located discuss indexing, and differences between indexing and column based tuple access. Unfortunately they don't tend to address the loading speeds and performance of getting the data "IN" to the database in the first place.
Column based data stores bring benefits to the table:
* Rapid Query, less overhead (according to the math that I've read through)
* No need for PHYSICAL data modeling (as long as you don't need/want GOVERNANCE or MANAGEMENT in your data store).
* No "seemingly physical" limit to the number of columns PER TABLE.
* Automatic data compression/removal of duplicates on insert
* IF the grid / cloud computing works properly, then they should be able to scale out
* They appear to achieve anywhere from a 3:1 to a 7:1 compression ratio on the data slammed in to the box.
* Raw data can be loaded quickly (in native format) without "stopping to normalize, or assign sequence number surrogate keys"
Now let's take a look at some of the issues that they bring to the table (simple issues)
* Most column based databases have yet to solve massive load performance issues
* Most column based databases have to "STOP" the data stream to compress it, and assign it to the right column post-loading.
* In order to achieve high speed trickle feed (8,000 transactions per second or better) they need to have a significant RAM cache somewhere on one of the nodes to load the data.
* Splitting the data over multiple gridded nodes might take more work than originally thought
* Load balancing with spreading the data set across multiple gridded nodes might be an issue.
* Today, most column based data stores work extremely well on big iron SMP boxes, but struggle to take full advantage of Grid technologies and shared-nothing architectures.
* To handle "BATCH LOADS" Most column based data stores use a "staging area" internally to load the batch data, then split it across and push it in to the column database (this may NOT be such a bad thing... we do this in MPP environments too!)
* Column based databases have "come and gone", the only one that has stuck around over the years has been Sybase IQ, and finally for the first time in many years we are beginning to see announcements from the company that they are putting money back into R&D for this product.
Let's take a look at the physical nature of MPP:
* Provides mechanisms for governance and management through physical data modeling
* Provides high-speed batch loads, and high-speed trickle feeds (real-time transactions)
* Provides balanced queries, and can easily handle mixed workload components (loading while querying, and both tactical and strategic queries at the same time).
* Has grown up, is based on mature proven technology.
* Scales out very easily, allows MASSIVE sets of data (because it's not locked in to a single SMP environment).
* Usually requires good physical data modeling (normalization) in order to load-balance the data sets across the nodes.
* Usually requires a staging area inside the MPP platform before re-distributing the data ** caveat: some MPP platforms have architected their bulk-loaders to overcome this problem.
* Usually requires JOIN INDEXES or some materialized table to assist with the Tuple Joins
* Usually requires column based compression to be turned on by the operator to achieve benefits.
* Requires enough nodes to "split the workload evenly"
* Requires all nodes to be running at the same speed in order to achieve maximum performance gains.
So these are just a FEW of the points made both for and against column based databases when comparing them to MPP designs. They both work well for their own purpose. Customers of mine continue to look for a "single solution to do it all" however today, it just doesn't seem possible. This is why (I think) that we continue to hear vendors like IBM and Teradata advertise: "we partner with...." fill in the blank of your favorite column based database...
However, watch the vendors closely - this market space is heating up, and over the next year I expect new technologies to be released from all vendors that will converge some functionality and blur the lines between RDBMS MPP, and Column based on a grid.
Thoughts? What do you see in the market?
Posted September 21, 2008 2:56 PM
Permalink | 1 Comment |