Links to the Columnar Database series:
This final installment in my series on columnar will focus on what, to many, is the bottom line - when to use one. I have built up to this point from the ground up - starting with the major difference between columnar and row-based - the orientation of the data in its blocks/pages. Understanding those physical differences is the foundation for understanding the queries and workloads that will perform better or worse, and by how much. To be sure, you would use columnar for selective improved performance of data retrieval queries.
Seldom are such platform decisions based on a few queries. The overall workload would need to be analyzed. There may be a view to split the workload accordingly. This may mean duplicating some of the data across columnar and row-based systems.
So, anything is possible when it comes to system selection. It all depends on the value of performance. Since columnar databases will not run an enterprise, they are up against entrenched row-based skill sets and mindsets. It must be the value of performance that overrides this if columnar is going to be considered. Many times this comes after exhaustive tuning of the row-based DBMS.
Based on theory, client-side experience in platform selection and building information management systems, personal testing and benchmarks, I offer these observations. Clearly, these must be query observations and not workload observations. Block reads are a primary barometer of performance, although you have the post-read "gluing" process I spoke of in an earlier post to deal with in columnar. You also have prefetch strategies in both row-based and column-based, which speeds up scans when it is triggered. Also the row makeup (size, number/type columns), as discussed earlier, factors into what you're getting with the block reads.Some things are clear however.
A table scan of all columns will perform best in row-based. You already know why if you've read the other posts. However, just because the row-based DBMS decided to do a table scan does not mean one is necessary in columnar. This is hugely important when assessing columnar. Some row-based DBMS will decide to table scan at 15% selectivity. In columnar, it would only access those vectors containing the interesting columns. Single column scans can therefore be much faster in columnar.
The higher the number of columns in the query, the more it will approach row-based index performance (assuming the right indexes are in place in the row-based DBMS). For small databases, the performance differences are muted.
A single record retrieval is better in row-orientation, especially when there are a high number of columns in the table. If the workload is mostly point queries of full records, you will be better off in row-based.
The bottom line is column selectivity. Where it is high, those queries will perform better in columnar. This also means that longer rows (thereby creating a significant granularity to query selectivity) would lend themselves to columnar. Of course, you can argue that you can build a bunch of "mini-tables" in row-based that you can connect by key. Yes, and you could also simulate columnar with two-column (key and value) tables and /or directing the optimizers to leaf-level index scans. The maintenance on such systems will be very high and onerous. Plus, index leaf pages must contain record ID overhead (pointers to the data pages where the 'rest of' the records reside) and columnar structures do not.
The "devil's in the details" - the details of your workload and the engineering of t he columnar DBMS, that is. It's an emerging specialized future. If you're looking for the performance advantages of columnar, go columnar for the workloads of large data, long rows, batch (or nonintrusive to query) loads, with selective column queries and where the value of performance is there.
Posted June 24, 2010 10:59 AM
Permalink | No Comments |