We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

June 2010 Archives

Links to the Columnar Database series:


Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

This final installment in my series on columnar will focus on what, to many, is the bottom line - when to use one.  I have built up to this point from the ground up - starting with the major difference between columnar and row-based - the orientation of the data in its blocks/pages.  Understanding those physical differences is the foundation for understanding the queries and workloads that will perform better or worse, and by how much. To be sure, you would use columnar for selective improved performance of data retrieval queries.

Seldom are such platform decisions based on a few queries.  The overall workload would need to be analyzed.  There may be a view to split the workload accordingly.  This may mean duplicating some of the data across columnar and row-based systems. 

So, anything is possible when it comes to system selection.  It all depends on the value of performance.  Since columnar databases will not run an enterprise, they are up against entrenched row-based skill sets and mindsets.  It must be the value of performance that overrides this if columnar is going to be considered.  Many times this comes after exhaustive tuning of the row-based DBMS. 

Based on theory, client-side experience in platform selection and building information management systems, personal testing and benchmarks, I offer these observations.  Clearly, these must be query observations and not workload observations.  Block reads are a primary barometer of performance, although you have the post-read "gluing" process I spoke of in an earlier post to deal with in columnar.  You also have prefetch strategies in both row-based and column-based, which speeds up scans when it is triggered.  Also the row makeup (size, number/type columns), as discussed earlier, factors into what you're getting with the block reads.

Some things are clear however. 

A table scan of all columns will perform best in row-based.  You already know why if you've read the other posts.  However, just because the row-based DBMS decided to do a table scan does not mean one is necessary in columnar.  This is hugely important when assessing columnar.  Some row-based DBMS will decide to table scan at 15% selectivity.  In columnar, it would only access those vectors containing the interesting columns.  Single column scans can therefore be much faster in columnar.

The higher the number of columns in the query, the more it will approach row-based index performance (assuming the right indexes are in place in the row-based DBMS).  For small databases, the performance differences are muted.

A single record retrieval is better in row-orientation, especially when there are a high number of columns in the table.  If the workload is mostly point queries of full records, you will be better off in row-based.

The bottom line is column selectivity.  Where it is high, those queries will perform better in columnar.  This also means that longer rows (thereby creating a significant granularity to query selectivity) would lend themselves to columnar.  Of course, you can argue that you can build a bunch of "mini-tables" in row-based that you can connect by key.  Yes, and you could also simulate columnar with two-column (key and value) tables and /or directing the optimizers to leaf-level index scans.  The maintenance on such systems will be very high and onerous.  Plus, index leaf pages must contain record ID overhead (pointers to the data pages where the 'rest of' the records reside) and columnar structures do not.

The "devil's in the details" - the details of your workload and the engineering of t he columnar DBMS, that is.  It's an emerging specialized future.  If you're looking for the performance advantages of columnar, go columnar for the workloads of large data, long rows, batch (or nonintrusive to query) loads, with selective column queries and where the value of performance is there. 

Posted June 24, 2010 10:59 AM
Permalink | No Comments |

Links to the Columnar Database series:


Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

So far in this series on columnar, I have attempted to distinguish the columnar database from the row-wise database from the bottoms up, drawing distinctions in how the data is stored on disk, which is the major physical difference.  Once the data is stored in its new and unusual way in a column-based DBMS, it, of course, must be retrieved.

In row-wise data retrieval, we frequently ask for multiple tables to be joined in the result set.  Of course, in columnar (which works with SQL too by the way), there are joins, but there is also an analogous concept to joins in columnar within a single table.  It's a process I call "gluing" and it's the process of pulling together the necessary columns from within a table that are needed by the query, either for selection or projection (where clause).

In row-wise, unnecessary columns must be removed from the result set because inevitably (unless the query is a SELECT *) more columns were retrieved than are interesting to the query.  In columnar, since just enough columns were retrieved, that is not necessary.  However, what is necessary is a strategy that determines when to glue (materialize) the columns together.  There are two choices - early materialization or late materialization.

Early materialization retrieves all the columns necessary and glues them together up front, effectively creating a "row" with which to do the necessary processing.  If you asked...




...it would pull CUSTID, PRICE, PRODID, and STOREID from their respective vectors, glue them together*, apply the WHERE clause and return the selected columns by trimming PRODID and STOREID from the "row".  This strategy works best for complex joins.

The same query could also look into the PRODID vector and return a bitmap of the rows that met the condition (=4), do the same for STOREID (=1), then "AND" the bitmaps to determine which rows met both conditions.  The columns that matched the rows in those vectors can then be pulled and glued together at the very end of the processing, just before presentation.  This is called late materialization and it is usually best for less complex and non-join queries.

Ideally, the columnar optimizer is able to execute either strategy.  Ideally, it can analyze the query and determine and execute the best strategy.  It's important to analyze your workload in order to determine the value of this aspect of the columnar optimizer to your shop.  And that will lead to the next entry, which will be on the fit of columnar databases in an enterprise.

*Recalling from this post how it carefully connects columns from the same row

Posted June 1, 2010 1:51 PM
Permalink | 1 Comment |


Search this blog
Categories ›
Archives ›
Recent Entries ›