Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

This final installment in my series on columnar will focus on what, to many, is the bottom line - when to use one.  I have built up to this point from the ground up - starting with the major difference between columnar and row-based - the orientation of the data in its blocks/pages.  Understanding those physical differences is the foundation for understanding the queries and workloads that will perform better or worse, and by how much. To be sure, you would use columnar for selective improved performance of data retrieval queries.

Seldom are such platform decisions based on a few queries.  The overall workload would need to be analyzed.  There may be a view to split the workload accordingly.  This may mean duplicating some of the data across columnar and row-based systems. 

So, anything is possible when it comes to system selection.  It all depends on the value of performance.  Since columnar databases will not run an enterprise, they are up against entrenched row-based skill sets and mindsets.  It must be the value of performance that overrides this if columnar is going to be considered.  Many times this comes after exhaustive tuning of the row-based DBMS. 

Based on theory, client-side experience in platform selection and building information management systems, personal testing and benchmarks, I offer these observations.  Clearly, these must be query observations and not workload observations.  Block reads are a primary barometer of performance, although you have the post-read "gluing" process I spoke of in an earlier post to deal with in columnar.  You also have prefetch strategies in both row-based and column-based, which speeds up scans when it is triggered.  Also the row makeup (size, number/type columns), as discussed earlier, factors into what you're getting with the block reads.

Some things are clear however. 

A table scan of all columns will perform best in row-based.  You already know why if you've read the other posts.  However, just because the row-based DBMS decided to do a table scan does not mean one is necessary in columnar.  This is hugely important when assessing columnar.  Some row-based DBMS will decide to table scan at 15% selectivity.  In columnar, it would only access those vectors containing the interesting columns.  Single column scans can therefore be much faster in columnar.

The higher the number of columns in the query, the more it will approach row-based index performance (assuming the right indexes are in place in the row-based DBMS).  For small databases, the performance differences are muted.

A single record retrieval is better in row-orientation, especially when there are a high number of columns in the table.  If the workload is mostly point queries of full records, you will be better off in row-based.

The bottom line is column selectivity.  Where it is high, those queries will perform better in columnar.  This also means that longer rows (thereby creating a significant granularity to query selectivity) would lend themselves to columnar.  Of course, you can argue that you can build a bunch of "mini-tables" in row-based that you can connect by key.  Yes, and you could also simulate columnar with two-column (key and value) tables and /or directing the optimizers to leaf-level index scans.  The maintenance on such systems will be very high and onerous.  Plus, index leaf pages must contain record ID overhead (pointers to the data pages where the 'rest of' the records reside) and columnar structures do not.

The "devil's in the details" - the details of your workload and the engineering of t he columnar DBMS, that is.  It's an emerging specialized future.  If you're looking for the performance advantages of columnar, go columnar for the workloads of large data, long rows, batch (or nonintrusive to query) loads, with selective column queries and where the value of performance is there. 


Posted June 24, 2010 10:59 AM
Permalink | No Comments |

Leave a comment

    
   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›