We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

 Links to the Columnar Database series:


Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

This is a continuation of my thread on columnar databases.  In the last entry, as a foundation, I introduced the row-wise data page.  In this post, I will address the changes at the data page level that occur for a columnar database. 

In the row-wise data page, all of the columns are found sequentially within the table for each row and rows are stored consecutively.  In a columnar database, each data page is populated with data from only one COLUMN.  I will refer to this data structure as Sybase IQ does, as a vector.  So, each vector contains a column's worth of data.  Vectors will take up more or less pages depending on the size of the column.  A table will have 1 vector per column.

Queries will usually need multiple columns from a table.  So those columns must be fit together appropriately.  It would not work to put column 1 from row 1 together with column 2 from row 2 and column 3 from row 3 and present it as a "row".  The columns have to be put back together (what I call "glued") appropriately.  To facilitate this, the columns are stored in the order of the rows they belong to.  Hence, a row ID map at the end of the page is not required in columnar and consequently, there can be less wasted space.*

For example purposes, I will use a 3-column table with a customer number, first name and last name columns.  The rows contain:

1.       123-William-McKnight

2.       456-Joe-Smith

3.       789-Joe-Doe

There are 4 ways that columnar databases store data:

1.       For each column, the row ID is stored alongside the column value.

a.       Vector 1:  1-123,2-456,3-789

b.      Vector 2: 1-William,2-Joe,3-Joe

c.       Vector 3: 1-McKnight, 2-Smith, 3-Doe

2.       For each column, a row "range" is used where applicable.

a.       Vector 1: 1-123,2-456,3-789

b.      Vector 2: 1-William,2-3-Joe

c.       Vector 3: 1-McKnight, 2-Smith, 3-Doe

3.       For each column, the row is indicated by position.

a.       Vector 1: 123,456,789

b.      Vector 2: William, Joe, Joe (or William, Joe (2))

c.       Vector 3: McKnight, Smith, Doe

4.       Bitmap representation with abstracted values corresponding to entries in a separate map structure are used.

The first way is wasteful since the row ID is unnecessary to store since it is implied by the order of the values.  The second way does not improve upon that, but does compress space when there are recurring values in sequential rows.  The third ways shows removal of the row ID.  Notice Vector 2 may or may not actually use the range representation for the recurring values.  It's usually better if it does.  Finally, while there is nothing new about columnar bitmaps (vs. row-wise bitmaps), the acceptable cardinality for the fit of a column as a bitmap is much higher in columnar than row-wise.  In columnar, up to 1500-2000 values is considered "low cardinality" and "worth it" to apply bitmapping to (as opposed to < 50-100 in row-wise DBMS).  This is useful when you have recurring values that are not stored in sequential records.  A column like "country" can reduce its size 30-fold with bitmap encoding.

In some columnar DBMS, the columns/vectors can utilize different storage methods. The 3rd and 4th ways shown here are best in a mixed strategy.

Again, columnar is attempting to give you more bang for the buck in your I/Os when you are seeking a minimal set of columns from the table.  In the next entry, I will describe why getting the most out of your I/Os has becoming increasingly very important.

*Wasted space can occur in row-wise when you run out of Row IDs before storage space on the page.  Since each row must have a map entry, the page will be considered full when the Row ID limit is reached.


Posted May 6, 2010 9:33 PM
Permalink | No Comments |

Leave a comment


Search this blog
Categories ›
Archives ›
Recent Entries ›