We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

 Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

This is a continuation entry in my series on columnar databases.  Today, I will talk about how columnar databases address the modern architectural bottlenecks.

Row-based MPP is a large cluster of thousands of processors.  Each node can be SMP shared-nothing or shared-disk. The interconnect connects the nodes, allowing for processing to span nodes.  Row-based MPP architectures support some of the largest and most successful data warehouses in the world.  And this is not going to change any time soon.

In row-based MPP, one of the most important design choices we make include how to distribute data amongst the nodes.  Typically some form of 'randomization' is used, like hash or round-robin.  We also have slowly made our block sizes larger over the years and the reason is the block is the unit of I/O (notwithstanding prefetch) and we want to get as much data as possible in a single I/O.  I/Os have become the bottleneck in our environments. 

While disk density has gone up significantly in the last 20 years or so, packing much more data down into smaller spaces, I/O is still limited by the physical head movement of the arm.  Physics simply won't allow such a small component to move much faster without the arm flying right off its handle (which could ultimately be why solid state disk and phase change memory become the norm over time).

Other than the larger block sizes, our designs have not changed much over the years to accommodate this I/O-bound reality.  We have tried to make the OLTP database more analytic with specialized indices, OLAP cubes, summary tables and partitioning, but we would need hundreds of drives to keep 1 CPU truly busy in a robust, complex, utilized data warehouse environment.  It's not feasible.  Incidentally, because of this bottleneck, random I/O has sped up much more slower over the years than sequential I/O, which doesn't require nearly as much head movement.

Naturally, on a project-by-project basis, you just make the best of the systems you have to work with.  The reason to discuss this in my columnar series is because columnar is one technique that makes the most of the I/O.

Regardless of row- or column-orientation, the data is stored on disk and there are various gates that store and process data until it gets to the CPU.  Each works on ever-decreasing data sizes.  Think of it as a pyramid.

pyramid.JPG

 

 

In row-based MPP, the entire row goes "up" the pyramid, creating bottlenecks at each level and, based on the architecture, causing rows to "skip" the screening potential of each level, especially the L2, whereupon the predicates end up being applied directly by the CPU.  The processes are the same in columnar, but, as previously described in this series, only columns are passed up the pyramid, creating a clearer path to the CPU.  This is one way to work around the I/O bottleneck - ask the I/O to process only the data that it needs!

In my next entry in this series, I will talk about the strategies columnar databases use to pull it all together and materialize a result set.

 


Posted May 17, 2010 9:01 PM
Permalink | No Comments |

Leave a comment

    
   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›