Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

November 2010 Archives

In my last post, I talked about Microsoft's new, upcoming columnar offering, Apollo.   I said it was designed to take some pressure off the core DBMS to do it all and do it all fast.  That's doubly true for Parallel Data Warehouse (PDW), the new MPP offering from Microsoft.  This is probably one of the last times you'll hear the word DataAllegro, but that technology, acquired in 2008 by Microsoft, is what PDW is based on.  MIcrosoft has spent the last 2 years replacing the core (Ingres) DBMS with SQL Server and the Linux/Java with Windows/C#.  PDW currently works on HP hardware and is in early release.

Microsoft is giving its users 2 major additional data storage options in Denali - columnar and MPP.  Microsoft is going down the path of functional parity between the core SMP offering and PDW, which is already integrated with the SQL BI stack.  It hopes to keep some of those SMP customers hitting its scalability limits in the Microsoft tent.

There is a lot of overlap in capabilities among SMP, columnar and MPP.  It's your job to sort through your workloads and make a plan.  I have found MPP much more advantageous the larger the data is and columnar useful for those high column selectivity workloads.

I'll be part of a virtual seminar focused on PDW on Tuesday.  I'll be talking about data consolidation strategies, a topic Microsoft is ready to take on with PDW.

Abstract:

As budgets languish, data growth balloons and business demand intensifies, BI and data warehousing professionals are under immense pressure to squeeze every last dollar of value from existing investments, while providing 24/7 access to mission-critical business information. That's the bad news.

The good news is you're invited to join renowned visionaries Bill Inmon (the father of data warehousing) and William McKnight (leading information management consultant), for our LIVE, interactive virtual seminar on November 16th (9:00 AM - 1:30 PM (EDT))  - designed to help you leverage next-generation data warehousing technologies for maximum gain.


Posted November 14, 2010 8:44 AM
Permalink | 1 Comment |

It was columnar day for me at SQL PASS on Wednesday.  On Tuesday, Microsoft announced that Denali, the, the code name for its next release, would have a columnar data store option.  My talk was on columnar databases Wednesday.  Here are some of the details I shared about Denali's column store, which has a project name of Apollo.  If you're interested in columnar databases in general, see my blog entries here.

In Denali, there will be an index type of "COLUMN STORE".  I find this to be an interesting use of index, because the resultant data stores that are created are not like traditional indexes.  However, Microsoft has never been a conformist organization.   The column stores are non-clustered indexes.  No indexes can be created on top of the column stores.

Where the column store is like an index is that you need the accompanying row-wise database.  The column stores are not created mutually exclusive to the row-wise database.  To my knowledge, this is the only database that requires this.  I don't expect this to be a long-term requirement.  While this may seem like it's expanding your storage needs (and it is), it may not be as much as you initially think because some non-clustered indexes might become redundant in this architecture. 

The good news about this is that the optimizer has been updated to route queries to the column stores or the row store accordingly.  This could prove to be a competitive differentiator.  Few other database systems have this.  An intelligent hybrid optimizer will be key to the success of databases that are at least partly columnar. 

Apollo's vectors (per my Sybase IQ language in my earlier posts) are called column segments, although there can be multiple segments per column, as explained below.  You can only have one column store index per table, but you can name as many columns as you want.  Of course, it doesn't matter what order you use because each column forms an independent segment.  Only single-column segments are supported in Apollo.

Apollo leverages the intellectual property, patents, and good experiences that Microsoft has had with Vertipaq, the client-side, columnar, in-memory structure used with PowerPivot.  Columnar remains the preferred, future, and only, format for Vertipaq. 

In Apollo, no inserts, updates or deletes are allowed on the tables that have a COLUMN STORE (this is the part of the talk where I did a mock exit).  You can, however, do incremental loads and you can switch partitions to add data.  You can also sandwich DISABLE and REBUILD of the segments around your updates.  I expect this will improve over time.

As long as I'm on limitations, the columns selected have data type restrictions.  The columns must be integer, real, string, money, datetime or a decimal that is 18 digits or less.  No other data types are supported. [11/19 update: Apollo will support all the date and time types (including datetime) except that datetimeoffset is only supported for precisions of 0 to 2.]

As we know, getting more (relevant) information in the I/O is one of the advantages of columnar data store.  Microsoft has taken this to a new level.  While data is still stored in pages (blocks), the unit of I/O is actually one million data values.  That forms a "segment."  You read it right - the unit of I/O is not a certain number of "K" but has to do with the NUMBER of data values.  Inside those pages, the data is stored in blobs.  Bitmapping is part of the storage somehow as well although columnar data page layouts are not public information.  Neither is how it's doing materialization.  As for I/O, compression algorithms have been reengineered for columnar.  These are not the same compression algorithms from the row-wise database.

If there are more than a million data values represented, there will be multiple segments.  Apollo uses run-length encoding ("row-range" in this post) and the 'million' refers to the number of corresponding rows covered by the data values.  If there are 500,000 unique values in a 2,000,000 row table, there will be 500,000 entries but 2 segments.

Apollo is not available in the CTP1.  I have no information on its licensing, except that it probably won't be available in the lower-tier SQL Server licenses.  As for which columns to store this way, it's going to be per the recommendations from this post. 

Get ready as Microsoft begins its foray into columnar and taking the pressure off the core DBMS to do all workloads and do it all fast. 


Posted November 12, 2010 10:22 AM
Permalink | No Comments |
 

For much of the last decade, conventional theories surrounding decision support

architectures have focused more on cost than business benefit. Lack of Return on

Investment (ROI) quantification has resulted in platform selection criteria being focused

on perceived minimization of initial system cost rather than maximizing lasting value to

the enterprise. Often these decisions are made within departmental boundaries without

consideration of an overarching data warehousing strategy.

 

This reasoning has led many organizations down the eventual path of data mart proliferation.

This represents the creation of non-integrated data sets developed to address

specific application needs, usually with an inflexible design. In the vast majority of

cases, data mart proliferation is not the result of a chosen architectural strategy, but a

consequence due to lack of an architectural strategy.

 

To further complicate matters, the recent economic environment and ensuing budget

reduction cycles have forced IT managers to find ways of squeezing every drop of

performance out of their systems while still managing to meet users' needs. In other

words, we're all being asked to do more with less. Wouldn't it be great to follow in

others' footsteps and learn from their successes while still being considered a thought

leader?

 

The good news is that the data warehousing market is now mature enough that there are

successes and best practices to be leveraged. There are proven methods to reduce costs,

gain efficiencies, and increase the value of enterprise data. Pioneering organizations

have found a way to save millions of dollars while providing their users with integrated,

consistent, and timely information. The path that led to these results started with a

rapidly emerging trend in data warehousing today - Data Mart Consolidation (DMC).

I've learned that companies worldwide are embracing DMC as a way to save large

amounts of money while still providing high degrees of business value with ROI. DMC

is an answer to the issues many face today.  

 

My paper on this subject available here details the process of DMC at eight different organizations while capturing the keys to success from each. These case studies were specifically selected to demonstrate several variations on the concept of consolidation. While there is no such thing as a "cookie-cutter" DMC process, there are common best practices and lessons to be shared.

 


Posted November 7, 2010 4:44 PM
Permalink | No Comments |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›