Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

April 2010 Archives

Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

I received several direct questions on my latest blog posting regarding the row-oriented data page.  Before I move on to more discussion of columnar, I thought I'd answer those questions here.

1.       How does the DBMS know where fields begin that are after variable length fields since it will not be in the same place for every record?

It's correct that any field that follows a variable length field will not be in the same position on each record.  The DBMS actually reads the field length of the variable length field in its first 2 bytes and uses that to "jump" to the start of the next field.  All subsequent fields, provided they are not variable length also, will begin at fixed offsets from the end of the variable length field.  This is why you should put variable length fields at or towards the end of the record and why you should only use variable length fields when the field size really does vary.  Variable length fields save space, but keep in mind 2 bytes are added to each field for the length bytes.

2.       If a field is null, wouldn't it be 0 or 1 bit in length?

First of all, nothing inside actual fields is a bit in length.  Everything in the DBMS is done on byte (8 bit) boundaries.  So, there will be at least 1 byte for each field, even if that field were a decimal (1), which could only contain values up to 9 and could be held in 4 bits.  For nulls, there is an ADDITIONAL byte pre-pended to the field.  If that byte is 'on', the field is null and the DBMS will ignore whatever leftover data may happen to be in the field.  If the byte is 'off', the value is not null and the field value is legitimate.  So, if a value is set to null, space is not saved!  For any nullable field, you actually have to ADD a byte to storage.  However, nullability does not make the column variable length (see answer #1 above.)  Of course, a variable length column can be nullable.

3.       What happens when a record is deleted?

Not much really happens when a record is deleted.  There is a bit in the record header that is on/off depending on whether it's a valid record or not.  That bit is set to 'off' when a record is deleted.  I'll refer to invalid records as holes.  Holes just sit there taking up space.  Holes are actually linked together within the page!  Why?  Because if this is the page that the DBMS wants to insert a record, it can use an existing hole or compress holes together to make a bigger hole to use for the new record.  If you have a clustering index, the DBMS MUST put the records in order.  However, that 'order' is only according to the Row ID map.  Physically on the page, they can still be out of order.  This is fixed, as well as all holes removed, during reorganization processes, either explicitly called or, in the case of some DBMS, as background processes.

 


Posted April 30, 2010 8:08 AM
Permalink | No Comments |

 Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

For 99% of you, the database management systems you have been working with have been "row oriented".  You may never have applied that label to them, know why it's important or that alternatives exist.  And for many, even DBAs, the "need to know" down to the data page layout has not been there.  Even the vendors who are columnar don't necessarily promote their products that way, but rather promote their systems as solving "analytic" queries.

The current main alternative to "row oriented" is "column oriented".  "Column oriented" databases now permeate the landscape and claim to provide a better value proposition for a specialized analytic workload or even the entire data warehouse. 

This is the orientation of....

  • Sybase IQ
  • SAND
  • Vertica
  • ParAccel
  • InfoBright
  • Kickfire
  • Exasol
  • MonetDB
  • Microsoft SQL Server 2008 R2 Gemini/Vertipaq
  • Oracle Exadata2 (both row- and columnar-)
  • Calpont's InifiniDB

Microsoft  is using columnar in PowerPivot for SharePoint (Vertipaq) client-side collaborative result set manipulation.  Oracle's Exadata2 accommodates both columnar and row orientation and Oracle has announced a "hybrid future".  While scant on details, this should merit some attention.  Also there is at least one vendor, Illuminate, which is a hybrid of row and column implementation, which I'll discuss later.

 

So what is columnar?  Well, today, I'm going to start the literal answer and in later posts, I'll expand on the answer, the byproducts and ramifications.  The literal answer will expand from a knowledge of the row-based data page layout.

 

Row- and columnar- have to do with how the data is laid out on a data page.  DBAs will know about data page size, which can range from 4K to 64K.  Every DBMS is slightly different, but inside the page is a page header, the records (with record headers), a row ID map and a small page footer.  The row ID map is significant because it contains the offsets from the start of the page to the beginning of the record (well actually the record header.)  If the DBMS wants to read the 4th column of the 3rd record on the 123rd page, it will go to the 123rd page and read the (let's say 2-byte) entry that is 8 bytes from the end of the page (2 byte footer plus 2 bytes for each row ID map entry times 3) to get to the 3rd record.  That takes it to the offset and after (let's say) 6 bytes for the record header, the actual data begins.

 

In a row-oriented DBMS, EVERY field for the record is then found (in column order within the table.)  If a certain column (say, the 4th column) is interesting, the DBMS will advance the number of bytes that's contained in the first 3 fields (according to the catalog notwithstanding variable length fields, for which the field length is contained IN the first 2 bytes of field itself) and there the read engine finds it's desired 4th column of the 3rd record on the 123rd page.

 

The time-consuming part of the operation was reading the data page in the first place.  If you're interested in just one column, it would be great to be able to find a bunch of that column's values on the page, without having to skip over those other columns that aren't interesting.  And that will set us up for the next entry in this series, which will be on the columnar data page.

 

Row-oriented data page:datapage.JPG


Posted April 26, 2010 7:45 PM
Permalink | 2 Comments |

Platform research and development in large enterprises is necessary today more than ever.  The reasons:

1.       No shortage of data consolidation and re-platforming opportunities as a byproduct of the years and years of departmental and otherwise non-centralized data warehouse development

The drive for a reduced infrastructure footprint from the boardroom is finally giving the directors and vice presidents who preside over the various overlapping data warehouses in the environment reason to point out the redundancy.  Data consolidation projects have straightforward ROI at some level - the lower carrying costs of the multiple environments.  

2.       The emerging specialized future, providing new and confusing opportunities

A new variety of legitimate options in database orientation, architecture and hosting mean that research is necessary more now than ever.

3.       Growing cynicism in vendor messages

However confused you may believe your end user shop to be, that is as confused, or more, that the newly consolidated enterprise company sales force finds itself.  The options can be overwhelming and the advice largely needs to be updated and consistent.  Likewise, the sales quotas need the same alignment.  Consequently, the vendor messages are received with more cynicism and suspicion than ever.

4.       An increased realization that 2010 will not be the last year

2010 is already tuning out to be a progressive year for information management and there is seemingly no end to these projects and, as such, platform decisions of today may need to be revisited soon if not made optimally.  This could be costly.  The rate of data collection is beyond expectations.  Companies who never considered themselves to have terabyte level needs now have 5 terabytes under management.  Fortune 50 enterprises have hundreds of terabytes.  At this rate, petabytes are not far off.  There's a need to get it optimal now.  The alternative could be fatal to the business, not just IT.

These reasons are leading to more and more internal research.  The need is for a huge sandbox that supports practical information management platform investigation, both as a matter of curiosity as well as pre-planning specific upcoming platform needs.  Decisions today around platform, modeling approach (dimensional/normalized), columnar, hosting, memory use, etc. will be highly leveraged for years and your internal research, supported by independent unbiased consulting, is required to make the best decisions today.


Posted April 12, 2010 6:50 AM
Permalink | No Comments |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›