Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in DBMS Selection Category

Teradata Aster demonstrates its graphical "pathing" capabilities very nicely by showing the relationships between tweeters and their tweets at events, like the Teradata Third-Party Influencers Event I attended last week. 

The demonstration shows how to produce some sentiment of the event, but more importantly demonstrates relationships and influence power.  Customer relationships and influence power are becoming part of the set of derived data needed to fully understand a company's customers.  This leads to identifying engagement models and the early identification of patterns of activity that lead to certain events - desired or otherwise.

One important point noted by Stephanie McReynolds, Director of Product Marketing, at Teradata Aster, was that the sphere of relevant influence depends on the situation.  You can retweet hundreds of tweets, many for which you do not even know the tweeter.  However, when buying a car, those who would influence you would be only a handful.

One would need to take some more heed of an influencer's opinion - or that of someone with a relationships to the influencer.  It can become quite a layered analysis and influence power is hard to measure.  Grabbing various digital breadcrumbs is relatively easy, but is it indicative of influence?  Likewise, is a tweetstream indicative of the sentiment of an event?  I'm not sure.  It may not even be indicative of the sentiment of the tweeters.  Digital is all a start.  The worlds of third-party data, real sentiment analysis and possibly sensor data are coming together.   


Posted April 24, 2012 11:17 AM
Permalink | No Comments |

Teradata rolled out Teradata Data Labs (TDL) in Teradata 14.  Though it is not a high-profile enhancement, it is worth understanding for not only Teradata data warehouse customers, but also for all data warehouse programs as a signal for how program architectures now look. Teradata Data Labs supports how customers are actually playing with their resource allocations in production environments in an effort to support more agile practices under more control by business users.

TDL is part of Teradata Viewpoint, a portal-based system management solution.  TDL is used to manage "analytic sandboxes" by these non-traditional builders of data systems.  Companies can allocate a percentage of overall disk and other resources to the lab area and the authorities can be managed with the TDL.  By creating "data labs" and assigning them to requesting business users, TDL minimizes the potential dangers of the "can of worms" that has long been opened, supporting production create, alter and delete activity - not just select activity - by business users.

These sandboxes must be managed since resources are limited.  Queries can be limited, various defaults set and, obviously, disk space is limited for each lab.  Expiration dates can be placed on the labs, which is not dissimilar to how a public library works.  Timeframes will span a week through a year.  The users may also send a "promotion" request to the labs manager, requesting the entities within the lab be moved out of labs and into production.

Data labs can be joined to data in the regular data warehouse.  One Teradata customer has 25% of the data warehouse space allocated to TDL.

TDL can support temporary processing needs with strong resources - not what is usually found in development environments.  I can also see TDL supporting normal IT development.  Look into TDL, or home-grow the idea within your non-Teradata data warehouse environment.  It's an idea whose time has come.

TDL is backward-compatible to Teradata 13.


Posted April 17, 2012 9:38 AM
Permalink | No Comments |

One of my favorite blog entries was the one about the relational data page.  In that entry, I talked about how so much of the data allocated to a database is formatted.  Some people agreed but pointed out that also much storage is dedicated to index pages.  And they are correct.  It depends on your index strategy.  If you add up all the index sizes on some tables, it can exceed the row size itself.  Then, likewise the index pages would outnumber the data pages in the database.  What about them and what do they look like?

There are two basic page formats for any index page, which, like the data page, has size options for the user.  I'll repeat my earlier admonition that knowing what goes on at the page level will help you understand better how your decisions affect your performance.   

I'll start with the most prominent format - that of the leaf page.  The leaf page contains a series of key/RID(s) combinations (called entries.)  The key is the actual value of the column.  RID stands for row/record ID and is comprised of the data page number and the record number within the page.  The RID was explained in this post.  The RID is how the index is connected to the data pages.  All RIDs that connect to the 3rd record on the 123rd data page would be "123-3".  If the record there was a customer record for Brad Smith who lives in Texas and there was an index on state, the key/RID combination would be "Texas-(123-3)".

Naturally, you would have multiple customers who live in Texas so there would be multiple RIDs in the state index associated with Texas.  It might look like "Texas-(123-3),(123-4),(125-6),(125-19),(127-10), etc.".  Any index key that shows up multiple times in the table would have multiple RIDs.  A unique index would only have one RID associated to each key.

Successive entries in an index would not be in order except for the one clustered index on the table.  For example, an index on last name could have entries of:

Chamberlin-(234-2),(234-5),(336-3)

Chambers-(67-9)

Chambless-(900-33)

 

This is NOT for a clustered index. If it were, the RIDs would be in numerical order across entries.  Most indexes are non-clustered and it is normal for the RIDs to jump all around the table.  If you navigate quickly to the Chambers entry, data page 67, record 9 is where you would find "the rest of the record".  This is excellent for a query like "Select * from table where lastname = 'Chambers'".

But what about that navigation?  That comes about from the other index page format - called creatively the non-leaf page.  The non-leaf pages contain key ranges of the leaf pages so that an index navigation, which always begins at the root node, can quickly navigate to the correct leaf page.  That is the function of the non-leaf pages. 

In practice, this quickly fills up a single index page (of a few "K" bytes) and then the entries are split into two non-leaf pages and a "root" page now points to the ranges on those pages.  Eventually the root

page fills up, spilts again and a new level is created.  The more levels, the more non-leaf page accesses to get to the leaf page.  All index access involves getting to the leaf page.  These non-leaf pages are mostly kept in memory. 

That's the indexing pages and process in a tiny nutshell.  It does set us up for talking about what DB2 is doing in 9.7 with indexes.  They're compressing them.

Compression in general is another way to circumvent the I/O problem.   When compression is spoke about, it is almost always about table, not index, compression.  In DB2 9.7, however, there are 2 forms of compression happening in the index leaf pages.  One is happening to the RID list.

As you may have noticed above, RIDs are in sequence within a key.  Instead of storing a full RID (which in actuality for many indexes is 6 bytes - 4 for the page number and 2 for the ID), DB2 9.7 can store the first RID and the offsets - both page and record number -  for all successive RIDs for the key.

For example, the Chamberlin entry would be "Chamberlin-(234-2),(3),(102,-2)".  The (3) represents adding 3 to the record number of the previous RID (while leaving the page number the same).  The (102,-2) represents adding 102 to the previous page number and subtracting 2 from its record number. This small example may not look like it's much savings, but consider that the 234 is really stored as 4 bytes and the (3) is stored as only 1.  And the record number, usually stored as 2 bytes is replaced by -2, stored as one byte.  Bits internally help DB2 understand if the byte it's looking at is a key, a RID, a compressed page number or a compressed record number.

This is obviously best for those indexes with long many repeat values and long RID lists.

The other innovation in index compression in 9.7 is called Prefix Compression, which applies to the key itself.  DB2 essentially "bit maps" all the common prefixes on a leaf page in the header section of the leaf page.  So these last name values:

Chamberlin

Chambers

Chambless

with a common prefix of "Chamb" could be reduced to:

1erlin

1ers

1less

within the leaf page.

For keys with densely sequential values like this with common prefixes, this is going to pack more, potentially a lot more, index keys down on a leaf page.  Many mutli-column indexes have lots of repeat/common values in the higher level columns of the index.  These would be compressed out with Prefix Compression. The upshot is more keys in the dreaded I/O.

Indexes will automatically be compressed on any table which is compressed.

Indexing is mightily important.  So is compression.  These 2 techniques are now coming together to provide more capabilities for database workload performance.


Posted June 15, 2011 7:31 AM
Permalink | No Comments |

Links to the Columnar Database series:

 

Understanding Columnar Database Fundamentals

More on the Row-Oriented Data Page

The Columnar Data Page

How Columnar Addresses Modern Bottlenecks

Columnar Materialization

The Value of Performance and When to Consider Columnar Databases

I received several direct questions on my latest blog posting regarding the row-oriented data page.  Before I move on to more discussion of columnar, I thought I'd answer those questions here.

1.       How does the DBMS know where fields begin that are after variable length fields since it will not be in the same place for every record?

It's correct that any field that follows a variable length field will not be in the same position on each record.  The DBMS actually reads the field length of the variable length field in its first 2 bytes and uses that to "jump" to the start of the next field.  All subsequent fields, provided they are not variable length also, will begin at fixed offsets from the end of the variable length field.  This is why you should put variable length fields at or towards the end of the record and why you should only use variable length fields when the field size really does vary.  Variable length fields save space, but keep in mind 2 bytes are added to each field for the length bytes.

2.       If a field is null, wouldn't it be 0 or 1 bit in length?

First of all, nothing inside actual fields is a bit in length.  Everything in the DBMS is done on byte (8 bit) boundaries.  So, there will be at least 1 byte for each field, even if that field were a decimal (1), which could only contain values up to 9 and could be held in 4 bits.  For nulls, there is an ADDITIONAL byte pre-pended to the field.  If that byte is 'on', the field is null and the DBMS will ignore whatever leftover data may happen to be in the field.  If the byte is 'off', the value is not null and the field value is legitimate.  So, if a value is set to null, space is not saved!  For any nullable field, you actually have to ADD a byte to storage.  However, nullability does not make the column variable length (see answer #1 above.)  Of course, a variable length column can be nullable.

3.       What happens when a record is deleted?

Not much really happens when a record is deleted.  There is a bit in the record header that is on/off depending on whether it's a valid record or not.  That bit is set to 'off' when a record is deleted.  I'll refer to invalid records as holes.  Holes just sit there taking up space.  Holes are actually linked together within the page!  Why?  Because if this is the page that the DBMS wants to insert a record, it can use an existing hole or compress holes together to make a bigger hole to use for the new record.  If you have a clustering index, the DBMS MUST put the records in order.  However, that 'order' is only according to the Row ID map.  Physically on the page, they can still be out of order.  This is fixed, as well as all holes removed, during reorganization processes, either explicitly called or, in the case of some DBMS, as background processes.

 


Posted April 30, 2010 8:08 AM
Permalink | No Comments |

Oops, what is that I tripped on? Oh, it’s another new column-oriented data warehouse appliance. If you haven’t noticed, in 2007, Vertica, ParAccel and Calpont have emerged with a column orientation to their DBMS and the appliance model to their delivery. By the way, that makes 12 data warehouse appliances by my count.

A phrase I saw on the internet recently - “Pioneer calls RDBMS technology obsolete” - caught my eye and the first thing that came to mind was “Michael Stonebraker?”. My suspicions were correct. Vertica is his new venture and he states “my prediction is that column stores will take over the warehouse market over time, completely displacing row stores”.

Most IS professionals do not know about column (or “vector”) oriented DBMS. Column-oriented DBMS have several major architectural differences from other relational database management systems. The main difference is its physical orientation of data in columns as opposed to rows. This allows it to perform very high selective compression because all of a column’s values are physically together. It also provides for excellent performance when you select a small subset of the columns in a table since you do not perform I/O for data that is not needed. Column-orientation greatly assists a compression strategy due to the high potential for the existence of similar values in columns of adjacent rows in the table.

The Model 204 was sort of like this and Sybase IQ is definitely column oriented. There have been special occassions where they are more appropriate than the row-oriented DBMS.

It will be interesting to see where and how these approaches find merit in DW, if they have overcome some of the problems of the past such as those below (early indications are that they may have) and finally, if they intend to compete for EDW, as Michael Stonebraker suggests in his quote above.

Former challenges of column-oriented DBMS:

It is recommended and common practice to index all columns at least once and, for some columns, more than once
Lack of parallelism
Query performance disadvantages for any query other than columnar functions
Insert performance disadvantages
Overcoming lack of market resources, lack of vendor ports and industry row-oriented mindsets

Technorati tags: DBMS, ParAccel, Vertica, Calpont


Posted October 5, 2007 8:26 AM
Permalink | No Comments |
PREV 1 2

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›