Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Business Intelligence/Data Warehousing Category

Potentially Teradata's most significant enhancement in a decade will be on display next week at the Teradata Partners conference.  And that is Teradata Columnar.  Few leading database players have altered the fundamental structure of having all of the columns of the table stored consecutively on disk for each record.  The innovations and practical use cases of "columnar databases" have come from the independent vendor world, where it has proven to be quite effective in the performance of an increasingly important class of analytic query.  Here is the first in a series of blogs where I discussed columnar databases. 

Teradata obviously is not a "columnar database" but would now be considered a hybrid, exhibiting columnar features upon those columns that are chosen to participate.  Teradata combines columnar capabilities with a feature-rich and requirements-matching DBMS already deployed by many large clients for their enterprise data warehouse.  Columnar is available in all Teradata platforms - Teradata Active Enterprise Data Warehouse, Teradata Data Warehouse Appliance, Teradata Extreme Data Appliance and Teradata Extreme Performance Appliance.

Teradata's approach allows for the mixing of row structure, column structures and multi-column structures directly in the DBMS in "containers."  The physical structure of each container can also be in row- (extensive page metadata including a map to offsets) which is referred to as "row storage format" or columnar- (the row "number" is implied by the value's relative position) format.  All rows of the table will be treated the same way, i.e., there is no column structure/columnar-format for the first 1 million rows and row structure for the rest.  However, (row) partition elimination is still very alive and, when combined with column structures, creates I/O that can now retrieve a very focused set of data for the price of a few metadata reads to facilitate the eliminations.

Each column goes in one container.  A container can have one or multiple columns.  Columns that are frequently access together should be put into the same container.  Physically, multiple container structures are possible for columns with a large number of rows.

Teradata Columnar utilizes several compression methods that take advantage of the columnar orientation of the data.  Methods include run-length encoding, dictionary encoding, delta compression, null compression, trim compression and the previously-available columnar-agnostic UTF8.  Multiple methods can be used with each column.

 

The dictionary representations are fixed length which allows the data pages to remain void of internal maps to where records begin.  This small fact saves calculations at run-time for page navigation, another benefit of columnar. Variable-length records are handled similarly.  Dictionaries are container-specific, which is advantageous in the usual case where column values are fairly unique to the column.   

Starting by analyzing the workloads to be used with the data and focusing on column-specific workloads, then grouping columns accessed together, the foundation for table creation, with its automatic compression, is laid.  Advantages will be seen in fewer storage needs, improvements in I/O bound query performance and scan operations. 


Posted September 30, 2011 3:17 PM
Permalink | No Comments |

In business intelligence, we all know and espouse the fact that data integration is the most time-consuming part of the build process.  This is undeniably true.  However, if one were to look at the long-term (me: not a full-time analyst, but observant of the implementations I've been in for a full lifecycle over the past few years), I believe most long-term costs clearly fall into the data access layer.   This is where the reports, dashboards, alerts, etc. are built.


This is true for a variety of reasons, not the least of which is a short-cutting of the data modeling process, which, when done well, minimizes the gap between design and usage.  This aspect of BI is receiving only modest recognition.  The focus instead is on a new breed of disruptive data access tools that are architecturally doing side-runs around the legacy tools in how they use memory and advanced visualization.  Specifically, these tools are Tableau, QlikTech, and Spotfire.  These tools attack a very important component of the long-term cost of BI - the cost of IT having to continue to do everything post-production.


There are a few areas where these tools are getting recognition:


  1. They perform faster - this allows a user, in the 30 minutes of time he has to do an analysis, to get to a deeper level of root cause analysis
  2. They are seen as more intuitive - this empowers the end user so they can do more, versus getting IT involved, which stalls a thought stream and introduces delay which can obliterate the relevancy
  3. They visualize data differently - I won't expound on it here and I don't think it's necessarily due to the tool architecture, but many claim it's better

So why do I bring it up in opposition to outsourced business intelligence?  Because to truly set up business intelligence to work in a self-service capacity, you would overweigh the idea of working closely with users in the build process, which is a lever that gets deemphasized in outsourced BI.  You would see business intelligence as less a technical exercise and more as an empowerment exercise.   You would keep the build closer to home, where the support would be.  And you would not gear up an offshore group to handle a laborious process of maintaining the data layer over the years in the way users desire.  You would invest in users - culture, education, information use - instead of outsourced groups.  And this is just what many are doing now. 


Posted August 14, 2011 10:52 AM
Permalink | 2 Comments |

I was at Teradata Influencer's Days this week, an annual 3-day invitation-only event where Teradata catches us up on the latest offerings and company strategy.  We were in Las Vegas this year and we had a fascinating visit to the Switch data center where eBay stores their Teradata EDW, Hadoop clusters and another large system where the thousands of jobs run daily to keep eBay on top of their game.

Teradata is undoubtedly a long-standing leader in information management.  They have been preparing for the heterogeneous future (or is it the heterogeneous present?) and diversifying their offerings for several years.  Teradata's moves should have everyone reconsidering any notion of Teradata as a high-hurdle company that wants you to put everything online in a single data warehouse.  And it seems to be working.  Teradata released earnings Wednesday showing revenue growth of 24 percent in 2Q11.

Aster Data - A "big data" acquisition for the management of the multi-structured data with patented SQL/MapReduce

Active Data Warehousing - Abilities built into the Teradata 5000 EDW series that support and promote fast, active, intra-day loading of the data warehouse as opposed to a batch-loaded warehouse

Aprimo - Marketing applications that put the information to work and a software-as-a-service model to build some of their future on

Master Data Management - The "system of record" for subject areas that need governance and need to be integrated in real-time, operationally

Hot-Cold Data Placement - Less-used data placed into lower-cost storage, with accompanying degraded performance

Appliance Family - Pre-loaded machines of varying specification according to workload that can get your data access up and running quickly; some are using the appliance for their data warehouse

I noted still something could be done where many analytics are going - to the operational world.  Something in complex event processing would further an information ecosystem.  

We discussed Teradata 14 and it will continue this theme of providing the range of platform options necessary today.

Now that some of these acquisitions are assimilated, we are seeing a reflection in the marketing.  With "Teradata Everywhere" as the imperative, the reference architecture is now the "Analytic Ecosystem" which is an environment that includes, but is not all-consumed by, the Enterprise Data Warehouse.  Consider the market sizes of the markets Teradata is going after, as shared by Teradata: Data Warehousing ($27B), Business Applications ($15B) and Big Data Analytics ($2B).  Teradata is embracing the heterogeneous future as a focused leader in information management.


Posted August 6, 2011 8:45 AM
Permalink | No Comments |

What do you think about when you hear the term "self-service"?  To some, it's a positive term connoting the removal of barriers to a goal.  I can, for example, go through the self-service checkout line at the grocery store and I'm limited only by my own scanning (and re-scanning) speed to getting out the door.  However, as we've seen with some chains eliminating self-service lines recently, self-service is not always desired by either party.  To some, "self-service" is a negative term, euphemistically meaning "no service" or "you're on your own."

As defined in Claudia Imhoff and Colin White's excellent report, "Self-Service Business Intelligence: Empowering Users to Generate Insights", self-service BI is defined as "the facilities within the BI environment that enable BI users to become more self-reliant and less dependent on the IT organization."

If you put up a poor data warehouse, it is a copy of operational data, only lightly remodeled from source and usually carrying many of the same data quality flaws from the source.  It solves a big problem - making the data available - but after this copy of data, the fun begins with each new query being a new adventure into data sources, tools, models, etc.  What has inevitably happened in some environments is that users take what they need, like it's raw data, and do the further processing required for the business department or function. 

This post-warehouse processing is frequently very valuable to the rest of the organization, if the organization could only get access to it.  However, data that is generated and calculated post-data warehouse has little hope of reaching any kind of shared state.  This data warehouse is not ready for self-service BI.

According to Imhoff and White, the BI environment needs to achieve four main objectives for self-service BI:

1.       Make BI tools easy to use

2.       Make BI results easy to consume and enhance

3.       Make DW solutions fast to deploy and easy to manage

4.       Make it easy to access source data

To achieve these goals, you need a solid foundation and solid processes.  Take account of your BI environment.  While IT and consultancy practices have coined "self-service business intelligence" to put some discipline to the idea of user empowerment, some of it is mere re-labeling of "no service" BI and does not attain and maintain a healthy relationship with the user community and healthy exploitation of the data produced in the systems.  We all know that IT budgets are under pressure, but this is not the time to cut vital services of support that maintain multi-million dollar investments.


Posted July 28, 2011 7:07 PM
Permalink | 2 Comments |

Increasingly data warehouse components as well as many operational systems are moving to the cloud.  By the cloud, I mean systems that conform to the NIST definition of on-demand with self-service, have broad network access, resource pooling, rapid elasticity and measured service.  The cloud has lowered barriers to entry in terms of IT competencies that need to be employed as well as hardware, software, power, floor space, storage, network, Procurement and Accounting. In addition cloud providers provide more professional chargeback capabilities.  Obviously, much personal software has gone to the cloud - i.e. Microsoft 365, Dropbox, Google Docs, MobileMe and the impending Google Chrome device.  But what about core enterprise systems like data warehouses?

 

The cloud can be resisted there due to the loss of control. However, we must lose the fear and facilitate the right cloud strategy for data warehousing.  Some of my clients are eagerly moving various systems components to the cloud and in so doing are going to apply different standards for cloud data warehouses then they would for in-house data warehouses.  For example, the usual 99.99% availability mark gets compromised in Amazon's public cloud which offers 99.95% availability. There have also been many public cloud relationships derailed such as the one between Eli Lilly and Amazon.  However, even with Amazon's recent outage, I find that companies, despite the media FUD about it, are responding not by moving away from the cloud but by doubling down with high availability systems.  Upon further inspection, much availability, security and performance can be considered better than in house systems.

 

So which type of cloud and what services for data warehousing? As would be expected, initially there is a preference for the private cloud as it gives clients more comfort and more control over security compliance and integration - even though some accountants would prefer the public option so that expenses are fully capitalized without question.

 

I believe that over time data warehouse infrastructures in the cloud will evolve to a hybrid approach whereby there are public cloud components as well as private cloud components and they are connected with integration.  This reflects the fact that data warehouses today are integrated with data marts of various stripes and various alternative technologies to row based relational systems.  One of the first ports for data in the cloud is simply data storage fileservers including content management systems, off-site backups, etc.  However, CIOs have plans to see 50% of their databases in the cloud in the next five years.  The cloud is truly the biggest threat to change how IT operates ever as companies consider what their true core competencies are and the allure of the savings that can be realized with cloud done right.

 

So which part of the multi-component data warehouse will move to the cloud?  There will be many and varied paths, but one popular approach will be that first, there will mostly be those databases.  There is also the information delivery layer which is a natural follow-on and then the integration layer, which must see connectivity between the source systems and the data warehouse.

 

A great time for considering the cloud for data warehouses is during times of consolidation, which many companies are doing today.  Finding a mixture of data volume, concurrency, query complexity, data latency, and data sensitivity that works well for the cloud - public or private - is important in developing the cloud value proposition.  Developing a disaster plan will be important as well and may look very different from the one used with in-house data warehouses.

 

So should we have different, lower standards for a cloud data warehouse?  Some very well may because of the overarching value proposition that they see in the cloud.  Be open to considering all aspects of the cloud value proposition as you architect your data warehouse consolidation or next major data warehouse architecture change.


Posted May 26, 2011 9:50 AM
Permalink | No Comments |
PREV 1 2 3 4 5 6 7 8 9 10 11 12

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›