Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

May 2011 Archives

Increasingly data warehouse components as well as many operational systems are moving to the cloud.  By the cloud, I mean systems that conform to the NIST definition of on-demand with self-service, have broad network access, resource pooling, rapid elasticity and measured service.  The cloud has lowered barriers to entry in terms of IT competencies that need to be employed as well as hardware, software, power, floor space, storage, network, Procurement and Accounting. In addition cloud providers provide more professional chargeback capabilities.  Obviously, much personal software has gone to the cloud - i.e. Microsoft 365, Dropbox, Google Docs, MobileMe and the impending Google Chrome device.  But what about core enterprise systems like data warehouses?

 

The cloud can be resisted there due to the loss of control. However, we must lose the fear and facilitate the right cloud strategy for data warehousing.  Some of my clients are eagerly moving various systems components to the cloud and in so doing are going to apply different standards for cloud data warehouses then they would for in-house data warehouses.  For example, the usual 99.99% availability mark gets compromised in Amazon's public cloud which offers 99.95% availability. There have also been many public cloud relationships derailed such as the one between Eli Lilly and Amazon.  However, even with Amazon's recent outage, I find that companies, despite the media FUD about it, are responding not by moving away from the cloud but by doubling down with high availability systems.  Upon further inspection, much availability, security and performance can be considered better than in house systems.

 

So which type of cloud and what services for data warehousing? As would be expected, initially there is a preference for the private cloud as it gives clients more comfort and more control over security compliance and integration - even though some accountants would prefer the public option so that expenses are fully capitalized without question.

 

I believe that over time data warehouse infrastructures in the cloud will evolve to a hybrid approach whereby there are public cloud components as well as private cloud components and they are connected with integration.  This reflects the fact that data warehouses today are integrated with data marts of various stripes and various alternative technologies to row based relational systems.  One of the first ports for data in the cloud is simply data storage fileservers including content management systems, off-site backups, etc.  However, CIOs have plans to see 50% of their databases in the cloud in the next five years.  The cloud is truly the biggest threat to change how IT operates ever as companies consider what their true core competencies are and the allure of the savings that can be realized with cloud done right.

 

So which part of the multi-component data warehouse will move to the cloud?  There will be many and varied paths, but one popular approach will be that first, there will mostly be those databases.  There is also the information delivery layer which is a natural follow-on and then the integration layer, which must see connectivity between the source systems and the data warehouse.

 

A great time for considering the cloud for data warehouses is during times of consolidation, which many companies are doing today.  Finding a mixture of data volume, concurrency, query complexity, data latency, and data sensitivity that works well for the cloud - public or private - is important in developing the cloud value proposition.  Developing a disaster plan will be important as well and may look very different from the one used with in-house data warehouses.

 

So should we have different, lower standards for a cloud data warehouse?  Some very well may because of the overarching value proposition that they see in the cloud.  Be open to considering all aspects of the cloud value proposition as you architect your data warehouse consolidation or next major data warehouse architecture change.


Posted May 26, 2011 9:50 AM
Permalink | No Comments |

Due to increasing data volume and data's high utility, there has been an explosion of capabilities in the possibilities in the past few years brought into the early mainstream. While stalwarts of our information, like the relational row-based enterprise data warehouse, remain highly supported, it is widely acknowledged that no single solution will satisfy all enterprise data management needs. 

Many are confused by the value of Hadoop, data warehouse appliances and stream processing. Their value propositions seemingly conflict with current information management infrastructure.

Costs for keeping "all data for all time" in an EDW are still escalating, even though storage remains historically inexpensive. That is driving some heterogeneity as well.  

The key to making the correct data storage selection is an understanding of your workloads - current, projected and envisioned.

Join me for the Break Free Tour session in a city near you. This practical session will organize and explore the major categories of information stores available and help you make the best choices to ensure information remains an unparalleled corporate asset.

You will learn:

  • The place for Relational Row-Oriented Data Warehouse and Data Marts
  • Efficient operation of RDBMS with I/O Bottleneck alleviation
  • How multidimensional databases fit into an organization
  • When data streams make an information store
  • Hadoop basics for Big Data, webscale and unstructured workloads
  • Cloud considerations for information storage and interaction

I'll be joined by IBM leaders drilling in on these topics and others relevant to the perspective of the IT decision maker. Register for one of these sessions, and bring your questions. We're looking forward to a fascinating series.

Tuesday, June 7 - Bellevue, WA
Thursday, June 9 - Boston, MA
Tuesday, June 14 - Los Angeles, CA
Wednesday, June 15 - Denver, CO
Thursday, June 16 - Atlanta, GA

Additional analysts and colleagues will be hosting Break Free sessions in Toronto and New York.

See full details on any of these sessions and register now.  I hope you can join us! 

And if you're near Bellevue, Boston, Los Anglees, Denver or Atlanta and you want to find some time together (end user organizations, software companies) while I'm in town, let me know. 


Posted May 25, 2011 12:30 PM
Permalink | No Comments |

Discovery Days kicked off last week in Indianapolis and part of the focus was on Netezza.  I gave a talk on the origins of appliances, based in part on the linear progression from uniprocessing to SMP to Clusters to MPP and made a point that I see appliances in that lineage.  However, it was with the caveat that it's no longer linear and each appliance is putting different nuance to MPP.   Appliances do represent something different than just bundled MPP systems. 

 

Alan Edwards of IBM filled in some of the details of the Netezza story and value proposition.  Netezza is very much a strong part of the IBM information management story today.  Below are some points from Alan's talk.  Netezza-aware professionals will already understand most, but keep in mind the lack of familiarity of the audience.  Everyone can at least be reminded of these high points for Netezza.

 

Nearly 70% of data warehouses experience performance-constrained issues

Traditional systems are just too complex, too long to get answers

Netezza means results in Urdu.

Twinfin 3rd gen is a line of surfboards (and the latest line of Netezza appliances)

Netezza is purpose built for analytics

There a no "hints", etc. (none necessary)

They have an analytics package that runs in hardware (I believe this is a reference to the FPGA)

Appliances start at 1 tb and go up to 1 pb+

  nz.jpg

Netezza is true massively parallel

Gets "streaming data"  - just as data comes off disk, send to cpu only what's needed

There is an SQL interface

Still no indexes (none needed)

 

More than half of accounts do not have dedicated DBA and one person can have all the knowledge of how Netezza works that a shop needs.  This is a key point as Discovery Days, and other events and promotions by all major vendors today, have a strong element of saving costs.

 

Netezza still spreads all data evenly across the disk drives

Compression is 4x - 5x

 

Netezza is not a fit for:

OLTP

When the majority of queries are highly selective (i.e., a few rows out of billions)

For row-at-a-time processing (i.e., cursors)

For small volumes, it's overkill

 

Wheel it in & test it.  Teach you in 2-4 weeks.

Most customers are in 10s of terabytes.

 

Analytic workloads are the sweet spot of what NZ does well.  They win business with POCs, which are done for80% of initial purchases.  70% of POCs are onsite.  More than half of customers have bought multiple systems.

 

Interesting, of the audience - all in IT, 83% have "not at all" heard of Netezza.  0 knew it very well or had first-hand experience.

 

Netezza is competitive with Exadata, Teradata, "others" (Greenplum, Vertica, Oracle software-only (SMP & sometimes RAC), Conventional DBMS and SAP/Hana in-memory, which is emerging, not released, on the horizon.  They didn't list Hadoop as a competitor.

 

600 customers are claimed

 

There are prebuilt Cognos industry & application specific "blueprints" for NZ.

 

All in all, it was a great overview of Netezza and the kind of great information being shared at Discovery Days. 

 

I have been helping clients compare and contrast Netezza and other appliances with more flexible and customizable environments such as DBMSs, columnar databases and "big data" Hadoop  environments depending on the workload.  It's not a one-size-fits-all.  It is a heterogeneous future and appliances remain capable for important analytical workloads that enterprises clearly have.


Posted May 8, 2011 6:02 PM
Permalink | 1 Comment |

Open Source.  Check.  Columnar.   Check.  Dallas' answer to a couple of others checking these boxes out there is InfiniDB from Calpont.  I had a chance to catch up with them this week as they announced their latest release, 2.1, about 15 months after their commercial launch.

 

InfiniDB does not have indexes, does late materialization (or "just in time" as they call it) and multi-table hash joins.

 

Some of the items they were stressing were scalability, SQL extension, compression and performance.

 

The modules that perform user functions and performance functions can be scaled individually to accommodate need in either area.  This is part of their plan to provide linear scalability.

 

Partitioning can be vertical or horizontal and comes by default with CREATE TABLE.

 

There was a lot of stressing of the predictable, linear performance.

 

  calpont1.jpg 

Here are some SQL extensions that were added in last few releases that were stressed:

 

Subqueries

Limit keyword

User Defined Functions

STDDEV and related functions

Views

Auto incrementing

Partition drop - to take individual partitions offline

Insert into table select from... where the from can be to/from InfiniDB and MySQL

Vertical and horizontal partitioning

 

Most of these are available in the enterprise edition.  If you want their syntax guide, drop me an email.

 

Compression is also new and improved.   Using the Piwik.org data set, they achieved 3x to 9x compression.  Compression can be set on individual columns.

 

Performance with compression was a key takeaway: 

calpont2.jpg

Several recommendations for effectively using InfiniDB were given:

 

Use tight data type declarations (4 byte instead of 8 byte)

Use fixed rather than long strings

Loading can be slow, which is a byproduct of being columnar

Other advice around using Cpimport, which is their fastest bulk loader for flat files  

 

InfiniDB performance testing with the Star Schema benchmark looks like this: 

calpont3.jpg

The sales model seems streamlined to maximize a product reach in 2011.  Like a lot of open source, it reflects a "try before you buy" approach, happy to work "bottoms up" in organizations, with short-term contracts and simple pricing ($6,000 per CPU core).

  

  

 


Posted May 6, 2011 7:35 AM
Permalink | No Comments |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›