Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation published by Addison-Wesley in 1997.

Over the past few years, Barry has extended his interest to cover the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT.

Barry has worked in the IT industry for more than 25 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Data warehouse Category

The data warehouse has now been with us for a quarter of a century.  Its architecture and infrastructure have stood largely stable over that period.  A range of methodologies for designing and building data warehouses and data marts has evolved over the years.  And yet, time after time, in one project after another, one question is repeatedly asked: "why is it so difficult to accurately and reliably estimate the size and duration of data warehouse development projects?"

On Friday, 20 May, WhereScape launched their new product WhereScape 3D at the Boulder BI Brain Trust (BBBT) meeting.  3D, standing for "Data Driven Design" is a novel and compelling approach to specifically supporting the design phase of data warehouse and data mart development projects and the data-focused experts whose skills and knowledge are vital to avoiding the sizing and scoping issues that frequently plague the development phase of these projects.

I provided a white paper for WhereScape as part of the launch.  This paper first explores the issues that plague data warehouse development projects and the most common trades-off made by vendors and developers--choosing between speed of delivery and consistency of information delivered.  The conclusion is simple.  This trade-off is increasingly unproductive.  Advances in business needs and technological functions demand delivery of data warehouses and marts with both speed and consistency.  And reliable estimates of project size and duration.

One compelling solution to these issues emerges from taking a new look at the process of designing and building data warehouses and marts from a very specific viewpoint--data and the specific skills needed to understand it.  From this, the paper surfaces the concept of data driven design and a number of key recommendations on how data warehouse design and population activities can be best structured for maximum accuracy and reliability in estimating project scope and schedule.

So, what is different about data driven design?  Briefly, it focuses on the planning phases of a data warehouse or data mart development project, before we bring in the ETL tool and the experts who build ETL.  This planning phase documents all that is known and can be discovered about the two key components of the development--the source data and the target model or database--at both a logical and physical level.  The reason for this focus is simple: if you know the most you can about these two components, you have the best chance of avoiding the development pitfalls so common in the development phase.

To me, that's money in the bank of IT!  And my only question to WhereScape is: why are you offering it for free?  There's no excuse for data warehouse project managers; go download it and try it out!

Posted May 30, 2011 6:29 AM
Permalink | 1 Comment |
Wow!  Is this some sort of record?  No, I'm not talking about the acquisition just yet.  I'm actually referring to my posting two blog entries within a couple of days.

Given my time zone in Cape Town, I saw the announcement yesterday evening and have had a night to sleep on it, and let a few other analysts give their first impressions, before having to come up with a considered opinion.  Of course, that doesn't make it any more than an opinion!

I've been somewhat negative in the past about a number of acquisitions, particularly SAP's purchase of Sybase and HP's acquisition of Vertica, particularly around the implications for innovation in the broad business intelligence space.  That opinion stems from my view of relative sizes, market positions, skill sets, cultures, and ambitions of the buyer and the bought.  In the above cases, particularly, I felt the acquisitions would have a negative impact on innovation, not only in the short-term (which, I believe, is the case in most all acquisitions), but also in the medium and longer term.  I'm considerably more sanguine regarding innovation and other matters when it comes to the Teradata  / Aster Data deal.

In their analyst call yesterday, the two companies provided very little information on the technical integration or roadmaps of their products and focused on a very broad market positioning.  To me, this is unsurprising.  Having been through the acquisition process myself in a large company, I know that, for legal reasons, only a very limited number of people are privy to the discussions before the announcement.  This is unfortunate, because broader company wisdom and collaborative input on architectural and technical possibilities, organizational approaches, and more is generally excluded from the initial decision.  This is hardly good strategic decision-making practice; gut feel, high-level vision and personal contacts probably play too strongly.

That said, what I know about the two companies involved leads me to believe that a sensible and, indeed, exciting roadmap can be developed over the coming weeks before the deal closes.  Of course, I'm making some assumptions about the thinking involved.  If it's missing, I happily offer this post as strategic input :-)

Before going further, there is a general market problem around nomenclature that makes it difficult to discuss the technical information aspects of this deal.  I'm referring, of course, to the ill-defined and much abused phrases "big data", "structured data", "unstructured information", and so on.  I've been trying to introduce the terms "hard" and "soft information" for some time now but I'm beginning to feel that these may also prove inadequate.  Sigh!  Another piece of definitional work needs to be done...

With that, back to the business in hand.  I see this acquisition as being all about positioning the traditional world of enterprise data warehousing and the emerging world of so-called big data.  Big data is not necessarily big at all and, more to the point, includes a dog's dinner of information from different sources and with disparate characteristics.  For the moment, let's call it "nontraditional information" and characterize it as the information that people like to process with MapReduce, Hadoop and its associated menagerie of open source tools and techniques.  Teradata and Aster Data, like most database vendors, have a strong interest in the emerging MapReduce market.  Both have strong partnerships with Cloudera.  But, the most interesting point for Teradata, I suspect, is Aster Data's patent-pending integrated SQL-MapReduce function.  Porting that function into the Teradata database (assuming that's possible) would provide much-needed, seamless integration between the traditional EDW and nontraditional information.

Aster Data's other key selling point has been the concept of bringing the function to the data, arguing that creating multiple copies of data for analysis through different tools is an expensive and dangerous process.  Their answer has been in-database analytics.  The concept of minimizing copies of data is powerful, and is an approach that I have been preaching for some years.  It is also part of Teradata's philosophy.  Therefore, and again subject technical feasibility, I imagine that Teradata will look to port in-database analytics into the Teradata DBMS.

None of this is to say that the Aster Data database will disappear soon.  But, it would make sense that its longevity would depend on its ability to attract further customers whose requirements, technical or otherwise, could not be met through at Teradata DBMS upgraded with the functions mentioned above.

Beyond the practicalities of porting function from Aster Data into the Teradata platform--which, at the end of the day is only code, after all--what I see in this acquisition is two companies coming together with a shared understanding of data warehousing, analytics, and the emerging "big data" environment, who have demonstrated ongoing innovation in their separate existences.  Provided they abandon the Treaty of Torsedillas kind of positioning described by Curt Monash, I believe this acquisition has a better than average chance of succeeding.


Posted March 4, 2011 8:09 AM
Permalink | 2 Comments |
So, another innovative start-up in the data warehousing space has succumbed to the blandishments of a richer, bigger suitor in a Valentine's Day marriage!  Sorry, couldn't resist the obvious parallels as HP and Vertica announced a match made in heaven on February 14th.

Vertica, founded in 2005 by Dr. Michael Stonebraker (Berkley and MIT database guru) and Andrew Palmer, has had an enviable reputation for being a leading innovator and market success in the columnar database field in recent years.  Most recently, they have introduced a hybrid database model on top of the pure columnar database, giving some of the performance advantages of both row-based and column-based models.

On the other hand, HP's status in the data warehousing field is unfortunately most closely tied to the slow-motion train-wreck that was Neoview, whose demise was confirmed in late January.

Of course, we must assume (as presumably HP does) that the Vertica team will bring both their innovative thinking and their undoubted cachet in the specialized data warehouse analytic database market to HP.  The question is: is this a reasonable assumption?

Sadly, in many cases it proves impossible to take external innovation and reputation and successfully embed it into a larger organization.  Whether it is the large organization ethos, the incumbent power structures or the existing technical skills, most acquisitions of smaller technological assets tend to underperform on their buyers' expectations.  And the leadership team of the acquired company, both business and technical, often finds the new environment too challenging, and either leaves as soon as the golden handcuffs are undone or slips namelessly into the larger company culture and forgoes the innovative drive.

While fully understanding the economic realities involved for smaller companies in a market dominated by a relatively small number of enormous players, I am concerned that the trend towards increasing consolidation will kill the innovation we've seen blossoming in the data warehousing space in the past few years.  That would be a great shame, as the business needs that have emerged in recent times demand a significantly different model of business intelligence than we've followed over the past twenty years.  That model, as I've discussed elsewhere, requires advanced innovation from people who understand where we've come from in BI and both the possibilities and limitations of new technologies to solve tomorrow's information challenges.

HP's success with Vertica depends on economic and technological factors, for sure.  However, the most important will undoubtedly be organizational and political in nature.  Will HP, chastened by their clear failure with Neoview, step back and look at the market with fresh eyes and allow Vertica be a change agent in a new and emerging view of business insight?  Or will they attempt to compete with the market incumbents and cast their new Vertica database in the old roles of either buying market presence or playing technical catchup?

In the analytic database marketplace as of now, the mantle of independent innovation must now fall upon ParAccel, Aster Data and a few smaller, mostly open source vendors.

Posted February 15, 2011 3:17 PM
Permalink | No Comments |

Having keynoted, spoken at and attended the inaugural O'Reilly Media Strata Conference in Santa Clara over the past few days, I wanted to share a few observations.

With over 1,200 attendees, the buzz was palpable.  This was one of the most energized data conferences I've attended in at least a decade.  Whether it was the tag line "Making Data Work", the fact it was an O'Reilly event or something else, it was clear that the conference captured the interest of the data community. 

The topics on the agenda were strongly oriented towards data science, "big data" and the softer (aka less structured) types of information.  This led me to expect that I'd be an almost lone voice for traditional data warehousing topics and thoughts.  I was wrong.  While there certainly were lots of experts in data analysis and Hadoop, there was no shortage of both speakers and attendees who did understand many of the principles of cleansing, consistency and control at the heart of data warehousing.

Given the agenda, I was also expecting to be somewhat of the "elder lemon" of the conference.  Unfortunately (in my personal view), in this I was correct.  It looked to me that the median age was well south of thirty, although I've done no data analysis to validate that impression.  Another observation, which was a bit more concerning, was that the gender balance of the audience was about the same as I've seen at data warehouse conferences since the mid-90s: about the same mid-90s percentage of males.  It seems that data remains largely a masculine topic.

The sponsor / vendor exhibitor list was also very interesting.  There were only a few of those that turn up at traditional data warehouse conferences.  Of course, the new "big data" vendors were there in force, as well as a few information providers.  Of the relational database vendors, only ParAccel and AsterData were represented.  Jaspersoft and Pentaho represented the Open Source BI vendors. While Pervasive and Tableau rounded out the vendors I recognized from the BI space.

As a final point, I note that the next Strata Conference has already been announced: 19-21 September in New York.  Wish I could be there!


Posted February 3, 2011 7:02 PM
Permalink | No Comments |
Just putting NoSQL in the title of a post on B-eye-Network might raise a few hackles ;-) but the growing popularity of the term and vaguely related phrases like big data, Hadoop and distributed file systems brings the topic regularly to the fore these days.  I'm often asked by BI practitioners: what is NoSQL and what can we do about it?

Broadly speaking, NoSQL is a rather loose term that groups together databases (and sometimes non-databases!) that do not use the relational model as a foundation.  And, like anything that is defined by what it's not, NoSQL ends up being on one hand a broad church and on the other a focal point for those who strongly resist the opposite view.  NoSQL is thus claimed by some not to be anti-SQL, and said to stand for "not only SQL".  But, let's avoid this particular minefield and focus on the broad church of data stores that gather together under the NoSQL banner.

David Bessemer, CTO of Composite Software, gives a nice list in his "Data Virtualization and NoSQL Data Stores" article: (1) Tabular/Columnar Data Stores, (2) Document Stores, (3) Graph Databases, (4) Key/Value Stores, (5) Object and Multi-value Databases and (6) Miscellaneous Sources.  He then discusses how (1) and (4), together with XML document stores--a subset of (2)--can be integrated using virtualization tools such as Composite.

There is another school of thought that favors importing such data (particularly textual data) into the data warehouse environment, either by first extracting keywords from it via text analytics or by converting it to XML or other "relational-friendly" formats.  In my view, there is a significant problem with this approach; namely that the volumes of data are so large and their rate of change so fast in many cases, that traditional ETL and Data Warehouse infrastructures will struggle to manage.  The virtualization approach thus makes more sense as the mass access mechanism for such big data.

But, it's also noticeable that Bessemer covers only 2.5 of his 6 classes in detail, saying that they are "particularly suited for the data virtualization platform".  So, what about the others?

In my May 2010 white paper, "Beyond the Data Warehouse: A Unified Information Store for Data and Content", sponsored by Attivio, I addressed this topic in some depth.  BI professionals need to look to what is emerging in the world of content management to see that soft information (also known by the oxymoronic term "unstructured information") is increasingly being analyzed and categorized by content management tools to extract business meaning and value on the fly, without needing to be brought into the data warehouse.  What's needed now is for content management and BI tool vendors to create the mechanism to join these two environments and create a common set of metadata that bridges the two.

This is also a form of virtualization, but the magic resides in the joint metadata.  Depending on your history and preferences, you can see this as an extension of the data warehouse to include soft information or an expansion of content management into relational data.  But, whatever you choose, the key point is to avoid duplicating NoSQL data stores into the data warehouse.

I'll be speaking at O'Reilly Media's big data oriented Strata Conference - Making Data Work - 1-3 February in Santa Clara, California. A keynote, The Heat Death of the Data Warehouse, Thursday, 3 February, 9:25am and an Exec Summit session, The Data-driven Business and Other Lessons from History, Tuesday, 1 February, 9:45am.  O'Reilly Media are offering a 25% discount code for readers, followers, and friends on conference registration:  str11fsd.  


Posted January 21, 2011 9:03 AM
Permalink | No Comments |
PREV 1 2