Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Data warehouse Category

eco-skyscraper-by-vikas-pawar-2a.jpgIn an era of "big data this" and "Internet of Things that", it's refreshing to step back to some of the basic principles of defining, building and maintaining data stores that support the process of decision making... or data warehousing, as we old-fashioned folks call it. Kalido did an excellent job last Friday of reminding the BBBT just what is needed to automate the process of data warehouse management. But, before the denizens of the data lake swim away with a bored flick of their tails, let me point out that this matters for big data too--maybe even more so. I'll return to this towards the end of this post.

In the first flush of considering a BI or analytics opportunity in the business and conceiving a solution that delivers exactly the right data needed to address that pesky problem, it's easy to forget the often rocky road of design and development ahead. More often forgotten, or sometimes ignored, is the ongoing drama of maintenance. Kalido, with their origins as an internal IT team solving a real problem for the real business of Royal Dutch Shell in the late '90s, have kept these challenges front and center.

All IT projects begin with business requirements, but data warehouses have a second, equally important, staring point: existing data sources. These twin origins typically lead to two largely disconnected processes. First, there is the requirements activity often called data modeling, but more correctly seen as the elucidation of a business model, consisting of function required by the business and data needed to support it. Second, there is the ETL-centric process of finding and understanding the existing sources of this data, figuring out how to prepare and condition it, and designing the physical database elements needed to support the function required.

Most data warehouse practitioners recognize that the disconnect between these two development processes is the origin of much of the cost and time expended in delivering a data warehouse. And they figure out a way through it. Unfortunately, they often fail to recognize that each time a new set of data must be added or an existing set updated, they have to work around the problem yet again. So, not only is initial development impacted, but future maintenance remains an expensive and time-consuming task. An ideal approach is to create an integrated environment that automates the entire set of tasks from business requirements documentation, through the definition and execution of data preparation, all the way to database design and tuning. Kalido is one of a small number of vendors who have taken this all-inclusive approach. They report build effort reductions of 60-85% in data warehouse development.

Conceptually, we move from focusing on the detailed steps (ETL) of preparing data to managing the metadata that relates the business model to the physical database design. The repetitive and error-prone donkey-work of ETL, job management and administration is automated. The skills required in IT change from programming-like to modeling-like. This has none of the sexiness of predictive analytics or self-service BI. Rather, it's about real IT productivity. Arguably, good IT shops always create some or all of this process- and metadata-management infrastructure themselves around their chosen modeling, ETL and database tools. Kalido is "just" a rather complete administrative environment for these processes.

Which brings me finally back to the shores of the data lake. As described, the data lake consists of a Hadoop-based store of all the data a business could ever need, in its original structure and form, and into which any business user can dip a bucket and retrieve the data required without IT blocking the way. However, whether IT is involved or not, the process of understanding the business need and getting the data from the lake into a form that is useful and usable for a decision-making requirement is exactly identical to that described in my third paragraph above. The same problems apply. Trust me, similar solutions will be required.

Image: http://inhabitat.com/vikas-pawar-skyscraper/



Posted March 17, 2014 5:33 AM
Permalink | No Comments |
Operational analytics is making headlines in 2013. But why is it important? And why is it more likely to succeed now than in the mid-2000s, when it was called operational BI or the mid-1990s when it surfaced as the operational data store (ODS)? 
 
First, let's define the term. My definition, from two recent white papers (April 2012 and May 2013) is: "Operational analytics is the process of developing optimal or realistic recommendations for real-time, operational decisions based on insights derived through the application of statistical models and analysis against existing and/or simulated future data, and applying these recommendations in real-time interactions." While the language is clearly analytical in tone, the bottom line of the desired business impact is much the same as definitions we've seen in the pact for the ODS and operational BI: real-time or near real-time decisions embedded into the operational processes of the business. 

Anybody who has heard me speak in the 1990s or early 2000s will know that I was not a big fan of the ODS. So, what has changed? In short, two things: (1) businesses are more advanced in their BI programs and (2) technology has advanced to the stage where it can support the need for real-time operational-informational integration. 

BI Evolution.jpg
The evolution of BI can be traced on two fronts shown in the accompanying figure: the behaviors driving business users and the responses required of IT providers. As this evolution proceeds apace, business demands increasing flexibility in what can be done with the data and increasing timeliness in its provision. In Phase I, largely fixed reports are generated perhaps on a weekly schedule from data that IT deem appropriate and furnish in advance. Such reporting is entirely backward looking, describing selected aspects of business performance. Today, few businesses remain in this phase because of its now limited return on investment; most have already moved to Phase II. 

This second phase is characterized by an increasing awareness of the breadth of information available collectively across the wider business and an emerging ability to use information to predict future outcomes. In this phase, IT is highly focused on integrating data from the multiple sources of operational data throughout the company. This is the traditional BI environment, supported by a data warehouse infrastructure. The majority of businesses today are at Phase II in their journey and leaders are beginning to make the transition to Phase III. 

Phase III marks a major step change in decision making support for most organizations. On the business side, the need moves from largely ad hoc, reactive and management driven to a process view, allowing the outcome of predictive analysis to be applied directly, and often in real time, to the business operations. This is the essence of the behavior called operational analytics. In this stage, IT must become highly adaptive in order to anticipate emerging business needs for information. Such a change requires a shift in thinking from separate operational and informational systems to a combined operational-informational environment. This is where the action is today. This is where return on investment for leading businesses is now to be found. And, simply put, this is why operational analytics is making headlines today--many businesses are ready for it; the leaders are already implementing it. 

This leads us to the second contention: that technology has advanced sufficiently to support the need. There are many ways that recent advances in technology can be combined to do this. In the white papers referenced above, one shows how two complementary technologies, IBM DB2 for z/OS and Netezza, can be integrated to meet the requirements. The other shows how the introduction of columnar technology and other performance improvements in DB2 Advanced Enterprise Edition can meet these same needs. Other vendors are improving their offerings in similar directions. 

So, to paraphrase the "Six Million Dollar Man": we have the business waiting. We have the technology. We have the capability to build this... But, wait. There is one more hurdle. Most existing IT architectures strictly separate operational and informational systems based on a data warehouse approach dating back to the mid-1980s. This split is a serious impediment to building this new environment that demands a tight feedback loop between the two environments. Analyses in the informational environment must be transferred instantly into the operational environment to take immediate effect. Outcomes of actions in the operational systems must be copied directly to the informational systems to tune the models there. These requirements are difficult to satisfy in the current architecture; they demand a new approach. This is beginning to emerge, but is by no means widespread yet. I'll be discussing this topic further over the coming weeks.

Posted May 13, 2013 8:41 AM
Permalink | No Comments |
The data warehouse has now been with us for a quarter of a century.  Its architecture and infrastructure have stood largely stable over that period.  A range of methodologies for designing and building data warehouses and data marts has evolved over the years.  And yet, time after time, in one project after another, one question is repeatedly asked: "why is it so difficult to accurately and reliably estimate the size and duration of data warehouse development projects?"

On Friday, 20 May, WhereScape launched their new product WhereScape 3D at the Boulder BI Brain Trust (BBBT) meeting.  3D, standing for "Data Driven Design" is a novel and compelling approach to specifically supporting the design phase of data warehouse and data mart development projects and the data-focused experts whose skills and knowledge are vital to avoiding the sizing and scoping issues that frequently plague the development phase of these projects.

I provided a white paper for WhereScape as part of the launch.  This paper first explores the issues that plague data warehouse development projects and the most common trades-off made by vendors and developers--choosing between speed of delivery and consistency of information delivered.  The conclusion is simple.  This trade-off is increasingly unproductive.  Advances in business needs and technological functions demand delivery of data warehouses and marts with both speed and consistency.  And reliable estimates of project size and duration.

One compelling solution to these issues emerges from taking a new look at the process of designing and building data warehouses and marts from a very specific viewpoint--data and the specific skills needed to understand it.  From this, the paper surfaces the concept of data driven design and a number of key recommendations on how data warehouse design and population activities can be best structured for maximum accuracy and reliability in estimating project scope and schedule.

So, what is different about data driven design?  Briefly, it focuses on the planning phases of a data warehouse or data mart development project, before we bring in the ETL tool and the experts who build ETL.  This planning phase documents all that is known and can be discovered about the two key components of the development--the source data and the target model or database--at both a logical and physical level.  The reason for this focus is simple: if you know the most you can about these two components, you have the best chance of avoiding the development pitfalls so common in the development phase.

To me, that's money in the bank of IT!  And my only question to WhereScape is: why are you offering it for free?  There's no excuse for data warehouse project managers; go download it and try it out!

Posted May 30, 2011 6:29 AM
Permalink | 1 Comment |
Wow!  Is this some sort of record?  No, I'm not talking about the acquisition just yet.  I'm actually referring to my posting two blog entries within a couple of days.

Given my time zone in Cape Town, I saw the announcement yesterday evening and have had a night to sleep on it, and let a few other analysts give their first impressions, before having to come up with a considered opinion.  Of course, that doesn't make it any more than an opinion!

I've been somewhat negative in the past about a number of acquisitions, particularly SAP's purchase of Sybase and HP's acquisition of Vertica, particularly around the implications for innovation in the broad business intelligence space.  That opinion stems from my view of relative sizes, market positions, skill sets, cultures, and ambitions of the buyer and the bought.  In the above cases, particularly, I felt the acquisitions would have a negative impact on innovation, not only in the short-term (which, I believe, is the case in most all acquisitions), but also in the medium and longer term.  I'm considerably more sanguine regarding innovation and other matters when it comes to the Teradata  / Aster Data deal.

In their analyst call yesterday, the two companies provided very little information on the technical integration or roadmaps of their products and focused on a very broad market positioning.  To me, this is unsurprising.  Having been through the acquisition process myself in a large company, I know that, for legal reasons, only a very limited number of people are privy to the discussions before the announcement.  This is unfortunate, because broader company wisdom and collaborative input on architectural and technical possibilities, organizational approaches, and more is generally excluded from the initial decision.  This is hardly good strategic decision-making practice; gut feel, high-level vision and personal contacts probably play too strongly.

That said, what I know about the two companies involved leads me to believe that a sensible and, indeed, exciting roadmap can be developed over the coming weeks before the deal closes.  Of course, I'm making some assumptions about the thinking involved.  If it's missing, I happily offer this post as strategic input :-)

Before going further, there is a general market problem around nomenclature that makes it difficult to discuss the technical information aspects of this deal.  I'm referring, of course, to the ill-defined and much abused phrases "big data", "structured data", "unstructured information", and so on.  I've been trying to introduce the terms "hard" and "soft information" for some time now but I'm beginning to feel that these may also prove inadequate.  Sigh!  Another piece of definitional work needs to be done...

With that, back to the business in hand.  I see this acquisition as being all about positioning the traditional world of enterprise data warehousing and the emerging world of so-called big data.  Big data is not necessarily big at all and, more to the point, includes a dog's dinner of information from different sources and with disparate characteristics.  For the moment, let's call it "nontraditional information" and characterize it as the information that people like to process with MapReduce, Hadoop and its associated menagerie of open source tools and techniques.  Teradata and Aster Data, like most database vendors, have a strong interest in the emerging MapReduce market.  Both have strong partnerships with Cloudera.  But, the most interesting point for Teradata, I suspect, is Aster Data's patent-pending integrated SQL-MapReduce function.  Porting that function into the Teradata database (assuming that's possible) would provide much-needed, seamless integration between the traditional EDW and nontraditional information.

Aster Data's other key selling point has been the concept of bringing the function to the data, arguing that creating multiple copies of data for analysis through different tools is an expensive and dangerous process.  Their answer has been in-database analytics.  The concept of minimizing copies of data is powerful, and is an approach that I have been preaching for some years.  It is also part of Teradata's philosophy.  Therefore, and again subject technical feasibility, I imagine that Teradata will look to port in-database analytics into the Teradata DBMS.

None of this is to say that the Aster Data database will disappear soon.  But, it would make sense that its longevity would depend on its ability to attract further customers whose requirements, technical or otherwise, could not be met through at Teradata DBMS upgraded with the functions mentioned above.

Beyond the practicalities of porting function from Aster Data into the Teradata platform--which, at the end of the day is only code, after all--what I see in this acquisition is two companies coming together with a shared understanding of data warehousing, analytics, and the emerging "big data" environment, who have demonstrated ongoing innovation in their separate existences.  Provided they abandon the Treaty of Torsedillas kind of positioning described by Curt Monash, I believe this acquisition has a better than average chance of succeeding.


Posted March 4, 2011 8:09 AM
Permalink | 2 Comments |
So, another innovative start-up in the data warehousing space has succumbed to the blandishments of a richer, bigger suitor in a Valentine's Day marriage!  Sorry, couldn't resist the obvious parallels as HP and Vertica announced a match made in heaven on February 14th.

Vertica, founded in 2005 by Dr. Michael Stonebraker (Berkley and MIT database guru) and Andrew Palmer, has had an enviable reputation for being a leading innovator and market success in the columnar database field in recent years.  Most recently, they have introduced a hybrid database model on top of the pure columnar database, giving some of the performance advantages of both row-based and column-based models.

On the other hand, HP's status in the data warehousing field is unfortunately most closely tied to the slow-motion train-wreck that was Neoview, whose demise was confirmed in late January.

Of course, we must assume (as presumably HP does) that the Vertica team will bring both their innovative thinking and their undoubted cachet in the specialized data warehouse analytic database market to HP.  The question is: is this a reasonable assumption?

Sadly, in many cases it proves impossible to take external innovation and reputation and successfully embed it into a larger organization.  Whether it is the large organization ethos, the incumbent power structures or the existing technical skills, most acquisitions of smaller technological assets tend to underperform on their buyers' expectations.  And the leadership team of the acquired company, both business and technical, often finds the new environment too challenging, and either leaves as soon as the golden handcuffs are undone or slips namelessly into the larger company culture and forgoes the innovative drive.

While fully understanding the economic realities involved for smaller companies in a market dominated by a relatively small number of enormous players, I am concerned that the trend towards increasing consolidation will kill the innovation we've seen blossoming in the data warehousing space in the past few years.  That would be a great shame, as the business needs that have emerged in recent times demand a significantly different model of business intelligence than we've followed over the past twenty years.  That model, as I've discussed elsewhere, requires advanced innovation from people who understand where we've come from in BI and both the possibilities and limitations of new technologies to solve tomorrow's information challenges.

HP's success with Vertica depends on economic and technological factors, for sure.  However, the most important will undoubtedly be organizational and political in nature.  Will HP, chastened by their clear failure with Neoview, step back and look at the market with fresh eyes and allow Vertica be a change agent in a new and emerging view of business insight?  Or will they attempt to compete with the market incumbents and cast their new Vertica database in the old roles of either buying market presence or playing technical catchup?

In the analytic database marketplace as of now, the mantle of independent innovation must now fall upon ParAccel, Aster Data and a few smaller, mostly open source vendors.

Posted February 15, 2011 3:17 PM
Permalink | No Comments |
PREV 1 2

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›