Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Data management Category

For some time now, when it comes to big data, my mantra has been "big data is simply all data".  IBM's April 3 announcement served admirably to reinforce that point of view. Was it a big data announcement, a DB2 announcement, or a hardware announcement?  The short answer is "yes", to all the above and more.

Weaving together a number of threads, Big Blue created a credible storyline that can be summarized in three key thoughts: larger, faster and simpler.  As many of you may know, I worked for IBM until early 2008, so my views on this announcement are informed by my knowledge of how the company works or, perhaps, used to work.  Last Wednesday, I came away impressed.  Here were a number of diverse, individual product developments that conform to a single theme across different lines and businesses.

Take BLU acceleration as a case in point.  The headline, of course, is that DB2 LUW (on Linux, Unix and Windows) 10.5 introduces a hybrid architecture.  Data can be stored in columnar tables with extensive compression, making use of in-memory storage and taking further advantage of parallel and vector processing techniques available on modern processors.  The result is an up to 25% improvement in analytic and reporting performance (and considerably more in specific queries) and up to 90% data compression.  In addition, the elimination of indexes and aggregates simplifies considerably the need for manual tuning and maintenance of the database.  This is a direction that has long been shown by small, newer vendors such as ParAccel and Vertica (now part of HP), so it is hardly a surprise.  IBM can claim a technically superior implementation, but more impressive is the successful retrofitting into the existing product base.  And the re-use of the technology in the separate Informix TimeSeries code base to enhance analytics and reporting there too, as well as the promise that it will be extended to other data workloads in the future.  It seems the product development organization is really pulling together across different product lines.  That's no mean feat within IBM.

Another hint at the strength of the development team was the quiet announcement of a technology preview of JSON support in DB2 at the same time as the availability of 10.5.  JSON is one of the darlings of the NoSQL movement that provides significant agility to support unpredictable and changing data needs.  See my May 2012 white paper "Business Intelligence--NoSQL... No Problem" for more details.  As in its support for other NoSQL technologies, such as XML and RDF graph databases, IBM has chosen to incorporate support for JSON into DB2.  There are pros and cons to this approach.  Performance and scalability may not match a pure JSON database, but the ability to take advantage of the ACID and RAS characteristics of an existing, full-feature database like DB2 makes it a good choice where business continuity is a strong requirement.  IBM clearly recognizes that the world of data is no longer all SQL, but that for certain types of non-relational data, the difference is sufficiently small that they can be handled as an adjunct to the relational model through a "subservient" engine, allowing easier joining of NoSQL and SQL data types.  This is a vital consideration for machine-generated data, one of three information domains I've defined in a recent white paper, "The Big Data Zoo--Taming the Beasts".

The announcement didn't ignore the little yellow elephant, either.  The PureData System family has been expanded with the PureData System for Hadoop, with built-in analytics acceleration and archiving, and provides significantly simpler and faster deployment of projects requiring the MapReduce environment.  And InfoSphere BigInsights 2.1 offers the Big SQL interface to Hadoop, an alternative file system, GPFS-FPO, with enhanced security and no single point of failure, as well as high availability.

While the announcement clearly targeted Big Data--at the Speed of Business, the underlying message, as seen above, is much broader.  This view is of an emerging information ecosystem that must be considered from a fully holistic viewpoint.  A key role, and perhaps even the primary role, for BigInsights / Hadoop is in exploratory analytics, where innovative, what-if thinking is given free rein.  But the useful insights gained here must eventually be transferred to production (and back) in a reliable, secure, managed environment--typically a relational database.  This environment must also operate at speed, with large data volumes and with ease of management and use.  These are characteristics that are clearly emphasized in this announcement.  They are also key components of the integrated information platform I described in the Data Zoo white paper already mentioned.  Missing still are some of the integration-oriented aspects such as the comprehensive, cross-platform metadata management, data integration and virtualization required to tie it all together.  IBM has more to do to deliver on the full breadth of this vision, but this announcement is a big step in the right direction.


Posted April 8, 2013 9:14 AM
Permalink | No Comments |
bp-napkin.jpg"Seven Faces of Data - Rethinking data's basic characteristics" - new White Paper by Dr. Barry Devlin.

We live in a time when data volumes are growing faster than Moore's Law and the variety of structures and sources has expanded far beyond those that IT has experience of managing.  It is simultaneously an era when our businesses and our daily lives have become intimately dependent on such data being trustworthy, consistent, timely and correct.  And yet, our thinking about and tools for managing data quality in the broadest sense of the word remain rooted in a traditional understanding of what data is and how it works.  It is surely time for some new thinking.

A fascinating discussion with Dan Graham of Teradata over a couple of beers in February last at Strata in Santa Clara ended up in a picture of something called a "Data Equalizer" drawn on a napkin.  As often happens after a few beers, one thing led to another...

The napkin picture led me to take a look at the characteristics of data in the light of the rapid, ongoing change in the volumes, varieties and velocity we're seeing in the context of Big Data.  A survey of data-centric sources of information revealed almost thirty data characteristics considered interesting by different experts.  Such a list is too cumbersome to use and I narrowed it down based on two criteria.  First was the practical usefulness of the characteristic: how does the trait help IT make decisions on how to store, manage and use such data?  What can users expect of this data based on its traits?  Second, can the trait actually be measured?

The outcome was seven fundamental traits of data structure, composition and use that enable IT professionals to examine existing and new data sources and respond to the opportunities and challenges posed by new business demands and novel technological advances.  These traits can help answer fundamental questions about how and where data should be stored and how it should be protected.  And they suggest how it can be securely made available to business users in a timely manner.

So what is the "Data Equalizer"?  It's a tool that graphically portrays the overall tone and character of a dataset, IT professionals can quickly evaluate the data management needs of a specific set of data.  More generally, it clarifies how technologies such as relational databases and Hadoop, for example, can be positioned relative to one another and how the data warehouse is likely to evolve as the central integrating hub in a heterogeneous, distributed and expanding data environment.

Understanding the fundamental characteristics of data today is becoming an essential first step in defining a data architecture and building an appropriate data store.  The emerging architecture for data is almost certainly heterogeneous and distributed.  There is simply too large a volume and too wide a variety to insist that it all must be copied into a single format or store.  The long-standing default decision--a relational database--may not always be appropriate for every application or decision-support need in the face of these surging data volumes and growing variety of data sources.  The challenge for the evolving data warehouse will be to ensure that we retain a core set of information to ensure homogeneous and integrated business usage.  For this core business information, the relational model will remain central and likely mandatory; it is the only approach that has the theoretical and practical schema needed to link such core data to other stores.

"Seven Faces of Data - Rethinking data's basic characteristics" - new White Paper by Dr. Barry Devlin (sponsored by Teradata)


Posted November 17, 2011 6:07 AM
Permalink | No Comments |
Last month, I blogged about Predixion's predictive analytics product, Insight.  Despite being very impressed with its extensive and powerful function as well as its price point, I mentioned my concerns about unleashing such power in an uncontrolled BI environment where users share dirty data like junkies share needles.  And I pointed to two white papers I wrote in 2008 and 2009 sponsored by Lyzasoft, where I described their collaborative analytic environment as the type of control and management that would be needed.  I'd love to think that the announcement this week that Predixion and Lyzasoft are partnering in delivering some of Predixion's analytic function through Lyza Commons is a result of that, but I suspect otherwise...

In any case, the link-up is a step in exactly the right direction.  BI departments' reaction to spreadsheets over the years has ranged from trying to rein them in to ignoring them.  Neither approach works.  Users love and, indeed, need to have control of their data when they are experimenting and sandboxing.  PC-based data, particularly spreadsheets, gave them that control--and they are unlikely to relinquish it any time soon.  Data in the Cloud is just another phase of distributed data.  The old IT approach of ignoring it or damning it is not going to work here either.

It's wonderful to see two small BI software companies showing how to technologically address this rapidly growing data management/governance issue. 

Posted October 21, 2010 9:04 AM
Permalink | No Comments |
Simon Arkell and Jamie MacLennan briefed me over the past couple of days on their new cloud-based, self-service, predictive analytics software, Predixion Insight, launched on 13 September.  Closely integrated with Microsoft Excel and PowerPivot, Predixion Insight offers business analysts a powerful and compelling set of predictive analytic function in an environment that is familiar to almost every business person today.

To a first level of approximation, predictive analytics is a modern outgrowth of data mining.  Both areas have traditionally been associated with large (often MPP) machines, complex data preparation and manipulation processes and PhDs in statistics.  What Predixion Software has done is to move the entire process within the reach of people with little of those three things.  The heavy lifting is done in the cloud, and at a licensing cost of $99 per user per month, the required computing power and statistical algorithms are made readily available to most businesses.  While some knowledge of statistics and data manipulation is still needed, the use of the familiar Excel paradigm makes the whole process less threatening.  And Predixion divides the tasks over two tabs on the Excel ribbon: Insight Analytics, aimed at those creating analyses and Insight Now, which gathers tasks related to running existing parameterized, analyses.  

Bottom line is that Predixion brings to predictive analytics what spreadsheets brought to planning, and given its close integration with the spreadsheet behemoth, the case is very compelling.  In competition with the traditional data mining and predictive analytics tools, Predixion Insight has the potential to be a very disruptive influence in the market.

The other potential for disruption is in data management and quality programs in businesses.  Of course, this is neither new nor unique to Predixion; but looking at what it can do and how easily it does it brings the question right to the front of my mind.  The problem has been around since the 1980s when spreadsheets first became popular and has grown ever since.  Self-service BI exacerbates the issue still further.  It is sometimes argued that managing and controlling the data sources, usually through a data warehouse, can address these issues.  But the sad truth is that once the data is out of the centrally-managed environment, the cat is out the bag.  Spreadsheets enable users to manipulate data as they need (or want) with no easy way to audit or track what they've done.  Errors and indeed fraud cannot be easily detected.  And they are readily promulgated through the organization as users share spreadsheets and build upon each other's work.  Self-service, predictive analytics ups the ante even further.  

Statistics are often clumped with lies and damn lies, and with good reason.  Not only are they easily misused, but they are often misunderstood or misapplied.  The dangers inherent in making predictive analytics available to a wider audience in the business should not be underestimated by IT or by the audit functions of large businesses, particularly those in the financial industry.  It could be argued that the recent financial meltdown was caused in part by an overreliance on mathematical and statistical models.  And these were often in the hands of people with PhDs.  What is the danger of giving the same tools to marketing people and middle managers?  And, as we know, Sarbanes-Oxley carries possible jail penalties for c-level executives who sign off on misunderstood financials!

My advice to Predixion Software is to build a lot more mandatory tracking and auditability function into their product.  And there is a huge upside to doing this as well.  It provides the basis for real social networking and collaboration in BI and for ensuring that the true sources of business innovation and insight are fully integrated into the core information provision infrastructure of the company as I've written about here and here.  By the way, the same advice goes to Microsoft, if they're listening!

Posted September 17, 2010 4:59 AM
Permalink | No Comments |