Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

August 2011 Archives

"We are running through the United States with dynamite and rock saws so an algorithm can close the deal three microseconds faster," according to algorithm expert Kevin Slavin at last month's TEDGlobal conference.  Kevin was describing the fact that Spread Networks is building a fiber optic connection to shave three microseconds off the 825 mile trading journey between Chicago and New York.  The above comes courtesy of a thought-provoking article called "When algorithms control the world" by Jane Wakefield on the BBC website.

As someone who follows the data warehousing or big data scene, you're probably familiar with some of the decisions that algorithms are now making, based largely on the increasing volumes on information that we have been gathering, particularly over recent years.  We all know by now that online retailers and search engines use sophisticated algorithms to decide what we see when we go to a web page.  Mostly, we're pretty pleased that what comes up is well-matched to our expectations and that we don't have to plow through a list of irrelevant suggestions.  We're thankful for the reduction in information overload.

But, there are a few downsides to our growing reliance on algorithms and the data they graze upon...

The carving up of the US for fiber highlights the ecological destruction that comes from our ever-growing addiction to information and the speed of getting it.  We tend to sit back comfortably and think that IT is clean and green.  And it probably is greener than most, but it certainly far from environmentally neutral.  We need to pay more attention to our impact--carbon emissions, cable-laying and radio-frequency and microwave pollution are all part of IT's legacy.

But there's a much more subtle downside, and one that has been of concern to me for years.  Just because we can do some particular analysis, does it really make sense?  The classic case is in the use of BI and data mining in the insurance industry.  Large data sets and advanced algorithms allow insurers to discover subtle clues to risk in segments of the population and adjust premiums accordingly.  Now, of course, actuaries have been doing this since the 18th and 19th centuries.  But the principal driver in the past was to derive an equitable spread of risk in a relatively large population, such that the cost of a single event was effectively spread over a significantly larger number of people.  However, data mining allows ever more detailed segmentation of a population, and insurers have responded by identifying particularly high-risk groups and effectively denying them insurance or pricing premiums so high that such people cannot insure their risk.  While in some cases we can argue that this drives behavior changes that reduce overall risk (for example, safer driving practices among young males), in many other instances, no such change is possible (for example, for house owners living on flood plains).  I would argue that excessive use of data mining to segment risk in insurance eventually destroys the possibility to spread risk equitably and thus undermines the entire industry.

In a similar manner, the widespread use of sophisticated algorithms and technology to speed trading seems to me to threaten the underlying purpose of futures and other financial markets, which, in my simplistic view, is to enable businesses to effectively fund future purchases or investments.  The fundamental goal of the algorithms and speedy decision making, however, seems to be to maximize profits for traders and investors, without any concern for the overall purpose of the market.  We've seen the results of this dysfunctional behavior over the past few years in the derivatives market, where all sense of proportion and real value was lost in the pursuit of illusory financial gain.

But, it gets worse!  The BBC article reveals how a British company, Epagogix, uses algorithms to predict what makes a hit movie.  Using metrics such as script, plot, stars, location, and the box office takings of similar films, it tries to predict the success of a proposed production.  The problem here, and note that the same applies to book suggestions on Amazon and all similar approaches, is that the algorithm depends on past consumer behavior and predicts future preferences based upon that.  The question is: how do new preferences emerge if the only thing being offered is designed solely to satisfy past preferences?

I would argue that successful business strategy requires a subtle blend of understanding past buyer behavior and offering new possibilities that enable and encourage new behaviors to emerge.  If all that a business offers is that which has been successful in the past, it will be rapidly superseded by new market entrants that are not locked into the past.  The danger of an over-reliance on data mining and algorithms is that innovation is stifled for business.  More importantly, for civilization, imagination is suffocated and strangled for lack of new ideas and thoughts.

Do you want to live in such an algorithm-controlled world?  As Wakefield puts it so well: "In reality, our electronic overlords are already taking control, and they are doing it in a far more subtle way than science fiction would have us believe. Their weapon of choice - the algorithm."

Posted August 25, 2011 6:19 AM
Permalink | 1 Comment |
The announcements by HP yesterday set the Web rippling with the opinion that HP is pulling out of the consumer-facing business by dropping WebOS and TouchPad and spinning off its PC business.  Probably of more interest to readers of BeyeNetwork, though, is HP's decision to acquire Autonomy for a cool $10.2B.  Following on from HP's February purchase of Vertica, it seems fair to say that HP is moving (or returning?) strongly into the enterprise information management business.

As a long-time proponent of the view that the divisions between different "types" of data are breaking down rapidly, the move is not surprising.  Autonomy uses the tag-lines "meaning based computing" and "human-friendly data" and focuses on what I call soft (or, unstructured, as it's usually misleadingly called) information.  As I discussed in my last couple of posts on IDC's Digital Universe Study, this type of information represents an enormous and rapidly growing proportion of the information resource of the world, and one that requires a very different way of thinking about and managing it.  And much of the interest in big data stems directly from the insight one can gain from mining and analyzing exactly this type of information.  The acquisition of Autonomy gives HP a significant foothold in this soft information space, given Autonomy's positioning as a leader in the content management and related spaces by Gartner and Forrester.

I have long characterized the traditional approach to computing as being partitioned between operational, informational and collaborative.  In the past, these areas have been developed separately, built on disparate platforms, supported by different parts of the IT organization and end up on users' desks as three sets of dis-integrated applications.  Business intelligence, although receiving all its base data from the operational environment, operated as a stand-alone environment.  HP bought into that environment with its Vertica acquisition.  With the Vertica Connector for Hadoop, HP already has access to some of the big data / collaborative data area.  However, the Autonomy acquisition takes the use and analysis of soft, collaborative information to an entirely new level.  And we can speculate just how far HP will be able to go in aligning and perhaps integrating the functionality in these two areas.

While operational data is still very much the preserve of SAP and similar tools (not to mention home-grown applications from previous generations), the informational and collaborative world are growing ever more intertwined.  It's in this converging arena that HP is clearly now throwing its hat, and competing against the big players such as IBM, Microsoft and Oracle, who already have offerings spanning both areas, although with varying levels of integration.  Teradata has also seriously entered this field with its recent acquisition of Aster Data.  This arena is already populated with strong players.

So, while HP has acquired a strong and well-respected tool with inventive developers in Vertica and now a major player in the content market, I believe there remains a serious question about how easy it will be for them to gain traction in the information management market.  I'll be looking out for some seriously innovative developments from HP to convince me that they can gain the respect of the BI and content communities and compete seriously with the incumbents.

Posted August 19, 2011 6:05 AM
Permalink | No Comments |

In my last post, I discussed some of the key points in the 5th annual Digital Universe study from IDC, released by EMC in June.  Here, I consider a few more: some of the implications of the changes in sourcing on security and privacy, the importance of considering transient data, where volumes are a number of orders of magnitude higher, and a gentle reminder that bigger is not necessarily the nub of the problem.

Let's start with transient data.  IDC notes that "a gigabyte of stored content can generate a petabyte or more of transient data that we typically don't store (e.g., digital TV signals we watch but don't record, voice calls that are made digital in the network backbone for the duration of a call)".  Now, as an old data warehousing geek, that type of statement typically rings alarm bells: what if we miss some business value in the data that we never stored?  How can we ever recheck at a future date the results of an old analysis we made in real-time?  We used to regularly encounter this problem with DW implementations that focused on aggregated data, often because of the cost of storing the detailed data.  Over the years, decreasing storage costs meant that more warehouses moved to storing the detailed data.  But now, it seems like we are facing the problem again.  However, from a gigabyte to a petabyte is a factor of a million!  And, as the study points out, the "growth of the [permanent] digital universe continues to outpace the growth of storage capacity".  So, this is probably a bridge to far for hardware evolution.

The implication (for me) is that our old paradigm about the need to keep raw, detailed data needs to be reconsidered, at least for certain types of data.  This leads to the point about "big data" and whether the issue is really about size at all.  The focus on size, which is the sound-bite for this study and most of the talk about big data, distracts us from the reality that this expanding universe of data contains some very different types of data to traditional business data and comes from a very different class of sources.  Simplistically, we can see two very different types of big data: (1) human-generated content, such as voice and video and (2) machine metric data such as website server logs and RFID sensor event data.  Both types are clearly big in volume, but in terms of structure, information value per gigabyte, retention needs and more, they are very different beasts.  And interesting to note that some vendors are beginning to specialize.  Infobright, for example, is focusing on what they call "machine-generated data", a class of big data that is particularly suited to their technical strengths.

Finally, a quick comment on security and privacy.  The study identifies the issues: "Less than a third of the information in the digital universe can be said to have at least minimal security or protection; only about half the information that should be protected is protected."  Given how much information that consumers are willing to post on social networking sites or share with businesses in order to get a 1% discount, this is a significant issue that proponents of big data and data warehousing projects.  As we bring this data from social networking sources into our internal information-based decision-making systems, we will increasingly expose our business to possible charges of misusing information, exposing personal information, and so on.

There are many more thought-provoking observations in the Digital Universe study.  Well worth a read for anybody considering integrating data warehouse and big data.

Posted August 12, 2011 11:45 AM
Permalink | No Comments |