Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

July 2012 Archives

piles-of-money.jpgIt often worries me that much of the excitement about big data and business analytics relates to marketing initiatives.  In the greater scheme of life, I personally feel that money spent trying to convince me to drink cola brand A rather than B could be put to better use.  Promoting the health benefits of pure water, maybe.  Tackling real world problems like contaminated water sources, even better.  

I suspect that may be a rather unpopular view in some circles, so in case you're tempted to stop reading now, I'd like to mention upfront a big data survey that is currently open for your input.  Shawn Rogers and John Myers of EMA and I have constructed a short survey to discover what companies are doing with big data and what challenges they are encountering.  We'd be delighted to hear from you.

But, on to big data and big money... and, in particular, off-shore investment money.  Over the weekend, articles in both the Guardian newspaper in the UK and the BBC reported that a tiny global elite of extraordinarily rich people had some $21 trillion in off-shore tax havens as of the end of 2010, an amount equivalent to the US and Japanese economies combined.  The work that estimated the above figure was commissioned by the Tax Justice Network and carried out by former McKinsey & Co. Chief Economist James Henry.  A press release covering the highlights of the report "The Price of Offshore Revisited" notes that Henry "drew on data from the World Bank, the IMF, the United Nations, central banks, the Bank for International Settlements, and national treasuries, and triangulates his results against data reflecting demand for reserve currency and gold, and data on offshore private banking studies by consulting firms and others".  The six-page press release reveals some truly staggering figures and is well worth a read.

You may contest the figures and the conclusions, and many will.  But, as the report says--and this is where we get back on topic with big data--"This scandal is made worse by the fact that [official institutions like the Bank for International Settlements, the IMF, the World Bank, the OECD, and the G20] already have much of the data needed to estimate this sector more carefully".  There is very little of the world's money that is not represented by and moved about as 1s and 0s in financial computing systems.  There is little doubt that this is, indeed, big data and amenable to the collection and processing we talk about and carry out... when we need marketing information.  We can now reliably detect petty fraud on the world's voluminous credit card transactions in flight;  so I'm convinced that detecting, storing and analyzing the transactions that moved this wealth off-shore is technically-speaking, a piece of cake.  Perhaps the question is: do we have the will to do so?

I'll leave you with a more positive spin from the report: "From another angle, this study is really good news. The world has just located a huge pile of financial wealth that might be called upon to contribute to the solution of our most pressing global problems. We have an opportunity to think not only about how to prevent some of the abuses that have led to it, but also to think about how best to make use of the untaxed earnings that it generates."

In the meantime, read some of the above coverage (I haven't found a link to the full report) and please take the big data survey, too.


Posted July 24, 2012 7:43 AM
Permalink | No Comments |
Virtualization-Federation.pngThere are still many illusions and unjustified expectations about big data.  But, one old belief--dating back to the early days of data warehousing--that it has shattered is in a single store that can serve all BI needs.  Given the volumes and variety of big data, any thought of routing it all through a relational database environment just doesn't make sense.  And after the market's brief flirtation with the idea that all data could be handled in Hadoop (doh!), there is a general belief that IT needs to provide some sort of over-arching, integrating view for users across multiple data stores.

Cirro is among the latest players in this field, as I discovered talking to CEO Mark Theissen, previously data warehousing technical lead at Microsoft and a veteran of DATAllegro and Brio.  Its basic value proposition is to offer users self-driven exploration--via Cirro's Excel plug-in and BI tools--of data across a wide variety of platforms via ad hoc federation.  Cirro's starting point is big data scale and performance, offering a data hub with a cost-based federation optimizer, smart caching and a function library of low level MapReduce and SQL functions.  It also offers an optional "multi store" consisting of Hadoop and MySQL components that can be used as a temporary scratchpad area or a data mart.

In our conversation, Theissen declared that Cirro does federation, whereas competitors like Composite and Denodo do virtualization.  The difference, in his view, is that virtualization involves an expensive and time-consuming phase to create a semantic layer, while federation is done on the fly and, in the case of Cirro, using existing metadata from BI tools, databases and so on.  I wish it were that simple to differentiate between these two phrases, which have become a marketing battleground for many of the vendors competing in this field from the majors like IBM and Informatica to the newcomers such as Karmasphere and ClearStory.

I'd like to try to clarify the two terms... again.

The concept of federation (in data) goes back to the mid-1980s with the concept of federating SQL queries against the then-emerging relational databases.  By 1991, IBM's Information Warehouse Framework included access to heterogeneous databases via EDA/SQL from Information Builders.  By the early years of the new millennium, the need to join data from multiple, heterogeneous sources beyond traditional databases was widespread, often described as enterprise information integration (EII).  But, vendor offerings were poorly received, especially in BI, because of concerns about mismatched data meanings, security and query performance.  I consider federation as the basic technology of being able to split up a query in real time into component parts, distribute it to heterogeneous, autonomous sources and retrieve and combine the results.  To do this, access to technical metadata that defines database (or file) locations and structures, data volumes, network performance and more is needed to enable query optimization for access and performance.

Data virtualization, in my view, builds on top of federation with knowledge of the business-related metadata required to address the problem of disparate data meanings, relationships and currencies and deliver high quality results that are meaningful and consistent for the business user submitting the query.  Simply put, there are two ways to address these problems and supply the needed metadata.  The easiest approach is to depend on the business user to understand data consistency and similar quasi-IT issues and to make sensible (in terms of data coherence and reliable results) queries.  The second way is to model the data to some extent upfront and create a semantic layer, as it's often called, that ensures the quality of returned results.

The former approach typically leads to faster, cheaper implementations; the latter to longer-term quality at some upfront cost.  The former works better if you're coming from a big data view point, where much of the data is poorly defined, changing and of questionable accuracy and consistency in any case.  The latter favors enterprise information management where quality and consistency are key.  The reality of today's world, however, is that we need both!
Cirro, with its sights set on big data and its minimal formal structure, strongly favors the first approach.  Allowing, indeed encouraging, users to build their explorations in the freeform environment that is Excel is a strong statement in itself.  It's typically fast, easy and iterative, all highly valued qualities in today's break-neck speed business environment.  However, when you link from there to the (hopefully) high-quality data warehouse, the need for a more formal and modeled approach becomes clear.  

So, which approach to choose?  It depends on your starting point and initial drivers.  And your long-term needs.  Composite, for example, focuses more on the prior creation of business views to shield users from the technical complexity and inconsistencies in typical enterprise data.  Denodo, in contrast, talks of both bottom-up and top-down modeling to address both sets of needs.  In the long run, you'll probably need both approaches: the speed of an ad hoc approach for sandboxing and the quality of semantic modeling for production integration.


Posted July 12, 2012 7:47 AM
Permalink | No Comments |