Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

December 2010 Archives

I'm thrilled to be keynoting at O'Reilly Media's first Strata Conference - Making Data Work - 1-3 February in Santa Clara, California.  I have a couple of sessions positioning data warehousing and "big data" as it's popularly known:


O'Reilly Media are making available a 25% discount code for readers, followers, and friends on conference registration:  str11fsd.  So, an added incentive to sign up...

Researching big data in preparation for the conference has been a fascinating experience, as well as bringing up an intriguing sense of déjà vu.  In fact, it reminds me of the early days of data warehouse tooling, when the emphasis was on speeds and feeds of ETL into the warehouse and how everybody needed a new-fangled OLAP database to do the latest and greatest dimensional modeling. Today, the excitement is around Hadoop and MapReduce and the volumes of data they can chew through, and the statistical and text analytics that will ultimately find that gold nugget of "unknown unknown" information in the data exhaust of your web site usage.

This is pioneering work, to bravely go in the data universe where no man has gone before.  It is exciting and produces great stories of challenges overcome, volumes never before processed and behaviors never before correlated.  These are the big promises of big data.  Although it should be a salutary lesson for all that the old data mining / data warehousing nugget from the 1990s "Men who buy diapers on Friday evenings are also likely to buy beer" is now widely believed to be an urban legend rather than a true story of unexpected and momentous business value.

Even more, those of us who have been around big data for many years before the phrase became popular (back in the 1980s, a few hundred MB of data was BIG) know that big responsibilities soon catch up with big promises.  It's a lot easier to run some experimental analyses on big data than to move the whole process into ongoing production.  Playing with huge volumes of web log data is fine until you realize that you have to comply with privacy and other regulations and that you have to store the data for seven years to enable future audits on your decision making.  At that moment, you begin to realize why databases have the consistency and sustainability characteristics they do.  And that clever parallel processing and distributed file systems have limitations as well as strengths.

That said, Hadoop and MapReduce are demonstrating some fascinating possibilities in parallel processing of complex data, and as multi-core processors come out with ever more cores, we certainly need new ways to take advantage of them.  Database vendors from Aster Data to Teradata are also exploring the possibilities both on the analysis side and on the data sourcing side.  Given the hype, it seems likely we'll hear a lot more about these types of usage in 2011.  But, in the long term the big winners will be the companies who are serious about productionalizing these techniques rather than necessarily those who have the biggest data.

I wish you a Happy Christmas and a Peaceful and Prosperous New Year.


Posted December 20, 2010 8:46 AM
Permalink | No Comments |
As an old proponent of the Enterprise Data Warehouse or EDW (well, let me stick my neck out and claim to be its first proponent, although I labeled it the BDW - Business Data Warehouse, or Barry Devlin's Warehouse!), I've had many debates over the years about the relative merits of consolidating and reconciling data in an EDW for subsequent querying vs. sending the query to a disparate set of data sources.  Unlike some traditionalists, I concluded as far back as 2002 that there existed good use cases for both approaches. I still stick with that belief.  So, the current excitement and name-space explosion about the topic leaves me a touch bemused.

But I found myself more confused than bemused when I read Stephen Swoyer's article Why Data Virtualization Trumps Data Federation Alone in the Dec. 1 TDWI "BI This Week" newsletter.  Quoting Philip Russom, research manager with TDWI Research, and author of a new Checklist Report from TDWI Research, Data Integration for Real-Time Data Warehousing and Data Virtualization, he says: "[D]ata virtualization must abstract the underlying complexity and provide a business-friendly view of trusted data on demand. To avoid confusion, it's best to think of data federation as a subset or component of data virtualization. In that context, you can see that a traditional approach to federation is somewhat basic or simple compared to the greater functionality of data virtualization".

OK, maybe I'm getting old, but that didn't help me a lot to understand why data virtualization trumps data federation alone.  So, I went to the Checklist Report, where I found a definition: "For the purposes of this Checklist Report, let's define data virtualization as the pooling of data integration resources", whereas traditional data federation "only federates data from many different data sources in real time", the latter from a table sourced by Informatica, the sponsor of the report.  When I read the rest of the table, it finally dawned on me that I was in marketing territory.  Try this for size: "[Data virtualization] proactively identifies and fixes data quality issues on the fly in the same tool"!  How would that work?

Let me try to clarify the conundrum of virtualization, federation, enterprise information integration and even mash-ups, at least from my (perhaps over-simplified) viewpoint.  They're all roughly equivalent - there may be highly nuanced differences, but the nuances depend on which vendor you're talking to.  They all provide a mechanism for decomposing a request for information into sub-requests that are sent to disparate and distributed data sources unbeknownst to the user, receive the answers and combine them into a single response.  In order to do that, they all have some amount of metadata that allows locates and describes the information sources, a set of adapters (often called by different names) that know how to talk with different data sources, and, for want of a better description, a layer that insulates the user from all of the complexity underneath.

But, whatever you call it (and let's call it data virtualization for now - the term with allegedly the greatest cachet), is it a good idea?  Should you do it?  I believe the answer today is a resounding yes - there is far too much information of too many varieties to ever succeed in getting it into a single EDW.  There is an ever growing business demand for access to near real-time information that ETL, however trickle-fed, struggles to satisfy.  And, yes, there are dangers and drawbacks to data virtualization, just as there are to ETL.  And the biggest drawback, despite Informatica's claim to the contrary, is that you have to be really, really careful about data quality.

By the way, I am open to being proven wrong on this last point; it's only by our mistakes that we learn!  Personally, I could use a tool that "proactively identifies and fixes data quality issues on the fly".

Posted December 2, 2010 7:52 AM
Permalink | No Comments |