Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation published by Addison-Wesley in 1997.

Over the past few years, Barry has extended his interest to cover the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Data integration Category

Virtualization-Federation.pngThere are still many illusions and unjustified expectations about big data.  But, one old belief--dating back to the early days of data warehousing--that it has shattered is in a single store that can serve all BI needs.  Given the volumes and variety of big data, any thought of routing it all through a relational database environment just doesn't make sense.  And after the market's brief flirtation with the idea that all data could be handled in Hadoop (doh!), there is a general belief that IT needs to provide some sort of over-arching, integrating view for users across multiple data stores.

Cirro is among the latest players in this field, as I discovered talking to CEO Mark Theissen, previously data warehousing technical lead at Microsoft and a veteran of DATAllegro and Brio.  Its basic value proposition is to offer users self-driven exploration--via Cirro's Excel plug-in and BI tools--of data across a wide variety of platforms via ad hoc federation.  Cirro's starting point is big data scale and performance, offering a data hub with a cost-based federation optimizer, smart caching and a function library of low level MapReduce and SQL functions.  It also offers an optional "multi store" consisting of Hadoop and MySQL components that can be used as a temporary scratchpad area or a data mart.

In our conversation, Theissen declared that Cirro does federation, whereas competitors like Composite and Denodo do virtualization.  The difference, in his view, is that virtualization involves an expensive and time-consuming phase to create a semantic layer, while federation is done on the fly and, in the case of Cirro, using existing metadata from BI tools, databases and so on.  I wish it were that simple to differentiate between these two phrases, which have become a marketing battleground for many of the vendors competing in this field from the majors like IBM and Informatica to the newcomers such as Karmasphere and ClearStory.

I'd like to try to clarify the two terms... again.

The concept of federation (in data) goes back to the mid-1980s with the concept of federating SQL queries against the then-emerging relational databases.  By 1991, IBM's Information Warehouse Framework included access to heterogeneous databases via EDA/SQL from Information Builders.  By the early years of the new millennium, the need to join data from multiple, heterogeneous sources beyond traditional databases was widespread, often described as enterprise information integration (EII).  But, vendor offerings were poorly received, especially in BI, because of concerns about mismatched data meanings, security and query performance.  I consider federation as the basic technology of being able to split up a query in real time into component parts, distribute it to heterogeneous, autonomous sources and retrieve and combine the results.  To do this, access to technical metadata that defines database (or file) locations and structures, data volumes, network performance and more is needed to enable query optimization for access and performance.

Data virtualization, in my view, builds on top of federation with knowledge of the business-related metadata required to address the problem of disparate data meanings, relationships and currencies and deliver high quality results that are meaningful and consistent for the business user submitting the query.  Simply put, there are two ways to address these problems and supply the needed metadata.  The easiest approach is to depend on the business user to understand data consistency and similar quasi-IT issues and to make sensible (in terms of data coherence and reliable results) queries.  The second way is to model the data to some extent upfront and create a semantic layer, as it's often called, that ensures the quality of returned results.

The former approach typically leads to faster, cheaper implementations; the latter to longer-term quality at some upfront cost.  The former works better if you're coming from a big data view point, where much of the data is poorly defined, changing and of questionable accuracy and consistency in any case.  The latter favors enterprise information management where quality and consistency are key.  The reality of today's world, however, is that we need both!
Cirro, with its sights set on big data and its minimal formal structure, strongly favors the first approach.  Allowing, indeed encouraging, users to build their explorations in the freeform environment that is Excel is a strong statement in itself.  It's typically fast, easy and iterative, all highly valued qualities in today's break-neck speed business environment.  However, when you link from there to the (hopefully) high-quality data warehouse, the need for a more formal and modeled approach becomes clear.  

So, which approach to choose?  It depends on your starting point and initial drivers.  And your long-term needs.  Composite, for example, focuses more on the prior creation of business views to shield users from the technical complexity and inconsistencies in typical enterprise data.  Denodo, in contrast, talks of both bottom-up and top-down modeling to address both sets of needs.  In the long run, you'll probably need both approaches: the speed of an ad hoc approach for sandboxing and the quality of semantic modeling for production integration.


Posted July 12, 2012 7:47 AM
Permalink | 2 Comments |
JackBe's CTO, John Crupi, and VP of Marketing, Chris Warner, created a definitional firestorm among BI experts at the BBBT on Friday.  A long-time Ajax and Enterprise Mashup Platform provider, JackBe has more recently begun to describe itself as a Real-Time Intelligence provider.  That was always going to be a phrase that generated excited discussion.

First, what is Real-Time?  In the case of JackBe, it relates more to immediate access, both in definition and use, to existing sources of data than to the more conventional BI use of the term, which focuses more on how current that data is.  As a mashup, JackBe's Presto product doesn't actually care how current the data it accesses is.  The source could be an operational application, a data warehouse, a spreadsheet, a web resource or whatever--clearly a wide range of data latency (and reliability,too!).  So, the important idea that BI practitioners have to get their heads around is that Real-Time in this context is about giving business users fast and nimble access to existing data sources.

As a mashup, and coming from the Web 2.0 world, the second thing we need to recognize is that JackBe allows end users to combine information in innovative ways into dashboard-like constructs themselves.  In function, mashups are similar to more traditional portals, but use the more flexible tooling and constructs of Web 2.0, enabling users to do more for themselves without calling on IT.  JackBe thus enables self-service BI, provided that accessible information resources already exist.  Presto provides the means to find those sources, the ability to link them together and the robust security required to ensure users can access only what they are allowed to.

As with all approaches to self-service business intelligence, the most challenging aspect for BI practitioners is to understand and even regulate the validity of the results produced.  Does it make logical business sense to combine sources A and B?  Does source A contain data from the same timeframe as source C?  Does profit margin in source B have the exact same definition as that in source D?  And so on.  These are the types of questions that lead to the creation of a data warehouse; resolving them leads to the typical delays in delivering data warehouses.

The bottom line is that JackBe provides a powerful tool to drive rapid innovation by end users in business intelligence.  Given the speed of change in today's business, that has to be a good thing.  But, as is the case when any powerful tool is put in the hands of a user, there is a danger of severely burnt fingers!  The BI department must therefore put processes in place to help users know if the information they want is really suitable for mashing up.  In practice, this will require either the creation of extensive metadata to describe the available information sources or the provision of a robust help desk facility to explain to users what's possible and even what went wrong.


Posted January 30, 2011 10:16 AM
Permalink | No Comments |
As an old proponent of the Enterprise Data Warehouse or EDW (well, let me stick my neck out and claim to be its first proponent, although I labeled it the BDW - Business Data Warehouse, or Barry Devlin's Warehouse!), I've had many debates over the years about the relative merits of consolidating and reconciling data in an EDW for subsequent querying vs. sending the query to a disparate set of data sources.  Unlike some traditionalists, I concluded as far back as 2002 that there existed good use cases for both approaches. I still stick with that belief.  So, the current excitement and name-space explosion about the topic leaves me a touch bemused.

But I found myself more confused than bemused when I read Stephen Swoyer's article Why Data Virtualization Trumps Data Federation Alone in the Dec. 1 TDWI "BI This Week" newsletter.  Quoting Philip Russom, research manager with TDWI Research, and author of a new Checklist Report from TDWI Research, Data Integration for Real-Time Data Warehousing and Data Virtualization, he says: "[D]ata virtualization must abstract the underlying complexity and provide a business-friendly view of trusted data on demand. To avoid confusion, it's best to think of data federation as a subset or component of data virtualization. In that context, you can see that a traditional approach to federation is somewhat basic or simple compared to the greater functionality of data virtualization".

OK, maybe I'm getting old, but that didn't help me a lot to understand why data virtualization trumps data federation alone.  So, I went to the Checklist Report, where I found a definition: "For the purposes of this Checklist Report, let's define data virtualization as the pooling of data integration resources", whereas traditional data federation "only federates data from many different data sources in real time", the latter from a table sourced by Informatica, the sponsor of the report.  When I read the rest of the table, it finally dawned on me that I was in marketing territory.  Try this for size: "[Data virtualization] proactively identifies and fixes data quality issues on the fly in the same tool"!  How would that work?

Let me try to clarify the conundrum of virtualization, federation, enterprise information integration and even mash-ups, at least from my (perhaps over-simplified) viewpoint.  They're all roughly equivalent - there may be highly nuanced differences, but the nuances depend on which vendor you're talking to.  They all provide a mechanism for decomposing a request for information into sub-requests that are sent to disparate and distributed data sources unbeknownst to the user, receive the answers and combine them into a single response.  In order to do that, they all have some amount of metadata that allows locates and describes the information sources, a set of adapters (often called by different names) that know how to talk with different data sources, and, for want of a better description, a layer that insulates the user from all of the complexity underneath.

But, whatever you call it (and let's call it data virtualization for now - the term with allegedly the greatest cachet), is it a good idea?  Should you do it?  I believe the answer today is a resounding yes - there is far too much information of too many varieties to ever succeed in getting it into a single EDW.  There is an ever growing business demand for access to near real-time information that ETL, however trickle-fed, struggles to satisfy.  And, yes, there are dangers and drawbacks to data virtualization, just as there are to ETL.  And the biggest drawback, despite Informatica's claim to the contrary, is that you have to be really, really careful about data quality.

By the way, I am open to being proven wrong on this last point; it's only by our mistakes that we learn!  Personally, I could use a tool that "proactively identifies and fixes data quality issues on the fly".

Posted December 2, 2010 7:52 AM
Permalink | No Comments |
Pervasive Software presented at the Boulder BI Brain Trust (BBBT) last Friday, August 13.  What caught my attention was their DataRush product and technology, and particularly the technological driver behind it.  For a brief overview of the other aspects of Pervasive covered, check out Richard Hackathorn's blog on the day.

But, back to DataRush.  DataRush was originally conceived as a redesign of Pervasive's data integration tool, acquired from Data Junction in 2003.  However, it was soon recognized that the underlying function could be applied to other data-intensive tasks such as analytics.  Pervasive CTO, Mike Hoskins, described DataRush as a toolkit and engine that enables ordinary programmers to create parallel-processing applications simply and easily using data flow techniques to design them and without having to worry about the complexities of parallel-processing design, such as timing and synchronization between parallel tasks.

Now, of course, there's nothing new about parallel processing or the inherent difficulties it presents to programmers.  It's been at the heart of large-scale data warehousing, particularly through the use of MPP (massively parallel processing) systems, for a number of years.  Mike's point, however, was that parallel processing is about to go mainstream.  The technology shift enabling that has been underway for a few years now--the growing availability of multi-core processors and servers since the mid-2000s.  4-core processors are already common on desktop machines, while processors with 32 cores and more are already available for servers.  Multiply that by the number of sockets in a typical server, and you have massive parallelism in a single box--if you can use it.  The problem is that with existing applications designed for serial processing, the only benefit to be gained from such multi-core servers at present it in supporting multiple concurrent users or tasks or in what's known as "embarrassingly parallel" applications where there are no inter-task dependencies.  DataRush's claim to fame is that it moves data-intensive parallel processing from high-end, expensive and complex MPP clusters and specialist programmers to commodity, inexpensive and simple SMP multi-core servers and ordinary developers.

Of course, Pervasive is not alone in trying to tackle the issues involved in software development for parallel-processing environments.  But their approach, coming from the large-scale data integration environment, makes a lot of sense in BI.

However, to see the really significant implications, we need to see this development in the context of other technological advances.  There is the emergence of solid-state disks (SSDs) and the growing sizes and dropping costs of core memory that remove or reduce the traditional disk I/O bottleneck.  The decades-old supremacy of traditional relational databases is being challenged by a variety of different structures, some broadly relational and others distinctly not.  Add to this the explosive growth of data volumes, especially soft or "unstructured" information.  Pervasive, along with other small and medium-sized software vendors, is pushing information processing to an entirely new level.


Posted August 18, 2010 7:58 AM
Permalink | No Comments |