Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Data integration Category

sounion_athens1.jpgTraditionally, BI has been a process-free zone. Decision makers are such free thinkers that suggesting their methods of working can be defined by some stogy process is generally met with sneers of derision. Or worse. BI vendors and developers have largely acquiesced; the only place you see process mentioned is in data integration, where activity flow diagrams abound to define the steps needed to populate the data warehouse and marts.

I, on the other hand, have long held - since the turn of the millennium, in fact - that all decision making follows a process, albeit a very flexible and adaptive one. The early proof emerges in operational BI (or decision management, as it's also called) where decision making steps are embedded in fairly traditional operational processes. As predictive and operational analytics has become increasingly popular, this intermingling of informational and operational is such that these once distinctly different business behaviors are becoming indistinguishable. A relatively easy thought experiment then leads to the conclusion that all decision making has an underlying process.

I was also fairly sure at an early stage that only a Service Oriented Architecture (SOA) approach could provide the flexible and adaptive activities and workflows required. I further saw that SOA could (and would need to) be a foundation for data integration as the demand for near real-time decision making grew. As a result, I have been discussing all this at seminars and conferences for many years now. But every time I'd mention SOA, the sound of discontent would rumble around the room. Too complex. Tried it and failed. And, more recently, isn't that all old hat now with cloud and mobile?

All of this is by way of introduction to a very interesting briefing I received this week from Pat Pruchnickyj, Director of Product Marketing at Talend, who restored my faith in SOA as an overall approach and in its practical application! Although perhaps best known for its open source ETL (extract, transform and load) and data integration tooling it first introduced in 2006, Talend today take a broader view and offers data focused solutions, such as ETL and data quality, as well as open source application integration solutions, such as enterprise service bus (ESB) and message queuing. These various approaches are united by common metadata, typically created and managed through a graphical, workflow-oriented tool, Talend Open Studio.

So, why is this important? If you follow the history of BI, you'll know that many well-established implementations are characterized by complex and often long-running batch processes that gather, consolidate and cleanse data from multiple internal operational sources into a data warehouse and then to marts. This is a model that scales poorly in an era where vast volumes of data are coming from external sources (a substantial part of big data) and analysis is increasingly demanding near real-time data. File-based data integration becomes a challenge in these circumstances. The simplest approach may be to move towards ever smaller files running in micro-batches. However, the ultimate requirement is to enable message-based communication between source and target applications/databases. This requires a fundamental change in thinking for most BI developers. So a starting point of ETL and an end point of messaging, both under a common ETL-like workflow, makes for easier growth. Developers can begin to see that a data transfer/cleansing service is conceptually similar to any business activity also offered as a service. And the possibility of creating workflows combining operational and informational processes emerges naturally to support operational BI.

Thumbnail image for Business unIntelligence Cover.jpgIs this to say that ETL tools are a dying species? Certainly not. For some types and sizes of data integration, a file-based approach will continue to offer higher performance or more extensive integration and cleansing function. The key is to ensure common, shared metadata (or as I prefer to call it, context-setting information, CSI) between all the different flavors of data and application integration.

Process, including both business and IT aspects, is the subject of Chapter 7 of "Business unIntelligence: Insight and Innovation Beyond Analytics and Big Data".

Sunset Over Architecture (SOA) image: http://vorris.blogspot.com/2012/07/mr-cameron-you-are-darn-right-start.html

Posted December 12, 2013 3:46 AM
Permalink | No Comments |
orthrus.jpgMuch has happened while I've been heads down over the past few months finishing my book. Well, Business unIntelligence - Insight and Innovation Beyond Analytics and Big Data" went to the printer last weekend and should be in the stores by mid-October. And, I can rejoin the land of the living. One of the interesting news stories in the meantime was Cisco's acquisition of Composite Software, which closed at the end of July. Mike Flannagan, Senior Director & General Manager, IBT Group at Cisco and Bob Eve, Data Virtualization Marketing Director and long-time advocate of virtualization at Composite turned up at the BBBT in mid-August to brief an eclectic bunch of independent analysts, including myself. 

The link-up of Cisco and Composite is, I believe, going to offer some very interesting technological opportunities in the market, especially in BI and big data. 

BI has been slow to adopt data virtualization. In fact, I was one of the first to promote the approach with IBM Information Integrator (now part of InfoSphere), some ten years ago when I was still with IBM. The challenge was always that virtualization seems to fly in the face of traditional EDW consolidation and reconciliation via ETL tools. I say seems because the two approaches are more complementary than competitive. Way back in the early 2000s, it was already clear to me that there were three obvious use cases: (i) real-time access to operational data, (ii) access to non-relational data stores, and (iii) rapid prototyping. The advent of big data and the excitement of operational analytics have confirmed my early enthusiasm. No argument - data virtualization and ETL are mandatory components of any new BI architecture or implementation.

So, what does Cisco add to the mix with Composite? One of the biggest challenges for virtualization is to understand and optimize the interaction between databases and the underlying network. When data from two or more distributed databases must be joined in a real-time query, the query optimizer needs to know, among other things, where the data resides, the volumes in each location, the available processing power of each database, and the network considerations for moving the data between locations. Data virtualization tools typically focus on the first three database concerns, probably as a result of their histories.  However, the last concern, the network, increasingly holds the key to excellent optimization.  There are two reasons. First, processer power continues to grow, so database performance has proportionately less impact.  Second, Cloud and big data together mean that distribution of data is becoming much more prevalent.  And growth in network speed, while impressive, is not in the same ballpark as that of processing, making for a tighter bottleneck.  And who better to know about the network and even tweak its performance profile to favor a large virtualization transfer than a big networking vendor like Cisco? The fit seems just right.

For this technical vision to work, the real challenge will be organizational, as is always the case with acquisitions.  Done well, acquisitions can be successful.  Witness IBM's integration of Lotus and Netezza, to name but two. Of course, strategic management and cultural fit always count. But, the main question usually is: Does the buyer really understand what the acquired company brings and is the buyer willing to change their own plans to accommodate that new value? It's probably too early to answer that question. The logistics are still being worked through and the initial focus is on ensuring that current plans and revenue targets are at least maintained. But, if I may offer some advice on the strategy... 

The Cisco network must recognize that the query optimizer in Composite will, in some sense, become another boss. The value for the combined company comes from the knowledge that resides in the virtualization query optimizer about what data types and volumes need to be accommodated on the network. This becomes the basis of how to route the data and how to tweak the network to carry it. In terms of company size, this may be the tail wagging the dog. But, in terms of knowledge, it's more like the dog with two heads. The Greek mythological image of Kyon Orthros, "the dog of morning twilight" and the burning heat of mid-summer, is perhaps an appropriate image.  An opportunity to set the network ablaze.

Posted September 7, 2013 12:21 AM
Permalink | 1 Comment |
JackBe's CTO, John Crupi, and VP of Marketing, Chris Warner, created a definitional firestorm among BI experts at the BBBT on Friday.  A long-time Ajax and Enterprise Mashup Platform provider, JackBe has more recently begun to describe itself as a Real-Time Intelligence provider.  That was always going to be a phrase that generated excited discussion.

First, what is Real-Time?  In the case of JackBe, it relates more to immediate access, both in definition and use, to existing sources of data than to the more conventional BI use of the term, which focuses more on how current that data is.  As a mashup, JackBe's Presto product doesn't actually care how current the data it accesses is.  The source could be an operational application, a data warehouse, a spreadsheet, a web resource or whatever--clearly a wide range of data latency (and reliability,too!).  So, the important idea that BI practitioners have to get their heads around is that Real-Time in this context is about giving business users fast and nimble access to existing data sources.

As a mashup, and coming from the Web 2.0 world, the second thing we need to recognize is that JackBe allows end users to combine information in innovative ways into dashboard-like constructs themselves.  In function, mashups are similar to more traditional portals, but use the more flexible tooling and constructs of Web 2.0, enabling users to do more for themselves without calling on IT.  JackBe thus enables self-service BI, provided that accessible information resources already exist.  Presto provides the means to find those sources, the ability to link them together and the robust security required to ensure users can access only what they are allowed to.

As with all approaches to self-service business intelligence, the most challenging aspect for BI practitioners is to understand and even regulate the validity of the results produced.  Does it make logical business sense to combine sources A and B?  Does source A contain data from the same timeframe as source C?  Does profit margin in source B have the exact same definition as that in source D?  And so on.  These are the types of questions that lead to the creation of a data warehouse; resolving them leads to the typical delays in delivering data warehouses.

The bottom line is that JackBe provides a powerful tool to drive rapid innovation by end users in business intelligence.  Given the speed of change in today's business, that has to be a good thing.  But, as is the case when any powerful tool is put in the hands of a user, there is a danger of severely burnt fingers!  The BI department must therefore put processes in place to help users know if the information they want is really suitable for mashing up.  In practice, this will require either the creation of extensive metadata to describe the available information sources or the provision of a robust help desk facility to explain to users what's possible and even what went wrong.


Posted January 30, 2011 10:16 AM
Permalink | No Comments |
As an old proponent of the Enterprise Data Warehouse or EDW (well, let me stick my neck out and claim to be its first proponent, although I labeled it the BDW - Business Data Warehouse, or Barry Devlin's Warehouse!), I've had many debates over the years about the relative merits of consolidating and reconciling data in an EDW for subsequent querying vs. sending the query to a disparate set of data sources.  Unlike some traditionalists, I concluded as far back as 2002 that there existed good use cases for both approaches. I still stick with that belief.  So, the current excitement and name-space explosion about the topic leaves me a touch bemused.

But I found myself more confused than bemused when I read Stephen Swoyer's article Why Data Virtualization Trumps Data Federation Alone in the Dec. 1 TDWI "BI This Week" newsletter.  Quoting Philip Russom, research manager with TDWI Research, and author of a new Checklist Report from TDWI Research, Data Integration for Real-Time Data Warehousing and Data Virtualization, he says: "[D]ata virtualization must abstract the underlying complexity and provide a business-friendly view of trusted data on demand. To avoid confusion, it's best to think of data federation as a subset or component of data virtualization. In that context, you can see that a traditional approach to federation is somewhat basic or simple compared to the greater functionality of data virtualization".

OK, maybe I'm getting old, but that didn't help me a lot to understand why data virtualization trumps data federation alone.  So, I went to the Checklist Report, where I found a definition: "For the purposes of this Checklist Report, let's define data virtualization as the pooling of data integration resources", whereas traditional data federation "only federates data from many different data sources in real time", the latter from a table sourced by Informatica, the sponsor of the report.  When I read the rest of the table, it finally dawned on me that I was in marketing territory.  Try this for size: "[Data virtualization] proactively identifies and fixes data quality issues on the fly in the same tool"!  How would that work?

Let me try to clarify the conundrum of virtualization, federation, enterprise information integration and even mash-ups, at least from my (perhaps over-simplified) viewpoint.  They're all roughly equivalent - there may be highly nuanced differences, but the nuances depend on which vendor you're talking to.  They all provide a mechanism for decomposing a request for information into sub-requests that are sent to disparate and distributed data sources unbeknownst to the user, receive the answers and combine them into a single response.  In order to do that, they all have some amount of metadata that allows locates and describes the information sources, a set of adapters (often called by different names) that know how to talk with different data sources, and, for want of a better description, a layer that insulates the user from all of the complexity underneath.

But, whatever you call it (and let's call it data virtualization for now - the term with allegedly the greatest cachet), is it a good idea?  Should you do it?  I believe the answer today is a resounding yes - there is far too much information of too many varieties to ever succeed in getting it into a single EDW.  There is an ever growing business demand for access to near real-time information that ETL, however trickle-fed, struggles to satisfy.  And, yes, there are dangers and drawbacks to data virtualization, just as there are to ETL.  And the biggest drawback, despite Informatica's claim to the contrary, is that you have to be really, really careful about data quality.

By the way, I am open to being proven wrong on this last point; it's only by our mistakes that we learn!  Personally, I could use a tool that "proactively identifies and fixes data quality issues on the fly".

Posted December 2, 2010 7:52 AM
Permalink | No Comments |
Pervasive Software presented at the Boulder BI Brain Trust (BBBT) last Friday, August 13.  What caught my attention was their DataRush product and technology, and particularly the technological driver behind it.  For a brief overview of the other aspects of Pervasive covered, check out Richard Hackathorn's blog on the day.

But, back to DataRush.  DataRush was originally conceived as a redesign of Pervasive's data integration tool, acquired from Data Junction in 2003.  However, it was soon recognized that the underlying function could be applied to other data-intensive tasks such as analytics.  Pervasive CTO, Mike Hoskins, described DataRush as a toolkit and engine that enables ordinary programmers to create parallel-processing applications simply and easily using data flow techniques to design them and without having to worry about the complexities of parallel-processing design, such as timing and synchronization between parallel tasks.

Now, of course, there's nothing new about parallel processing or the inherent difficulties it presents to programmers.  It's been at the heart of large-scale data warehousing, particularly through the use of MPP (massively parallel processing) systems, for a number of years.  Mike's point, however, was that parallel processing is about to go mainstream.  The technology shift enabling that has been underway for a few years now--the growing availability of multi-core processors and servers since the mid-2000s.  4-core processors are already common on desktop machines, while processors with 32 cores and more are already available for servers.  Multiply that by the number of sockets in a typical server, and you have massive parallelism in a single box--if you can use it.  The problem is that with existing applications designed for serial processing, the only benefit to be gained from such multi-core servers at present it in supporting multiple concurrent users or tasks or in what's known as "embarrassingly parallel" applications where there are no inter-task dependencies.  DataRush's claim to fame is that it moves data-intensive parallel processing from high-end, expensive and complex MPP clusters and specialist programmers to commodity, inexpensive and simple SMP multi-core servers and ordinary developers.

Of course, Pervasive is not alone in trying to tackle the issues involved in software development for parallel-processing environments.  But their approach, coming from the large-scale data integration environment, makes a lot of sense in BI.

However, to see the really significant implications, we need to see this development in the context of other technological advances.  There is the emergence of solid-state disks (SSDs) and the growing sizes and dropping costs of core memory that remove or reduce the traditional disk I/O bottleneck.  The decades-old supremacy of traditional relational databases is being challenged by a variety of different structures, some broadly relational and others distinctly not.  Add to this the explosive growth of data volumes, especially soft or "unstructured" information.  Pervasive, along with other small and medium-sized software vendors, is pushing information processing to an entirely new level.


Posted August 18, 2010 7:58 AM
Permalink | No Comments |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›