Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation published by Addison-Wesley in 1997.

Over the past few years, Barry has extended his interest to cover the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Database Category

Speeding up database performance for analytic work has been all the rage recently.  Most of the new players in the field tout a combination of hardware and software advances to achieve 10-100 times and more improvement in query speeds.  Netezza's approach has been more hardware-oriented than most--their major innovation being the FPGA (field-programmable gate array) that sits between the disk and processor in each Snippet Blade (basically, MPP node).  The FPGA is coded with a number of Fast Engines, two of which, in particular, drive performance: the Compress engine, which compresses and decompresses data to and from the disk, and the Project and Restrict engine responsible for removing unneeded data from the stream coming off the disk.  Netezza say that data volumes through the rest of the system can be reduced by as much as 95% in this manner.

So, the FPGA is the magic ingredient.  Combine that with a re-architecting of Netezza in TwinFin, released last August, that more effectively layered the disk access and moved to Intel-based CPUs on IBM BladeCenter technology, and you can see why Daniel Abadi came to the very prescient conclusion a month ago that IBM would be a likely suitor to acquire Netezza.

It seems likely that the short-term intent of the acquisition is to boost IBM's presence in the appliance market, competing especially with Oracle Exadata, not to mention EMC Greenplum and Terdata.  Of more interest is the medium- and longer-term directions for the combined product line and for data warehousing in general.  Curt Monash has already given his well-judged thoughts on the product implications, to which I'd like to add some.

My thoughts relate to the broad parallel between FPGA programming and microcode.  You could argue that Netezza's FPGA is basically a microcoded accelerator for analytic access to data on commodity hard drives.  As a long-time proponent of microcoded dedicated components and accelerators in its systems, dating all the way back to the System/360, IBM's way of thinking and Netezza's approach align nicely.  The question, of course, is how transparently it could be done underneath DB2, and further, the willingness of DB2 for Linux, UNIX and Windows to embrace the use of accelerators as DB2 for z/OS has.  The possible application of this approach under the Informix database shouldn't be forgotten either.

The interesting thing here is that the Netezza Fast Engine approach is inherently extensible.  The MPP node passes information to the FPGA as to the characteristics of the query, allowing it to perform appropriate preprocessing on the data streaming to or from the disk.  In theory, at least, there is no reason why such preprocessing couldn't be applied in situations other than analytic.  Using contextual metadata to qualify OLTP records?  Preprocessing content to mine implicit metadata?  Encryption / decryption?  It all lines up well with my contention that we are seeing the beginning of a convergence between the currently separate worlds of operational, informational and collaborative information.

But, what does this acquisition suggest for data warehousing in general?  Well, despite my long history with and high regard for IBM, I do fear that this acquisition is part of a trend that is reducing innovation in the industry.  The explosion of start-ups in BI over the past few years has resulted in a wonderful blossoming of new ideas, in stark contrast to the previous ten years, when traditional relational databases were the answer; now, what's the question.  Big companies find it very difficult to nurture innovation and their acquisitions often end up killing the spark that made the start-up worth acquiring in the first place.  IBM is by no means the worst in this regard, but I do hope that the inventions and innovations the characterized Netezza continue to live and thrive in Big Blue...  for the good of the data warehousing industry.


Posted September 22, 2010 3:02 PM
Permalink | No Comments |
I was speaking to Susan Davis and Bob Zurek of Infobright the other day, and one statement that caught my attention was that they try to go to the actual data as little as possible.  An interesting objective for a product that's positioned as a "high performance database for analytic applications and data marts", don't you think?

It sounds somewhat counter-intuitive until you realize that in a world of exploding data volumes that need to be analyzed, you have only two choices if you want to maintain a reasonable response time for users: (1) throw lots of hardware at the problem--parallel processing, faster storage, and more--or (2) be a lot cleverer in what you access and when.  The first approach is pretty common and based on recent developments, quite successful.  And as we move into solid-state disks (SSD) and in-memory databases, we'll see even more gains.  But, let's play with the second option a bit.

How can we minimize access (disk I/O) to the actual data?  So, we can say immediately that the minimum number of times we have to touch the actual data is once!  In the case of a data warehouse or mart, that is when we load it.  In a traditional row-based RDBMS, that's when we build an indexes we need to speed access for particular queries or further processes.  With column-based databases, we often hear that indexes are no longer needed or much reduced--reducing database size, load time and ongoing maintenance costs.  And it's certainly true that columnar databases improve query response time.  And yet, we might ask (and it applies in the case of row-based databases as well) is there anything else we could do on that single and mandatory access to all the data that could help reduce later data access during analysis?

Infobright's solution is the Knowledge Grid, a set of metadata based on Rough Set theory generated at load-time and used to limit the range of actual data a query has to retrieve in order to figure out which values match the query conditions.  Each 64K items block of data (Data Pack) on disk has a set of metadata such as maximum and minimum values, sum, count, etc. for numerical items calculated for it at load-time.  At query run-time, these statistics inform the database engine that some data packs are irrelevant because no item meets the query conditions.  Other data packs contain only data that meets the query conditions, and if the statistics contain the result needed by the query, the data here need not be accessed either.  The remainder of the data packs contain some data that matches the query and will have to be accessed.  Given the right statistics, the amount of disk I/O can be significantly reduced.  Infobright also create metadata for character items at load-time and for joins at query-time.

Generalizing from the above, we can begin to imagine other possibilities.  What if you didn't load the actual data into the database, but just left it where it was and crawled through it to create metadata of a similar nature to allow irrelevant data for a particular query to be eliminated en masse?  Of course, that sounds a bit like the indexing approach used by search engines and extended by Attivio and others to cover relational databases as well.  Of course, the problem with indexes and similar metadata is that they tend to grow in volume also, until they reach a significant percentage of the actual data size; then we're back to square one.

My mathematical skills are far too rusty (if they were ever bright and shiny enough in the first place) to know if Rough Set theory has anything to say about that issue or how it could be applied beyond the way that Infobright have implemented it, but it does seem like a interesting area for exploration as data volume continue to explode.  Any bright PhDs out there like to give it a try?


Posted July 29, 2010 2:03 PM
Permalink | No Comments |
Any acquisition in the database market, in this case, the July 6 announcement of EMC's plan to acquire Greenplum, generates a flurry of analyst activity speculating about the financial or technical rationale for the acquisition, winners and losers among other database vendors and the effect of the move on customers' buying patterns.  Personally, I find these opinions very interesting and highly informative.  And I invite you to check out, for example, Curt Monash or Merv Adrian to explore these aspects of the acquisition.

However, I'd like to take the opportunity to focus our minds once again on a more fundamental question: how is IT going to manage data quality and reliability in a rapidly expanding data environment, both in terms of data volumes and places to store the data?  I'm currently describing a logical enterprise architecture, Business Integrated Insight (BI2), that focuses on this.

So, for me, what the acquisition emphasizes, like that of Sybase by SAP, is that specialized databases, with their sophisticated features and functions, are rapidly entering the mainstream of database usage.  Their ability to handle large data volumes with vast improvements in query performance has become increasingly valuable in a wide range of industries that want to analyze enormous quantities of very detailed data at relatively low cost.  How to do this?  Vendors of these systems typically have a simple answer: copy all the required data into our machine and away you go!

My concern is that IT ends up with yet another copy of the corporate data, and a very large copy at that, that must be kept current in meaning, structure and content on an ongoing basis.  Any slippage in maintaining one or more of these characteristics leads inevitably to data quality problems and eventually to erroneous decisions.  Such issues typically emerge unexpectedly, in time-constrained or high-risk situations and lead to expensive and highly visible firefighting actions by IT.  Unfortunately, such occurrences are common in BI environments, but typically relate to unmanaged spreadsheets or relatively small data marts.  We have just jumped the problem size up by a couple of orders of magnitude.

So, am I suggesting that you shouldn't be using these specialized databases?  Would I recommend that you stand in front of a speeding freight train?  Clearly not!

There are two ways that these problems will be addressed.  One falls upon customer IT departments, while the other comes back to the database industry and the vendors, whether acquiring or acquired.  These paths will need to be followed in parallel.

IT departments need to define and adopt stringent "data copy minimization" policies.  The purist in me would like to say "elimination" rather than "minimization".  However, that's clearly impossible.  Minimization of data copies, in the real world, requires IT to evaluate the risks of yet another copy of data, the possibility of using an existing set of data for the new requirement and, if a new copy of the data is absolutely needed, whether existing analytic solutions could be migrated to this new copy of data and the existing data copies eliminated.

Meanwhile, it is incumbent upon the database industry to take a step back and look at the broader picture of data management needs in the context of emerging technologies and the explosive growth in data volumes.  The basic question that needs to be asked is: how can the enormous power and speed of these emerging technologies be crafted into solutions that equally support divergent data use cases on a single copy of data?  And, if not on a single copy, how can multiple copies of data be managed to complete consistency invisibly within the database technology?

Tough questions, perhaps, but ones that the acquirers in this industry, with their deep pockets, need to invest in.  As the database market re-converges, the vendors that solve this architectural conundrum will become the market leaders in highly consistent, pervasive and minimally duplicated data that enables IT to focus on solving real business needs rather than managing data quality.  Wouldn't that be wonderful?

Posted July 7, 2010 1:18 PM
Permalink | No Comments |
Preparing materials for a seminar really forces you to think!  I just finished the slides for my two-day class in Rome next week, and after I got over my need for a strong drink (a celebration, of course), I got to reflect on some of what I had discovered.

Perhaps the most interesting was the amazing changes in the database area that have been happening over the past couple of years.  A combination of hardware advances and software innovations have come together with a recognition that data is no longer what it once was to pose some fundamental questions about how databases should be constructed.

Let's start on the business side - always a good place to start.  Users now think that their internal IT systems should behave like a combination of Google, Facebook and Twitter.  Want an answer to the CEO's question on plummeting sales?  Just do a "search", maybe "call a friend", join it all together and voila!  We have the answer. 

From an information viewpoint, this brings up some very challenging questions about the intersection of soft (aka unstructured) information and hard (structured) data and how one ensures consistency and quality in that set.  IT's problem is no longer just combining hard data from different sources; it's about parsing and qualifying soft information as well.  This is not a truly new problem.  Data modelers have struggled with it for years.  It's the speed with which it needs to be done that causes the problem.

So, what has this got to do with new software and hardware for databases?  Well, the key point is that database thinking has suddenly moved on from strict adherence to the relational paradigm.  The relational model is an extraordinarily structured view of data.  Relational algebra is a very precise tool for querying data.  You need to have a strong understanding of both to make valid queries, but do you really want your users to think that way?  Should you necessarily store the information physically in that model?  When you free yourself of these assumptions, you can begin to think in new ways.  Store the data in columns instead of rows?  Perfect!  A mix of row- and column-oriented data, and maybe some in memory only?  Yes, can do!  And then there's mixing searching (a soft information concept) with querying (a hard data thought) to create a hybrid result.  That's easy too!

And on the edges of the field, there are even more fundamental questions being asked.  Do we need always need consistency in our databases?  Can we do databases without going to disk for the data?  Could we do away with physically modeling the data and just let the computer look after it?  The answers to these questions and more like them are not what you might expect if you've been around the database world for 20 years.  And with those different answers, the overall architecture of your IT systems is suddenly open to dramatic change.

Believe me, the first businesses to adopt some of these approaches are going to gain some extraordinary competitive advantages.  Watch this space!

Posted April 8, 2010 9:58 AM
Permalink | No Comments |
I'm presenting a two-day seminar for Technology Transfer in Rome in mid-April, entitled "BI2--From Business Intelligence to Enterprise IT Integration" and am currently researching and preparing the material.  And the more I research, the more excited I get about the prospects for the next wave of development from BI to... what?  Well, that's the real question for me!

It's my belief, and I've been writing and speaking about this for quite a while now, that the way we do BI today has reached its limits.  Business today demands ever closer to real-time information that must be consistent and meaningfully integrated across ever wider scopes.  These demands simply cannot be satisfied by our current concept of a layered, triplicated (and more) data warehouse of hard information--largely numerical data arranged in neat tables--along with some soft information thrown in as an afterthought.  The only way forward that I can see is to begin to treat all business information as a conceptually single, integrated, modelled resource with minimal duplication of data.  I've described this business information resource (BIR) to a first approximation elsewhere and my seminar will, among other things, dig deeper into the structure of the BIR and the technology needed to create and maintain it.

My current excitement stems from the growing reality of "hybrid" databases--combining the features and strengths of row-oriented and columnar relational databases.  Now, I know that academia has proposed approaches to this as much as 8 years ago, but it's only in the last year that commercial databases are introducing it.  I wrote about Vertica's FlexStore feature, introduced in 2009, in my last post.  The latest announcement I found  is of a technology preview program for Ingres VectorWise, the newest entrant in the hybrid database arena.  Add Oracle's Exadata V2, announced last year with typical modesty by Larry Ellison as the "fastest machine in the world for data warehousing, but now by far the fastest machine in the world for online transaction processing", and we can see that the approach is finally gaining market traction.

Why is this important?  Well, despite the hype, Larry hit the nail on the head.  If we finally have databases that can handle both operational and informational workloads equally well, we can begin to define an architecture that doesn't insist on copying vast quantities of data from one database to another.  That doesn't mean the death of the data warehouse any time soon, but it does mean that a much more integrated IT environment is coming your way.

Posted March 18, 2010 10:12 AM
Permalink | 2 Comments |
PREV 1 2