Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Database Category

susurration.jpgIt seems to me that much of the drive behind NoSQL (whether No SQL or Not Only SQL) arose from a rather narrow view of the relational model and technology by web-oriented developers whose experience was constrained by the strengths and limitations of MySQL. Many of their criticisms of relational databases had actually been overcome by commercial products like DB2, Oracle and Teradata to varying extents and under certain circumstances. Although, of course, open source and commodity hardware pricing also continue to drive uptake.

A similar pattern can be seen with NewSQL in its original definition by Matt Aslett of the 451 group, back in April 2011. So, when it comes to products clamoring for inclusion in either category, I tend to be somewhat jaundiced. A class defined by what it is not (NoSQL) presents some logical difficulties. And one classed "new", when today's new is tomorrow's obsolete is not much better. I prefer to look at products in a more holistic sense. With that in mind, let's get to NuoDB, which announced version 2 in mid-October. With my travel schedule I didn't find time to blog then, but now that I'm back on terra firma in Cape Town, the time has come!

Back in October 2012, I encountered NuoDB prior to their initial launch, and their then positioning as part of the NewSQL wave. I also had a bit of a rant then about the NoSQL/NewSQL nomenclature (although no one listened then either), and commented on the technical innovation in the product, which quite impressed me, saying "NuoDB takes a highly innovative, object-oriented, transaction/messaging-system approach to the underlying database processing, eliminating the concept of a single control process responsible for all aspects of database integrity and organization. [T]he approach is described as elastically scalable - cashing in on the cloud and big data.  It also touts emergent behavior, a concept central to the theory of complex systems. Together with an in-memory model for data storage, NuoDB appears very well positioned to take advantage of the two key technological advances of recent years... extensive memory and multi-core processors."

The concept of emergent behavior (the idea that the database could be anything anybody wanted it to be, with SQL simply as first model) was interesting technically but challenging in positioning the product. Version 2 is more focused, with a tagline of distributed database and an emphasis on scale-out and geo-distribution within the relational paradigm. This makes more sense in marketing terms and the use case in a global VoIP support environment shows how the product can be used to reduce latency and improve data consistency. No need to harp on about "NewSQL" then...

Sales aside, the underlying novel technical architecture continues to interest me. A reading on the NuoDB Technical Whitepaper (registration required) revealed some additional gems. One, in particular, resonates with my thinking on the ongoing breakdown of one of the longest-standing postulates of decision support: the belief that operational and informational processes demand separate databases to support them, as discussed in Chapter 5 of my book. While there continue to be valid business reasons to build and maintain a separate store of core, historical information, real-time decision needs also demand the ability to support both operational and informational needs on the primary data store. NuoDB's Transaction Engine architecture and use of Multi-Version Concurrency Control together enable good performance of both read/write and longer-running read-only operations seen in operational BI applications.

Business unIntelligence Cover.jpgI will return to exploring the themes and messages of "Business unIntelligence: Insight and Innovation Beyond Analytics and Big Data" over the coming weeks.  You can order the book at the link above; it is now available. Also, check out my presentation at the BrightTALK Business Intelligence and Big Data Analytics Summit, recorded Sept. 11 and Beyond BI is... Business unIntelligence, recorded Sept. 26. Read my interview with Lindy Ryan in Rediscovering BI.

Susurration image: http://bigjoebuck.blogspot.com/2010_12_27_archive.html

Posted November 20, 2013 1:55 AM
Permalink | No Comments |
Speeding up database performance for analytic work has been all the rage recently.  Most of the new players in the field tout a combination of hardware and software advances to achieve 10-100 times and more improvement in query speeds.  Netezza's approach has been more hardware-oriented than most--their major innovation being the FPGA (field-programmable gate array) that sits between the disk and processor in each Snippet Blade (basically, MPP node).  The FPGA is coded with a number of Fast Engines, two of which, in particular, drive performance: the Compress engine, which compresses and decompresses data to and from the disk, and the Project and Restrict engine responsible for removing unneeded data from the stream coming off the disk.  Netezza say that data volumes through the rest of the system can be reduced by as much as 95% in this manner.

So, the FPGA is the magic ingredient.  Combine that with a re-architecting of Netezza in TwinFin, released last August, that more effectively layered the disk access and moved to Intel-based CPUs on IBM BladeCenter technology, and you can see why Daniel Abadi came to the very prescient conclusion a month ago that IBM would be a likely suitor to acquire Netezza.

It seems likely that the short-term intent of the acquisition is to boost IBM's presence in the appliance market, competing especially with Oracle Exadata, not to mention EMC Greenplum and Terdata.  Of more interest is the medium- and longer-term directions for the combined product line and for data warehousing in general.  Curt Monash has already given his well-judged thoughts on the product implications, to which I'd like to add some.

My thoughts relate to the broad parallel between FPGA programming and microcode.  You could argue that Netezza's FPGA is basically a microcoded accelerator for analytic access to data on commodity hard drives.  As a long-time proponent of microcoded dedicated components and accelerators in its systems, dating all the way back to the System/360, IBM's way of thinking and Netezza's approach align nicely.  The question, of course, is how transparently it could be done underneath DB2, and further, the willingness of DB2 for Linux, UNIX and Windows to embrace the use of accelerators as DB2 for z/OS has.  The possible application of this approach under the Informix database shouldn't be forgotten either.

The interesting thing here is that the Netezza Fast Engine approach is inherently extensible.  The MPP node passes information to the FPGA as to the characteristics of the query, allowing it to perform appropriate preprocessing on the data streaming to or from the disk.  In theory, at least, there is no reason why such preprocessing couldn't be applied in situations other than analytic.  Using contextual metadata to qualify OLTP records?  Preprocessing content to mine implicit metadata?  Encryption / decryption?  It all lines up well with my contention that we are seeing the beginning of a convergence between the currently separate worlds of operational, informational and collaborative information.

But, what does this acquisition suggest for data warehousing in general?  Well, despite my long history with and high regard for IBM, I do fear that this acquisition is part of a trend that is reducing innovation in the industry.  The explosion of start-ups in BI over the past few years has resulted in a wonderful blossoming of new ideas, in stark contrast to the previous ten years, when traditional relational databases were the answer; now, what's the question.  Big companies find it very difficult to nurture innovation and their acquisitions often end up killing the spark that made the start-up worth acquiring in the first place.  IBM is by no means the worst in this regard, but I do hope that the inventions and innovations the characterized Netezza continue to live and thrive in Big Blue...  for the good of the data warehousing industry.


Posted September 22, 2010 3:02 PM
Permalink | No Comments |
I was speaking to Susan Davis and Bob Zurek of Infobright the other day, and one statement that caught my attention was that they try to go to the actual data as little as possible.  An interesting objective for a product that's positioned as a "high performance database for analytic applications and data marts", don't you think?

It sounds somewhat counter-intuitive until you realize that in a world of exploding data volumes that need to be analyzed, you have only two choices if you want to maintain a reasonable response time for users: (1) throw lots of hardware at the problem--parallel processing, faster storage, and more--or (2) be a lot cleverer in what you access and when.  The first approach is pretty common and based on recent developments, quite successful.  And as we move into solid-state disks (SSD) and in-memory databases, we'll see even more gains.  But, let's play with the second option a bit.

How can we minimize access (disk I/O) to the actual data?  So, we can say immediately that the minimum number of times we have to touch the actual data is once!  In the case of a data warehouse or mart, that is when we load it.  In a traditional row-based RDBMS, that's when we build an indexes we need to speed access for particular queries or further processes.  With column-based databases, we often hear that indexes are no longer needed or much reduced--reducing database size, load time and ongoing maintenance costs.  And it's certainly true that columnar databases improve query response time.  And yet, we might ask (and it applies in the case of row-based databases as well) is there anything else we could do on that single and mandatory access to all the data that could help reduce later data access during analysis?

Infobright's solution is the Knowledge Grid, a set of metadata based on Rough Set theory generated at load-time and used to limit the range of actual data a query has to retrieve in order to figure out which values match the query conditions.  Each 64K items block of data (Data Pack) on disk has a set of metadata such as maximum and minimum values, sum, count, etc. for numerical items calculated for it at load-time.  At query run-time, these statistics inform the database engine that some data packs are irrelevant because no item meets the query conditions.  Other data packs contain only data that meets the query conditions, and if the statistics contain the result needed by the query, the data here need not be accessed either.  The remainder of the data packs contain some data that matches the query and will have to be accessed.  Given the right statistics, the amount of disk I/O can be significantly reduced.  Infobright also create metadata for character items at load-time and for joins at query-time.

Generalizing from the above, we can begin to imagine other possibilities.  What if you didn't load the actual data into the database, but just left it where it was and crawled through it to create metadata of a similar nature to allow irrelevant data for a particular query to be eliminated en masse?  Of course, that sounds a bit like the indexing approach used by search engines and extended by Attivio and others to cover relational databases as well.  Of course, the problem with indexes and similar metadata is that they tend to grow in volume also, until they reach a significant percentage of the actual data size; then we're back to square one.

My mathematical skills are far too rusty (if they were ever bright and shiny enough in the first place) to know if Rough Set theory has anything to say about that issue or how it could be applied beyond the way that Infobright have implemented it, but it does seem like a interesting area for exploration as data volume continue to explode.  Any bright PhDs out there like to give it a try?


Posted July 29, 2010 2:03 PM
Permalink | No Comments |
Any acquisition in the database market, in this case, the July 6 announcement of EMC's plan to acquire Greenplum, generates a flurry of analyst activity speculating about the financial or technical rationale for the acquisition, winners and losers among other database vendors and the effect of the move on customers' buying patterns.  Personally, I find these opinions very interesting and highly informative.  And I invite you to check out, for example, Curt Monash or Merv Adrian to explore these aspects of the acquisition.

However, I'd like to take the opportunity to focus our minds once again on a more fundamental question: how is IT going to manage data quality and reliability in a rapidly expanding data environment, both in terms of data volumes and places to store the data?  I'm currently describing a logical enterprise architecture, Business Integrated Insight (BI2), that focuses on this.

So, for me, what the acquisition emphasizes, like that of Sybase by SAP, is that specialized databases, with their sophisticated features and functions, are rapidly entering the mainstream of database usage.  Their ability to handle large data volumes with vast improvements in query performance has become increasingly valuable in a wide range of industries that want to analyze enormous quantities of very detailed data at relatively low cost.  How to do this?  Vendors of these systems typically have a simple answer: copy all the required data into our machine and away you go!

My concern is that IT ends up with yet another copy of the corporate data, and a very large copy at that, that must be kept current in meaning, structure and content on an ongoing basis.  Any slippage in maintaining one or more of these characteristics leads inevitably to data quality problems and eventually to erroneous decisions.  Such issues typically emerge unexpectedly, in time-constrained or high-risk situations and lead to expensive and highly visible firefighting actions by IT.  Unfortunately, such occurrences are common in BI environments, but typically relate to unmanaged spreadsheets or relatively small data marts.  We have just jumped the problem size up by a couple of orders of magnitude.

So, am I suggesting that you shouldn't be using these specialized databases?  Would I recommend that you stand in front of a speeding freight train?  Clearly not!

There are two ways that these problems will be addressed.  One falls upon customer IT departments, while the other comes back to the database industry and the vendors, whether acquiring or acquired.  These paths will need to be followed in parallel.

IT departments need to define and adopt stringent "data copy minimization" policies.  The purist in me would like to say "elimination" rather than "minimization".  However, that's clearly impossible.  Minimization of data copies, in the real world, requires IT to evaluate the risks of yet another copy of data, the possibility of using an existing set of data for the new requirement and, if a new copy of the data is absolutely needed, whether existing analytic solutions could be migrated to this new copy of data and the existing data copies eliminated.

Meanwhile, it is incumbent upon the database industry to take a step back and look at the broader picture of data management needs in the context of emerging technologies and the explosive growth in data volumes.  The basic question that needs to be asked is: how can the enormous power and speed of these emerging technologies be crafted into solutions that equally support divergent data use cases on a single copy of data?  And, if not on a single copy, how can multiple copies of data be managed to complete consistency invisibly within the database technology?

Tough questions, perhaps, but ones that the acquirers in this industry, with their deep pockets, need to invest in.  As the database market re-converges, the vendors that solve this architectural conundrum will become the market leaders in highly consistent, pervasive and minimally duplicated data that enables IT to focus on solving real business needs rather than managing data quality.  Wouldn't that be wonderful?

Posted July 7, 2010 1:18 PM
Permalink | No Comments |
Preparing materials for a seminar really forces you to think!  I just finished the slides for my two-day class in Rome next week, and after I got over my need for a strong drink (a celebration, of course), I got to reflect on some of what I had discovered.

Perhaps the most interesting was the amazing changes in the database area that have been happening over the past couple of years.  A combination of hardware advances and software innovations have come together with a recognition that data is no longer what it once was to pose some fundamental questions about how databases should be constructed.

Let's start on the business side - always a good place to start.  Users now think that their internal IT systems should behave like a combination of Google, Facebook and Twitter.  Want an answer to the CEO's question on plummeting sales?  Just do a "search", maybe "call a friend", join it all together and voila!  We have the answer. 

From an information viewpoint, this brings up some very challenging questions about the intersection of soft (aka unstructured) information and hard (structured) data and how one ensures consistency and quality in that set.  IT's problem is no longer just combining hard data from different sources; it's about parsing and qualifying soft information as well.  This is not a truly new problem.  Data modelers have struggled with it for years.  It's the speed with which it needs to be done that causes the problem.

So, what has this got to do with new software and hardware for databases?  Well, the key point is that database thinking has suddenly moved on from strict adherence to the relational paradigm.  The relational model is an extraordinarily structured view of data.  Relational algebra is a very precise tool for querying data.  You need to have a strong understanding of both to make valid queries, but do you really want your users to think that way?  Should you necessarily store the information physically in that model?  When you free yourself of these assumptions, you can begin to think in new ways.  Store the data in columns instead of rows?  Perfect!  A mix of row- and column-oriented data, and maybe some in memory only?  Yes, can do!  And then there's mixing searching (a soft information concept) with querying (a hard data thought) to create a hybrid result.  That's easy too!

And on the edges of the field, there are even more fundamental questions being asked.  Do we need always need consistency in our databases?  Can we do databases without going to disk for the data?  Could we do away with physically modeling the data and just let the computer look after it?  The answers to these questions and more like them are not what you might expect if you've been around the database world for 20 years.  And with those different answers, the overall architecture of your IT systems is suddenly open to dramatic change.

Believe me, the first businesses to adopt some of these approaches are going to gain some extraordinary competitive advantages.  Watch this space!

Posted April 8, 2010 9:58 AM
Permalink | No Comments |
PREV 1 2

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›