Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation published by Addison-Wesley in 1997.

Over the past few years, Barry has extended his interest to cover the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Operational analytics is making headlines in 2013. But why is it important? And why is it more likely to succeed now than in the mid-2000s, when it was called operational BI or the mid-1990s when it surfaced as the operational data store (ODS)? 
 
First, let's define the term. My definition, from two recent white papers (April 2012 and May 2013) is: "Operational analytics is the process of developing optimal or realistic recommendations for real-time, operational decisions based on insights derived through the application of statistical models and analysis against existing and/or simulated future data, and applying these recommendations in real-time interactions." While the language is clearly analytical in tone, the bottom line of the desired business impact is much the same as definitions we've seen in the pact for the ODS and operational BI: real-time or near real-time decisions embedded into the operational processes of the business. 

Anybody who has head me speak in the 1990s or early 2000s will know that I was not a big fan of the ODS. So, what has changed? In short, two things: (1) businesses are more advanced in their BI programs and (2) technology has advanced to the stage where it can support the need for real-time operational-informational integration. 

BI Evolution.jpg
The evolution of BI can be traced on two fronts shown in the accompanying figure: the behaviors driving business users and the responses required of IT providers. As this evolution proceeds apace, business demands increasing flexibility in what can be done with the data and increasing timeliness in its provision. In Phase I, largely fixed reports are generated perhaps on a weekly schedule from data that IT deem appropriate and furnish in advance. Such reporting is entirely backward looking, describing selected aspects of business performance. Today, few businesses remain in this phase because of its now limited return on investment; most have already moved to Phase II. 

This second phase is characterized by an increasing awareness of the breadth of information available collectively across the wider business and an emerging ability to use information to predict future outcomes. In this phase, IT is highly focused on integrating data from the multiple sources of operational data throughout the company. This is the traditional BI environment, supported by a data warehouse infrastructure. The majority of businesses today are at Phase II in their journey and leaders are beginning to make the transition to Phase III. 

Phase III marks a major step change in decision making support for most organizations. On the business side, the need moves from largely ad hoc, reactive and management driven to a process view, allowing the outcome of predictive analysis to be applied directly, and often in real time, to the business operations. This is the essence of the behavior called operational analytics. In this stage, IT must become highly adaptive in order to anticipate emerging business needs for information. Such a change requires a shift in thinking from separate operational and informational systems to a combined operational-informational environment. This is where the action is today. This is where return on investment for leading businesses is now to be found. And, simply put, this is why operational analytics is making headlines today--many businesses are ready for it; the leaders are already implementing it. 

This leads us to the second contention: that technology has advanced sufficiently to support the need. There are many ways that recent advances in technology can be combined to do this. In the white papers referenced above, one shows how two complementary technologies, IBM DB2 for z/OS and Netezza, can be integrated to meet the requirements. The other shows how the introduction of columnar technology and other performance improvements in DB2 Advanced Enterprise Edition can meet these same needs. Other vendors are improving their offerings in similar directions. 

So, to paraphrase the "Six Million Dollar Man": we have the business waiting. We have the technology. We have the capability to build this... But, wait. There is one more hurdle. Most existing IT architectures strictly separate operational and informational systems based on a data warehouse approach dating back to the mid-1980s. This split is a serious impediment to building this new environment that demands a tight feedback loop between the two environments. Analyses in the informational environment must be transferred instantly into the operational environment to take immediate effect. Outcomes of actions in the operational systems must be copied directly to the informational systems to tune the models there. These requirements are difficult to satisfy in the current architecture; they demand a new approach. This is beginning to emerge, but is by no means widespread yet. I'll be discussing this topic further over the coming weeks.

Posted May 13, 2013 8:41 AM
Permalink | No Comments |
For some time now, when it comes to big data, my mantra has been "big data is simply all data".  IBM's April 3 announcement served admirably to reinforce that point of view. Was it a big data announcement, a DB2 announcement, or a hardware announcement?  The short answer is "yes", to all the above and more.

Weaving together a number of threads, Big Blue created a credible storyline that can be summarized in three key thoughts: larger, faster and simpler.  As many of you may know, I worked for IBM until early 2008, so my views on this announcement are informed by my knowledge of how the company works or, perhaps, used to work.  Last Wednesday, I came away impressed.  Here were a number of diverse, individual product developments that conform to a single theme across different lines and businesses.

Take BLU acceleration as a case in point.  The headline, of course, is that DB2 LUW (on Linux, Unix and Windows) 10.5 introduces a hybrid architecture.  Data can be stored in columnar tables with extensive compression, making use of in-memory storage and taking further advantage of parallel and vector processing techniques available on modern processors.  The result is an up to 25% improvement in analytic and reporting performance (and considerably more in specific queries) and up to 90% data compression.  In addition, the elimination of indexes and aggregates simplifies considerably the need for manual tuning and maintenance of the database.  This is a direction that has long been shown by small, newer vendors such as ParAccel and Vertica (now part of HP), so it is hardly a surprise.  IBM can claim a technically superior implementation, but more impressive is the successful retrofitting into the existing product base.  And the re-use of the technology in the separate Informix TimeSeries code base to enhance analytics and reporting there too, as well as the promise that it will be extended to other data workloads in the future.  It seems the product development organization is really pulling together across different product lines.  That's no mean feat within IBM.

Another hint at the strength of the development team was the quiet announcement of a technology preview of JSON support in DB2 at the same time as the availability of 10.5.  JSON is one of the darlings of the NoSQL movement that provides significant agility to support unpredictable and changing data needs.  See my May 2012 white paper "Business Intelligence--NoSQL... No Problem" for more details.  As in its support for other NoSQL technologies, such as XML and RDF graph databases, IBM has chosen to incorporate support for JSON into DB2.  There are pros and cons to this approach.  Performance and scalability may not match a pure JSON database, but the ability to take advantage of the ACID and RAS characteristics of an existing, full-feature database like DB2 makes it a good choice where business continuity is a strong requirement.  IBM clearly recognizes that the world of data is no longer all SQL, but that for certain types of non-relational data, the difference is sufficiently small that they can be handled as an adjunct to the relational model through a "subservient" engine, allowing easier joining of NoSQL and SQL data types.  This is a vital consideration for machine-generated data, one of three information domains I've defined in a recent white paper, "The Big Data Zoo--Taming the Beasts".

The announcement didn't ignore the little yellow elephant, either.  The PureData System family has been expanded with the PureData System for Hadoop, with built-in analytics acceleration and archiving, and provides significantly simpler and faster deployment of projects requiring the MapReduce environment.  And InfoSphere BigInsights 2.1 offers the Big SQL interface to Hadoop, an alternative file system, GPFS-FPO, with enhanced security and no single point of failure, as well as high availability.

While the announcement clearly targeted Big Data--at the Speed of Business, the underlying message, as seen above, is much broader.  This view is of an emerging information ecosystem that must be considered from a fully holistic viewpoint.  A key role, and perhaps even the primary role, for BigInsights / Hadoop is in exploratory analytics, where innovative, what-if thinking is given free rein.  But the useful insights gained here must eventually be transferred to production (and back) in a reliable, secure, managed environment--typically a relational database.  This environment must also operate at speed, with large data volumes and with ease of management and use.  These are characteristics that are clearly emphasized in this announcement.  They are also key components of the integrated information platform I described in the Data Zoo white paper already mentioned.  Missing still are some of the integration-oriented aspects such as the comprehensive, cross-platform metadata management, data integration and virtualization required to tie it all together.  IBM has more to do to deliver on the full breadth of this vision, but this announcement is a big step in the right direction.


Posted April 8, 2013 9:14 AM
Permalink | No Comments |
stake-in-the-heart.jpgWikibon's lovingly detailed Big Data Vendor Revenue and Market Forecast 2012-2017 provides an excellent list and positioning of players in the "Big Data" market.  Readers may be surprised to see that IBM tops the list as the biggest vendor in the market in 2012 with nearly 12% market share ($1,352 million), more than twice that of the second-placed HP.  Indeed, the names of the top ten in the list--IBM, HP, Teradata, Dell, Oracle, SAP, EMC, Cisco, Microsoft and Accenture--may also raise an eyebrow, given that all of them come from the "old school" of computer companies.  The top contender among the "new school Big Data" vendors is Splunk with revenue of $186 million.

Wikibon openly describes their methodology for calculating these figures, and one could describe it as more art than science, given the reluctance of vendors to share such data.  Furthermore, the authors have also revised their original 2011 market size estimate up from $5.1 to $7.2 billion.  So, one might dispute the figures and placements at length, but it's probably fair to say that this report is among the more useful publicly available data on this market.

Of more concern to me is the big, hairy, ugly question that has bothered me since "Big Data" attained celebrity status: what on earth is it?  Furthermore, how can one evaluate the overall figures with Wikibon's  two-part definition: (1) "those data sets whose size, type and speed of creation make them impractical to process and analyze with traditional database technologies and related tools in a cost- or time-effective way" and (2) "requires practitioners to embrace an exploratory and experimental mindset regarding data and analytics... Projects whose processes are informed by this mindset meet Wikibon's definition of Big Data even in cases where some of the tools and technology involved may not".  Part 1 is the fairly widespread definition of "Big Data", and one that is, in my view, so vague as to be meaningless.  Part 2 is certainly creative but poses some interesting questions about how one might reliably access practitioners' mindsets and assess them as exploratory and experimental!  The bottom line of this definition is that if somebody says a dataset or project in "Big Data" then it is so.  I've long ago come to the conclusion that, unless somebody can come up with a watertight definition, we should stop talking about and fooling ourselves that we can measure "Big Data".  I've said this before, but the term won't go away.  Hence, the reference to killing vampires in the title...

As an alternative, I'd like to point again to a white paper I wrote last year, The Big Data Zoo - Taming the Beasts, where I categorized information/data into three domains: (1) process-mediated data, (2) human-sourced information and (3) machine-generated data, as shown in the accompanying figure.  I suggest that this is a much more clearly defined way of breaking down the universe of information/data and of differentiating between data uses and projects that are part of what you might call classical data processing and those that have emerged or are emerging in the fields that first sprouted the term "Big Data".  These information domains are largely self-describing, relatively well-bounded and group together data that has similar characteristics in terms of structure and volatility.  Size actually has very little to do with it.

Three information domains.jpgReturning to Wikibon's results and their companion piece, Big Data Database Revenue and Market Forecast 2012-2017, in database software, IBM again tops the list with $215 million in SQL-based revenue and is followed by 5 other SQL-based database vendors (SAP, HP, Teradata, EMC and Microsoft) until we reach MarkLogic as the top NoSQL (XML, in fact, so hardly part of the post-2008 NoSQL wave except by self-declaration) vendor with revenue of $43 million in 2012.  Wikibon's "bottom line: the top five vendors have about 2/3rds of the database revenue, all from SQL-only product lines. Wikibon believes that NoSQL vendors will challenge these vendors hard of the next five years. However SQL will continue to retain over half of revenues for the foreseeable future."  I personally don't know on what Wikibon based the growth projections, so I cannot comment, but I do have questions about the 2012 figures themselves, both including and beyond the definition of "Big Data".  Hadoop is not mentioned, and although I agree with its exclusion as a database, many vendors are incorporating it into their database environments by a variety of means.  Is this included or excluded and why?  HP and EMC grab third and fifth positions, based on their Vertica and Greenplum acquisitions respectively.  Judging by the fact they overshadow significant database players like Microsoft and Oracle, it would seem that most or all of their database revenue is classified as "big data".  Is this reasonable?  How did the survey apportion IBM's, Teradata's and Microsoft's database revenue between "big data" and the rest?  Is all of SAP HANA revenue called "big data" simply because it's an in-memory appliance... or how was it split?  And the list goes on...

I'm sure IBM is very happy to be placed top in both listings; I assume the Netezza figures loom large in the database placement.  SAP will be pleased to take second place, based largely on HANA.  HP Vertica can claim top pure play "big data" database.  And Teradata can take pride in its placement, earned I expect, in large part through its Aster acquisition.  And so on...  But the more interesting point is that these are all SQL databases.  The highest-placed NoSQL (in the post-2008 wave sense) is 10gen, with attributed revenue of less than 10% of that attributed to IBM in the "big data" category.  All this will drive marketing machines, but with more heat than light.  Given the underlying dysfunction in the definitions, how will it help businesses who are trying to figure out what to do about the "big data" truck allegedly bearing down on them?

My suggestions are straightforward.  In terms of the three data domains outlined above, be aware that process-mediated data - the well-defined, -structured and -managed data residing in current operational and informational systems - is growing fast and can drive significant new value through operational analytic approaches.  Human-sourced information - currently mostly about social media - and machine-generated data are emerging and rapidly growing sources of knowledge about people's behaviors and intentions.  They enable new, extensive predictive analytics (the successor to data mining) that initially demands flexibility in exploration, such as that offered by Hadoop.  However, they will demand proper integration in the formal data management environment in the medium to long term.  This requires a well-defined and thoroughly thought-out infrastructure and platform strategy that embraces all types of data and processes.  Of all the vendors mentioned above, only IBM and Teradata are attempting to take such a holistic view, in my opinion.  

As for NoSQL databases (however you define them - aren't IMS and IDMS also NoSQL by definition?).  I believe the post-2008 NoSQL databases have important roles in the emerging environment.  They certainly drive substantial and long-absent innovation in the relational database market.  In particular, they offer a level of flexibility in database design that is key in emerging markets and applications.  And they have technical characteristics that are very useful in a variety of niches in the market, solving problems with which relational databases struggle.  
Interesting data times.  But, let's just quietly drop the "big"...

Posted March 4, 2013 8:21 AM
Permalink | No Comments |
Baby elephant.JPGThe past year has been dominated by Big Data.  What it might mean and the way you might look at it.  The stories have often revolved around Hadoop and his herd of oddly-named chums.  Vendors and analysts alike have run away and joined this ever-growing and rapidly moving circus.  And yet, as we saw in our own EMA and 9sight Big Data Survey, businesses are on a somewhat different tour.  Of course, they are walking with the elephants, but many so-called Big Data projects have more to do with more traditional data types, i.e. relationally structured, but bigger or requiring faster access.  And in these instances, the need is for Big Analytics, rather than Big Data.  The value comes from what you do with it, not how big it happens to be.

Which brings us to Big Blue.  I've been reading IBM's PureSystems announcement today.  The press release headline trumpets Big Data (as well as Cloud), but the focus from a data aspect is on the deep analysis of highly structured, relational information with a substantial upgrade of the PureData for Analytics System, based on Netezza technology, first announced less than four months ago.  The emphasis on analytics, relational data and the evolving technology is worth exploring.

Back in September 2010, when IBM announced the acquisition of Netezza, there was much speculation about how the Netezza products would be positioned within IBM's data management and data warehousing portfolios that included DB2 (in a number of varieties), TM1 and Informix.  Would the Netezza technology be merged into DB2?  Would it continue as an independent product?  Would it, perhaps, die?  I opined that Netezza, with its hardware-based acceleration, was a good match for IBM who understood the benefits of microcode and dedicated hardware components for specific tasks, such as the field programmable gate array (FPGA), used to minimize the bottleneck between disk and memory.  It seems I was right in that; not only has Netezza survived as an independent platform, as the basis for the PureData System for Analytics, but also being integrated behind DB2 for z/OS in the IBM DB2 Analytics Accelerator.

Today's announcement of the PureData System for Analytics N2001 is, at heart, a performance and efficiency upgrade to the original N1001 product, offering a 3x performance improvement and 50% greater capacity for the same power consumption.  The improvements come from a move to smaller, higher capacity and faster disk drives and faster FPGAs.  With a fully loaded system capable of handling a petabyte or more of user data (depending on compression ratio achieved), we are clearly talking big data.  The technology is purely relational.  And a customer example from the State University of New York, Buffalo quotes a reduction in run time for complex analytics on medical records from 27 hours to 12 minutes (the prior platform is not named).  So, this system, like competing Analytic Appliances from other vendors, is fast.  Perhaps we should be using images of cheetahs?

[The photo is from my visit to Addo Game Reserve in South Africa last week.  For concerned animal lovers, she did eventually manage to climb out...]

Posted February 5, 2013 10:30 AM
Permalink | No Comments |
MagnifyingGlassDataSpider.jpgAs we begin a new year, we are promised a move from a focus on the meaning and technology of big data to the useful and worthwhile business applications it may offer.  A timely move indeed.  Hopefully, we'll begin to hear less about analyzing Twitter streams to optimize advertising spend and more about applications with the potential to improve people's lives or the environment.  And even more hopefully, people may begin to consider the risks they run when revealing or gathering personal data on our deeply interconnected Web.

With all of the synchronicity that is the Internet, I came across two articles from the New York Times published in last week. The first, by Peter Jaret on January 14, describes how patient records, transcribed and digitized from scrawled (why do they write so poorly?) doctors' notes, anonymized and stored on the Web, can be statistically mined to discover previously unknown side-effects of and interactions between prescribed drugs.  Clearly useful and valuable work.  The second article, three days later by Gina Kolata, revealed how easily a genetics researcher was able to identify five individuals and their extended families by combining publicly-available information from the anonymized 1000 Genome Project database, a commercial genealogy Web site and Google.  Kolata quotes Amy L. McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston:  "To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position".  The underlying genetic data is used in medical research to good effect, of course, but what are the possible consequences for those individuals thus identified as insurance companies, governments or other interested parties make potentially negative assessments based on their once private genomes?

Such occurrences--and there are many of them--should be deeply disturbing to those of us involved in the business of big data and analytics.  Here are doctors, scientists and lawyers--with training in logic, ethics and law--who see the power of analytics to improve the human condition, but who seem to gloss over the wider privacy and security implications of making personal information widely available on the Web.  After all, the limits of data anonymization on the web were being discussed openly as long ago as May 2011 by Pete Warden on the O'Reilly Radar blog.  And as far back as 1997, Prof. Latanya Sweeney, now Director of the Data Privacy Lab at Harvard, could show that the combination of gender, ZIP code and birthdate was unique for 87% of the U.S. population.  

Eben Moglen, professor of law and legal history at Columbia University and Chairman of the Software Freedom Law Center, warned at re:publica Berlin in May 2012 that "media that spies on and data-mines the public is destroying freedom of thought and only this generation, the last to grow up remembering the 'old way', is positioned to save this, humanity's most precious freedom".  With media and medicine, government and retail, telecommunications and finance all gathering hoards of information about us, each for their own allegedly good purpose, the reality is now that the abuse of big data (as opposed to its use) is not only possible, but proceeding apace, even in largely democratic,Western states.

So, given that big data anonymity is "no longer a sustainable position", it should be clear that the analytics possible on today's high-powered computers is a double-edged sword; it serves us poorly to focus only on one, single, razor-sharp edge.  As we evaluate and build useful and worthwhile business analytics applications of this coming year, let us step back even occasionally to contemplate whether the profits to be earned or the discoveries to be made are worth the price of human freedom.

Posted January 23, 2013 9:06 AM
Permalink | No Comments |
PREV 1 2 3 4

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›