Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Big data Category

Thoughts on the societal impact of the Internet of Things inspired by a unique dashboard product.

VisualCue tile.pngNewcomer to the BBBT, on 2nd May, Kerry Gilger, Founder of VisualCue took the members by storm with an elegant, visually intuitive and, to me at least, novel approach to delivering dashboards. VisualCue is based on the concept of a tile that represents a set of metrics as icons colored according to their state relative to defined threshold values. The main icon in the tile shown here represents the overall performance of a call center agent, with the secondary icons showing other KPIs, such as total calls answered, average handling time, sales per hour worked, customer satisfaction, etc. Tiles are assembled into mosaics, which function rather like visual bar charts that can be sorted according to the different metrics, drilled down to related items and displayed in other formats, including tabular numbers.

Visual Cue Mosaic.jpgThe product seems particularly useful in operational BI applications, with Kerry showing examples from call centers, logistics and educational settings. The response of the BBBT members was overwhelmingly positive. @rick_vanderlans described it as "revolutionary technology", while @gildardorojas asked "why we didn't have before something as neat and logical?" @marcusborba opined "@VisualCue's capability is amazing, and the data visualization is gorgeous!"

So, am I being a Luddite, or even a curmudgeon, to have made the only negative comments of the call? My concern was not about the product at all, but rather around the power it unleashes simply by being so good at what it does. Combine this level of ease-of-use in analytics with big data and, especially, data from the Internet of Things, and we take a quantum leap from measurement to invasiveness, from management to Big-Brother-like control.

Each of the three example use cases described by Gilger provided wonderful examples of real and significant business benefit; but, taken together, they also opened up appalling possibilities of abuse of privacy, misappropriation of personal information and disempowerment of the people involved. I'll briefly explore the three examples, realizing that in the absence of the full story, I'm undoubtedly imagining some aspects. Nor is this about VisualCue (who tweeted that "Privacy is certainly a critical issue! We focus on presenting data that an organization already has--maybe we make it obvious") or the companies using it; it's meant to be a warning that we who know some of the possibilities--positive and negative--offered by big data analytics must consider in advance the unintended consequences.

Detailed monitoring of call center agents' performance is nothing new. Indeed, it is widely seen as best practice and key to improving both individual and overall call center results. VisualCue, according to Gilger, has provided outstanding performance gains, including one center where agents in competition with peers have personally sought out training to improve their own metrics, something that is apparently unheard of in the industry. Based on past best practices and detailed knowledge of where the agent is weak, VisualCue can provide individually customized advice. In a sense, this example illustrates the pinnacle of such use of monitoring data and analytics to drive personnel performance. But, within it lies the seeds of its own destruction. As the agent's job is more and more broken down into repeatable tasks, each measurable by a different metric, human innovation and empathy is removed and the job prepared for automation. In fact, a 2013 study puts at 99% the probability that certain call center jobs, particularly telemarketing, will be soon eliminated by technology.

The old adage "what you can't measure, you can't manage" is at the heart of traditional BI. In an era when data was scarce and often incoherent, this focus makes sense. However, applying it to all aspects of life today is, to me, ethically problematical. The example of monitoring the entire scope of an educational institution in a single dashboard--from financials through administration to student performance--is a case where our ability to analyze so many data points leads to the illusion that we can manage the entire process mechanically. The Latin root of "educate" means "to draw forth" from the student, the success of which simply cannot be gauged through basic numerical measures, and is certainly not correlated with the business measures of the institution.

vehtrack.jpgThe final example of tracking the operational performance of a waste management company's routes, trucks and drivers emphasizes our growing ability to measure and monitor the details of real life minute by minute. By continuously tracking the location and engine management signals from its trucks, the dashboard created by this company enabled it to make significant financial savings and improvements to its operational performance. However, it also enables supervisors to drill into the ongoing behavior of the company's drivers: deviations from planned routes, long stops with the engine running, extreme braking, exceeding the speed limit, etc. While presumably covered by their employment contract, such micromanagement of employees is at best disempowering and at worst open to abuse by increasingly all-seeing supervisors. Of much greater concern is the fact that these sensors are increasingly embedded in private automobiles and that such tracking capability is already being applied without owners' consent to smartphones. As far as a year back, Euclid Analytics had already tracked about 50 million devices in 4,000 locations according to a New York Times blog.

1984-big-brother-is-watching-you.jpgI'm grateful to Kerry Gilger for sharing the use cases that inspired my speculations above. Of course, my point is beyond the individual companies involved and products used. At issue is the range of social and ethical dilemmas raised by the rapid advances in sensor technology, data gathered and the power of analytic software. Our every action online is already monitored by the likes of Google and Facebook for profit and by organizations like the NSA allegedly for security and crime prevention. The level of monitoring of our physical lives is now rapidly increasing. Anonymity is rapidly disappearing, if not already extinct. Our personal privacy rights are being usurped by the data gathering and analysis programs of these commercial and governmental organizations, as eloquently described by Shoshana Zuboff of Harvard Business and Law schools in a recent article in Frankfurter Allgemeine Zeitung.

It is imperative that those of us who have grown up with and nurtured business intelligence over the past three decades--from hardware and software vendors, to consultants and analysts, to BI managers and implementers in businesses everywhere--begin to deeply consider the ethical, legal and societal issues now being raised and take action to guide the industry and society appropriately through the development of new codes of ethical behavior and use of information, and input to national and international legislation.


Posted May 4, 2014 6:26 AM
Permalink | No Comments |
Sad little elephant.jpg
I look up, and suddenly it's August.  I've been heads-down for the past three months, finishing my new book, which will be available in October.  The title is designed to be thought provoking: "Business unIntelligence - Insight and Innovation beyond Analytics and Big Data".  More on that next week, starting with why I want to provoke you into thinking, and over the coming weeks, too... promise!  

For now, let's talk Big Data again.  It's a topic that remains one part frustrating and one part energizing.  Let's start with the frustration.  Despite the best efforts of a number of thought leaders over the past months, the reality is stubbornly hard to pin down.  The technologists continue to push the boundaries, but in often perverse ways.  Take the announcement by Hortonworks at the recent Hadoop Summit, for example.  As reported by Stephen Swoyer, Apache YARN (Yet Another Resource Negotiator) will make it easier to parallelize non-MapReduce jobs: "It's the difference, argues Hortonworks founder Arun Murthy, between running applications 'on' and running them in Hadoop".  Sounds interesting, I thought, so I headed over to the relevant page and found this gem: "When all of the data in the enterprise is already available in HDFS, it is important to have multiple ways to process that data".  Really?  How many of you believe that there is the slightest possibility that all the data in the enterprise will ever be available in HDFS?  Of course, Hadoop does need a new and improved resource management approach (I suspect that studying IBM System z resource management would help).  But, let's not pretend that even a copy all enterprise data will ever be in one place.  Wasn't that the original data warehouse thinking?  We are in a fully distributed and diversified IT world now.  And wasn't big data a major driver in that realization?

Now to the energizing part.  When EMA and I ran our first big data survey last summer, we found that big data projects in the real world exhibit a wide range of starting points.  Even when the projects are based on Hadoop (and many are not), the idea that all enterprise data should be in HDFS is simply not on the radar.  With this year's survey just recently opened up for input, you do have the opportunity to prove me wrong!  As in last year's work, our focus is on how businesses are translating the hype and opportunities of big data and the emerging technologies into actual projects.  It spans both business and technology drivers, because these two aspects are now intimately related, a concept I call the biz-tech ecosystem.  That is a foundation of Business unIntelligence and the topic of my next blog.

Until then, I encourage you to take the big data survey soon - it will close next week - especially  those of you based beyond North America.  We are very interested to see the global picture.

Picture: Sad Little Elephant by Katherine Devlin

Posted July 31, 2013 1:37 AM
Permalink | No Comments |
For some time now, when it comes to big data, my mantra has been "big data is simply all data".  IBM's April 3 announcement served admirably to reinforce that point of view. Was it a big data announcement, a DB2 announcement, or a hardware announcement?  The short answer is "yes", to all the above and more.

Weaving together a number of threads, Big Blue created a credible storyline that can be summarized in three key thoughts: larger, faster and simpler.  As many of you may know, I worked for IBM until early 2008, so my views on this announcement are informed by my knowledge of how the company works or, perhaps, used to work.  Last Wednesday, I came away impressed.  Here were a number of diverse, individual product developments that conform to a single theme across different lines and businesses.

Take BLU acceleration as a case in point.  The headline, of course, is that DB2 LUW (on Linux, Unix and Windows) 10.5 introduces a hybrid architecture.  Data can be stored in columnar tables with extensive compression, making use of in-memory storage and taking further advantage of parallel and vector processing techniques available on modern processors.  The result is an up to 25% improvement in analytic and reporting performance (and considerably more in specific queries) and up to 90% data compression.  In addition, the elimination of indexes and aggregates simplifies considerably the need for manual tuning and maintenance of the database.  This is a direction that has long been shown by small, newer vendors such as ParAccel and Vertica (now part of HP), so it is hardly a surprise.  IBM can claim a technically superior implementation, but more impressive is the successful retrofitting into the existing product base.  And the re-use of the technology in the separate Informix TimeSeries code base to enhance analytics and reporting there too, as well as the promise that it will be extended to other data workloads in the future.  It seems the product development organization is really pulling together across different product lines.  That's no mean feat within IBM.

Another hint at the strength of the development team was the quiet announcement of a technology preview of JSON support in DB2 at the same time as the availability of 10.5.  JSON is one of the darlings of the NoSQL movement that provides significant agility to support unpredictable and changing data needs.  See my May 2012 white paper "Business Intelligence--NoSQL... No Problem" for more details.  As in its support for other NoSQL technologies, such as XML and RDF graph databases, IBM has chosen to incorporate support for JSON into DB2.  There are pros and cons to this approach.  Performance and scalability may not match a pure JSON database, but the ability to take advantage of the ACID and RAS characteristics of an existing, full-feature database like DB2 makes it a good choice where business continuity is a strong requirement.  IBM clearly recognizes that the world of data is no longer all SQL, but that for certain types of non-relational data, the difference is sufficiently small that they can be handled as an adjunct to the relational model through a "subservient" engine, allowing easier joining of NoSQL and SQL data types.  This is a vital consideration for machine-generated data, one of three information domains I've defined in a recent white paper, "The Big Data Zoo--Taming the Beasts".

The announcement didn't ignore the little yellow elephant, either.  The PureData System family has been expanded with the PureData System for Hadoop, with built-in analytics acceleration and archiving, and provides significantly simpler and faster deployment of projects requiring the MapReduce environment.  And InfoSphere BigInsights 2.1 offers the Big SQL interface to Hadoop, an alternative file system, GPFS-FPO, with enhanced security and no single point of failure, as well as high availability.

While the announcement clearly targeted Big Data--at the Speed of Business, the underlying message, as seen above, is much broader.  This view is of an emerging information ecosystem that must be considered from a fully holistic viewpoint.  A key role, and perhaps even the primary role, for BigInsights / Hadoop is in exploratory analytics, where innovative, what-if thinking is given free rein.  But the useful insights gained here must eventually be transferred to production (and back) in a reliable, secure, managed environment--typically a relational database.  This environment must also operate at speed, with large data volumes and with ease of management and use.  These are characteristics that are clearly emphasized in this announcement.  They are also key components of the integrated information platform I described in the Data Zoo white paper already mentioned.  Missing still are some of the integration-oriented aspects such as the comprehensive, cross-platform metadata management, data integration and virtualization required to tie it all together.  IBM has more to do to deliver on the full breadth of this vision, but this announcement is a big step in the right direction.


Posted April 8, 2013 9:14 AM
Permalink | No Comments |
stake-in-the-heart.jpgWikibon's lovingly detailed Big Data Vendor Revenue and Market Forecast 2012-2017 provides an excellent list and positioning of players in the "Big Data" market.  Readers may be surprised to see that IBM tops the list as the biggest vendor in the market in 2012 with nearly 12% market share ($1,352 million), more than twice that of the second-placed HP.  Indeed, the names of the top ten in the list--IBM, HP, Teradata, Dell, Oracle, SAP, EMC, Cisco, Microsoft and Accenture--may also raise an eyebrow, given that all of them come from the "old school" of computer companies.  The top contender among the "new school Big Data" vendors is Splunk with revenue of $186 million.

Wikibon openly describes their methodology for calculating these figures, and one could describe it as more art than science, given the reluctance of vendors to share such data.  Furthermore, the authors have also revised their original 2011 market size estimate up from $5.1 to $7.2 billion.  So, one might dispute the figures and placements at length, but it's probably fair to say that this report is among the more useful publicly available data on this market.

Of more concern to me is the big, hairy, ugly question that has bothered me since "Big Data" attained celebrity status: what on earth is it?  Furthermore, how can one evaluate the overall figures with Wikibon's  two-part definition: (1) "those data sets whose size, type and speed of creation make them impractical to process and analyze with traditional database technologies and related tools in a cost- or time-effective way" and (2) "requires practitioners to embrace an exploratory and experimental mindset regarding data and analytics... Projects whose processes are informed by this mindset meet Wikibon's definition of Big Data even in cases where some of the tools and technology involved may not".  Part 1 is the fairly widespread definition of "Big Data", and one that is, in my view, so vague as to be meaningless.  Part 2 is certainly creative but poses some interesting questions about how one might reliably access practitioners' mindsets and assess them as exploratory and experimental!  The bottom line of this definition is that if somebody says a dataset or project in "Big Data" then it is so.  I've long ago come to the conclusion that, unless somebody can come up with a watertight definition, we should stop talking about and fooling ourselves that we can measure "Big Data".  I've said this before, but the term won't go away.  Hence, the reference to killing vampires in the title...

As an alternative, I'd like to point again to a white paper I wrote last year, The Big Data Zoo - Taming the Beasts, where I categorized information/data into three domains: (1) process-mediated data, (2) human-sourced information and (3) machine-generated data, as shown in the accompanying figure.  I suggest that this is a much more clearly defined way of breaking down the universe of information/data and of differentiating between data uses and projects that are part of what you might call classical data processing and those that have emerged or are emerging in the fields that first sprouted the term "Big Data".  These information domains are largely self-describing, relatively well-bounded and group together data that has similar characteristics in terms of structure and volatility.  Size actually has very little to do with it.

Three information domains.jpgReturning to Wikibon's results and their companion piece, Big Data Database Revenue and Market Forecast 2012-2017, in database software, IBM again tops the list with $215 million in SQL-based revenue and is followed by 5 other SQL-based database vendors (SAP, HP, Teradata, EMC and Microsoft) until we reach MarkLogic as the top NoSQL (XML, in fact, so hardly part of the post-2008 NoSQL wave except by self-declaration) vendor with revenue of $43 million in 2012.  Wikibon's "bottom line: the top five vendors have about 2/3rds of the database revenue, all from SQL-only product lines. Wikibon believes that NoSQL vendors will challenge these vendors hard of the next five years. However SQL will continue to retain over half of revenues for the foreseeable future."  I personally don't know on what Wikibon based the growth projections, so I cannot comment, but I do have questions about the 2012 figures themselves, both including and beyond the definition of "Big Data".  Hadoop is not mentioned, and although I agree with its exclusion as a database, many vendors are incorporating it into their database environments by a variety of means.  Is this included or excluded and why?  HP and EMC grab third and fifth positions, based on their Vertica and Greenplum acquisitions respectively.  Judging by the fact they overshadow significant database players like Microsoft and Oracle, it would seem that most or all of their database revenue is classified as "big data".  Is this reasonable?  How did the survey apportion IBM's, Teradata's and Microsoft's database revenue between "big data" and the rest?  Is all of SAP HANA revenue called "big data" simply because it's an in-memory appliance... or how was it split?  And the list goes on...

I'm sure IBM is very happy to be placed top in both listings; I assume the Netezza figures loom large in the database placement.  SAP will be pleased to take second place, based largely on HANA.  HP Vertica can claim top pure play "big data" database.  And Teradata can take pride in its placement, earned I expect, in large part through its Aster acquisition.  And so on...  But the more interesting point is that these are all SQL databases.  The highest-placed NoSQL (in the post-2008 wave sense) is 10gen, with attributed revenue of less than 10% of that attributed to IBM in the "big data" category.  All this will drive marketing machines, but with more heat than light.  Given the underlying dysfunction in the definitions, how will it help businesses who are trying to figure out what to do about the "big data" truck allegedly bearing down on them?

My suggestions are straightforward.  In terms of the three data domains outlined above, be aware that process-mediated data - the well-defined, -structured and -managed data residing in current operational and informational systems - is growing fast and can drive significant new value through operational analytic approaches.  Human-sourced information - currently mostly about social media - and machine-generated data are emerging and rapidly growing sources of knowledge about people's behaviors and intentions.  They enable new, extensive predictive analytics (the successor to data mining) that initially demands flexibility in exploration, such as that offered by Hadoop.  However, they will demand proper integration in the formal data management environment in the medium to long term.  This requires a well-defined and thoroughly thought-out infrastructure and platform strategy that embraces all types of data and processes.  Of all the vendors mentioned above, only IBM and Teradata are attempting to take such a holistic view, in my opinion.  

As for NoSQL databases (however you define them - aren't IMS and IDMS also NoSQL by definition?).  I believe the post-2008 NoSQL databases have important roles in the emerging environment.  They certainly drive substantial and long-absent innovation in the relational database market.  In particular, they offer a level of flexibility in database design that is key in emerging markets and applications.  And they have technical characteristics that are very useful in a variety of niches in the market, solving problems with which relational databases struggle.  
Interesting data times.  But, let's just quietly drop the "big"...

Posted March 4, 2013 8:21 AM
Permalink | No Comments |
Baby elephant.JPGThe past year has been dominated by Big Data.  What it might mean and the way you might look at it.  The stories have often revolved around Hadoop and his herd of oddly-named chums.  Vendors and analysts alike have run away and joined this ever-growing and rapidly moving circus.  And yet, as we saw in our own EMA and 9sight Big Data Survey, businesses are on a somewhat different tour.  Of course, they are walking with the elephants, but many so-called Big Data projects have more to do with more traditional data types, i.e. relationally structured, but bigger or requiring faster access.  And in these instances, the need is for Big Analytics, rather than Big Data.  The value comes from what you do with it, not how big it happens to be.

Which brings us to Big Blue.  I've been reading IBM's PureSystems announcement today.  The press release headline trumpets Big Data (as well as Cloud), but the focus from a data aspect is on the deep analysis of highly structured, relational information with a substantial upgrade of the PureData for Analytics System, based on Netezza technology, first announced less than four months ago.  The emphasis on analytics, relational data and the evolving technology is worth exploring.

Back in September 2010, when IBM announced the acquisition of Netezza, there was much speculation about how the Netezza products would be positioned within IBM's data management and data warehousing portfolios that included DB2 (in a number of varieties), TM1 and Informix.  Would the Netezza technology be merged into DB2?  Would it continue as an independent product?  Would it, perhaps, die?  I opined that Netezza, with its hardware-based acceleration, was a good match for IBM who understood the benefits of microcode and dedicated hardware components for specific tasks, such as the field programmable gate array (FPGA), used to minimize the bottleneck between disk and memory.  It seems I was right in that; not only has Netezza survived as an independent platform, as the basis for the PureData System for Analytics, but also being integrated behind DB2 for z/OS in the IBM DB2 Analytics Accelerator.

Today's announcement of the PureData System for Analytics N2001 is, at heart, a performance and efficiency upgrade to the original N1001 product, offering a 3x performance improvement and 50% greater capacity for the same power consumption.  The improvements come from a move to smaller, higher capacity and faster disk drives and faster FPGAs.  With a fully loaded system capable of handling a petabyte or more of user data (depending on compression ratio achieved), we are clearly talking big data.  The technology is purely relational.  And a customer example from the State University of New York, Buffalo quotes a reduction in run time for complex analytics on medical records from 27 hours to 12 minutes (the prior platform is not named).  So, this system, like competing Analytic Appliances from other vendors, is fast.  Perhaps we should be using images of cheetahs?

[The photo is from my visit to Addo Game Reserve in South Africa last week.  For concerned animal lovers, she did eventually manage to climb out...]

Posted February 5, 2013 10:30 AM
Permalink | No Comments |
PREV 1 2 3 4 5

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›