Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Big data Category

Sad little elephant.jpg
I look up, and suddenly it's August.  I've been heads-down for the past three months, finishing my new book, which will be available in October.  The title is designed to be thought provoking: "Business unIntelligence - Insight and Innovation beyond Analytics and Big Data".  More on that next week, starting with why I want to provoke you into thinking, and over the coming weeks, too... promise!  

For now, let's talk Big Data again.  It's a topic that remains one part frustrating and one part energizing.  Let's start with the frustration.  Despite the best efforts of a number of thought leaders over the past months, the reality is stubbornly hard to pin down.  The technologists continue to push the boundaries, but in often perverse ways.  Take the announcement by Hortonworks at the recent Hadoop Summit, for example.  As reported by Stephen Swoyer, Apache YARN (Yet Another Resource Negotiator) will make it easier to parallelize non-MapReduce jobs: "It's the difference, argues Hortonworks founder Arun Murthy, between running applications 'on' and running them in Hadoop".  Sounds interesting, I thought, so I headed over to the relevant page and found this gem: "When all of the data in the enterprise is already available in HDFS, it is important to have multiple ways to process that data".  Really?  How many of you believe that there is the slightest possibility that all the data in the enterprise will ever be available in HDFS?  Of course, Hadoop does need a new and improved resource management approach (I suspect that studying IBM System z resource management would help).  But, let's not pretend that even a copy all enterprise data will ever be in one place.  Wasn't that the original data warehouse thinking?  We are in a fully distributed and diversified IT world now.  And wasn't big data a major driver in that realization?

Now to the energizing part.  When EMA and I ran our first big data survey last summer, we found that big data projects in the real world exhibit a wide range of starting points.  Even when the projects are based on Hadoop (and many are not), the idea that all enterprise data should be in HDFS is simply not on the radar.  With this year's survey just recently opened up for input, you do have the opportunity to prove me wrong!  As in last year's work, our focus is on how businesses are translating the hype and opportunities of big data and the emerging technologies into actual projects.  It spans both business and technology drivers, because these two aspects are now intimately related, a concept I call the biz-tech ecosystem.  That is a foundation of Business unIntelligence and the topic of my next blog.

Until then, I encourage you to take the big data survey soon - it will close next week - especially  those of you based beyond North America.  We are very interested to see the global picture.

Picture: Sad Little Elephant by Katherine Devlin

Posted July 31, 2013 1:37 AM
Permalink | No Comments |
For some time now, when it comes to big data, my mantra has been "big data is simply all data".  IBM's April 3 announcement served admirably to reinforce that point of view. Was it a big data announcement, a DB2 announcement, or a hardware announcement?  The short answer is "yes", to all the above and more.

Weaving together a number of threads, Big Blue created a credible storyline that can be summarized in three key thoughts: larger, faster and simpler.  As many of you may know, I worked for IBM until early 2008, so my views on this announcement are informed by my knowledge of how the company works or, perhaps, used to work.  Last Wednesday, I came away impressed.  Here were a number of diverse, individual product developments that conform to a single theme across different lines and businesses.

Take BLU acceleration as a case in point.  The headline, of course, is that DB2 LUW (on Linux, Unix and Windows) 10.5 introduces a hybrid architecture.  Data can be stored in columnar tables with extensive compression, making use of in-memory storage and taking further advantage of parallel and vector processing techniques available on modern processors.  The result is an up to 25% improvement in analytic and reporting performance (and considerably more in specific queries) and up to 90% data compression.  In addition, the elimination of indexes and aggregates simplifies considerably the need for manual tuning and maintenance of the database.  This is a direction that has long been shown by small, newer vendors such as ParAccel and Vertica (now part of HP), so it is hardly a surprise.  IBM can claim a technically superior implementation, but more impressive is the successful retrofitting into the existing product base.  And the re-use of the technology in the separate Informix TimeSeries code base to enhance analytics and reporting there too, as well as the promise that it will be extended to other data workloads in the future.  It seems the product development organization is really pulling together across different product lines.  That's no mean feat within IBM.

Another hint at the strength of the development team was the quiet announcement of a technology preview of JSON support in DB2 at the same time as the availability of 10.5.  JSON is one of the darlings of the NoSQL movement that provides significant agility to support unpredictable and changing data needs.  See my May 2012 white paper "Business Intelligence--NoSQL... No Problem" for more details.  As in its support for other NoSQL technologies, such as XML and RDF graph databases, IBM has chosen to incorporate support for JSON into DB2.  There are pros and cons to this approach.  Performance and scalability may not match a pure JSON database, but the ability to take advantage of the ACID and RAS characteristics of an existing, full-feature database like DB2 makes it a good choice where business continuity is a strong requirement.  IBM clearly recognizes that the world of data is no longer all SQL, but that for certain types of non-relational data, the difference is sufficiently small that they can be handled as an adjunct to the relational model through a "subservient" engine, allowing easier joining of NoSQL and SQL data types.  This is a vital consideration for machine-generated data, one of three information domains I've defined in a recent white paper, "The Big Data Zoo--Taming the Beasts".

The announcement didn't ignore the little yellow elephant, either.  The PureData System family has been expanded with the PureData System for Hadoop, with built-in analytics acceleration and archiving, and provides significantly simpler and faster deployment of projects requiring the MapReduce environment.  And InfoSphere BigInsights 2.1 offers the Big SQL interface to Hadoop, an alternative file system, GPFS-FPO, with enhanced security and no single point of failure, as well as high availability.

While the announcement clearly targeted Big Data--at the Speed of Business, the underlying message, as seen above, is much broader.  This view is of an emerging information ecosystem that must be considered from a fully holistic viewpoint.  A key role, and perhaps even the primary role, for BigInsights / Hadoop is in exploratory analytics, where innovative, what-if thinking is given free rein.  But the useful insights gained here must eventually be transferred to production (and back) in a reliable, secure, managed environment--typically a relational database.  This environment must also operate at speed, with large data volumes and with ease of management and use.  These are characteristics that are clearly emphasized in this announcement.  They are also key components of the integrated information platform I described in the Data Zoo white paper already mentioned.  Missing still are some of the integration-oriented aspects such as the comprehensive, cross-platform metadata management, data integration and virtualization required to tie it all together.  IBM has more to do to deliver on the full breadth of this vision, but this announcement is a big step in the right direction.


Posted April 8, 2013 9:14 AM
Permalink | No Comments |
stake-in-the-heart.jpgWikibon's lovingly detailed Big Data Vendor Revenue and Market Forecast 2012-2017 provides an excellent list and positioning of players in the "Big Data" market.  Readers may be surprised to see that IBM tops the list as the biggest vendor in the market in 2012 with nearly 12% market share ($1,352 million), more than twice that of the second-placed HP.  Indeed, the names of the top ten in the list--IBM, HP, Teradata, Dell, Oracle, SAP, EMC, Cisco, Microsoft and Accenture--may also raise an eyebrow, given that all of them come from the "old school" of computer companies.  The top contender among the "new school Big Data" vendors is Splunk with revenue of $186 million.

Wikibon openly describes their methodology for calculating these figures, and one could describe it as more art than science, given the reluctance of vendors to share such data.  Furthermore, the authors have also revised their original 2011 market size estimate up from $5.1 to $7.2 billion.  So, one might dispute the figures and placements at length, but it's probably fair to say that this report is among the more useful publicly available data on this market.

Of more concern to me is the big, hairy, ugly question that has bothered me since "Big Data" attained celebrity status: what on earth is it?  Furthermore, how can one evaluate the overall figures with Wikibon's  two-part definition: (1) "those data sets whose size, type and speed of creation make them impractical to process and analyze with traditional database technologies and related tools in a cost- or time-effective way" and (2) "requires practitioners to embrace an exploratory and experimental mindset regarding data and analytics... Projects whose processes are informed by this mindset meet Wikibon's definition of Big Data even in cases where some of the tools and technology involved may not".  Part 1 is the fairly widespread definition of "Big Data", and one that is, in my view, so vague as to be meaningless.  Part 2 is certainly creative but poses some interesting questions about how one might reliably access practitioners' mindsets and assess them as exploratory and experimental!  The bottom line of this definition is that if somebody says a dataset or project in "Big Data" then it is so.  I've long ago come to the conclusion that, unless somebody can come up with a watertight definition, we should stop talking about and fooling ourselves that we can measure "Big Data".  I've said this before, but the term won't go away.  Hence, the reference to killing vampires in the title...

As an alternative, I'd like to point again to a white paper I wrote last year, The Big Data Zoo - Taming the Beasts, where I categorized information/data into three domains: (1) process-mediated data, (2) human-sourced information and (3) machine-generated data, as shown in the accompanying figure.  I suggest that this is a much more clearly defined way of breaking down the universe of information/data and of differentiating between data uses and projects that are part of what you might call classical data processing and those that have emerged or are emerging in the fields that first sprouted the term "Big Data".  These information domains are largely self-describing, relatively well-bounded and group together data that has similar characteristics in terms of structure and volatility.  Size actually has very little to do with it.

Three information domains.jpgReturning to Wikibon's results and their companion piece, Big Data Database Revenue and Market Forecast 2012-2017, in database software, IBM again tops the list with $215 million in SQL-based revenue and is followed by 5 other SQL-based database vendors (SAP, HP, Teradata, EMC and Microsoft) until we reach MarkLogic as the top NoSQL (XML, in fact, so hardly part of the post-2008 NoSQL wave except by self-declaration) vendor with revenue of $43 million in 2012.  Wikibon's "bottom line: the top five vendors have about 2/3rds of the database revenue, all from SQL-only product lines. Wikibon believes that NoSQL vendors will challenge these vendors hard of the next five years. However SQL will continue to retain over half of revenues for the foreseeable future."  I personally don't know on what Wikibon based the growth projections, so I cannot comment, but I do have questions about the 2012 figures themselves, both including and beyond the definition of "Big Data".  Hadoop is not mentioned, and although I agree with its exclusion as a database, many vendors are incorporating it into their database environments by a variety of means.  Is this included or excluded and why?  HP and EMC grab third and fifth positions, based on their Vertica and Greenplum acquisitions respectively.  Judging by the fact they overshadow significant database players like Microsoft and Oracle, it would seem that most or all of their database revenue is classified as "big data".  Is this reasonable?  How did the survey apportion IBM's, Teradata's and Microsoft's database revenue between "big data" and the rest?  Is all of SAP HANA revenue called "big data" simply because it's an in-memory appliance... or how was it split?  And the list goes on...

I'm sure IBM is very happy to be placed top in both listings; I assume the Netezza figures loom large in the database placement.  SAP will be pleased to take second place, based largely on HANA.  HP Vertica can claim top pure play "big data" database.  And Teradata can take pride in its placement, earned I expect, in large part through its Aster acquisition.  And so on...  But the more interesting point is that these are all SQL databases.  The highest-placed NoSQL (in the post-2008 wave sense) is 10gen, with attributed revenue of less than 10% of that attributed to IBM in the "big data" category.  All this will drive marketing machines, but with more heat than light.  Given the underlying dysfunction in the definitions, how will it help businesses who are trying to figure out what to do about the "big data" truck allegedly bearing down on them?

My suggestions are straightforward.  In terms of the three data domains outlined above, be aware that process-mediated data - the well-defined, -structured and -managed data residing in current operational and informational systems - is growing fast and can drive significant new value through operational analytic approaches.  Human-sourced information - currently mostly about social media - and machine-generated data are emerging and rapidly growing sources of knowledge about people's behaviors and intentions.  They enable new, extensive predictive analytics (the successor to data mining) that initially demands flexibility in exploration, such as that offered by Hadoop.  However, they will demand proper integration in the formal data management environment in the medium to long term.  This requires a well-defined and thoroughly thought-out infrastructure and platform strategy that embraces all types of data and processes.  Of all the vendors mentioned above, only IBM and Teradata are attempting to take such a holistic view, in my opinion.  

As for NoSQL databases (however you define them - aren't IMS and IDMS also NoSQL by definition?).  I believe the post-2008 NoSQL databases have important roles in the emerging environment.  They certainly drive substantial and long-absent innovation in the relational database market.  In particular, they offer a level of flexibility in database design that is key in emerging markets and applications.  And they have technical characteristics that are very useful in a variety of niches in the market, solving problems with which relational databases struggle.  
Interesting data times.  But, let's just quietly drop the "big"...

Posted March 4, 2013 8:21 AM
Permalink | No Comments |
Baby elephant.JPGThe past year has been dominated by Big Data.  What it might mean and the way you might look at it.  The stories have often revolved around Hadoop and his herd of oddly-named chums.  Vendors and analysts alike have run away and joined this ever-growing and rapidly moving circus.  And yet, as we saw in our own EMA and 9sight Big Data Survey, businesses are on a somewhat different tour.  Of course, they are walking with the elephants, but many so-called Big Data projects have more to do with more traditional data types, i.e. relationally structured, but bigger or requiring faster access.  And in these instances, the need is for Big Analytics, rather than Big Data.  The value comes from what you do with it, not how big it happens to be.

Which brings us to Big Blue.  I've been reading IBM's PureSystems announcement today.  The press release headline trumpets Big Data (as well as Cloud), but the focus from a data aspect is on the deep analysis of highly structured, relational information with a substantial upgrade of the PureData for Analytics System, based on Netezza technology, first announced less than four months ago.  The emphasis on analytics, relational data and the evolving technology is worth exploring.

Back in September 2010, when IBM announced the acquisition of Netezza, there was much speculation about how the Netezza products would be positioned within IBM's data management and data warehousing portfolios that included DB2 (in a number of varieties), TM1 and Informix.  Would the Netezza technology be merged into DB2?  Would it continue as an independent product?  Would it, perhaps, die?  I opined that Netezza, with its hardware-based acceleration, was a good match for IBM who understood the benefits of microcode and dedicated hardware components for specific tasks, such as the field programmable gate array (FPGA), used to minimize the bottleneck between disk and memory.  It seems I was right in that; not only has Netezza survived as an independent platform, as the basis for the PureData System for Analytics, but also being integrated behind DB2 for z/OS in the IBM DB2 Analytics Accelerator.

Today's announcement of the PureData System for Analytics N2001 is, at heart, a performance and efficiency upgrade to the original N1001 product, offering a 3x performance improvement and 50% greater capacity for the same power consumption.  The improvements come from a move to smaller, higher capacity and faster disk drives and faster FPGAs.  With a fully loaded system capable of handling a petabyte or more of user data (depending on compression ratio achieved), we are clearly talking big data.  The technology is purely relational.  And a customer example from the State University of New York, Buffalo quotes a reduction in run time for complex analytics on medical records from 27 hours to 12 minutes (the prior platform is not named).  So, this system, like competing Analytic Appliances from other vendors, is fast.  Perhaps we should be using images of cheetahs?

[The photo is from my visit to Addo Game Reserve in South Africa last week.  For concerned animal lovers, she did eventually manage to climb out...]

Posted February 5, 2013 10:30 AM
Permalink | No Comments |
MagnifyingGlassDataSpider.jpgAs we begin a new year, we are promised a move from a focus on the meaning and technology of big data to the useful and worthwhile business applications it may offer.  A timely move indeed.  Hopefully, we'll begin to hear less about analyzing Twitter streams to optimize advertising spend and more about applications with the potential to improve people's lives or the environment.  And even more hopefully, people may begin to consider the risks they run when revealing or gathering personal data on our deeply interconnected Web.

With all of the synchronicity that is the Internet, I came across two articles from the New York Times published in last week. The first, by Peter Jaret on January 14, describes how patient records, transcribed and digitized from scrawled (why do they write so poorly?) doctors' notes, anonymized and stored on the Web, can be statistically mined to discover previously unknown side-effects of and interactions between prescribed drugs.  Clearly useful and valuable work.  The second article, three days later by Gina Kolata, revealed how easily a genetics researcher was able to identify five individuals and their extended families by combining publicly-available information from the anonymized 1000 Genome Project database, a commercial genealogy Web site and Google.  Kolata quotes Amy L. McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston:  "To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position".  The underlying genetic data is used in medical research to good effect, of course, but what are the possible consequences for those individuals thus identified as insurance companies, governments or other interested parties make potentially negative assessments based on their once private genomes?

Such occurrences--and there are many of them--should be deeply disturbing to those of us involved in the business of big data and analytics.  Here are doctors, scientists and lawyers--with training in logic, ethics and law--who see the power of analytics to improve the human condition, but who seem to gloss over the wider privacy and security implications of making personal information widely available on the Web.  After all, the limits of data anonymization on the web were being discussed openly as long ago as May 2011 by Pete Warden on the O'Reilly Radar blog.  And as far back as 1997, Prof. Latanya Sweeney, now Director of the Data Privacy Lab at Harvard, could show that the combination of gender, ZIP code and birthdate was unique for 87% of the U.S. population.  

Eben Moglen, professor of law and legal history at Columbia University and Chairman of the Software Freedom Law Center, warned at re:publica Berlin in May 2012 that "media that spies on and data-mines the public is destroying freedom of thought and only this generation, the last to grow up remembering the 'old way', is positioned to save this, humanity's most precious freedom".  With media and medicine, government and retail, telecommunications and finance all gathering hoards of information about us, each for their own allegedly good purpose, the reality is now that the abuse of big data (as opposed to its use) is not only possible, but proceeding apace, even in largely democratic,Western states.

So, given that big data anonymity is "no longer a sustainable position", it should be clear that the analytics possible on today's high-powered computers is a double-edged sword; it serves us poorly to focus only on one, single, razor-sharp edge.  As we evaluate and build useful and worthwhile business analytics applications of this coming year, let us step back even occasionally to contemplate whether the profits to be earned or the discoveries to be made are worth the price of human freedom.

Posted January 23, 2013 9:06 AM
Permalink | No Comments |
PREV 1 2 3 4 5

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›