Wikibon's lovingly detailed Big Data Vendor Revenue and Market Forecast 2012-2017
provides an excellent list and positioning of players in the "Big Data" market. Readers may be surprised to see that IBM tops the list as the biggest vendor in the market in 2012 with nearly 12% market share ($1,352 million), more than twice that of the second-placed HP. Indeed, the names of the top ten in the list--IBM, HP, Teradata, Dell, Oracle, SAP, EMC, Cisco, Microsoft and Accenture--may also raise an eyebrow, given that all of them come from the "old school" of computer companies. The top contender among the "new school Big Data" vendors is Splunk with revenue of $186 million.
Wikibon openly describes their methodology for calculating these figures, and one could describe it as more art than science, given the reluctance of vendors to share such data. Furthermore, the authors have also revised their original 2011 market size estimate up from $5.1 to $7.2 billion. So, one might dispute the figures and placements at length, but it's probably fair to say that this report is among the more useful publicly available data on this market.
Of more concern to me is the big, hairy, ugly question that has bothered me since "Big Data" attained celebrity status: what on earth is it? Furthermore, how can one evaluate the overall figures with Wikibon's two-part definition: (1) "those data sets whose size, type and speed of creation make them impractical to process and analyze with traditional database technologies and related tools in a cost- or time-effective way"
and (2) "requires practitioners to embrace an exploratory and experimental mindset regarding data and analytics... Projects whose processes are informed by this mindset meet Wikibon's definition of Big Data even in cases where some of the tools and technology involved may not"
. Part 1 is the fairly widespread definition of "Big Data", and one that is, in my view, so vague as to be meaningless. Part 2 is certainly creative but poses some interesting questions about how one might reliably access practitioners' mindsets and assess them as exploratory and experimental! The bottom line of this definition is that if somebody says a dataset or project in "Big Data" then it is so. I've long ago come to the conclusion that, unless somebody can come up with a watertight definition, we should stop talking about and fooling ourselves that we can measure "Big Data". I've said this before, but the term won't go away. Hence, the reference to killing vampires in the title...
As an alternative, I'd like to point again to a white paper I wrote last year, The Big Data Zoo - Taming the Beasts
, where I categorized information/data into three domains: (1) process-mediated data, (2) human-sourced information and (3) machine-generated data, as shown in the accompanying figure. I suggest that this is a much more clearly defined way of breaking down the universe of information/data and of differentiating between data uses and projects that are part of what you might call classical data processing and those that have emerged or are emerging in the fields that first sprouted the term "Big Data". These information domains are largely self-describing, relatively well-bounded and group together data that has similar characteristics in terms of structure and volatility. Size actually has very little to do with it.
Returning to Wikibon's results and their companion piece, Big Data Database Revenue and Market Forecast 2012-2017
, in database software, IBM again tops the list with $215 million in SQL-based revenue and is followed by 5 other SQL-based database vendors (SAP, HP, Teradata, EMC and Microsoft) until we reach MarkLogic as the top NoSQL (XML, in fact, so hardly part of the post-2008 NoSQL wave except by self-declaration) vendor with revenue of $43 million in 2012. Wikibon's "bottom line: the top five vendors have about 2/3rds of the database revenue, all from SQL-only product lines. Wikibon believes that NoSQL vendors will challenge these vendors hard of the next five years. However SQL will continue to retain over half of revenues for the foreseeable future."
I personally don't know on what Wikibon based the growth projections, so I cannot comment, but I do have questions about the 2012 figures themselves, both including and beyond the definition of "Big Data". Hadoop is not mentioned, and although I agree with its exclusion as a database, many vendors are incorporating it into their database environments by a variety of means. Is this included or excluded and why? HP and EMC grab third and fifth positions, based on their Vertica and Greenplum acquisitions respectively. Judging by the fact they overshadow significant database players like Microsoft and Oracle, it would seem that most or all of their database revenue is classified as "big data". Is this reasonable? How did the survey apportion IBM's, Teradata's and Microsoft's database revenue between "big data" and the rest? Is all of SAP HANA revenue called "big data" simply because it's an in-memory appliance... or how was it split? And the list goes on...
I'm sure IBM is very happy to be placed top in both listings; I assume the Netezza figures loom large in the database placement. SAP will be pleased to take second place, based largely on HANA. HP Vertica can claim top pure play "big data" database. And Teradata can take pride in its placement, earned I expect, in large part through its Aster acquisition. And so on... But the more interesting point is that these are all SQL databases. The highest-placed NoSQL (in the post-2008 wave sense) is 10gen, with attributed revenue of less than 10% of that attributed to IBM in the "big data" category. All this will drive marketing machines, but with more heat than light. Given the underlying dysfunction in the definitions, how will it help businesses who are trying to figure out what to do about the "big data" truck allegedly bearing down on them?
My suggestions are straightforward. In terms of the three data domains outlined above, be aware that process-mediated data - the well-defined, -structured and -managed data residing in current operational and informational systems - is growing fast and can drive significant new value through operational analytic approaches. Human-sourced information - currently mostly about social media - and machine-generated data are emerging and rapidly growing sources of knowledge about people's behaviors and intentions. They enable new, extensive predictive analytics (the successor to data mining) that initially demands flexibility in exploration, such as that offered by Hadoop. However, they will demand proper integration in the formal data management environment in the medium to long term. This requires a well-defined and thoroughly thought-out infrastructure and platform strategy that embraces all types of data and processes. Of all the vendors mentioned above, only IBM and Teradata are attempting to take such a holistic view, in my opinion.
As for NoSQL databases (however you define them - aren't IMS and IDMS also NoSQL by definition?). I believe the post-2008 NoSQL databases have important roles in the emerging environment. They certainly drive substantial and long-absent innovation in the relational database market. In particular, they offer a level of flexibility in database design that is key in emerging markets and applications. And they have technical characteristics that are very useful in a variety of niches in the market, solving problems with which relational databases struggle.
Interesting data times. But, let's just quietly drop the "big"...