I've attended numerous big data events during the past three years, but the Strata/Hadoop World in New York City two weeks ago had the most people and buzz of any so far. With more than 2,500 attendees crammed into the New York Hilton Hotel, the conference was a bit of a madhouse, exploding with energy and possibility. It's clear that after simmering for several years, big data has finally captured the imagination and attention of the industry as a whole.
In fact, this year's Hadoop World reminded me of the early years of data warehousing. In 1997, The Data Warehousing Institute (TDWI) held its largest conference ever in a ramshackle two-story hotel on the outskirts of San Diego alongside the freeway. At the time, data warehousing was a red hot phenomenon and there was so much interest in this new technology that I missed getting a room at the event hotel. So every morning, I walked under the freeway to learn about the latest and greatest in data warehousing. This daily 20-minute vigil was a small sacrifice knowing that I was on to something groundbreaking!
Since then, data warehousing has become a fixture in corporate IT environments. And although it no longer attracts the same buzz and has taken its punches over the years, it has also matured. With trial and error, the data warehousing market has refined the processes, methodologies, and technologies for collecting, integrating, and reporting on large volumes of disparate data. Data warehousing is here to stay.
Pace of evolution. My first question these days is whether big data (a.k.a. Hadoop) will evolve the same way as data warehousing (and most information technologies) or will it navigate a unique course given its supercharged expectations? My guess is that in 2013 big data will slide down Gartner's trough of disillusionment as companies discover just how raw and costly the technologies are to manage in a production environment at scale.
But that doesn't mean Hadoop won't compete with established technologies and vendors for a growing portion of data processing and analytical workloads. In fact, there are indications that Hadoop may move through the maturity curve faster than most technologies. It certainly has the potential to replace technologies that companies use today to build and manage information-intensive applications. Vendors are quickly plugging the gaps in the Hadoop ecosystem with both open source and commercial software, and user organizations are moving rapidly from kicking Hadoop's tires to implementing it in production environments.
Dress code. One indication of the rapid evolution of the big data market is the changing dress code at big data events. Two years ago, the typical attendee at Hadoop World had a ponytail and wore jeans, Converse sneakers, and a t-shirt. But this year, I saw a sizable number of people in blue blazers or business suits sans ties with gray around the temples (including me!). Clearly, big data is no longer just a forum for Java developers and open source adherents. The buzz around big data has attracted the attention of commercial software vendors, venture capitalists, industry experts, and other market followers who are gearing up for the next big thing.
Complementary or Competitive?
The next question with Hadoop is whether it will complement or replace our existing analytical environments? Die-hard Hadoop advocates tout Hadoop as a replacement for data warehouses, relational databases, and data integration tools. However, top executives at Hadoop vendors, including Mike Olsen at Cloudera and Rob Beardon at Hortonworks, are more diplomatic and publicly declare that Hadoop plays a complementary role to existing technologies. Some of this "play nice" verbiage is necessary: for Hadoop to grow, it needs applications to run on top of it, and the quickest way to do that is to play nice with existing database, ETL, and BI vendors. But some of this is based on the current realities of Hadoop implementations.
Staging area and archive. Today, most companies use Hadoop as a staging area for semi-structured data. Most parse and aggregate Web logs and then load the results into a data warehouse for reporting and analysis. As such, Hadoop offers companies a cost effective way to collect, report, and analyze unstructured data, something which is not easily accomplished in the data warehousing world.
Hadoop also serves a cost-effective data archive since you simply add commodity servers to house more data and you never move data offline. As one colleague likes to joke, "Hadoop is the new tape."
Load and go. Most importantly, Hadoop offers a new approach to collecting and managing data that is quite liberating. In the data warehousing world, you have to model and structure data before you load it into a data warehouse, a process that it is time consuming and expensive. Consequently, the mantra from data warehousing experts is, "Collect only the data you need." But since Hadoop is based on a file system, you don't have to do any upfront modeling. You just load and go. As a result, the mantra from Hadoop developers is, "Collect any data you might need." That's because there is no longer any significant cost to accumulating data. You load it, explore it, and when you find something valuable, only then do you structure and load it into the data warehouse.
Ultimately, companies need both models of collecting and processing data. I call the data warehousing model "top down" intelligence and the Hadoop model "bottom up" intelligence. In the top down world, you know the questions you want to ask in advance, while in the bottom up world, you do not. Each model requires different architectural approaches and technologies.
Replacement trends. As such, Hadoop currently complements a data warehousing environment. However, the real question is whether Hadoop will bust out of this complementary role and begin to subsume additional analytical workloads--or indeed, all of them.
There is ample indication that Hadoop may in fact cannibalize additional components of the analytical environment. For example, this spring, I conducted a survey of BI professionals who have implemented Hadoop. Most plan to significantly increase the amount of ad hoc queries, visual exploration, data mining, and reporting that they run directly against Hadoop. While most of this querying will undoubtedly run against new data (i.e. unstructured data), it may also run against traditional data sets in the future.
Exploiting this desire to query data directly in Hadoop, Cloudera announced two weeks ago a real-time query engine that moves Hadoop out of the realm of batch processing and into the world of iterative querying and analytical workloads. In addition, Hortonworks earlier this year announced a metadata catalog called Hcatalog that makes it easier for users to query data in Hadoop. Moreover, there is an Apache project under development called Yarn that extends MapReduce's resource manager to support other processing frameworks and engines. As such, Yarn promises to make Hadoop better optimized to handle diverse sets of workloads, such as messaging, data mining, and real-time queries. In other words, Hadoop can do almost everything a relational database or data warehouse can do, and more.
Bleeding edge companies. Not surprisingly, companies that operate on the bleeding edge of technology, such as Netflix, have seen the future and are aggressively moving to it. Netflix now refers to its massive Teradata data warehouse as a "data mart" and Hadoop as its "data warehouse." Currently, only 10% of its master data resides in Hadoop and the rest in Teradata, but that ratio will soon be reversed. Since Netflix stages most of its data in Hadoop, it sees a natural progression toward moving most of its data and query processing to Hadoop as well. The data warehouse will either disappear or be relegated to handling certain types of dimensional reporting.
Surrounding Hadoop. Given the stakes, it's no wonder that established software vendors, like Teradata, Oracle, Microsoft, SAP, and IBM, as well as the multitude of traditional ETL and BI vendors are working overtime to court Hadoop. On one hand, big data represents a sizable new market opportunity for them--big data opens up a whole new set of applications running on unstructured data. But on the other hand, Hadoop is a threat, and they need to make sure that they don't cede their market hegemony to a slew of Hadoop startups.
So the battle lines are drawn: Hadoop vendors currently pitch the complementary nature of Hadoop, but keep releasing functionality that puts them in competition with established vendors. And established vendors hawk interoperability with Hadoop while aggressively trying to surround it to maintain control of customer accounts and market dynamics. They want to keep Hadoop as niche technology for handing unstructured data only, while exploiting it for market gain.
We'll learn a lot about the future of Hadoop in 2013. We'll find out just how reliable, secure, and manageable it is for enterprise deployments; we'll better understand the true cost of implementing and maintaining Hadoop environments; and we'll know whether it can live up to its hype as a be-all and end-all for data and information processing.