I attended O'Reilly's Strata Summit and Conference on big data this week in New York City. It was a real eye opener. A culture shock, really.
I suspect many of you have experienced the enlightenment that comes upon returning home from spending several days or weeks in a foreign country. Armed with a new cultural context, you see your own beliefs, traditions, and cultural mores through a fresh lens. Often, we learn things about ourselves that we didn't know because the truth was too obvious to see.
That was the experience I had attending Strata this year and Hadoop World a year ago. Last year, after spending two days at Hadoop World surrounded by developers, I discovered how much a data guy I am. This year, after two days at Strata surrounded by bright, young, hip open source people, I realize how much an old, corporate guy I am. Maybe it's not too late to change??
If all you have is a hammer...
Two of the more enlightening presentations I attended were delivered by John Rauser, a software engineer at Amazon.com.
In his first presentation, John made a good case for using Hadoop and cloud services. (It was a great pitch for Amazon Web Services.) He argued that Hadoop is a godsend when you need to scale an application beyond one computer. His point is that Hadoop makes distributed computing easy. He also argued that Hadoop isn't just for big data, but any volume of data in which peak processing requirements require a clustered environment. Finally, he explained the economic upside of using a cloud-based instance of Hadoop because you dynamically provision for current workloads not peak usage.
John is incredibly bright and articulate and clearly smitten with Hadoop and MapReduce. He and his fellow engineers at Amazon are using Hadoop/MapReduce to build (or rebuild) a lot of data-intensive applications that need greater horsepower at low cost. Although it's clear they are getting tremendous value out of Hadoop and MapReduce, I wonder how applicable Amazon's experience is to other companies.
John made a couple of assumptions while encouraging the audience to consider Hadoop as a distributing computing platform. His first assumption is that you are going to code applications rather than implement a software package. You don't need Hadoop if you buy an enterprise business application, most of which are designed to run in clustered environments already. Most companies I know are trying to reduce their dependence on custom applications because of the overhead of maintaining programmers and code. Second, John assumes that you have talented software programmers who have the skills to learn and build Hadoop/MapReduce applications. And he also assumes you have administrators who know how to operate and maintain a Hadoop cluster, which is still a relatively immature data processing environment. (Of course, this is a good argument for using AWS Elastic MapReduce services.)
So, the corollary to John's argument is that if your company likes to code software, employs lots of talented programmers, and either has an operations team versed in Hadoop or can afford to run Hadoop in the Cloud, then consider Hadoop when you need to scale an application cost effectively.
Dissing the DW
John was also pretty dismissive of data warehouses and the IT folks who administer them, but that's because he's a programmer. As a development platform, data warehouses aren't terrific, I'll admit. But then again, that's not what they are designed for. They are data repositories that support batch-oriented, SQL-based, reporting applications and which feed other data environments, including operational data stores that handle more run-the-business applications. To throw the baby out with the bath water is a tad irresponsible. But I'll agree, we need to make data warehouses more analyst and developer friendly.
Data Scientist Characteristics
John's second presentation covered the qualities that make up a good scientist, like himself. I thought his list was rather good. A data scientist needs to be 1) a good software engineer (i.e., programmer), 2) well versed in statistics and applied math 3) a critical thinker who takes nothing at face value and 4) someone who is perpetually curious and eager to learn new things.
This list is quite similar to the one I composed to define the characteristics of a good analytical modeler. During John's presentation, I tweeted that a data scientist is a "an old-school SAS analyst who knows how to program." Most old-school SAS analysts are statisticians who are adept at extracting and preparing large data sets from a variety of internal and external sources, building models, and programming the output in SQL or C so the model can be inserted into a corporate database and run against all new and existing records. In essence, SAS analysts have been forced to augment their statistical knowledge with computer science skills to create and deploy their models.
However, David Smith took exception with my data scientist analogy. David is VP of marketing at Revolution Analytics, which sells a distribution of the popular R open source programming language for creating analytical models. David doesn't think SAS analysts have the computer science or social interaction (ouch!) skills that data scientists need. He likened them to mainframe developers (but that's not surprising given that David is immersed in the open source movement.) David thinks the best data scientists have formal training or experience in both statistics and computer science, folks like, well, R programmers!
David emphasized that the goal of a data scientist is not to produce a model or report but to publish it so others can learn from it. To that end, data scientists need to know how to visualize quantitative data. (David then explained how R is terrific for visualizing data.) I wholeheartedly agree. But additionally, a data scientist needs to know how to deploy a model in an operational process where it can augment or automate decision making at the point of transaction or interaction. Embedded models that automate decisions should be the holy grail of a data scientist.
On the whole, my time at Strata was well spent. It was intellectually invigorating and culturally illuminating. In many ways, Strata the TED of data conferences. I'll be back again!
Posted September 22, 2011 1:28 PM
Permalink | 2 Comments |