- Keynote: The Heat Death of the Data Warehouse, Thursday, 3 February, 9:25am
- Exec Summit session: The Data-driven Business and Other Lessons from History, Tuesday, 1 February, 9:45am
O'Reilly Media are making available a 25% discount code for readers, followers, and friends on conference registration: str11fsd. So, an added incentive to sign up...
Researching big data in preparation for the conference has been a fascinating experience, as well as bringing up an intriguing sense of déjà vu. In fact, it reminds me of the early days of data warehouse tooling, when the emphasis was on speeds and feeds of ETL into the warehouse and how everybody needed a new-fangled OLAP database to do the latest and greatest dimensional modeling. Today, the excitement is around Hadoop and MapReduce and the volumes of data they can chew through, and the statistical and text analytics that will ultimately find that gold nugget of "unknown unknown" information in the data exhaust of your web site usage.
This is pioneering work, to bravely go in the data universe where no man has gone before. It is exciting and produces great stories of challenges overcome, volumes never before processed and behaviors never before correlated. These are the big promises of big data. Although it should be a salutary lesson for all that the old data mining / data warehousing nugget from the 1990s "Men who buy diapers on Friday evenings are also likely to buy beer" is now widely believed to be an urban legend rather than a true story of unexpected and momentous business value.
Even more, those of us who have been around big data for many years before the phrase became popular (back in the 1980s, a few hundred MB of data was BIG) know that big responsibilities soon catch up with big promises. It's a lot easier to run some experimental analyses on big data than to move the whole process into ongoing production. Playing with huge volumes of web log data is fine until you realize that you have to comply with privacy and other regulations and that you have to store the data for seven years to enable future audits on your decision making. At that moment, you begin to realize why databases have the consistency and sustainability characteristics they do. And that clever parallel processing and distributed file systems have limitations as well as strengths.
That said, Hadoop and MapReduce are demonstrating some fascinating possibilities in parallel processing of complex data, and as multi-core processors come out with ever more cores, we certainly need new ways to take advantage of them. Database vendors from Aster Data to Teradata are also exploring the possibilities both on the analysis side and on the data sourcing side. Given the hype, it seems likely we'll hear a lot more about these types of usage in 2011. But, in the long term the big winners will be the companies who are serious about productionalizing these techniques rather than necessarily those who have the biggest data.
I wish you a Happy Christmas and a Peaceful and Prosperous New Year.
Posted December 20, 2010 8:46 AM
Permalink | No Comments |



