We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


What Are They Doing with Big Data Technology?

Originally published October 25, 2012

The title of this article is somewhat tongue-in-cheek because if you have been reading the popular technology press, you’d think that at this point everything that is being done is somehow associated with "big data." Yet while there is a lot of buzz about big data, I remain somewhat skeptical about the business problems that are being solved as a result of using big data. This month I am sharing some thoughts about a minimal amount of “unscientific” research about specific tasks that rely on at least one big data technique.

Clearly, much of the buzz has centered on the open source tools collectively referred to as Hadoop, and my starting point is a review of the self-reported descriptions of those organizations that are using Hadoop. Conveniently, the Apache Software Foundation provides a wiki website entitled “Powered By Hadoop,” in which organizations report what they are doing. For example, social network company Facebook reports that it is using Hadoop “to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.” Fortunately, there are on the order of 150 organizations that have provided information about either their application, the size and configuration of their implementations, or both.

A scan through the project descriptions yields some interesting, yet not unexpected results. First, many of the projects involve document indexing, preparation of search indexes, or application of algorithms for search optimization. Second, a number of projects describe more traditional applications for reporting that rely on Hadoop for performance optimization. Third, many of the applications describe the use of the framework for archiving and storage.

I decided to segment the described applications into these general categories:

  • Indexing and search-related applications, such as web crawling, document (and presumably, other data artifact) scanning, filtering, indexing, creation of inverted indexes, and supplementing and then optimizing search.

  • Business intelligence, querying and reporting, with a number of descriptions of projects using Hadoop to speed data aggregation for responding to queries for report generation, trend analysis, and general information retrieval.

  • Improved performance for common data management operations, with the majority focusing on storage of logs of multiple streams of transactions (such as web logs), data storage and archiving, as well as sorting, responding to queries, running joins, extraction/transformation/loading (ETL) processing, structural and semantics conversions, other types of data conversions, as well as duplicate analysis and elimination.

  • Data mining and analytical applications, including a lot of recommendation algorithms, advertisement optimization, social network analysis, link analysis, facial recognition, profile matching, other types of text analytics, web mining, machine learning, information extraction, personalization and recommendation analysis, ad optimization, and behavior analysis.

  • Non-database applications, such as image processing, natural language processing, RFID monitoring, scientific research, text processing in preparation for publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and monitoring workflow processes.
Reflecting on these categories of this microcosm of described applications provides a few interesting insights:
  • Many of the applications described are not much different than what people have been doing for a long time. This suggests that the driving forces for adopting innovative technology are still reliant on long-standing expectations for employing analytics to try to solve business problems, especially using predictive analysis.

  • A lot of the focus is the use for performance improvement, largely on optimization of cost especially with respect to expanding full utilization of available computational capability. This is especially apparent with those organizations using big data for more than one type of analytical application, which suggests that a large focus of the added value is not on development of new algorithms, but of enabling greater scalability while simultaneously providing a degree of elasticity to get increased efficiency from the resources.

  • The use of big data analytics and computation supports additional existing “legacy” a BI and analytical applications, suggesting better use of integrated platforms for mixed use of analytics. This better enables real-time adjustments to business processes, although this is really an extrapolation.
One of the more interesting insights one might draw from the review is that presumably there are many organizations who are doing big data applications using Hadoop that choose not to report what they are doing, although this is also an extrapolation. The absence of what one might call “revolutionary” applications would seem to be somewhat surprising, unless the value that is added is so great that publishing any details, or even existence of the application, would lead to a competitive disadvantage.

But is this really the case? Perhaps time will tell, or perhaps we might ratchet back our collective hypnosis over the “bigness” of data and instead concentrate on how to use available technologies to consider the creation of value, the types of information necessary for creating that value, how the tools and technologies can segregate information out of the selected data sources, and how to optimize resource utilization in ways that let a growing community of analytics customers benefit from information utilization.

Recent articles by David Loshin

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!