Big Data and Text

Originally published November 3, 2011

Let’s take a look at big data. Corporations have discovered that there is a lot more data out there then they had ever imagined. There are log tapes, emails and tweets. There are registration records, phone records and TV log records. There are images and medical images. In short, there is an amazing amount of data.

Back in the good old days, there was just plain old transaction data. Bank teller machines. Airline reservation data. Point of sale records. We didn’t know how good we had it in those days. Why back in the good old days, a designer could create a data model and expect the data to fit reasonably well into the data model. Or the designer could define a record type to the database management system. The system would capture and store huge numbers of records that had the same structure. The only thing that was different was the content of the records.

Ah, the good old days – where there was at least a semblance of order when it came to managing and understanding data.

Take a look at the world now. There just is no structure to some of the big data types. Or if there is an order, it is well hidden. Really messing things up is the fact that much of big data is in the form of text. And text defies structure. Trying to put text into a standard database management system is like trying to put a really square peg into a really round hole.

Hadoop

Enter Hadoop. With a linear structuring of data and an ability to store very large amounts of data, Hadoop is the answer for big data. Or so we are told. With Hadoop we can store text to our hearts content. And that solves the problem.

Or does it solve the problem? Certainly one issue of handling text is the physical volume that it is stored on. And another issue of text is that it is extraordinarily irregular. But Hadoop addresses that.

Sort of.

Hadoop works until we look further into what is needed for truly understanding and managing text. It turns out that there are many facets to the ability to store and manage text. What about understanding text? Does Hadoop even come close to addressing the issues of understanding text? Let’s look at some really simple issues that relate to text:
  • Date standardization. We have ten documents stored in Hadoop. One document has the value: Dec 6, 2011. Another document has the value: 2011/12-06. Another document has the value: sixth of December in the year 2011. Another document has the value: mil novocientos noventa nueve, diciembre seis. Does Hadoop have any problem understanding and comparing these values?

  • Terminology. One document has “fractured tibia” and another document has the value “disarticulated ulna.” Does Hadoop understand that these documents are both talking about a broken bone?

  • Shorthand. Does Hadoop understand that “U B-F W H Inmon flt 367 DIA-LAX 2011/06/13” really means that Bill Inmon has been upgraded from business class to first class on flight 367 from Denver to Los Angeles on June 13, 2011.
The answer to these questions is that under the best of circumstances, Hadoop addresses only SOME of the issues of reading and handling text. There is another entirely different level of data management that is needed in order to claim that Hadoop “manages” text.

So let’s be clear about this. Hadoop is a storage mechanism – an infrastructure – not a solution.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



 

Comments

Want to post a comment? Login or become a member today!

Posted November 22, 2011 by Prabhakar Aluri

Bill, Point well taken that Hadoop can solve the infrastructural issues of handling the big data but not the analytical portion of it. The emerging big data and text mining tools might well resolve the issues of big data analysis to some extent or other, question is how capable are they to handle the traditional transactional data. How do we integrate the analytics from the text data and the transaction data.

Not sure if these questions are answered yet in the big data analytics space..........

Is this comment inappropriate? Click here to flag this comment.

Posted November 7, 2011 by Anonymous

Hi Bill – thanks for this informative post on big data and Hadoop’s role. Because there are so many new types of data out there like tweets, emails, status updates and more, companies have more potential to gain insight into their data and make smarter decisions using that intelligence.  As you point out, however, it can be a challenge to make sense of semi-structured data, and largely unstructured data. That’s one of the reasons Teradata recently bought Aster Data, which offers a patented SQL-MapReduce framework. MapReduce processing lets companies analyze massive amounts of Internet clickstream data, sensor data, and social-media content, unstructured and multistructured data, in short, and combine that information with the more traditional data sources. Big data integration is already happening — the next step is for organizations to develop best practices, think of new questions and figure out how to get the most value of the new and emerging data sources.

- Jonathan Goldman, Director of Analytics and Applications, Teradata

 

Is this comment inappropriate? Click here to flag this comment.

Posted November 3, 2011 by Anonymous

Out of the box, a data warehouse is seldom useful to a business because each individual company is unique in the way that they do business.  I have been working in data warehousing for some time now and have begun to enter the world of Hadoop.  While I agree that it is a storage mechanism and an infrastructure out of the box; it has the ability to be much more than that once its been customized and configured to meet specific business requirements.

The Hadoop Distributed File System (HDFS) is certainly the cornerstone to Hadoop's infrastructure but the concept of Map-Reduce is Hadoop's real power.  When there's massive amounts of data to process, it makes much more sense to "take the code to the data"--not the data to the code.  Its simple enough for an individual programmer to customize but powerful enough for entire systems to be build on top of it.

I encourage you to check out the many related projects that are really the reason why Hadoop has become such a buzz word today.  Hive is an example of this.  It is a completely new way of approaching data warehousing while maintaining some familiarity with traditional data warehouse concepts.

Whether Hadoop works well as a data warehousing platform or not, one thing I think it can do extremely well is ETL.  At its core, the concept of Map-Reduce is really the "T" in ETL.  With it's built-in libraries to access databases, file systems, and data streams; the "E" and the "L" are also relatively easy to work with.  Linear scalability on comodody hardware for an ETL environment?  That's sure to cause some waves.

Is this comment inappropriate? Click here to flag this comment.