Let’s take a look at big data. Corporations have discovered that there is a lot more data out there then they had ever imagined. There are log tapes, emails and tweets. There are registration records, phone records and TV log records. There are images and medical images. In short, there is an amazing amount of data.
Back in the good old days, there was just plain old transaction data. Bank teller machines. Airline reservation data. Point of sale records. We didn’t know how good we had it in those days. Why back in the good old days, a designer could create a data model and expect the data to fit reasonably well into the data model. Or the designer could define a record type to the database management system. The system would capture and store huge numbers of records that had the same structure. The only thing that was different was the content of the records.
Ah, the good old days – where there was at least a semblance of order when it came to managing and understanding data.
Take a look at the world now. There just is no structure to some of the big data types. Or if there is an order, it is well hidden. Really messing things up is the fact that much of big data is in the form of text. And text defies structure. Trying to put text into a standard database management system is like trying to put a really square peg into a really round hole.
Enter Hadoop. With a linear structuring of data and an ability to store very large amounts of data, Hadoop is the answer for big data. Or so we are told. With Hadoop we can store text to our hearts content. And that solves the problem.
Or does it solve the problem? Certainly one issue of handling text is the physical volume that it is stored on. And another issue of text is that it is extraordinarily irregular. But Hadoop addresses that.
Hadoop works until we look further into what is needed for truly understanding and managing text. It turns out that there are many facets to the ability to store and manage text. What about understanding text? Does Hadoop even come close to addressing the issues of understanding text? Let’s look at some really simple issues that relate to text:
- Date standardization. We have ten documents stored in Hadoop. One document has the value: Dec 6, 2011. Another document has the value: 2011/12-06. Another document has the value: sixth of December in the year 2011. Another document has the value: mil novocientos noventa nueve, diciembre seis. Does Hadoop have any problem understanding and comparing these values?
- Terminology. One document has “fractured tibia” and another document has the value “disarticulated ulna.” Does Hadoop understand that these documents are both talking about a broken bone?
- Shorthand. Does Hadoop understand that “U B-F W H Inmon flt 367 DIA-LAX 2011/06/13” really means that Bill Inmon has been upgraded from business class to first class on flight 367 from Denver to Los Angeles on June 13, 2011.
The answer to these questions is that under the best of circumstances, Hadoop addresses only SOME of the issues of reading and handling text. There is another entirely different level of data management that is needed in order to claim that Hadoop “manages” text.
So let’s be clear about this. Hadoop is a storage mechanism – an infrastructure – not a solution.
Recent articles by Bill Inmon