Sometimes what appears to be obvious and simple is anything but. Consider blood. Blood is red. Blood is wet. Living animals all need blood. So what is so difficult about blood? Well consider this. There’s human blood. There is horse blood. There’s blood in a fish. And there is blood in a tiger or lion. And all of the blood is red, liquid and has the same general properties.
But if you had to give someone a transfusion, would you want to put horse blood in a human? Or lion blood in a human? Of course not. You’d use human blood, but it gets even more specific. Would you want to put B negative blood into a person who is not B negative? You would not.
There is obviously more than meets the eye when it comes to understanding blood. There are more than just a few simple differences between different types of blood; and if you are a rank amateur, you are playing with fire if you do not understand the simple differences between different types of blood. In a word blood is not universally interchangeable regardless of its similarities.
Where are some other simple differences? Let’s take data, for example. Data is data is data, or so it seems. But like blood, data is not nearly as simple as it appears to be on the surface.
For example, there is structured data and unstructured data. Unstructured data is just that – data laid out on a slab where there is just data and only data. And then there is structured data. When we lay structured data under the microscope and we start to examine it, what do we find? We find that in addition to data, there is some stuff called metadata that is tightly interwoven into our structured data.
What does metadata in a structured world look like? Metadata looks like attributes. In structured data when we stumble across a name, there is metadata (or an attribute) that tells us what we have. In structured data when we run across the name “Joe Foster,” there is something that tells us that Joe Foster is a customer, or a supplier, or a retired military officer. There is some attribute that tells us some important information about Joe Foster.
Why is this attribution so important? Attribution (or metadata) allows us to do sophisticated processing/analysis against structured data. With structured data, we can ask questions such as:
- How many customers do I have?
- How many retired military officers are named “Joe”?
- How many people named “Joe” live in Alabama?
We can ask sophisticated questions when there are attributes attached to the data.
But when we have purely unstructured data, we can’t ask those questions. With unstructured data we can only ask how many people named “Joe Foster” are in the data?” We can’t ask about customers. Or retired military. Or residents of Alabama. There are no attributes in unstructured data that allow us to make anything but the most basic of queries. In a word, because of lack of attribution, we cannot do sophisticated analysis and processing of data that is unstructured.
There are indeed some important simple differences between data.
And why are those simple differences important? They are important because everyone is talking about big data today. You know, the kind of data found in Hadoop. Does anyone stop and realize that all data in Hadoop is unstructured and that you can do only the most basic of queries against that data?
Something to think about.
SOURCE: Data, Metadata and Big Data
Recent articles by Bill Inmon