Two Similar Categorizations
There are two categories of animals: domestic animals and wild ones. Similarly, there are two categories of data: masterable data and challenging data. With the former, we mean data that can be efficiently and effectively treated by employing the technologies available and affordable today
, whereas with the latter we mean the opposite.
A Remarkable Behavioral Phenomenon
"Big data," the most fashionable term or buzz in the community today, represents a behavioral phenomenon. In fact, this is not a completely new one if we recall what happened with the term of "data warehouse" in the past (see Is Yours Really a Data Warehouse?
). This time, however, it goes a little further. According to my observation, almost every article, interview, product prospect, etc. talking about "big data" provides a definition for the term. How many articles on this topic have you already read? 1001? This should equal the number of the definitions of "big data."
Psychology Behind "Big"
"Big" in our context can undoubtedly serve us for marketing and advertising purpose since it can attract our attention at a remarkable degree (adhering to the first letter of the marketing term AIDA
). Everything regarded as big is more or less difficult to handle if a handling is still possible.
Psychologically, it implicates certain dosage of danger, risk and uncertainty. Some people have fear of danger and dislike risk, while others are presumably stimulated by uncertainty. Our instinct, developed through the evolution of millions of years, dictates us to pay attention to anything that can be dangerous and risky, thus a "big" thing draws our attention. Anyway, we have to, mostly unconsciously, pay attention to "big."
"Dad, Did You Speak English?"
The attention induced by "big," however, is mostly due to the volume, mass or weight of the objects in consideration, instead of their velocity, variety, complexity or something else. For instance, if you hear something big, your first unconscious
reaction would be a tensing in your arm and shoulder muscles, instead of a tensing of the leg or waking up your mind. Furthermore, if you would tell your little daughter Mary, who is just observing a fast
running mouse, that the mouse is big
, how would she react? "Dad, did you speak English?" could come out of her small mouth. Do you expect an attention at such a cost? In fact, the associations/reactions induced by "big," i.e., the impact of its ultimate semantics, are not exactly those we expect and need for characterizing the data in consideration. There are at least three additional aspects of data, i.e., temporal, structural and qualitative ones, which are not covered by the standard meaning of "big." In short, the word choice is unfortunate. If we insist in practicing this way with our basic vocabulary, either we wreck the language, or we madden all people like the little Mary, in favor of marketing.
An Effective Communication?
Meanings of terms are results of conventions, documented in dictionaries or the like and forwarded and propagated by teaching in schools. Any term that conveys other meaning than the original convention induces inevitably inconvenient associations or uncomfortable upset in mind. Although this effect can be exploited by marketing for attracting attention, remedying this necessities additional effort for explanation. This, in turn, can cause further misunderstandings. Today, almost everyone wants to use "big." Obviously, however, nobody is sure whether his "big" is her or their "big." To make sure that his "big" is understood, every person defines his "big" in his article or product prospect. However, I am not sure whether he is sure that the readers still keep their "big" in mind at the end of reading the article exactly as his intended "big." Is this the way we learned to communicate effectively?
Why Not Use "Challenging Data" Instead?
Apart from the marketing purpose, what do we want to express when we use "big" to denote data? We want to say that it is not possible to efficiently and effectively treat the data in consideration using the technologies available and affordable today, from diverse viewpoints such as volume, temporality, structure, form, quality, consistency, etc. In other words, it is still a challenge
for us to do this. If this is the case, why do we not use the term "challenging data" directly
for expressing this distress as argued at the beginning of this article? Even from the marketing attention standpoint, "challenging" is equivalently attractive since it implicates danger, risk and uncertainty as well as "big" does without wrecking its original meaning. We often claim that we like challenges. Do we really? On the other hand, do you really like big things? I am not quite sure.
The Poorer You Are, the Bigger Is Your Data
If you scan of the interview reports provided by Ron Rowell
in the past, and have a look at the article What Are They Doing with Big Data Technology
by David Loshin or the blog New Technologies for Big Data
by Wayne Eckerson, you should notice that almost all technologies introduced there are engaging in making the solutions inexpensive and affordable. In other words, if money would not be an issue, we would not have the issues expressed by "big." It is because all of them can be resolved by the technologies already available today, at least available at the Pentagon or FBI. For instance, with the massively parallel processing technology like that employed by Teradata, available for decades, all challenges considered there can be mastered efficiently, although not necessarily inexpensively. The other support for this claim is the fact that the whole "big data" movement was induced by the availability of Hadoop, an inexpensive open source product. Which new "big data" technology in discussion is not related to Hadoop? If you are rich, the data is "masterable." Otherwise, it is "challenging." In essence, it is about an economical challenge and struggle.
Actually, all such "big data" technologies could be called "inexpensive" technologies. There were collected "inexpensive" CPUs like Intel used by Teradata. There were collected "inexpensive" disks employed in RAID
. Now, we have collected "inexpensive" memory for in-memory processing, collected "inexpensive" nodes for Hadoop and cloud computing, "inexpensive" software as grout material making the collections appearing jointless and, generally, "inexpensive" technologies of all categories for mastering the challenging data.
II- and AA-Technologies (added on the last day of 2012)
As a matter of fact, almost all inexpensive technologies mentioned here aim at infrastructures for mastering the challenging data. Therefore, we can consider them inexpensive infrastructural technologies
(ii-technologies) as a category. Actually, this is not sufficient for an effective mastering of the challenging data. More importantly, we still need effective analytic algorithmic technologies
(aa-technologies) for substantial tasks like pattern recognition and visualization to make the story perfect. These are, in fact, classic topics of the areas of data mining and knowledge discovery and, in general, more challenging. The ii-technologies are quantity-related, external-circumstance-dependent and, thus, usually have a relatively short life as a star, whereas the aa-technologies are quality-related, internal-substance-dependent and, therefore, and can have a much longer stay on the stage if they are sufficiently smart.
Does the 80-20 Rule Work in This Case as Well?
To the question "How much big data does your company have?", most answers I have read indicate that it is about 80% of all data. To the next question, "How much value does this big data contain in comparison to that contained in the 'small' data?", I would expect an answer of 20% of the whole value that data of all kinds contains if we would speculate according to the 80-20 rule (Pareto Principle
). Compactly formulated, the relative value-density
of the big data to the small data is 6.25%. The following is the calculation:
- The big data contains 20% of the entire value with 80% of the entire data volume: The absolute value-density of the big data = 20/80
- The small data contains 80% of the entire value with 20% of the entire data volume: The absolute value-density of the small data = 80/20
- Thus, the relative value-density of the big data to the small one = (20/80)/(80/20) = 1/16 = 6.25%
This means that regarding the big data, you have to store and process 16 times as much as the small data to obtain the same amount of value mined from the small data. Assume that your company has invested $8 million (USD) to gain all value contained in the small data. To mine all value covered in the big data, the investment should not be bigger than $2 million. Otherwise, it is not economical. Is your favorite new "inexpensive" technology so cheap ($2 million) but capable of efficiently and effectively treating 4 times of the volume that the corresponding small data has (whose treatment cost $8 million)?
Questions That Should be Answered Prior to the Big Data Initiatives
- How much value does your big data contain in relation to that contained in your small data? Is it denser than that the 80-20 rule says or thinner? A play with a sample of your big data should give you some idea.
- Is your "inexpensive" technology in consideration cheap enough in a production environment but sufficiently capable to efficiently and effectively treat this big data?
- Have you extracted so much value from your small data stored, for instance, in your enterprise data warehouse that you cannot get any more from it without investing more than that to be invested in challenging your big data?
- How do you define your "big data" after all? Do you mean something other than "challenging data?"
Moore’s Law (added on January 3, 2012)
- According to Moore’s law, we assume comprehensively that the entire data analyzing capability on the planet including CPU, memory, storage, network, etc. doubles every two years without significant economic impact to our usual business.
- The current report by IDC Digital Universe study found that 2.8 zettabytes of data were created and replicated in 2012, and predicts that the total amount of data on the planet will double every two years between now and 2020, completely coincided with Moore’s law.
- There, it is also reported that only less than 3% of this data could be analyzed with the analyzing capability available today.
- If both growths, i.e., those of the analyzing capability and data, accelerate at the same rate according to Moore’s law, we will not be able to analyze all data generated, even at the end of 2020.
- Assume that no data will be generated after 2020. Even so, we will not be able to analyze all data generated prior to this time point with the further growing analyzing capacity at the end of 2030.
- In other words, our analyzing capacity can presumably never catch up with the growing data. This means that we will never know everything about our business we could know, and we should generally never expect that our decision making is purely data-based. Should we strive further to catch up with the growing data to master it, or give up the hopeless perfectionist struggle?
- Last questions but not the least ones: Is Moore’s law applicable to the growth of the value hidden in the data generated and to be mined? If not, and if the costs for the analyzing capability grow somehow, then for whom are we busy?
Recent articles by Bin Jiang, Ph.D.