We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Data Scientists and Big Data

Originally published May 2, 2013

Almost everywhere you look, you see something about big data. The marketing machines of the big technology vendors have been working in overdrive, hyping and selling big data. Executives at every large corporation have been told that they need to invest in big data in order to stay competitive.

And because of this hype, there is much talk of the “data scientist,” a term that heretofore has gone unmentioned. Why has the role of the data scientist now come to the forefront? What is going on here?

It seems that there is a lot of unstructured data that comes with big data. It seems that in certain implementations of big data, that is all that there is – unstructured data. What happens is that organizations that have been sold on big data have a big collection of unstructured data. And in order to make sense of the collection of unstructured data, they are told they need a data scientist.

But why can’t you just query unstructured data? It turns out that unstructured data does not have those convenient pieces of metadata called attributes. Instead, there is just raw text. And you can only do basic queries on raw text. You can find out if a predetermined value exists, and you can count the number of occurrences of that value. And that’s pretty much it. There is not a lot of business value to be gleaned for an existence and count query.

An alternative is to write thousands of lines of code in MapReduce. Then you can do sophisticated queries. But writing code in MapReduce is like writing code in assembler. We learned a long time ago that assembler was not a really user friendly tool. And we learned that assembler was difficult to maintain. And that assembler was slow to be developed. And that assembler had to be written on a customized basis.

In short, writing tens of thousands of lines of code in MapReduce was not really a good or viable idea.

Thus, in order to get our money’s worth out of big data we are stuck with the data scientist. Now a data scientist is an interesting term. Which university is it that trains data scientists? And what is the curriculum of study for a data scientist? And exactly how many data scientists are there?

If data scientists are the answer to churning through unstructured data, then we are in real trouble. Why? Because there are tens of thousands of corporations that are being sold on big data, and there are not nearly enough data scientists to go around. A data scientist is as rare as a clear blue pool of water in the Sahara desert.

The Dilemma

We have a mountain of unstructured data coming our way. It is a bit like a tsunami, and we are poised at the edge of the beach. We can’t write code in MapReduce, and we can’t count on having enough data scientists to fill the demand.

As my dear mother used to say – it is a real poser.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



Want to post a comment? Login or become a member today!

Be the first to comment!