Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

After attending several big data conferences, I had to ask myself, "What's really new here?" After all, as a data warehousing practitioner, I've been doing "big data" for some 20 years. Sure, the scale and scope of the solutions has expanded along with the types of data that are processed. But much of what people are discussing seems a rehash of what we've already figured out.

After some deliberation, I came to the conclusion that there are six unique things about the current generation of "big data" which has become synonymous with Hadoop. Here they are:

  1. Unstructured data. Truth be told, the data warehousing community never had a good solution for processing unstructured and semi-structured data . Sure, we had workarounds, like storing this data as binary large objects or pointing to data in file systems. But we couldn't really query this data with SQL and combine it with our other data (although Oracle and IBM have pretty good extenders to do just this.) But now with Hadoop, we have a low-cost solution for storing and processing large volumes of unstructured and semi-structured data. Hadoop has quickly become an industry "standard" for dealing with this type of data. Now we just have to standardize the interfaces for integrating unstructured data in Hadoop with structured data in data warehouses.
  2. HDFS. The novel element of Hadoop (at least to SQL proponents) is that it's not based on a relational database. Rather, under the covers, Hadoop is a distributed file system into which you can dump any data without having to structure or model it first. Hadoop Distributed File System or HDFS runs on low-cost commodity servers, which it assumes will fail regularly. To ensure reliability in a highly unreliable environment, HDFS automatically transfers processing to an alternate server if one server fails. To do this, it requires that each block of data is replicated three times and placed on different servers, racks, and/or data centers. So with HDFS, your big data is three times bigger than your raw data. But this data expansion helps ensure high availability in a low-cost processing environment based on commodity servers.
  3. Schema at Read. Because Hadoop runs on a file system, you don't have to model and structure the data before loading it like you would do with a relational database. Consequently, the cost of loading data into Hadoop is much lower than the cost of loading data into a relational database. However, if you don't structure the data up front during load time, you have to structure it at query time. This is what "schema at read" means: whoever queries the data has to know the structure of the data to write a coherent query. In practice, this means that only the people who load the data know how to query it. This will change once Hadoop gets some reasonable metadata, but right now, issuing queries is a buyer-beware environment.
  4. MapReduce. Hadoop is a parallel processing environment, like most high-end, SQL-based analytical platforms. Hadoop spreads data across all its nodes, each of which has direct-attached storage. But writing parallel applications is complex. MapReduce is an API that shields developers from having to know the intricacies of writing parallel applications on a distributed file system. It takes cares of all the underlying inter-nodal communications, error checking, and so on. All developers need to know is what elements of their application can be parallelized or not.
  5. Open source. Hadoop is free; you can download it from the Apache Foundation and start building with it. For a big data platform, this is a radical proposition, especially since most commercial big data software easily carries a six-to seven-digit pricetag. Google developed the precursor to Hadoop as a cost-effective way to build its Web search indexes and then made its intellectual property public for others to benefit from its innovations. Google could have used relational databases to build its search indexes, but the costs doing so would have been astronomical and it would have not been the most elegant way to process Web data which is not classically structured.
  6. Data scientist. You need data scientists to extract value from Hadoop. From what I can tell, data scientists combine the skills of a business analyst, a statistician, a business domain expert, and a Java coder. In other words, they really don't exist. And if you can find one, they are expensive to hire. But the days of the data scientist are numbered; soon, the Hadoop community will deliver higher level languages and interfaces that make it easier for mere mortals to query the environment. Meanwhile, SQL-based vendors are working feverishly to integrate their products with Hadoop so that users can query Hadoop using familiar SQL-based tools without having to know how to access or manipulate Hadoop data.

So, those are the six unique things that Hadoop brings to the market. Probably the most significant is that Hadoop dramatically lowers the cost of loading data into an analytical environment. As such, organizations can now load all their data into Hadoop with financial or technical impunity. The mantra shifts from "load only what you need" to "load in case you need it." This makes Hadoop a much more flexible and agile environment, at least on the data loading side of the equation.


Posted February 19, 2013 11:59 AM
Permalink | 6 Comments |

6 Comments

"As such, organizations can now load all their data into Hadoop without financial or technical impunity."

should be

"As such, organizations can now load all their data into Hadoop with financial or technical impunity."

right?

Hi Wayne,

to make BeyeNETWORK having a little discussions ...

Mostly, I agree with your observations:

You wrote "After all, as a data warehousing practitioner, I've been doing 'big data' for some 20 years. Sure, the scale and scope of the solutions has expanded along with the types of data that are processed. But much of what people are discussing seems a rehash of what we've already figured out."

I support you with having written [1] "almost all technologies introduced there are engaging in making the solutions inexpensive and affordable. In other words, if money would not be an issue, we would not have the issues expressed by 'big. ' It is because all of them can be resolved by the technologies already available today, at least available at the Pentagon or FBI. For instance, with the massively parallel processing technology like that employed by Teradata, available for decades, all challenges considered there can be mastered efficiently, although not necessarily inexpensively."

You wrote "Probably the most significant is that Hadoop dramatically lowers the cost of loading data into an analytical environment."

I support you with having written [1] "The other support for this claim is the fact that the whole 'big data' movement was induced by the availability of Hadoop, an inexpensive open source product. Which new 'big data' technology in discussion is not related to Hadoop? If you are rich, the data is 'masterable.' Otherwise, it is 'challenging.' In essence, it is about an economical challenge and struggle."

You wrote "2. HDFS. … Hadoop Distributed File System or HDFS runs on low-cost commodity servers, … in a low-cost processing environment based on commodity servers." "5. Open source. Hadoop is free; … especially since most commercial big data software easily carries a six-to seven-digit pricetag."

I support you with having written [1] "Inexpensive Everything -- Actually, all such 'big data' technologies could be called 'inexpensive' technologies. There were collected 'inexpensive' CPUs like Intel used by Teradata. There were collected 'inexpensive' disks employed in RAID. Now, we have collected 'inexpensive' memory for in-memory processing, collected 'inexpensive' nodes for Hadoop and cloud computing, 'inexpensive' software as grout material making the collections appearing jointless and, generally, 'inexpensive' technologies of all categories for mastering the challenging data." "As a matter of fact, almost all inexpensive technologies mentioned here aim at infrastructures for mastering the challenging data. Therefore, we can consider them inexpensive infrastructural technologies (ii-technologies) as a category."

You wrote "1. Unstructured data. …" "4. MapReduce. …"

I support you with having written [1] "Actually, this is not sufficient for an effective mastering of the challenging data. More importantly, we still need effective analytic algorithmic technologies (aa-technologies) for substantial tasks like pattern recognition and visualization to make the story perfect. These are, in fact, classic topics of the areas of data mining and knowledge discovery and, in general, more challenging. The ii-technologies are quantity-related, external-circumstance-dependent and, thus, usually have a relatively short life as a star, whereas the aa-technologies are quality-related, internal-substance-dependent and, therefore, and can have a much longer stay on the stage if they are sufficiently smart."

However, I do not always agree with you:

You wrote "3. Schema at Read. …" I do not consider this point an advancement. The schema, the structure or the syntax are one of the most efficient means to treat meaning or semantics understandable by programs/machines. As soon as you "get some reasonable metadata" understandable by programs/machines, I call it "operative metadata" [2], you do put structure to the data.

You wrote "6. Data scientist. …" This term appeared mostly in the context of BI, instead of in that of "big data."

[1] "Metathink: Big Data or Challenging Data?" (http://www.b-eye-network.com/view/16753)
[2] "Data Warehouse Construction: Compiler, Interpreter and Operator" (http://www.b-eye-network.com/view/16816)

Yup! I'll change it. Thanks

Bin,

Thanks for your thoughtful reply. I agree with most of what you say, except there are things that are too cumbersome to do in a SQL RDBMS that are easier in Hadoop. Plus, Schema at Read is rather primitive, but it's different than the way we process data in the SQL RDBMS world.

Wayne,

It's a pleasure to obtain a great agreement from you! Now let's have a short look at the single "exception", especially, the inherent differences between Hadoop and SQL RDBMS.

The major inherent attractiveness of Hadoop is its inexpensiveness, in comparison with the commercial SQL RDMBS’s. Also in this comparison, the major inherent weakness of Hadoop is the read-only characteristic (i.e., the un-updatability). That is, we cannot modify any file stored in the system, without removing it completely at first and then writing the modified version of the file into the system. In this sense, it could be regarded as a huge rewritable disk.

The file size limit of HDFS is about 512 yottabytes (one YB = 10^24 or 2^80). That of a 64 bit linux is 8 extrabytes (one EB = 10^18 or 2^60). This is the object size limit that a 64 bit linux based SQL RDBMS could be extended with its BLOB or CLOB to store and process (BLOB=Binary Large Object; CLOB=Character Large OBject). Note that the current report by IDC Digital Universe study found that 2.8 zettabytes (one ZB = 10^21 or 2^70) of data were created and replicated in 2012 on our planet. Clearly, there is not any big or huge file that exceeds the 8 extrabytes limit provided by SQL RDBMS. In other words, this difference is not very relevant to our concern, even if it could be inherent.

With BLOB and CLOB, we can store any type of data that Hadoop can. If there were some special methods for processing data in Hadoop world, why can we not import them into the SQL RDBMS world?

All SQL RDBMS’s are based on file systems, standard or native ones. All of them are updatable. The un-updatability of HDFS is an inherent bad thing for being the foundation of a DBMS.

hi,i m read this reviews very excellent.i have learn using that my mind.

Hadoop Training in Chennai

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›