Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Let's change the word big (in big data) to an acronym, so that BIG data stands for Business Intelligence Generated data. The reason for this proposal is that many are struggling with the term big data, myself included. There is a lot of confusion, because there is no generally accepted definition. We all know it's about large quantities of data, high velocity data, and/or a wide variety of data. But then still, what is a large quantity? When is it high and when low velocity? For some, big data is highly structured sensor data (machine generated data), for others it's textual unstructured data coming from social media, and there are those who say it's semi-structured data stored in, for example, weblogs.

The fact that the word big is a relative quantity doesn't help either. What big is for a midsize European company, can be medium for a large US company. And is it really about the amount of the data? Or is it more about what we do with it, for example, we analyze that data (regardless of the quantity). The V's (Volume, Velocity, Variety, Variation, Visibility, and Value--I've lost count of how many V's there are) are mentioned regularly to describe when something qualifies as big data.

Some have presented definitions, but I haven't seen an acceptable one yet. One author used the following definition: big data is data that is too much for a SQL database. This makes no sense. For example, there are plenty of multi-terabyte systems that everyone would classify as big data systems and that can be handled by SQL products more than satisfactorily.

Lastly, enough data is enough data. The quality of an analytical result doesn't always increase when the amount of data increases. Data quality is often more important than data quantity.

Conclusion, confusion rules when it relates to the concept of  big data.

In this blog I look at big data systems from a different angle in the hope that this helps to clarify this muddled concept.

Undeniably, processing large quantities of data is a common characteristic of most big data systems, but there is another one, and that is that most of such systems combine characteristics of production systems and of BI systems. In a sense each big data system is a production system, because it collects and stores new data, plus it's a BI system, because this new data is not collected to support business processes, but the primary intention is to use it for some form of analytics, possibly embedded analytics (analytics embedded within production systems), operational analytics, or predictive analytics. With new data I mean data that is not collected and stored by the organization yet, and in many cases it's also a new type of data. For example, a big data system developed by a retail company may be gathering camera data for tracking customer routes through their stores. Or, a big data system of a large international electronics firm may collect unstructured social media data for sentiment analysis.

Traditionally, new data is entered with and processed by production systems, such as a general ledger, cash management, and claim processing systems. These systems are, however, not designed to support analytics, but are designed to support business processes. In fact, when they were designed, the focus was definitely not on analytics, but on supporting data entry. This is why it's sometimes so hard when developing BI systems to extract the right data from those production databases for analytical and reporting purposes--staging areas have to be developed, ETL and replication processes have to be designed, and so on. This is still true today: the designers of new production systems don't think about how the organization can use the data for analytical purposes.

In other words, what makes big data systems special is that they are hybrid systems, they are production systems and BI systems. In my opinion, this is what makes big data applications special--and, evidently, most of them collect massive amounts of data to supports the required forms of analytics.

So, maybe we should redefine the term big data. Let's begin by not associating the word big with a relative quantity anymore, but let's change the word big to an acronym, so that BIG data stands for Business Intelligence Generated data--data generated and stored with the primary purpose to analyze it. Thus, a big data system is a system that generates, collects, stores, and processes data specifically to support business intelligence. Subsequently, big data is data managed by a big data system.

Hopefully, by redefining the term big data it becomes more obvious what is meant with this promising category of systems and gets rid of some of the confusion.


Posted October 16, 2012 1:45 PM
Permalink | 2 Comments |

2 Comments

Rick,

When we organize seminars about big data, we see the classic data-oriented vs document-oriented divide that we also saw in the early days of XML. There are probably less documented-oriented big data solution developers, and they probably don't visit this b-eye-network.com site, but they can't be ignored. They are looking at graph databases and document stores, and generally not interested in BI or in analytics.

For them, the Business Intelligence Generated Data definition will be hard to swallow, although I agree that most big data projects these days aim at analytics of the BIG data. As such, they will also have an easier time proving ROI than the document and graph-oriented big data projects.

Patrick @itworks

Rick, I fully agree with your conclusion that confusion rules when it relates to the concept of big data. But I start to disagree when you say that the primary intention of data stored by big data systems is to use it for some form of analytics. From what I understand about the confusing term 'big data' is that it originates from the Googles, LinkedIns, Twitters and FaceBooks in this world. I don't think that all the data stored by the systems these companies built is strictly for analytics. By using the acronym BIG for Business Intelligence Generated data--data generated and stored with the primary purpose to analyze it, for me the confusion does not go away, because to me it suggests that big data is the product of BI, which from my understanding is definitely not always the case.

Leave a comment