We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Creating a Big Data Platform: A Q&A with Billy Bosworth of DataStax

Originally published February 27, 2012

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Billy Bosworth, CEO of DataStax. Ron and Billy discuss why enterprises should consider using the NoSQL database Apache Cassandra for big data analytics and talk about the products and services DataStax provides for Apache Cassandra.

Billy, everyone's talked about big data, can you give us your definition of “big data” and explain why it's so important?

Billy Bosworth: This is a really important question, and it's one that I think people are still coming to grips with in their own minds. If you talk to five different people and ask them this question, there's a good chance you're going to get maybe three to four different answers out of the five.

Personally, I think that Gartner's definition is the most accurate and the most complete. They define it with what they call the three V's and the C: variety, velocity, volume and complexity. I think it's very important to understand that this is a multifaceted definition. If you just think about it in terms of perhaps what would be considered the layman's definition of big data – oh, it must be big as in volume or it must be big as in petabytes – that's just one of several aspects of big data that are vastly important. If you think about variety, you're talking about not just structured data, but also semi-structured data and unstructured data and that creates a challenge for the traditional systems that have been around for a long time. Then if you think about velocity, you're talking about the speed at which you have to acquire your data and/or read your data. And again, some of the systems just weren't built to handle that type of linear scalability with data coming in on a seemingly never-ending basis. Then complexity speaks to the demands that are often implemented around the architecture for these new data requirements. For example, multi data center spanning is becoming more of an issue. It's become an issue not just for performance reasons with co-locating the data where the application is, but also for some disaster recovery and legal scenarios where people want their databases to span across multiple data centers or perhaps into the cloud. And so, I really think the best definition is the one that encompasses those four different elements of the data: the variety, the velocity, the volume, and the complexity.

Billy, that is a great definition, and I believe that Gartner really understands the importance of big data. DataStax provides products and services for Apache Cassandra. For those in our audience who are not familiar with Apache Cassandra, can you tell us why an enterprise should consider this NoSQL database for big data analytics in their enterprise?

Billy Bosworth: Maybe the best way to start that is to give you a definition of what Cassandra is just to make sure that everybody is on the same page. Cassandra is an open source so-called NoSQL database that provides continuous availability for real time big data applications that tend to require extreme performance. It has a fully distributed architecture that was designed from the ground up for single or multiple data centers including the cloud, and it has optimization for the latency that can result from a WAN connection into the cloud. It's actually optimized for that. It provides something that we call location transparency to the application, which means that because it's fully distributed you get this concept of read/write anywhere. You don't have to worry about where the master node is that does all the coordination for this. You simply write to any node in the cluster, and then it will handle the distribution throughout the database for you.

Now why customers tend to turn to it is that Cassandra has been known for a long time to be a very robust, extremely scalable system. In fact, in the early days, it got that reputation based on its very high write throughput. So because of the way its architected, it handles very high velocity throughput on the write side where it can keep up even with today's most challenging demands. Then people also like the fact that it is a fully distributed peer-to-peer architecture, which means every node is the same, and this introduces a concept of operational simplicity. With every node type being the same, when you go to grow the cluster or manage the cluster, you don't have to worry architecturally about when to add a master node or add coordinator node. That concept doesn't exist. It’s continuously available, always failing over. You don't test failovers because failovers are inherent in Cassandra. So when companies are looking for that mission-critical, real-time big data system for those applications that have to be continuously available, that's when they tend to turn to Cassandra.

Can you tell us a little bit about DataStax Enterprise and give us some examples of how your customers are using your product with Cassandra?

Billy Bosworth:
With DataStax Enterprise, we offer several things. The first is that we offer a stable, fully supported version of Apache Cassandra. When a brand-new version of Cassandra comes out, it is almost always followed by several patches that happen in rapid succession as the release settles down. Some customers want more stability than that. They want to make sure that they're getting a version that is stabilized in the community, very much like the Red Hat model in that respect, and that’s what we offer.

Then we offer a layer around Cassandra where we take all the power of Cassandra, and we leverage it to create a big data platform. In that big data platform, we're going to bring other technologies that will reside on top of Cassandra so that you never have to move your data. It becomes a data-centric big data platform.

In DataStax Enterprise Version 1, you get the ability to take your Cassandra data with a Hadoop layer on top of it so that you can run your MapReduce jobs, your Hive jobs, and your Pig jobs right against your same data that's in Cassandra. But, and this is the most important point, we do so by guaranteeing workload isolation so that your long-running batch jobs will never conflict with your real-time jobs. This all runs on top of Cassandra's architecture that as I said earlier is fully distributed and continuously available so you don't have to worry about any of the challenges of a master/slave architecture. It's all built on top of Cassandra's architecture. In future releases, we're going to be adding more popular technologies that people are using for big data into this same platform to again provide that single data-centric big data platform.

The final piece we offer is a product called DataStax OpsCenter, which allows you to visually monitor and manage your Cassandra environment and your DataStax Enterprise environment.

You make the statement that ETL is not required with DataStax Enterprise. Can you give us a little insight into that?

Billy Bosworth: Yes, and that actually was the initial driver of why this platform was necessary because customers would come back and say I love Cassandra, I'm using it for my mission-critical system, and I'm now in a position where I have all of my system of record data in one spot. But for anybody who has been around architectures for any length of time, you know that as soon as you have to ETL the data, the degree of separation that is created often breaks what is desired for most applications, which is a virtuous cycle of taking the data, learning from the data and adapting the application to the data, especially when it comes to analytics. Doing that when you have to do things that are ETL'd in batch windows becomes very difficult. What we decided was to take the inherent ability of Cassandra to handle this fully distributed database with what’s called a masterless architecture. There's no concept of a master node or a slave node. All the data is fully distributed for you and we added the Hadoop layer on top of it so that you can then say I would like these perhaps in a ten-node cluster. Let's say I would like these four nodes to be my Hadoop nodes, and the other six nodes to be my Cassandra nodes.

Well the data is already there because Cassandra is going to handle keeping that data fully distributed across the cluster for you, and you don't do anything to make that happen. That is exactly how the system is designed to work from the moment that you do the installation. The data is already where it needs to be; now it's just a matter of bringing the interfaces to the data, bringing the compute power to the data. We can do that in Cassandra by putting our desired workload on any nodes and let Cassandra do the data distribution. It all happens under the covers because of the Cassandra architecture, which eliminates that need to ETL it to a separate system that also requires a separate operation team that usually will then require a separate data team to figure out how to access that data and what to do with that data. With DataStax Enterprise, you can start to build a single data team for your big data needs with a single operator for the Cassandra environment, and then you can do everything that you need to on top of it.

In our next release that's coming out in March, we are going to give you the ability to change those node types. For example, on that same ten-node cluster, perhaps during the day you have a very high transactional system running and you need those six nodes devoted to the Cassandra real-time piece. But at night, maybe you're going to do some very long-running, highly intensive MapReduce jobs. You could actually take those six Cassandra nodes and convert three of them to MapReduce. So at night you could have seven nodes running your MapReduce jobs in batch, keep the three nodes running for your real-time system that may not be under heavy load, and then in the morning you can flip them back again. Why can you do that? Because the data is already distributed underneath the system for you so eliminating that ETL opens up a whole new world of possibilities and that's what we're delivering to the market with DataStax Enterprise.

So it would seem to me that you are substantially reducing the cost of processing big data as a result of your platform if you don't have to do ETL.

Billy Bosworth: You're doing a couple of things. Yes, you are absolutely eliminating cost in the sense of the administration cost, which is kind of the easiest to figure out on an ROI statement. But there are some hidden cost that emerge when you ETL, and that is really in your loss of ability to respond very quickly to the analysis that you learn in your MapReduce jobs. So people are trying to leverage big data as a strategic weapon. That is our vision statement at DataStax. We want to help people take big data and use it as a strategic weapon. Well to do that, it has to be easy to access and it has to be available. It has to be in a virtuous cycle of taking your batch piece and having it feed your real-time piece. Those have to remain interconnected to really start to use this as a strategic weapon. And so, yes, it eliminates the simple ROI cost of the overhead of a separate system, separate teams, separate administrators, but it also propels you forward, not just in what you can save but what you can do. That's where it really gets exciting.

What other tools and support do you provide for Apache Cassandra, and can you give us some examples of the types of applications that are developed with DataStax Enterprise?

Billy Bosworth: Yes. As I mentioned earlier, our DataStax OpsCenter is our web-based visual monitoring and management tool for your DataStax Enterprise and your Cassandra environments, and it does automation, alerting, performance, all those things you would expect from a monitoring and management tool when you have an infrastructure database. It's very elegant, and it's very easy to use in the sense that it's extremely visual.

The second part of the question about what types of applications are being developed is very interesting. There are some great examples, and they cross a wide spectrum of use cases, which is great to see.

The first could be categorized under what we'll call time-series data. Time-series data tends to be data that is coming in at a high velocity at ingestion so it’s going to require a high volume of writes, but also – and equally as important – you're going to want to read that data back in a time-slice analysis. I want to know what happened in my system with user X during time periods Y through Z. That’s time-series data. In those scenarios, we have customers using it for things like SNMP logs that send back all the data of all the devices across all their data centers, and they have a single cluster actually spanning 12 different data centers so this goes back to location transparency. It can write the data in its own data center, but then it gets replicated as needed all across the large Cassandra cluster. So we have a customer that does fleet management. They have tens of thousands of customers who own fleet vehicles with devices in them, and those devices are constantly feeding data back on where those vehicles are. That's time-series data.

We have website logs that look at the real time stats and analysis of how people are navigating the websites. We have financial customers using it for stock exchange transactions. There are lots of different examples of that time-series implementation.

We also have several customers using it for shopping carts because Cassandra has a very rich data model underneath it. It's very flexible so it's a great app for a shopping cart because not only does the data model handle it, but now you're into that high availability multi-data center kind of need. We have gaming customers that use it for in-game messaging. We have people using it for e-mailing and marketing campaign management. They blast a campaign and all of a sudden they're flooded with results. There is a big spike in activity requiring quick analysis, quick response, campaign adjustment – it is very fast, very high velocity. In the hospitality industry it’s used for hotel bookings and preferences. Media streaming companies use it to determine how you interact with the media stream, where you pause, what are you playing, when do you leave. I could go on and on but that's a pretty good sampling of the wide variety of applications that we're seeing used with Apache Cassandra.

Streaming media must really place high demands on systems because media files are very large. Could you elaborate further on that?

Billy Bosworth: Absolutely. But interestingly, it's probably not what you first thought. When you talk about streaming media, the files themselves are certainly big, and I think that's the first thought in everybody's mind. But, the big data aspects of these streaming companies who are turning to Cassandra and DataStax are actually around the velocity and variety problem. They want to capture all your interactions with that data. The data itself actually doesn't reside in Cassandra. The big heavy media file sits somewhere else. That's delivered through a content manager somewhere or sits on some other storage system. What is stored inside of Cassandra is all the interaction.

Again, what did I pick, what did I choose, what did I choose prior, when did I pause, how often did I pause, where did I jump around, what did I watch immediately after, what did I listen to immediately after, how long did I listen to this song before I turned it off? On, and on, and on. That is an extremely high velocity problem because you have millions of data points coming in very, very rapidly from your user base because you have hundreds, or thousands, or millions of users all hitting these systems. It creates a very challenging problem, and typically you want to look at the data in time slices. You want to know how long over a certain period of time did somebody listen to this number of songs or watch this number videos, and so on. So sometimes it's not the obvious big data problem that you might think.

Considering the complexity of big data applications, how does a customer get started with DataStax Enterprise?

Billy Bosworth:
It’s very easy to get. You can come to the website and click the download button. We also have a fast way to get you started with Cassandra called DataStax Community Edition. There's no registration required; no strings attached. We don't start spamming you right away when you download. This is truly a way to get somebody up and running with Cassandra in just a few minutes, and we actually have startup videos that can show you how to get up and running in less than five minutes. We have a sample app that comes with it with a sample schema already in place. We have all the right documentation; we have the right drivers and connectors. It’s everything you need in a single bundle. And the reason we do that is because one of the challenges of all of the wonderful Apache projects is you're limited sometimes in what you can do on the actual Apache site. What we do is take a version of Cassandra, the same Apache Cassandra that you find on the Apache site and we just make it very easy to get up and running. Otherwise, it's not a super easy experience, and we want it to be. Obviously, the more that Cassandra is adopted, the more success we’ll have in the long-term as a company so part of this is to make sure that the community has everything they need to be up and running very quickly, reliably, with what they need to be productive fast. That's our goal. This is really just our way to help accelerate Cassandra adoption in the wild, so to speak, and our gift back to the community. We want to make it very fast, easy, and reliable to get started.

Sounds great. Thank you, Billy, for taking time to talk to our audience about  DataStax and Cassandra.

  • Ron PowellRon Powell
    Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com. 

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel.

Recent articles by Ron Powell

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!