We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

NewSQL Provides Power, Scalability and Performance: A Spotlight Q&A with Mark Sarbiewski of Clustrix

Originally published May 30, 2012

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Mark Sarbiewski, CMO of Clustrix. Mark and Ron talk about how Clustrix has created a revolutionary new platform that provides massively parallel processing software in an appliance.

Mark, let's begin by having you give us an overview of Clustrix and the types of problems Clustrix is solving today.

Mark Sarbiewski: Absolutely. Clustrix is a venture-backed company, backed by Sequoia and U.S. Venture Partners. It has actually been in existence for close to six years, but the vast majority of that time was spent building what we consider a revolutionary new database solution and platform. We talk about that platform as a distributed NewSQL database.

What do you mean by a NewSQL database?

Mark Sarbiewski: Five years ago our founders, Paul Mikesell and Sergei Tsarev, got together and asked themselves, “What would you build if you had to build a database today that would scale in the way that the world demands now and had all the power of the legacy SQL databases but some great new capabilities?” They obviously started with an idea around massively parallel processing. So, number one, this is a massively parallel processing database, a shared-nothing architecture, and married to all the latest and greatest thinking around how to maximize the capabilities around multicore processors, and blindingly fast flash drives, etc. It's built from the ground up for that.

NewSQL to us means it’s the SQL plus, plus, plus. It's everything that’s the power of a relational SQL database: that transactions are guaranteed, that are consistent, that they're durable, etc. It's a relational system, but it has flexibility that systems of the past didn't have. What I mean by that is you can change schema on the fly. For the agile world, that really matters. If I'm building and adding features, I don’t want to be held back by my database. If I need to make a change to the database, I just want to be able to have it done within minutes and not bring things down. It would have inherent fault tolerance. Databases are the engines of the apps. The apps and the databases are, in many cases, such a huge part of the business that fault tolerant and self-healing systems are almost table stakes now. That's what people really need, demand and want, and it has to be super easy.

Then the last point – and this is probably the point that I think distinguishes NewSQL from legacy SQL more than any other – is the unbelievable scalability that's required. In our case, we've proven that our architecture delivers linear and incremental scalability. That means you start with three nodes in a cluster. If you go to six nodes, you get twice the performance, 12 is twice again, etc., etc. Very importantly from our standpoint, when you need to add capacity, you need to be able to add it on the fly without bringing anything down. The system should figure out that there's more capacity and begin to utilize it immediately.

I'd add just two points to how we deliver this to the market. One is that we've married this massively parallel processing software to an appliance model. Paul Mikesell, one of the founders of Clustrix, also founded a company called Isilon, which had great success with a similar idea in the file storage and file system space. We took that idea and married it to solid-state flash, to high-speed InfiniBand interconnects and switches, and we deliver it to our customers as a “set it and forget it” kind of an appliance. They get started, it expands when they need it, or they can expand it simply when they need it.

The very last point is our database. Even though it's entirely our own code base, entirely massively parallel designed, it speaks MySQL. To the application world, developers, and database administrators, it essentially looks like the world's fastest, most scalable MySQL single instance database. That matters because you don't ever want to push complexity to the developers. Your database should do everything it's expected to do and more, and the app folks can focus on features and have a database works.

Mark, I understand the difference between legacy SQL and NewSQL as you have described it, but how does NewSQL compare to NoSQL?

Mark Sarbiewski: Well, this is a great question. There's a huge disruption happening across the landscape in the information architectures at companies. With that disruption come lots of new entrants and the requirement for many new capabilities. I would say the differences are relatively straightforward. MySQL was an open-source dialect standard SQL that Sun acquired a number of years ago. It's really, by every measure, the most widely deployed relational database out there. There are a few branches of it. I think MariaDB is a nice one and there are others. Oracle, which acquired Sun, has continued to develop MySQL, and the community contributes to it too. In and of itself, it is essentially a relational SQL database that has a particular sort of dialect and query language that's MySQL. We chose it very specifically because it was so widely deployed, so well understood, and has so much skill set built up around it.

NewSQL, which is what we talk about, is about taking the power of any SQL system or SQL database and giving it the scale, flexibility, fault tolerance, and performance that the big data world needs. That’s essentially taking an SQL system and creating a capability that didn’t exist before.

NoSQL – which to some is “not only SQL” and to others is thought of as “No, I'm not doing SQL.” It's not about relational capabilities and the acronym is even in debate, but essentially it's more of a file system than a database. It does have a really nice scalability. It's very simple. It essentially is a system where you don't have to understand much at all about how the data is structured, or laid out, or tables, or anything. You can essentially create a very simple interface and store documents and all kinds of information in it, and the system to some degree kind of just deals with it for you. It's nice for lots of use cases. It's nice for analytics and tasks where you want to have the simplest sort of architecture around your data. It's used for things where, we would say, transactions don't matter. In other words, not a lot of e-commerce, healthcare or banking systems are built on this because what NoSQL doesn't really give you that SQL always has, and NewSQL promises as well, is that the transactions that happen are somewhat sacred. In other words, if I'm debiting money from my account, and it's going to yours and we're doing a payment, that transaction is sacred. It's consistent,  it happens, and it’s durable. If the power goes out, bad things don't happen. That’s not the promise of NoSQL. Their promise is that it’s very flexible and can accommodate many different data types, but for any capabilities you want that the database previously provided, you more or less have to write that into your applications when using NoSQL.

The choice of NoSQL versus a NewSQL choice is driven by the application workload and use case. With NewSQL, the point I make to people is NoSQL gets great attention because it overcomes some of the historic problems, but there's a reason why the relational database market is $23 billion. It's because most of the business value of data is in how it relates to other data. Most of the world is relational in that regard. I want to know if you've done all your orders, how do I get a sort on that, how do I know what to talk to you about next, how do you and I relate as customers? Clearly, the opportunity that Clustrix sees is the huge demand for relational capabilities that scale and perform in a big data context.

When we look at the world today, it is predominantly powered by relational databases. We now have big data, which is going to surpass what we saw when relational databases first came into the market. How does Clustrix fit into this big data picture and what makes you different from other database products?

Mark Sarbiewski: That’s another really good question because there's so much noise and activity in this space. I think at the highest level, big data means that there are two critical dimensions to understand. Most of the attention goes to what we talk about as the big analytics, which is throwing a whole mess of data from lots of different places into a pot and crunching it off-line. It's not necessarily being done in real-time, and it’s going after new business insight, what patterns are going to emerge in this data and how can that be incorporated into the thinking around business to exploit that insight.

The other dimension is big data applications. These are real-time systems. These are operational applications that deal with data at orders of magnitude more than we've ever seen before. Some of the earliest and obvious examples are Facebook and Twitter. These are not analytic systems; these are real-time systems where transactions are everything. Did you post something, did everyone else see it, did they react, is it keeping the conversation going? These applications have unbelievable scale, and they are things we didn't even imagine a couple of years ago. That's just in the web space, and they're everywhere on the Web.

We're starting to see a whole slew of other applications begin to be imagined and built that require what I would talk about as “web scale” but for other purposes. These include things like an RFID-based supply chain. We used to know where the boat was between here and say South America, and now we know the temperature of every pallet in the hold of the ship every two seconds. What are you going to do with that data? How do you do routing and supply chains and distribution with that kind of data? Smart meters on utilities provide opportunities to create very dynamic billing and power consumption models that they can now imagine because they have the data coming in but haven't had a database that could deal with it. Now they can, and as a result, maybe this summer when its 115 degrees where you are, they can send a message to your phone indicating that if don’t use your air conditioner for the next hour, you’ll receive a $5 rebate. That’s a real-time system so they don’t have to stand up another power plant. We're seeing opportunities in fleet management, logistics and in-home medical care. There are billions of devices exploding all over the world throwing off data that people need to understand in real-time. I'm just touching on the surface of those apps – thousands more that will pop up. The key will be how to deal with that amount of relational data in real-time, and I think that's the huge opportunity for us.

Mark, you talk about speed, and scale, and simplicity. What specifically do you do to make that happen?

Mark Sarbiewski: There are two aspects to the speed part. Part of it is our flexibility to allow people to expand and change schema. We're a database that lets teams be very agile. The key to winning in the market is to be agile with development and new features. If I can get out there with a great idea faster than everybody else can, I win because speed wins in the market. We also, of course, give speed in the performance of our database. Without going into all the details, the smallest system we ship has 24 processors working at maximum parallelization all the time and high-speed SSD. The scale is how big you need to grow. We have reference customers in a variety of sectors and spaces that have proven that we deliver linear scalability where they can expand on the fly.

Whenever you need processing power, capacity, or more transactions, historically, you had the very awful choice of what's called sharding your application, basically splitting your database apart into lots of little databases and then trying to deal with the complexity that falls out of that. What our system does is expand, and you never have to shard. You never have to change your application. You only think about features and innovation, and the database will keep up and perform at whatever level you need.

Last, and maybe the most important, is simplicity. We made a very deliberate choice to do a couple of things. One was to package this as an appliance, and this has been proven in lots of sectors from Isilon, to NetApp, to Netezza, and others with the appliance model. The appliance model idea – plug it in and it just works and does everything I wanted to do – has clearly taken hold, and we believe the time is right now for a great appliance in the relational market.

Our appliance is fault-tolerant and self-healing. A great customer quote that I have from Massive Media in Europe is, "It takes longer to get it out of the box than to set it up or expand this system." It's just that simple, and that was our idea. We looked to companies like Apple for the inspiration. Not only do people need this scale and agility, but also they need to have lean operations. They need to spend less on just maintaining stuff and more on innovation. That is really at the heart of our whole model: let's speak MySQL, and let's have this be a simple appliance that never goes down. Here’s an example of the simplicity. When a drive does go bad, the system immediately reacts, and re-protects everything. You get two emails from our system. One is that there was a drive failure, and within a couple of minutes you get another email saying you're fully protected. We immediately recognize it, we re-protect everything, and we can keep working with the remaining elements of the system.

Can you provide one or two real-world examples of users who are solving their scalability challenges with Clustrix and tell us what they are able to achieve?

Mark Sarbiewski: One example, that I mentioned earlier, is Massive Media. They're a website that is exploding in Europe. They're a social site with 7, 8 or 9 million users. They grow so fast it's hard to keep track. What they've been able to do in just one year is to grow from zero to pushing probably ten million users, and they've never had to worry about their database – they've never sharded. They're a great reference and case study for us because they actually have another web property that they started years ago that's more Facebook-like where they did have to shard. They did have to reconfigure their database and do all that complex stuff. They promised themselves they would never do it again. The second time, for this other property, they said, “We're going to go with Clustrix. We think that's the answer.” And, we've proven  our scalability. They started with a three-node system, they moved to six-node, and a nine-node, and as they grow, they'll be able to add more.

And there's one other example I'd like to talk about because it's kind of a different use case. It's a biomedical research institute. In this case, they don't have a giant database administration staff. They have many scientists, and they're writing applications to do research. They use MySQL as their database backend, and what they needed was a way to consolidate all their databases onto one ridiculously simple appliance that one part-time person could manage, and that's exactly what they've done with Clustrix. They're on their way to pushing 100 different databases onto one appliance with a hundred applications, and they have a part-time admin that manages the whole thing. What they have with Clustrix is a system that scales to whatever they need. Maybe 80 of these 100 apps don't need scale, but maybe 20 do. They all might need fault tolerance and high availability because if you have 100 apps and 100 databases, if two of those go down in the middle of the night, the beeper's going off and somebody has to get up and deal with that. With our system, they're always protected; it's completely fault-tolerant and self-healing. Even if an element within the appliance does fail, they're still going and nobody's being beeped in the middle of the night.

We have other use cases too. People are using it for metadata management on giant files systems. People are also using it purely for the high availability piece. Usually it’s some combination of those, but those are some examples of the great use cases that we have.

Well Mark, this has been great. I really appreciate your insights and looking forward to seeing how NewSQL moves forward in the market.

  • Ron PowellRon Powell
    Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com. 

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel.

Recent articles by Ron Powell



Want to post a comment? Login or become a member today!

Be the first to comment!