SPOTLIGHT: Columnar Databases and Big Data Q&A with Don DeLoach, CEO of Infobright
by Ron Powell
Originally published July 6, 2011
BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.
Presented as a Q&A-style article, these interviews with leading voices in the industry including software vendors, end users and independent consultants are conducted by the BeyeNETWORK and present the behind-the-scene view that you won’t read in press releases.
This BeyeNETWORK spotlight features Ron Powell's interview with Don DeLoach, CEO of Infobright. Ron and Don have an interesting conversation that includes how columnar data stores are changing the landscape, big data solutions and Hadoop, Infobright’s Rough Query capability and more.
Don, BeyeNETWORK Spotlights focus on what we consider newsworthy and groundbreaking information. In reviewing your latest release, the announcement centered around big data. That is a very hot topic within the BeyeNETWORK, and there's no denying that big data is the latest industry buzzword. But that term is also causing a lot confusion because there are so many definitions for it. Can you tell us how Infobright defines big data?
Don DeLoach: I agree with you – there is a ton of buzz and a ton of confusion around the notion of big data. We see it as the convergence of analysis, of structured and unstructured data, and generally it's on a very, very large scale. So when people think of big data, they think of Hadoop, Google, Facebook, and all of these massive sprawling databases.
The problem with this is that the interpretation of the problem as well as the strategies and tools to address the problem really run far and wide, really far and wide. The most common tool is clearly Hadoop, and everybody is talking about Hadoop connectivity, as well they should. At this point in time, anybody who is without the ability to integrate is at a disadvantage. We'll talk a little bit more about that, but it's becoming definitely mainstream these days as opposed to the “shiny penny” thing that nobody knew about.
A growing percentage of our customers using Infobright are using it in conjunction with Hadoop. But the thing that's happening that's very interesting is that there is a whole new landscape of data management tools that are out there that loosely fall into one of three versions of the NoSQL camp. These are very interesting tools, things like Mongo and Couch and Cassandra, and they're generating a lot of buzz, but what I don't find is a lot of people that seem to really understand what they are. And as is often the case, when there's something new on the market, everybody is all excited about the promise of the new technology. Early on, when a technology is its infancy, it is generally reasonably immature, and people don't understand the shortcomings of the technology. But as the market matures and as more people put more and more of a spotlight on what this emerging landscape looks like, it will force maturity into these tools. Some will survive and some won't, but the landscape will definitely change. That's part of the confusion that's out there.
The other thing that we see is that within big data, there's the unstructured side and the structured side. The structured side is most definitely influenced by the massive growth in machine-generated data. That's quite honestly where a lot of our focus comes – things like weblogs, call data records, financial pricing data, online gaming data, mobile and video advertising, RFID and other sensor-generated data, and more. So anything that has those type of characteristics, we care a lot about.
As big data is the convergence of structured and unstructured data, the high growth of machine-generated data poses its own challenges. There's an interesting link between the nuances of processing machine-generated data and the efforts associated with utilizing analytics databases to process and analyze the result set coming out of something like Hadoop. This is where we have a good amount of experience with our customers. They use us in conjunction with Hadoop, and then pass the result set into Infobright to process the analytics. We think it’s a very, very powerful strategy.
It turns out that using Infobright along with Hadoop can provide the same extended capabilities with the same benefits in terms of time and money as we do with processing machine-generated data.
Let me just step back for a minute. The key thing that we do is not just that we can process machine-generated data. There are a number of enterprise class data warehousing solutions out there that do that. But the architecture of Infobright, the Knowledge Grid architecture, is such that it allows us to address that challenge in far less time and with far less money. We're able to deliver a much more compelling way to answer that challenge than using something that's overkill. And again, as it turns out, the same type of techniques that we use to bring to bear this solution for analyzing machine-generated data happen to be very conducive to linking it with something like Hadoop to provide further analytics from the result set of a Hadoop cluster.
Well, that's a great segue into the recent release Infobright 4. 0ne of the things your press materials talk about is DomainExpert technology. Could you tell us what that is and why you are calling it the newest breakthrough in analytics?
Don DeLoach: Sure. So our Knowledge Grid architecture is our secret sauce, and that's what has enabled us to do what we do. It essentially establishes a columnar environment for analytics, just like a lot of our competitors do, but we do without having to apply very significant administrative resources to make it work. That means no projections or indices, no partitioning, no balancing, no schema migration. It’s fundamentally an automatic process, and that happens at the time of ingesting the data where knowledge about the data is automatically created including contextual information about the data. This is where we get huge leverage. That information is used when processing the query to reduce or eliminate the need to access data to answer the query. So, in some cases, we just have an extreme unfair advantage because we can get to the answer much, much faster and much, much more efficiently.
For a large number of technical reasons, this approach works particularly well with machine-generated data, and it also yields a very aggressive compression capability. To our knowledge, it’s more aggressive compression than anything else that's out there in the industry. Consequently, the savings are truly meaningful in terms of time, administrative overhead, and hardware resources.
I’ll explain what has happened with Domain Expert Technology. Our engineers figured out that they could extend the intelligence at a much deeper level than what we had been doing for machine-generated data. So rather than look at decimals, and integers, and characters, and var chars, the database would see them as things like URLs, email addresses, IP addresses, geospatial coordinates, or tick data for the financial world. And by using this contextual information and the understanding of the patterns in the data, it results in much, much faster queries and even more significant and aggressive disk compression. And again, all of that saves further time and further money.
Now the fact that Infobright reduces the need for all of this administrative work, is it much more tailored to the business user? Once the data is there, is it very easy for them to work with it?
Don DeLoach: It is. And in fact, that’s a very interesting question. We have sort of a bifurcation of the users. We have people within IT organizations that are responsible for database administration and standards that for reasons that should be obvious to many will have selected Infobright because we fit into the architecture in a more effective way for solving analytics of machine-generated data. But sometimes what we find is that it's the end users that come to us for exactly that reason. For example, we have a very, very large retailer that is using us to analyze Omniture data, and they're basically just downloading it into the business analyst environment with no help from IT whatsoever. There's also a bank that's doing virtually the same thing, and these are end users who have a very specific challenge that was well addressed by what Infobright could provide. It wasn't necessarily part of the corporate IT standards, but they found that they could bring it in themselves, get it up and running, and solve the problem in a fraction of the time and for a fraction of the cost. So I would say while we do have appeal on the IT side, your observation is a good one, and we definitely are brought in sometimes directly by the end users.
Don, that leads into my next question. I understand that users can add their own domain knowledge to enhance the results that they receive. Does the product require extensive domain expertise, or can organizations that don’t have someone capable of adding that to the product still get effective results?
Don DeLoach: The absolute answer would be that they can get effective results using Infobright without adding their own specific domain knowledge. But by providing the capability of adding their own specific domain knowledge, they're able to extend those capabilities. Prior to the release of 4.0, we didn't offer this capability, but we still have hundreds of very, very successful customers out there so the value is there for sure.
But what we're doing with DomainExpert is taking it to a whole other level, and we will be providing some of domain knowledge sort of in the can, if you will – URLs, or email addresses, or IP addresses. There'll be certain predefined DomainExpert definitions that we will bring forward, but we're going to extend the capability so that users can add their own custom definitions. And the thing about it is that it's going to be pretty easy to do. They can add their own domain expertise; but if, for whatever reason, they're incapable of adding their own, we have a services organization that can do that for them pretty quickly. Like everything we do, it's all about making the technology as automated as possible so it requires as little administrative overhead as possible and is as accessible as possible. Our approach has always been to make it easy to use, make it cost-effective, and get people up and running quickly without having to have a lot of extra hardware and administrative overhead. We carried that approach through to what we're doing with DomainExpert. We just are extending the leverage we get from our architecture by utilizing this concept.
Well, I will tell you, as technology moves into an enterprise, ease of use and the fact that you don't have to have certain IT knowledge for a business user is really key to success. I think it's the right move by Infobright in this case. According your press release, you say that this is the first database with built-in intelligence for near real-time analysis of machine-generated data. Now that's a bold claim. How you substantiate that you're the first?
Don DeLoach: Again, it's all about our Knowledge Grid architecture. There are always going to be people making claims about what they have or haven't done. The thing about us relative to the machine-generated data is that architecturally our approach is just entirely different than the principal approaches utilized in the market. There are some variations to the standard columnar store and some clever stuff that's out there. For example, the approach that was used by Vertica, now part of HP, had some nuances to it that were very clever for preplanned queries, and there are various architectures that are primarily focused on enterprise data warehouses.
What makes us so different and what substantiates our claims is really the suitability of the architecture specifically for machine-generated data. It's not so much that it's better, worse, or what have you, but rather it's a function of the purpose-built nature and what it can do.
So an example, I recently visited the website of a police department. This site talked about the department’s motorcycle unit. That got me thinking.
A motorcycle is one mode of transportation and an automobile is another, and most police departments have both motorcycles and automobiles. It doesn't make a motorcycle better or worse than an automobile; it just makes it different. The motorcycle costs less and it isn't going to be suitable for some of the things that the car does. You're not going to transport people to jail on a motorcycle, you're not going to have two people riding on the motorcycle, and you're not going to carry stuff around on the motorcycle.
On the other hand, what I found very interesting was that the website touted that the motorcycle unit is specifically deployed to get to crimes more quickly, especially when it’s necessary to go down bike paths, or narrow alleys, or places that aren't as appropriate for cars.
In a way, this is a lot like the characterization of Infobright versus some of the more general-purpose alternatives. The general-purpose alternatives out there are very, very good. In some ways, they can accomplish what Infobright accomplishes, but they just can't do it as quickly or as cost-effectively for the specific use case. What we do and what we're extending with our offering is fabulous if the business problem is analyzing machine-generated data. The other thing that I would say that substantiates this is the testimonial after testimonial from our customers who've evaluated other technologies and come to the very conclusions that I was talking about.
Don, your press release talks about enabling users to quickly find the needles in the haystack within large volumes of data. What types of needles in the haystack do your customers discover and why would they want to look for these needles? Wouldn't it be faster and easier to just discover and act upon the low-hanging fruit?
Don DeLoach: What we're talking about here is what we call Rough Query. Let me step back. On the most basic level, it speaks to suitability of a columnar data store service, ad hoc queries, and data mining.
A columnar data store is now generally accepted in the industry as the architecture of choice for getting data out of a database. If I want to analyze and mine that data, a columnar store is very good for that. The breakthrough that exists here – and this is truly a really cool breakthrough – is through the exploitation of the Knowledge Grid architecture through what we're calling Rough Query. And the reason we call it Rough Query is that our Knowledge Grid architecture is based on rough set mathematics, and this enables us to store information about the data in a metadata layer. What we're doing with Rough Query is really tapping into that. For the first time, we're delivering this capability, and it is unbelievable.
In a nutshell, it provides for basically instantaneous results to narrow in on an answer. Before we were looking specifically for the explicit finite end. Think of it as saying you want to find out where Mayor Bloomberg is at 10:30 on a Tuesday morning in New York. Instead of asking it that way and looking at every office, and every house, and every room, what you instantaneously get back is a range like he's either in SoHo or Harlem. That rules out 90% of the looking that you would have done for Mayor Bloomberg.
This illustration applies very well to what Rough Query does. It basically understands the nature of the query. And because we keep information about the data stored in the Knowledge Grid, we're able to go to the Knowledge Grid and instantaneously narrow the scope. Going back to the Mayor Bloomberg illustration, from there you can ask again until you narrow to the building or the block where you want to go door to door as opposed to having to look again in every room, every house, every hotel or whatever to find out where he is.
The net effect is that you can mine through extensive amounts of data in a way that’s significantly faster than using the traditional tools. And the other benefit is actually something that one of our customers in the data program pointed out to me – you free up your other resources. In the Mayor Bloomberg example, that would be like your colleagues in the vehicles they're in that are going door to door. You free up your other resources significantly while you speed up the time to get to the right answer, which really isn't a bad deal.
It sounds like a great deal. Don based on your experience and interaction with your customers and potential customers, what do you see as the biggest analytic challenges they're facing today?
Don DeLoach: This is actually not a new challenge, but I still see it as the biggest one and that is getting the right information without spending an undue amount of time and money. Most people see a trade-off. They see the trade-off as they have to have the right answer but it's going to take a long time or they're going to spend too much. Enterprise data warehouse initiatives are certainly fabulous examples of projects that run multimillions and run behind time and over budget and are often very painful. Or the other thing is they'll get the wrong answer or less precise answer but for less time and less money. But what people really want – and the big challenge, I believe – is to get to the right answer without having to break the bank or the calendar getting there.
Delivering the right answer for less time and less money is definitely the challenge, and we don't do that for everybody but, again, for machine-generated data we're pretty good.
If I were to ask you your prediction for the future of analytics and big data what would you say?
Don DeLoach: I have to think back to when I was much younger. I was with Hitachi Data Systems in the mainframe business, and that's when client/server was coming into its own. A lot of the people in the mainframe business really discounted client/server as being sort of a new toy or something that would be a fad and would go away. I remember one of the executives at Hitachi Data Systems at the time saying to me people would grow weary of client/server and nothing would ever displace mainframe technology. I think that's a very dangerous position to take. Technology is constantly evolving. When there's a real groundswell of interest in an emerging technology, generally speaking the market will force that technology to a level of maturity that will make it meaningful, and I would say big data is not going to go away. The explosion of data is unbelievable, especially machine-generated data.
For example, I recently wrote a blog about this. In Internet traffic alone, Cisco recently predicted that the data volumes between 2010 and 2015 will increase almost fivefold. If you think about the massive data volumes that exist now, and then think that in a very short time period of time it's going to have that that type of multiple, it’s huge. Just as the market has driven the evolution of numerous technologies – everything from steam shovels, to disk technology, to communications technology, or even what we're seeing in terms of the evolution of energy, batteries and cell phones and things like that – we see that market demand forcing the maturity of database technology and taking this landscape from the “shiny penny” of some of the new NoSQL variants into a more mature, well vetted place in the market, and we definitely see this as a good thing.
I do believe that people are starting to realize that there's not going to be one silver bullet. It's not going to be one specific technology that solves everybody's problems, but it's more likely to be a coexistence of technologies put together in the right way, in the right combination, to solve the right problems tomorrow better than they could have today. We think that's a good thing, and certainly we think that in the context of big data – aligning structured and non-structured and particularly on the structured side where it includes machine-generated data – that our opportunity is significant, and that has us definitely excited about the future.
I think you're right on. I go back to the mainframe days as well. Our platforms keep changing, and we went from the mainframe, to the mini, to client/ server, to the Internet and now to the cloud. When you really look at it, we're doing more things with more data, bigger, faster and cheaper. It's just amazing how technology moves and the validation especially in your market has been proven out. There is a definite need for this type of technology, and we see analytics as the next major wave so I think you're well-positioned Don, and I really appreciate you taking the time with me for this interview.
Don DeLoach: Well, thanks so much. I look forward to doing it again.
Recent articles by Ron Powell
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC