Imagine this: Would Google have built the predecessor to Hadoop in the mid-2000s if IBM's InfoSphere Streams (a.k.a. "IBM Streams") had been available? Since IBM Streams can ingest tens of thousands to millions of discrete events a second, perform complex transformations and analytics on those events with sub-second latency, why would Google have bothered to invest the man-hours into building a home-grown, distributed system to build its Web indexes?
Like Hadoop, IBM Streams runs on a cluster of commodity servers and parallelizes programming logic and handles node outages, relieving developers from having to worry about many of these low-level tasks. Moreover, contrary to what some may think about Big Blue products, Streams works on just about any data, including text, images, audio, voice, VoIP, video, web traffic, email, GPS data, financial transaction data, satellite data, and sensors.
Ok. I suspect Google has a "not invented here" mentality and needs to put its oodles of Java, Python, and C programmers to work doing something innovative to harness the massive reams of data that it collects daily from its sprawling Web and communications empire. And since it was an internet startup at the time, Google probably didn't want to pay for commercial software, and probably still doesn't. (Google was only six years old when it began developing Big Table, a predecessor to Hadoop, in 2004.) An entry-level implementation of Streams will set you back about $300,000, according to Roger Rea, Streams Product Manager at IBM.
Origins of CEP
I suppose you could argue that Hadoop inspired Streams. But that's probably not true. Streams emanates from a rather esoteric domain of computing, known as Complex Event Processing (CEP), which has been around for more than two decades and a major focus of a sizable amount of academic research. In fact, reknowned database guru and MIT professor, Michael Stonebraker, threw his hat into the CEP ring in 2003 when he commercialized technology that he had been working on with colleagues from several other universities. His company, StreamBase Systems, was actually a latecomer to the CEP landscape, preceded by companies, such as Tibco, Sybase, Progress Software, Oracle, Microsoft, and Informatica, all of which have developed CEP software inhouse or acquired it from startups.
CEP technology is about to move from the backwaters of data management technology to center stage. The primary driver is Big Data, which is at the height of the hype-cycle today. Open source technologies, such as Hadoop, have finally made it cost-effective for companies not only to amass mountains of Web and other data, but do something valuable with it. And no data is off limits: Twitter feeds, smart meter data, sensor feeds, video surveillance, systems logs, as well as voluminous transaction data. Much of this data is so big that you have to process it in real-time or you can never catch up.
Unfortunately, Hadoop is very young technology at this stage. It's also batch-oriented. That means you have to dump big, static files of data into a Hadoop cluster and then launch another big job to process or query the data. (Apache does support a project called Flume that is supposed to stream Web log data into Hadoop but early users report it doesn't work very well.)
Data in Motion
But what if you could process data as events happen? In other words, analyze data in motion instead of data at rest? This is where CEP technologies come into play.
Say you are a manager at a telecommunications company who wants to count the number of dropped calls each day, track customer calling patterns, and identify individuals with pre-paid calling plans who might churn. Every night, you could dedupe and dump all six billion of your company's call detail records (CDR) into your Hadoop cluster. Then you could issue queries against that entire data set to calculate the summaries and run the churn model. Given the volume of data, it might take more than 12 hours to process everything and by then it would be two days since the calls were made.
But if our telecommunications manager had a CEP system, he wouldn't have to load anything or run massive queries to get the answers he wants. He would create some rules and point his CEP engine at the CDR event stream and let it work its magic. The CEP system would first dedupe the data as it comes in by checking each incoming CDR against billions of existing CDRs in a data warehouse. It would then calculate a running summary of dropped calls, summarize call activity by customer, compute the churn model, and deposit the summaries into a SQL database. And it would do all that work in a fraction of a second per event record. A marketing manager could monitor the data on a real-time dashboard and send promotional offers to customers on a prepaid plan who are likely to churn within minutes of making their final call.
Now, if that isn't powerful computing, I'm not sure what is. That's certainly worth $300,000 or even ten times that amount for an enterprise deployment like I've just described. Google be damned!
CEP Use Cases
In a broad sense, CEP software creates a sophisticated notification system that works on high-volume streams of incoming data. You use it to detect anomalies and issue alerts. Fraud detection systems are a classic example of CEP systems in action. But, in reality, CEP offers more value than just pure notification. In fact, in the age of Big Data, other use cases may come to the forefront and even give Hadoop a run for its money.
According to Neil McGovern who heads worldwide strategy at Sybase, CEP has four use cases:
- Situational Detection. This is the traditional use case in which CEP applies calculations and rules to streams of incoming data and identifies exceptions or anomalies.
- Automated Response. This is an extension of situation detection in which CEP automatically takes predefined actions in response to an event or combination of events that exceeds thresholds.
- Stream Transformation. Here, CEP transforms incoming events to offload the processing burden from Hadoop, ETL tools, or data warehouses. In essence, CEP becomes the transformation layer in an enterprise data environment. It can filter, dedupe, and calculate data, including running data mining algorithms on a record by record basis.
- Continuous Intelligence. Here, CEP powers a real-time dashboard environment that enables managers or administrators to keep their fingers on the pulse of an organization or mission-critical process.
In many applications, CEP supports all four use cases at once. Certainly, in the era of Big Data, companies would be wise to implement CEP technology as a stream transformation engine that minimizes the size of data they have to land in Hadoop or a data warehouse. This would reduce their hardware footprint and software execution costs. Even though its commercial software, CEP products would provide high ROI in a Big Data environment.
CEP is a technology offers many valuable uses and is currently being adopted by leading edge companies. SAP plans to embed Sybase's CEP engine in all of its applications. So, if you are an SAP user, you'll be benefiting from CEP whether you know it or not. If you are a BI architect, it's time that you gave it a look and see how it can streamline your existing data processing and analytical operations.
Posted March 26, 2012 12:42 PM
Permalink | No Comments |