We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group,a research and consulting company focused on business intelligence, analytics and big data.

Recently in Business Analytics Category

9-19-2011 5-16-03 PM.jpg

For all the talk about analytics these days, there has been little mention of one of the most powerful techniques for analyzing data: location intelligence.

It's been said that 80% of all transactions embed a location. A sale happens in a store; a call connects people in two places; a deposit happens in a branch; and so on. When we plot objects on a map, including business transactions and metrics, we can see critical patterns with a quick glance. And if we explore relationships among spatial objects imbued with business data, we can analyze data in novel ways that help us make smarter decisions more quickly.

For instance, a location intelligence system might enable a retail analyst working on a marketing campaign to identify the number of high-income families with children who live within a 15-minute drive of a store. An insurance company can assess its risk exposure from policy holders who live in a flood plain or within the path of a projected hurricane. A sales manager can visually track the performance of sales territories by products, channels, and other dimensions.

Geographic Information Systems. Location intelligence is not new. It originated with cartographers and mapmakers in the 19th and 20th century and went digital in the 1980s. Companies, such as Esri, MapInfo, and Intergraph, offer geographic information systems (GIS) which are designed to capture, store, manipulate, analyze, manage, and present all types of geographically referenced data. If this sounds similar to business intelligence, it is.

Unfortunately, GIS have evolved independently from BI systems. Even though both groups analyze and visualize data to help business users make smarter decisions, there has been little cross-pollination between the groups and little, if any, data exchange between systems. This is unfortunate since GIS analysts need business data to provide context to spatial objects they define, and BI users benefit tremendously from spatial views of business data.

Convergence of GIS and BI

However, many people now recognize the value of converging GIS and BI systems. This is partly due to the rise in popularity of Google Maps, Google Earth, global positioning systems, and spatially-aware mobile applications that leverage location as a key enabling feature. These consumer applications are cultivating a new generation of users who expect spatial data to be a key component of any information delivery system. And commercial organizations are jumping on board, led by industries that have been early adopters of GIS, including utilities, public safety, oil and gas, transportation, insurance, government, and retail.

The range of spatially-enabled BI applications are endless and powerful. "When you put location intelligence in front of someone who has never seen it before, it's like a bic light to a caveman," says Steve Trammel, head of corporate alliances and IT marketing at ESRI.

Imagine this: an operations manager at an oil refinery will soon be able to walk around a facility and view alerts based on his proximity to under-performing processing units. His mobile device shows a map that depicts the operating performance of all processing units based on his current location. This enables him to view and troubleshoot problems first-hand rather than being tethered to a remote control room. (See figure 1.)

Figure 1. Mobile Location Intelligence.
9-19-2011 4-49-49 PM.jpg

A spatially-aware mobile BI application configured by Transpara for an oil refinery in Europe. Transpara is a mobile BI vendor that recently announced integration with Google Maps.

GIS Features. Unlike BI systems, GIS specialize in storing and manipulating spatial data, which consists of points, lines, and polygons. A line is simply the intersection of two points, and a polygon is the intersection of three or more points. Each point or object can be imbued with various properties or rules that govern its behavior. For example, a road (i.e., a line) has a surface condition and a speed limit, and the only points that can be located in the middle of the road are traffic lights. In many ways, a GIS is like computer-aided design (CAD) software for spatial applications.

Most spatial data is represented as a series of X/Y coordinates that can be plotted to a map. The most common coordinate system is latitude and longitude, which enables mapmakers to plot objects on geographical maps. But GIS developers can create maps of just about anything, from the inside of a submarine or office building to a geothermal well or cityscape. Spatial engines can then run complex calculations against coordinate data to determine relationships among spatial objects, such as the driving distance between two cities or the shadows that a proposed skyscraper cast on surrounding buildings.

Approaches for Integrating GIS and BI

There are two general options for integrating GIS and BI systems: 1) integrate business data within GIS systems and 2) integrate GIS functionality within BI systems. GIS administrators already do the former when creating maps but their applications are very specialized. Moreover, most companies only purchase a handful of GIS licenses, which are expensive, and the tools are too complex to use for general business users.
The more promising approach, then, is to integrate GIS functionality into BI tools, which have a broader audience. There are several ways to do this, which vary greatly by level of GIS functionality supported.

  • BI Map Templates. Most BI tools come with several standard map images, such as a global view with country boundaries or a North American view with state boundaries. A report designer can place a map in a report, link it to a standard "geography" dimension in the data (e.g. "state" field), and assign a metric to govern the shading of boundaries. For example, a report might contain a color-coded map of the U.S. that shows sales by state. This is the most elementary form of GIS-BI integration since these out-of-the box map templates are not interactive.
  • BI Mashups. An increasingly popular approach is to integrate a BI tool with a GIS Web service, such as those provided by Google or Microsoft Bing. Here the BI tool integrates static and interactive maps via a Web service (e.g., REST API) or a software development toolkit. An AJAX or other Web client renders the map and any geocoded KPIs or objects in the data. (Geocoding assigns a latitude and longitude to a data object.) End users can then pan and zoom on the maps as well as hover over map features to view their properties and click to view underlying data. (See figure 1 above.) This approach requires a developer to write Javascript or other code.
  • GIS Mashups. GIS mashups are similar to BI mashups above but go a step further because they integrate with a full-featured GIS server, either on premise or via a Web service. Here, a BI tool embeds a special GIS connector that integrates with a mapping server and gives the report developer a point-and-click interface to integrate interactive maps with reports and dashboards. In this approach, the end-user gains additional functionality, such as the ability to interact with custom maps created by inhouse GIS specialists and "lasso" features on a map and use those selections to query or filter other objects in a report or dashboard. Some vendors, such as Information Builders and MicroStrategy built custom interfaces to GIS products, while other vendors, such as IBM Cognos and SAP BusinessObjects, embed third party software connectors (e.g., SpotOn and APOS respectively.)
  • GIS-enabled Databases. Although GIS function like object-relational databases, they store data in relational format. Thus, there is no reason that companies can't store spatial data in a data warehouse or data mart and make it available to all users and applications that need it. Many relational databases, such as Oracle, IBM DB2, Netezza, and Teradata, support spatial data types and SQL extensions for querying spatial data. Here both BI systems and GIS can access the same spatial data set, providing economies of scale, greater data consistency, and broader adoption of location intelligence functionality. However, you will still need a map server for spatial presentation.
.

Recommendations

As visual analysis in all shapes and forms begins to permeate the world of BI, it's important to begin thinking about how to augment your reports and dashboards with location intelligence. Here are a few recommendations to get you started:

  1. Identify BI applications where location intelligence could accelerate user consumption of information and enhance their understanding of underlying trends and patterns.
  2. Explore GIS capabilities of your BI and data warehousing vendor to see if they can support the types of spatial applications you have in mind.
  3. Identify GIS applications that already exist in your organization and get to know the people who run them.
  4. Investigate Web-based mapping services from GIS vendors as well as Google and Bing since this obviates the need for an inhouse GIS.
  5. Start simply, by using existing geography fields in your data (e.g., state, county, and zip) to shade the respective boundaries in a baseline map based on aggregated metric data
  6. Combine spatial and business data in a single location, preferably your data warehouse so you can deliver spatially-enabled insights to all business users.
  7. Geocode business data, including customer records, metrics, and other objects, that you might want to display on a map.

Location intelligence is not new but it should be a key element in any analytics strategy. Adding location intelligence to BI applications not only makes them visually rich, but surfaces patterns and trends not easily discerned in tables and charts.


Posted September 19, 2011 2:51 PM
Permalink | 1 Comment |

In a recent blog ("What's in a Word: The Evolution of BI Semantics"), I discussed the evolution of BI semantics and end-user approaches to business intelligence. In this blog, I will focus on technology evolution and vendor messaging.

Four Market Segments. The BI market is comprised of four sub-markets that have experienced rapid change and growth since the 1990s: BI tools, data integration tools, database management systems (DBMS), and hardware platform. (See bottom half of figure 1.)

Thumbnail image for Thumbnail image for BI Market Evolution.jpg

Compute Platform. BI technologies in these market segments run on a compute infrastructure (i.e., the diagonal line in figure 1) that has changed dramatically over the years, evolving from mainframes and mini-computers in the 1980s and client/server in the 1990s to the Web and Web services in the early 2000s. Today, we see the advent of mobile devices and cloud-based platforms. Each change in the underlying compute platform has created opportunities for upstarts with new technology to grab market share and forced incumbents to respond in kind or acquire the upstarts. With an endless wave of new companies pushing innovative new technologies, the BI market has been one of the most dynamic in the software industry during the past 20 years.

BI Tools. Prior to 1990, companies built reports using 3GL and 4GL reporting languages, such as Focus and Ramis. In the 1990s, vendors began selling desktop or client/server tools that enabled business users to create their own reports and analyses. The prominent BI tools were Windows-based OLAP, ad hoc query, and ad hoc reporting tools, and, of course, Excel, which still is the most prevalent reporting and analysis tool in the market today.

In the 2000s, BI vendors "rediscovered" reporting, having been enraptured with analysis tools in the 1990s. They learned the hard way that only a fraction of users want to analyze data and the real market for BI lies in delivering reports, and subsequently, dashboards, which are essentially visual exception reports. Today, vendors have moved to the next wave of BI, which is predictive analytics, while offering support for new channels of delivery (mobile and cloud.) In the next five years, I believe BI search will become an integral part of a BI portfolio, since it provides a super easy interface for casual users to submit ad hoc queries and navigate data without boundaries.

BI Vendor Messaging. In the 1990s, vendors competed by bundling together multiple types of BI tools (reporting, OLAP, query) into a single "BI Suite." A few years later, they began touting "BI Platforms" in which once distinct BI tools in a suite became modules within a unified BI architecture that all use the same query engine, charting engine, user interface, metadata, administration, security model, and application programming interface. In the late 1990s, Microsoft launched the movement towards low-cost BI tools geared to the mid-market when it bundled its BI and ETL tools in SQL Server at no extra charge. Today, a host of low-cost BI vendors, including open source BI tools, cloud-BI tools, and in-memory visual analysis tools have helped bring BI to the mid-market and lower the costs of departmental BI initiatives.

Today, BI tools have become easier to use and tailored to a range of information consumption styles (i.e., viewer interactor, lightweight author, professional author). Consequently, the watchword is now "self-service BI" where business users meet their own information requirements rather than relying on BI professionals or power users to build reports on their behalf. Going forward, BI tools vendors will begin talking about "embedded BI" in which analytics (e.g. charts, tables, models) are embedded in operational applications and mission-critical business processes.
Data Integration Tools. In the data integration market, Informatica and Ascential Software (now IBM) led the charge towards the use of extract, transform, and load (ETL) engines to replace hand-coded programs that move data from source systems to a data warehouse. The engine approach proved superior to coding because its graphical interface meant you didn't have to be a hard-core programmer to write ETL code and, more importantly, it captured metadata in a repository instead of burying it in code.

But vendors soon discovered that ETL tools are only one piece of the data integration puzzle and, following the lead of their BI brethren, moved to create data integration "suites" consisting of data quality, data profiling, master data management, and data federation tools. Soon, these suites turned into data integration "platforms" running on a common architecture. Today, the focus is on using data federation tools to "virtualize" data sources behind a common data services interface and cloud-based data integration tools to migrate data from on premises to the cloud and back again. Also, data integration vendors are making their tools easier to use, thanks in large part to cloud-based initiatives, which now has them evangelizing the notion of "self-service data integration" in which business analysts, not IT developers, build data integration scripts.

DBMS Engines and Hardware. Throughout the 1990s and early 2000s, the database and hardware markets were sleepy tidewaters in the BI market, despite the fact that they consumed a good portion of BI budgets. True, database vendors had added cubing, aggregate aware optimizers, and various types of indexes to speed query performance, but that was the extent of the innovation.

But in the early 2000s, as data warehouse data volumes began to exceed the terabyte mark and query complexity grew, many data warehouses hit the proverbial wall. Meanwhile, Moore's law continued to make dramatic strides in the price-performance of processing, storage, and memory, and soon a few database entrepreneurs spotted an opportunity to overhaul the underlying BI compute infrastructure.

Netezza opened the flood gates in 2002 with the first data warehousing appliance (unless you count Teradata back in the 1980s!) that soon gained a bevy of imitators, offering orders of magnitude better query performance for a fraction of the cost. These new systems offer innovative new storage-level filtering, column-based compression and storage, massively parallel processing architecture, expanded use of memory-based caches, and in some cases, use of solid state disk, to bolster performance and availability for analytic workloads. Today, these "analytic platforms" are turbo-charging BI deployments, and in many cases, enabling BI professionals to deliver solutions that weren't possible before.

As proof of the power of these new purpose-built analytical systems, the biggest vendors in high-tech have invaded the market, picking off leading pureplays before they've even fully germinated. In the past nine months, Microsoft, IBM, Hewlett Packard, Teradata, SAP, and EMC purchased analytic platform vendors, while Oracle built its own with hardware from Sun Microsystems, which it acquired in 2009. (See "Jockeying for Position in the Analytic Platform Market.")

Mainstream Market. When viewed as a whole, the BI market has clearly emerged from an early adopter phase to the early mainstream. The watershed mark was 2007 when the biggest software vendors in the world--Oracle, SAP, and IBM--acquired the leading BI vendors--Hyperion, Business Objects, and Cognos respectively. Also, the plethora of advertisements about BI capabilities that appear on television (e.g., IBM's Smarter Planet campaign) and major consumer magazines (e.g. SAP and SAS Institute ads) reinforce the maturity of BI as a mainstream market. BI is now front and center on the radar screen of most CIOs, if not CEOs, who want to better leverage information to make smarter decisions and gain a lasting competitive advantage.

The Future. At this point, some might wonder if there is much headroom left in the BI market. The last 20 years have witnessed a dizzying array of technology innovations, products, and methodologies. It can't continue at this pace, right? Yes and no. The BI market has surprised us in the past. Even in recent years as the BI market consolidated--with big software vendors acquiring nimble innovators--we've seen a tremendous explosion of innovation. BI entrepreneurs see a host of opportunities, from better self-service BI tools that are more visual and intuitive to use to mobile and cloud-based BI offerings that are faster, better, and cheaper than current offerings. Search vendors are making a play for BI as well as platform vendors that promise data center scalability and availability for increasingly mission-critical BI loads. And we still need better tools and approaches for querying and analyzing unstructured content (e.g., documents, email, clickstream data, Web pages) and deliver data faster as our businesses increasingly compete on velocity and as our data volumes become too large to fit inside shrinking batch windows.

Next week, Beye Research will publish a report of mine that describes a new BI Delivery Framework for the next ten years. In that report, I describe a future state BI environment that contains not just one intelligence (i.e., business intelligence) but four intelligences (e.g. analytic, continuous, and content intelligence) that BI organizations will need to support or interoperate with in the near future. Stay tuned!


Posted March 18, 2011 2:29 PM
Permalink | No Comments |

MP900399514.JPG

I don't think I've ever seen a market consolidate as fast the analytic platform market.

By definition, an analytic platform is an integrated hardware and software data management system geared to query processing and analytics that offers dramatically higher price-performance than general purpose systems. After talking with numerous customers of these systems, I am convinced they represent game-changing technology. As such, major database vendors have been tripping over themselves to gain the upper hand in this multi-billion dollar market.

Rapid Fire Acquisitions. Microsoft made the first move when it purchased Datallegro in July, 2008. But it's taken two years for Microsoft to port the technology to Windows and SQL Server so, ironically, it finds itself trailing the leaders. Last May, SAP acquired Sybase, largely for its mobile technology, but also for its Sybase IQ analytic platform, which has long been been the leading column-store database on the market and has done especially well in financial services. And SAP is sparking tremendous interest within its installed base for HANA, an in-memory appliance designed to accelerate query performance of SAP BW and other analytic applications.

Two months after SAP acquired Sybase, EMC snapped up massively parallel processing (MPP) database, Greenplum, and reportedly has done an excellent job executing new deals. Two months later, in September, 2010, IBM purchased the leading pureplay, Netezza, in an all cash deal worth $1.8 billion that could be a boon to Netezza if IBM can clearly differentiate between its multiple data warehousing offerings and execute well in the field.

And last month, Hewlett Packard, whose NeoView analytic platform died ingloriously last fall, scooped up Vertica, a market leading columnar database with many interesting scalability and availability features. And finally, Teradata this week announced it was purchasing AsterData, a MPP shared nothing database with rich SQL MapReduce functions that can perform deep analytics on both structured and unstructured data.

So, in the past nine months, the world's biggest high tech companies purchased five of the leading, pureplay analytic platforms. This rapid pace of consolidation is dizzying!

Consolidation Drivers

Fear and Loathing. Part of this consolidation frenzy is driven by fear. Namely, fear of being left out of the market. And perhaps fear of Oracle, whose own analytic platform, Exadata, has gathered significant market momentum, knocking unsuspecting rivals on their heels. Although pricey, Exadata not only fuels game-changing analytic performance, it now also supports transaction applications--a one-stop database engine that competitors may have difficulty derailing (unless Oracle shoots itself in the foot with uncompromising terms for licensing, maintenance, and proofs of concept.)

Core Competencies. These analytic platform vendors are now carving out market niches where they can outshine the rest. For Oracle, it's a high-performance, hybrid analytic/transaction system; SAP touts its in-memory acceleration (HANA) and a mature columnar database that supports real-time analytics and complex event processing; EMC Greenplum targets complex analytics against petabytes of data; Aster Data focuses on analytic applications in which SQL MapReduce is an advantage; Teradata touts its mixed workload management capabilities and workload-specific analytic appliances; IBM Netezza focuses on simplicity, fast deployments, and quick ROI; Vertica trumpets its scalability, reliability, and availability now that other vendors have added columnar storage and processing capabilities; Microsoft is pitching is PDW along with a series of data mart appliances and a BI appliance.

Pureplays Looking for Cover. The rush of acquisitions leaves a number of viable pureplays out in the cold. Without a big partner, these vendors will need to clearly articulate their positioning and work hard to gain beachheads within customer accounts. ParAccel, for example, is eyeing Fortune 100 companies with complex analytic requirements, targeting financial services where it says Sybase IQ is easy pickings. Dataupia is seeking cover in companies that have tens to hundreds of petabytes to query and store. Kognitio likes its chances with flexible cloud-based offerings that customers can bring inhouse if desired. InfoBright is targeting the open source MySQL market, while Sand Technology touts its columnar compression, data mart synchronization, and text parsing capabilities. Ingres is pursuing the open source data warehousing market, and its new Vectorwise technology makes it a formidable in-memory analytics processing platform.

Despite the rapid consolidation of the analytic platforms market, there is still obviously lots of choice left for customers eager to cash in on the benefits of purpose-built analytical machines that deliver dramatically higher price-performance than database management systems of the past. Although the action was fast and furious in 2010, the race has only just begun. So, fasten your seat belts as players jockey for position in the sprint to the finish.


Posted March 8, 2011 8:20 AM
Permalink | 3 Comments |

I had the pleasure this week of talking about performance dashboards and analytics to more than 100 CFOs and financial managers at CFO Magazine's Corporate Performance Management Conference in New York City. They were a terrific audience: highly engaged with great questions and many were taking copious notes. Many were from mid-size companies. The dashboard topic was so popular that the event organizers scheduled a second three-hour workshop to accommodate demand.

I normally talk about business intelligence (BI) to IT audiences, so it was refreshing to address a business audience. Not surprisingly, they came at the topic from a slightly different perspective. Here's a sample of what they were thinking about:

  • Scorecards. The CFOs were more interested in scorecards than I anticipated. Since there are entire conferences devoted to scorecards (e.g. Palladium and the Balanced Scorecard Collaborative), which have largely attracted a financial audience, I thought that this would be old news to them. But I was wrong. They were particularly interested in how to cascade metrics throughout an organization and across scorecard environments.
  • Metrics. Not surprisingly, many found it challenging to create metrics in the first place. Most found that "the business" couldn't decide what it wanted or achieve consensus among various departmental heads. We talked about the challenges of "top-down" metrics-driven BI versus bottom-up ad hoc BI, and the tradeoffs of each approach.
  • The "Business." Since I've always considered finance to be part of the "business" it was surprising to hear finance refer to the "business" as a group separate from them. But, then it dawned on me that finance, like IT, is a shared service that is desperately trying to move from the back-office to the front-office and deliver more value to the business. Many CFOs in the audience have astutely recognized that providing consistent information and metrics via a dashboard is a great way to add value.
  • Project Management. The CFOs didn't have much perspective on how to organize a dashboard project. They didn't realize that you need a steering committee (e.g. sponsors), KPI team (e.g., subject matter experts plus one IT person) and a development team, and that the team doesn't disband after the project ends (i.e. project versus program management.)
  • Two to Tango. They also seemed to recognize that the business is the primary reason for failed BI projects not the IT team. If the business says it wants a new dashboard but the sponsor doesn't devote enough time to see the project through or free up the time of key subject matter experts to work with the BI team, the project can't succeed. Performance dashboards must be business owned and business-driven to succeed.
  • Requirements. Many CFOs also didn't realize that you need to development requirements (i.e., define metrics) before purchasing a tool. They admit that many projects they've been involved in have put the "cart before the horse" so to speak.
  • Technology. Not surprisingly, the CFOs had little understanding of the tools and architecture required to drive various types of dashboards. I don't talk much about dashboard technology and architectures to IT audiences because they know most of it already. But it's all new to the business, even basic things like how the data gets into a dashboard screen.
  • Build Once, Deploy Many Times. Perhaps the biggest revelation for many business people was the notion that you build a dashboard once and configure the views based on user roles and permissions. They didn't understand that one dashboard could consist of separate and distinct views for sales, marketing, finance, etc. and that within each of those views, the data could vary based on your level in the organization and permissions.
  • Change Management. Most recognized change management as a huge issue. Most had experienced internal resistance to new performance measurements and were eager to share stories and swap ideas for ensuring adoption.

What I Learned

I learned a few things, too. First, three hours is not enough time to address all the topics that business people need to learn to have a working knowledge of performance dashboards. Thankfully, I covered the most important topics in my book, "Performance Dashboards: Measuring, Monitoring, and Managing Your Business" which just came out in its second edition.

Second, I realized I have a lot to offer a business audience. Although I've been addressing IT audiences for the past 22 years, the way I present information resonates better with a business audience. It's not that I avoid technical issues; rather, I place technology in a business and process context and provide pragmatic examples and advice so people can apply the information back in the office.

Hopefully, I'll be delivering more business-oriented presentations in the coming months and years!


Posted February 2, 2011 11:23 AM
Permalink | 1 Comment |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |

Search this blog
Categories ›
Archives ›
Recent Entries ›