Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

Recently in Appliances Category

The prior article in this series discussed the human side of analytics. It explained how companies need to have the right culture, people, and organization to succeed with analytics. The flip side is the "hard stuff"- the architecture, platforms, tools, and data--that makes analytics possible. Although analytical technology gets the lionshare of attention in the trade press--perhaps more than it deserves for the value it delivers--it nonetheless forms the bedrock of all analytical initiatives. This article examines the architecture, platforms, tools, and data needed to deliver robust analytical solutions.

Architecture

The term "analytical architecture" is an oxymoron. In most organizations, business analysts are left to their own devices to access, integrate, and analyze data. By necessity, they create their own data sets and reports outside the purview and approval of corporate IT. By definition, there is no analytical architecture in most organizations--just a hodge-podge of analytical silos and spreadmarts, each with conflicting business rules and data definitions.

Analytical sandboxes. Fortunately, with the advent of specialized analytical platforms (discussed below), BI architects have more options for bringing business analysts into the corporate BI fold. They can use these high-powered database platforms to create analytical sandboxes for the explicit use of business analysts. These sandboxes, when designed properly, give analysts the flexibility they need to access corporate data at a granular level, combine it with data that they've sourced themselves, and conduct analyses to answer pressing business questions. With analytical sandboxes, BI teams can transform business analysts from data pariahs to full-fledged members of the BI community.

There are four types of analytical sandboxes:


  • Staging Sandbox. This is a staging area for a data warehouse that contains raw, non-integrated data from multiple source systems. Analysts generally prefer to query a staging area that contains all the raw data than each source system individually. Hadoop is a staging area for large volumes of unstructured data that a growing number of companies are adding to their BI ecosystems.

  • Virtual Sandbox. A virtual sandbox is a set of tables inside a data warehouse assigned to individual analysts. Analysts can upload data into the sandbox and combine it with data from the data warehouse, giving them one place to go to do all their analyses. The BI team needs to carefully allocate compute resources so analysts have enough horsepower to run ad hoc queries without interfering with other workloads running on the data warehouse.

  • Free-standing sandbox. A free-standing sandbox is a separate database server that sits alongside a data warehouse and contains its own data. It's often used to offload complex, ad hoc queries from an enterprise data warehouse and give business analysts their own space to play. In some cases, these sandboxes contain a replica of data in the data warehouse, while in others, they support entirely new data sets that don't fit in a data warehouse or run faster on an analytical platform.

  • In-memory BI sandbox. Some desktop BI tools maintain a local data store, either in memory or on disk, to support interactive dashboards and queries. Analysts love these types of sandboxes because they connect to virtually any data source and enable analysts to model data, apply filters, and visually interact with the data without IT intervention.

Next-Generation BI Architecture. Figure 1 depicts a BI architecture with the four analytical sandboxes colored in green. The top half of the diagram represents a classic top-down, data warehousing architecture that primarily delivers interactive reports and dashboards to casual users (although the streaming/complex event processing (CEP) engine is new.) The bottom half of the diagram depicts a bottom-up analytical architecture with analytical sandboxes along with new types of data sources. This next-generation BI architecture better accommodates the needs of business analysts and data scientists, making them full-fledged members of the corporate BI ecosystem.

Figure 1. The New BI Architecture
Part IV - BI Architecture of Future.jpg

The next-generation BI architecture is more analytical, giving power users greater options to access and mix corporate data with their own data via various types of analytical sandboxes. It also brings unstructured and semi-structured data fully into the mix using Hadoop and nonrelational databases.

Analytical Platforms

Since the beginning of the data warehousing movement in the early 1990s, organizations have used general-purpose data management systems to implement data warehouses and, occasionally, multidimensional databases (i.e., "cubes") to support subject-specific data marts, especially for financial analytics. General-purpose data management systems were designed for transaction processing (i.e., rapid, secure, synchronized updates against small data sets) and only later modified to handle analytical processing (i.e., complex queries against large data sets.) In contrast, analytical platforms focus entirely on analytical processing at the expense of transaction processing.

The analytical platform movement. In 2002, Netezza (now owned by IBM), introduced a specialized analytical appliance, a tightly integrated, hardware-software database management system designed explicitly to run ad hoc queries against large volumes of data at blindingly fast speeds. Netezza's success spawned a host of competitors, and there are now more than two dozen players in the market. (see Table 1).

Table 1. Types of Analytical Platforms
Part IV - Tools Table.jpg

Today, the technology behind analytical platforms is diverse: appliances, columnar databases, in memory databases, massively parallel processing (MPP) databases, file-based systems, nonrelational databases and analytical services. What they all have in common, however, is that they provide significant improvements in price-performance, availability, load times and manageability compared with general-purpose relational database management systems. Every analytical platform customer I've interviewed has cited an order-of-magnitude performance gains that most initially don't believe.

Moreover, many of these analytical platforms contain built-in analytical functions that make life easier for business analysts. These functions range from fuzzy matching algorithms and text analytics to data preparation and data mining functions. By putting functions in the database, analysts no longer have to craft complex, custom SQL or offboard data to analytical workstations, which limits the amount of data they can analyze and model.

Companies use analytical platforms to support free-standing sandboxes (described above) or as replacements for data warehouses running on MySQL and SQL Server, and occasionally major OLTP databases from Oracle and IBM. They also improve query performance for ad hoc analytical tools, especially those that connect directly to databases to run queries (versus those that download data to a local cache.)

Analytical Tools

In 2010, vendors turned their attention to meeting the needs of power users after ten years of enhancing reporting and dashboard solutions for casual users. As a result, the number of analytical tools on the market has exploded.

Analytical tools come in all shapes and sizes. Analysts generally need one of every type of tool. Just as you wouldn't hire a carpenter to build an addition to your house with just one tool, you don't want to restrict an analyst to just one analytical tool. Like a carpenter, an analyst needs a different tool for every type of job they do. For instance, a typical analyst might need the following tools:

Excel to extract data from various sources, including local files, create reports, and share them with others via a corporate portal or server (managed Excel).
BI Search tools to issue ad hoc queries against a BI tool's metadata.
Planning tools (including Excel) to create strategic and tactical plans, each containing multiple scenarios.
Mashboards and ad hoc reporting tools to create ad hoc dashboards and reports on behalf of departmental colleagues
Visual discovery tools to explore data in one or more sources of data and create interactive dashboards on behalf of departmental colleagues
Multidimensional OLAP (MOLAP) tools to explore small and medium sets of data dimensionally at the speed of thought and run complex dimensional calculations.
Relational OLAP tools to explore large sets of data dimensionally and run complex calculations
Text analytics tools to parse text data and put it in a relational structure for analysis.
Data mining tools to create descriptive and predictive models.
Hadoop and MapReduce to process large volumes of unstructured and semi-structured data in a parallel environment.

Figure 2. Types of Analytical Tools
Part IV - Types of Tools.jpg

Figure 2 plots these tools on a graph where the x axis represents calculation complexity and the y axis represents data volumes. Ad hoc analytical tools for casual users (or more realistically super users) are clustered in the bottom left corner of the graph, while ad hoc tools for power users are clustered slightly above and to the right. Planning and scenario modeling tools cluster further to the right, offering slightly more calculation complexity against small volumes of data. High-powered analytical tools, which generally rely on machine learning algorithms and specialized analytical databases, cluster in the upper right quadrant.

Data

Business analysts function like one-man IT shops. They must access, integrate, clean and analyze data, and then present it to other users. Figure 2 depicts the typical workflow of a business analyst. If an organization doesn't have a mature data warehouse that contains cross-functional data at a granular level, they often spend an inordinate amount of time sourcing, cleaning, and integrating data. (Steps 1 and 2 in the analyst workflow.) They then create a multiplicity of analytical silos (step 5) when they publish data, much to the chagrin of the IT department.

Figure 2. Analyst Workflow

In the absence of a data warehouse that contains all the data they need, business analysts must function as one-man IT shops where they spend an inordinate amount of time iterating between collecting, integrating, and analyzing data. They run into trouble when they distribute their hand-crafted data sets broadly.

Data Warehouse. The most important way that organizations can improve the productivity and effectiveness of business analysts is to maintain a robust data warehousing environment that contains most of the data that analysts need to perform their work. This can take many years. In a fast-moving market where the company adds new products and features continuously, the data warehouse may never catch up. But, nonetheless, it's important for organizations to continuously add new subject areas to the data warehouse, otherwise business analysts have to spend hours or days gathering and integrating this data themselves.

Atomic Data. The data warehouse also needs to house atomic data, or data at the lowest level of transactional detail, not summary data. Analysts generally want the raw data because they can repurpose in many different ways depending on the nature of the business questions they're addressing. This is the reason that highly skilled analysts like to access data directly from source systems or a data warehouse staging area. At the same time, less skilled analysts appreciate the heavy lifting done by the IT group to clean and integrate disparate data sets using common metrics, dimensions, and attributes. This base level of data standardization expedites their work.

Once a BI team integrates a sufficient number of subject areas in a data warehouse at an atomic level of data, business analysts can have a field day. Instead of downloading data to an analytical workstation, which limits the amount of data they can analyze and process, they can now run calculations and models against the entire data warehouse using analytical functions built into the database or that they've created using database development toolkits. This improves the accuracy of their analyses and models and saves them considerable time.

Summary

The technical side of analytics is daunting. There are many moving parts that all have to work synergistically together. However, the most important part of the technical equation is the data. The old adage holds true: "garbage in, garbage out." Analysts can't deliver accurate insights if they don't have access to good quality data. And it's a waste of their time to spend days trying to prepare the data for analysis. A good analytics program is built on a solid data warehousing foundation that embeds analytical sandboxes tailored to the requirements of individual analysts.


Posted November 15, 2011 7:44 AM
Permalink | No Comments |

Teradata has undergone a long overdue conversion. It is no longer a dogmatic proponent of a central data warehousing ideology. Although it still advocates integrated data, it is no longer hostile to the notion of distributed computing.

"We no longer give the EDW sermon," said Ed White, newly appointed general manager of appliances at Teradata , at an analyst briefing in Las Vegas last week. "And it was a sermon. "

During a short presentation, White described how Teradata stopped fighting the reality of federated data stores that exists at most companies and began embracing it. "We used to argue with companies about departmental systems and why they should avoid them. Now we sell them." Awakening to both architectural and economic realities, White added, "We began to realize that organizations had this entire data ecosystem, and we were only competing for the data warehousing portion of it."

Winning Customers

New Accounts. Today, Teradata has a family of appliances that all run on the Teradata database. These range from the one node Data Mart Appliance to the nine node Teradata Data Warehouse Appliance that is expandable to six cabinets. Teradata now has 225 appliance customers, many of whom purchased the Data Warehouse Appliance to replace Oracle or Microsoft SQL Server as the customer's data warehouse platform. As such, this product is helping Teradata win new accounts at the lower end of the data warehousing spectrum, something it couldn't do in the past because it didn't have a competitively priced offering. Priced at roughly $26,000 per terabyte, the Data Warehousing Appliance is helping dispel that notion.

Cannibalization? Although Teradata has been selling its appliances for several years, it has only recently embraced them as a key part its overall strategy. Executives needed to be convinced that the appliances wouldn't cannibalize the company's flagship product, the Teradata Active DW. Although some customers have replaced their Active DW platform with a Data Warehouse Appliance, White says Teradata would have lost that business to competitors if it didn't offer a more appropriately sized system.

More encouraging, 75 Active EDW customers are also running a Data Warehouse Appliance, reinforcing the notion that customers have multi-faceted data ecosystems and that a one-size-fits-all strategy doesn't always align with customer realities. In the end, Teradata's appliances have become a key wedge to unseat competitors at new accounts and an effective way to retain existing customers and compete for a greater share of a customer's wallet.

Managing an Ecosystem

However, Teradata has some work to do to make an appliance strategy work. As a newcomer to distributed computing, it needs to supply tools for managing a multi-faceted ecosystem and dynamically moving data among complementary systems. Staying true to its ideology of centralized computing, Teradata has established the goal of making a Teradata ecosystem function as if it's a single system. For example, this might mean automatically distributed queries to the appropriate system based on workload and availability.

Teradata has already taken a few steps in this direction by releasing Teradata Viewpoint and Teradata Multi-System Manager. Viewpoint provides a single console for monitoring, managing, and controlling Teradata systems in a customer ecosystem. Multi-System Manager goes beyond Teradata products and provides a view of the entire BI stack from source systems to ETL and BI tools. Expect Teradata to make announce additional utilities at its Teradata Partners conference this fall.

On the whole, it's great to see Teradata remake itself, and just in the nick of time. Teradata has always delivered excellent technology and great customer service, and it sincerely desires to make its customers successful. Now that it's taken off its ideological blinders, it has become a rejuvenated competitor as well.


Posted August 6, 2011 5:51 AM
Permalink | No Comments |

In a recent blog ("What's in a Word: The Evolution of BI Semantics"), I discussed the evolution of BI semantics and end-user approaches to business intelligence. In this blog, I will focus on technology evolution and vendor messaging.

Four Market Segments. The BI market is comprised of four sub-markets that have experienced rapid change and growth since the 1990s: BI tools, data integration tools, database management systems (DBMS), and hardware platform. (See bottom half of figure 1.)

Thumbnail image for Thumbnail image for BI Market Evolution.jpg

Compute Platform. BI technologies in these market segments run on a compute infrastructure (i.e., the diagonal line in figure 1) that has changed dramatically over the years, evolving from mainframes and mini-computers in the 1980s and client/server in the 1990s to the Web and Web services in the early 2000s. Today, we see the advent of mobile devices and cloud-based platforms. Each change in the underlying compute platform has created opportunities for upstarts with new technology to grab market share and forced incumbents to respond in kind or acquire the upstarts. With an endless wave of new companies pushing innovative new technologies, the BI market has been one of the most dynamic in the software industry during the past 20 years.

BI Tools. Prior to 1990, companies built reports using 3GL and 4GL reporting languages, such as Focus and Ramis. In the 1990s, vendors began selling desktop or client/server tools that enabled business users to create their own reports and analyses. The prominent BI tools were Windows-based OLAP, ad hoc query, and ad hoc reporting tools, and, of course, Excel, which still is the most prevalent reporting and analysis tool in the market today.

In the 2000s, BI vendors "rediscovered" reporting, having been enraptured with analysis tools in the 1990s. They learned the hard way that only a fraction of users want to analyze data and the real market for BI lies in delivering reports, and subsequently, dashboards, which are essentially visual exception reports. Today, vendors have moved to the next wave of BI, which is predictive analytics, while offering support for new channels of delivery (mobile and cloud.) In the next five years, I believe BI search will become an integral part of a BI portfolio, since it provides a super easy interface for casual users to submit ad hoc queries and navigate data without boundaries.

BI Vendor Messaging. In the 1990s, vendors competed by bundling together multiple types of BI tools (reporting, OLAP, query) into a single "BI Suite." A few years later, they began touting "BI Platforms" in which once distinct BI tools in a suite became modules within a unified BI architecture that all use the same query engine, charting engine, user interface, metadata, administration, security model, and application programming interface. In the late 1990s, Microsoft launched the movement towards low-cost BI tools geared to the mid-market when it bundled its BI and ETL tools in SQL Server at no extra charge. Today, a host of low-cost BI vendors, including open source BI tools, cloud-BI tools, and in-memory visual analysis tools have helped bring BI to the mid-market and lower the costs of departmental BI initiatives.

Today, BI tools have become easier to use and tailored to a range of information consumption styles (i.e., viewer interactor, lightweight author, professional author). Consequently, the watchword is now "self-service BI" where business users meet their own information requirements rather than relying on BI professionals or power users to build reports on their behalf. Going forward, BI tools vendors will begin talking about "embedded BI" in which analytics (e.g. charts, tables, models) are embedded in operational applications and mission-critical business processes.
Data Integration Tools. In the data integration market, Informatica and Ascential Software (now IBM) led the charge towards the use of extract, transform, and load (ETL) engines to replace hand-coded programs that move data from source systems to a data warehouse. The engine approach proved superior to coding because its graphical interface meant you didn't have to be a hard-core programmer to write ETL code and, more importantly, it captured metadata in a repository instead of burying it in code.

But vendors soon discovered that ETL tools are only one piece of the data integration puzzle and, following the lead of their BI brethren, moved to create data integration "suites" consisting of data quality, data profiling, master data management, and data federation tools. Soon, these suites turned into data integration "platforms" running on a common architecture. Today, the focus is on using data federation tools to "virtualize" data sources behind a common data services interface and cloud-based data integration tools to migrate data from on premises to the cloud and back again. Also, data integration vendors are making their tools easier to use, thanks in large part to cloud-based initiatives, which now has them evangelizing the notion of "self-service data integration" in which business analysts, not IT developers, build data integration scripts.

DBMS Engines and Hardware. Throughout the 1990s and early 2000s, the database and hardware markets were sleepy tidewaters in the BI market, despite the fact that they consumed a good portion of BI budgets. True, database vendors had added cubing, aggregate aware optimizers, and various types of indexes to speed query performance, but that was the extent of the innovation.

But in the early 2000s, as data warehouse data volumes began to exceed the terabyte mark and query complexity grew, many data warehouses hit the proverbial wall. Meanwhile, Moore's law continued to make dramatic strides in the price-performance of processing, storage, and memory, and soon a few database entrepreneurs spotted an opportunity to overhaul the underlying BI compute infrastructure.

Netezza opened the flood gates in 2002 with the first data warehousing appliance (unless you count Teradata back in the 1980s!) that soon gained a bevy of imitators, offering orders of magnitude better query performance for a fraction of the cost. These new systems offer innovative new storage-level filtering, column-based compression and storage, massively parallel processing architecture, expanded use of memory-based caches, and in some cases, use of solid state disk, to bolster performance and availability for analytic workloads. Today, these "analytic platforms" are turbo-charging BI deployments, and in many cases, enabling BI professionals to deliver solutions that weren't possible before.

As proof of the power of these new purpose-built analytical systems, the biggest vendors in high-tech have invaded the market, picking off leading pureplays before they've even fully germinated. In the past nine months, Microsoft, IBM, Hewlett Packard, Teradata, SAP, and EMC purchased analytic platform vendors, while Oracle built its own with hardware from Sun Microsystems, which it acquired in 2009. (See "Jockeying for Position in the Analytic Platform Market.")

Mainstream Market. When viewed as a whole, the BI market has clearly emerged from an early adopter phase to the early mainstream. The watershed mark was 2007 when the biggest software vendors in the world--Oracle, SAP, and IBM--acquired the leading BI vendors--Hyperion, Business Objects, and Cognos respectively. Also, the plethora of advertisements about BI capabilities that appear on television (e.g., IBM's Smarter Planet campaign) and major consumer magazines (e.g. SAP and SAS Institute ads) reinforce the maturity of BI as a mainstream market. BI is now front and center on the radar screen of most CIOs, if not CEOs, who want to better leverage information to make smarter decisions and gain a lasting competitive advantage.

The Future. At this point, some might wonder if there is much headroom left in the BI market. The last 20 years have witnessed a dizzying array of technology innovations, products, and methodologies. It can't continue at this pace, right? Yes and no. The BI market has surprised us in the past. Even in recent years as the BI market consolidated--with big software vendors acquiring nimble innovators--we've seen a tremendous explosion of innovation. BI entrepreneurs see a host of opportunities, from better self-service BI tools that are more visual and intuitive to use to mobile and cloud-based BI offerings that are faster, better, and cheaper than current offerings. Search vendors are making a play for BI as well as platform vendors that promise data center scalability and availability for increasingly mission-critical BI loads. And we still need better tools and approaches for querying and analyzing unstructured content (e.g., documents, email, clickstream data, Web pages) and deliver data faster as our businesses increasingly compete on velocity and as our data volumes become too large to fit inside shrinking batch windows.

Next week, Beye Research will publish a report of mine that describes a new BI Delivery Framework for the next ten years. In that report, I describe a future state BI environment that contains not just one intelligence (i.e., business intelligence) but four intelligences (e.g. analytic, continuous, and content intelligence) that BI organizations will need to support or interoperate with in the near future. Stay tuned!


Posted March 18, 2011 2:29 PM
Permalink | No Comments |

MP900399514.JPG

I don't think I've ever seen a market consolidate as fast the analytic platform market.

By definition, an analytic platform is an integrated hardware and software data management system geared to query processing and analytics that offers dramatically higher price-performance than general purpose systems. After talking with numerous customers of these systems, I am convinced they represent game-changing technology. As such, major database vendors have been tripping over themselves to gain the upper hand in this multi-billion dollar market.

Rapid Fire Acquisitions. Microsoft made the first move when it purchased Datallegro in July, 2008. But it's taken two years for Microsoft to port the technology to Windows and SQL Server so, ironically, it finds itself trailing the leaders. Last May, SAP acquired Sybase, largely for its mobile technology, but also for its Sybase IQ analytic platform, which has long been been the leading column-store database on the market and has done especially well in financial services. And SAP is sparking tremendous interest within its installed base for HANA, an in-memory appliance designed to accelerate query performance of SAP BW and other analytic applications.

Two months after SAP acquired Sybase, EMC snapped up massively parallel processing (MPP) database, Greenplum, and reportedly has done an excellent job executing new deals. Two months later, in September, 2010, IBM purchased the leading pureplay, Netezza, in an all cash deal worth $1.8 billion that could be a boon to Netezza if IBM can clearly differentiate between its multiple data warehousing offerings and execute well in the field.

And last month, Hewlett Packard, whose NeoView analytic platform died ingloriously last fall, scooped up Vertica, a market leading columnar database with many interesting scalability and availability features. And finally, Teradata this week announced it was purchasing AsterData, a MPP shared nothing database with rich SQL MapReduce functions that can perform deep analytics on both structured and unstructured data.

So, in the past nine months, the world's biggest high tech companies purchased five of the leading, pureplay analytic platforms. This rapid pace of consolidation is dizzying!

Consolidation Drivers

Fear and Loathing. Part of this consolidation frenzy is driven by fear. Namely, fear of being left out of the market. And perhaps fear of Oracle, whose own analytic platform, Exadata, has gathered significant market momentum, knocking unsuspecting rivals on their heels. Although pricey, Exadata not only fuels game-changing analytic performance, it now also supports transaction applications--a one-stop database engine that competitors may have difficulty derailing (unless Oracle shoots itself in the foot with uncompromising terms for licensing, maintenance, and proofs of concept.)

Core Competencies. These analytic platform vendors are now carving out market niches where they can outshine the rest. For Oracle, it's a high-performance, hybrid analytic/transaction system; SAP touts its in-memory acceleration (HANA) and a mature columnar database that supports real-time analytics and complex event processing; EMC Greenplum targets complex analytics against petabytes of data; Aster Data focuses on analytic applications in which SQL MapReduce is an advantage; Teradata touts its mixed workload management capabilities and workload-specific analytic appliances; IBM Netezza focuses on simplicity, fast deployments, and quick ROI; Vertica trumpets its scalability, reliability, and availability now that other vendors have added columnar storage and processing capabilities; Microsoft is pitching is PDW along with a series of data mart appliances and a BI appliance.

Pureplays Looking for Cover. The rush of acquisitions leaves a number of viable pureplays out in the cold. Without a big partner, these vendors will need to clearly articulate their positioning and work hard to gain beachheads within customer accounts. ParAccel, for example, is eyeing Fortune 100 companies with complex analytic requirements, targeting financial services where it says Sybase IQ is easy pickings. Dataupia is seeking cover in companies that have tens to hundreds of petabytes to query and store. Kognitio likes its chances with flexible cloud-based offerings that customers can bring inhouse if desired. InfoBright is targeting the open source MySQL market, while Sand Technology touts its columnar compression, data mart synchronization, and text parsing capabilities. Ingres is pursuing the open source data warehousing market, and its new Vectorwise technology makes it a formidable in-memory analytics processing platform.

Despite the rapid consolidation of the analytic platforms market, there is still obviously lots of choice left for customers eager to cash in on the benefits of purpose-built analytical machines that deliver dramatically higher price-performance than database management systems of the past. Although the action was fast and furious in 2010, the race has only just begun. So, fasten your seat belts as players jockey for position in the sprint to the finish.


Posted March 8, 2011 8:20 AM
Permalink | 2 Comments |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |
PREV 1 2