Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

Recently in Hadoop and NoSQL Category

The prior article in this series discussed the human side of analytics. It explained how companies need to have the right culture, people, and organization to succeed with analytics. The flip side is the "hard stuff"- the architecture, platforms, tools, and data--that makes analytics possible. Although analytical technology gets the lionshare of attention in the trade press--perhaps more than it deserves for the value it delivers--it nonetheless forms the bedrock of all analytical initiatives. This article examines the architecture, platforms, tools, and data needed to deliver robust analytical solutions.

Architecture

The term "analytical architecture" is an oxymoron. In most organizations, business analysts are left to their own devices to access, integrate, and analyze data. By necessity, they create their own data sets and reports outside the purview and approval of corporate IT. By definition, there is no analytical architecture in most organizations--just a hodge-podge of analytical silos and spreadmarts, each with conflicting business rules and data definitions.

Analytical sandboxes. Fortunately, with the advent of specialized analytical platforms (discussed below), BI architects have more options for bringing business analysts into the corporate BI fold. They can use these high-powered database platforms to create analytical sandboxes for the explicit use of business analysts. These sandboxes, when designed properly, give analysts the flexibility they need to access corporate data at a granular level, combine it with data that they've sourced themselves, and conduct analyses to answer pressing business questions. With analytical sandboxes, BI teams can transform business analysts from data pariahs to full-fledged members of the BI community.

There are four types of analytical sandboxes:


  • Staging Sandbox. This is a staging area for a data warehouse that contains raw, non-integrated data from multiple source systems. Analysts generally prefer to query a staging area that contains all the raw data than each source system individually. Hadoop is a staging area for large volumes of unstructured data that a growing number of companies are adding to their BI ecosystems.

  • Virtual Sandbox. A virtual sandbox is a set of tables inside a data warehouse assigned to individual analysts. Analysts can upload data into the sandbox and combine it with data from the data warehouse, giving them one place to go to do all their analyses. The BI team needs to carefully allocate compute resources so analysts have enough horsepower to run ad hoc queries without interfering with other workloads running on the data warehouse.

  • Free-standing sandbox. A free-standing sandbox is a separate database server that sits alongside a data warehouse and contains its own data. It's often used to offload complex, ad hoc queries from an enterprise data warehouse and give business analysts their own space to play. In some cases, these sandboxes contain a replica of data in the data warehouse, while in others, they support entirely new data sets that don't fit in a data warehouse or run faster on an analytical platform.

  • In-memory BI sandbox. Some desktop BI tools maintain a local data store, either in memory or on disk, to support interactive dashboards and queries. Analysts love these types of sandboxes because they connect to virtually any data source and enable analysts to model data, apply filters, and visually interact with the data without IT intervention.

Next-Generation BI Architecture. Figure 1 depicts a BI architecture with the four analytical sandboxes colored in green. The top half of the diagram represents a classic top-down, data warehousing architecture that primarily delivers interactive reports and dashboards to casual users (although the streaming/complex event processing (CEP) engine is new.) The bottom half of the diagram depicts a bottom-up analytical architecture with analytical sandboxes along with new types of data sources. This next-generation BI architecture better accommodates the needs of business analysts and data scientists, making them full-fledged members of the corporate BI ecosystem.

Figure 1. The New BI Architecture
Part IV - BI Architecture of Future.jpg

The next-generation BI architecture is more analytical, giving power users greater options to access and mix corporate data with their own data via various types of analytical sandboxes. It also brings unstructured and semi-structured data fully into the mix using Hadoop and nonrelational databases.

Analytical Platforms

Since the beginning of the data warehousing movement in the early 1990s, organizations have used general-purpose data management systems to implement data warehouses and, occasionally, multidimensional databases (i.e., "cubes") to support subject-specific data marts, especially for financial analytics. General-purpose data management systems were designed for transaction processing (i.e., rapid, secure, synchronized updates against small data sets) and only later modified to handle analytical processing (i.e., complex queries against large data sets.) In contrast, analytical platforms focus entirely on analytical processing at the expense of transaction processing.

The analytical platform movement. In 2002, Netezza (now owned by IBM), introduced a specialized analytical appliance, a tightly integrated, hardware-software database management system designed explicitly to run ad hoc queries against large volumes of data at blindingly fast speeds. Netezza's success spawned a host of competitors, and there are now more than two dozen players in the market. (see Table 1).

Table 1. Types of Analytical Platforms
Part IV - Tools Table.jpg

Today, the technology behind analytical platforms is diverse: appliances, columnar databases, in memory databases, massively parallel processing (MPP) databases, file-based systems, nonrelational databases and analytical services. What they all have in common, however, is that they provide significant improvements in price-performance, availability, load times and manageability compared with general-purpose relational database management systems. Every analytical platform customer I've interviewed has cited an order-of-magnitude performance gains that most initially don't believe.

Moreover, many of these analytical platforms contain built-in analytical functions that make life easier for business analysts. These functions range from fuzzy matching algorithms and text analytics to data preparation and data mining functions. By putting functions in the database, analysts no longer have to craft complex, custom SQL or offboard data to analytical workstations, which limits the amount of data they can analyze and model.

Companies use analytical platforms to support free-standing sandboxes (described above) or as replacements for data warehouses running on MySQL and SQL Server, and occasionally major OLTP databases from Oracle and IBM. They also improve query performance for ad hoc analytical tools, especially those that connect directly to databases to run queries (versus those that download data to a local cache.)

Analytical Tools

In 2010, vendors turned their attention to meeting the needs of power users after ten years of enhancing reporting and dashboard solutions for casual users. As a result, the number of analytical tools on the market has exploded.

Analytical tools come in all shapes and sizes. Analysts generally need one of every type of tool. Just as you wouldn't hire a carpenter to build an addition to your house with just one tool, you don't want to restrict an analyst to just one analytical tool. Like a carpenter, an analyst needs a different tool for every type of job they do. For instance, a typical analyst might need the following tools:

Excel to extract data from various sources, including local files, create reports, and share them with others via a corporate portal or server (managed Excel).
BI Search tools to issue ad hoc queries against a BI tool's metadata.
Planning tools (including Excel) to create strategic and tactical plans, each containing multiple scenarios.
Mashboards and ad hoc reporting tools to create ad hoc dashboards and reports on behalf of departmental colleagues
Visual discovery tools to explore data in one or more sources of data and create interactive dashboards on behalf of departmental colleagues
Multidimensional OLAP (MOLAP) tools to explore small and medium sets of data dimensionally at the speed of thought and run complex dimensional calculations.
Relational OLAP tools to explore large sets of data dimensionally and run complex calculations
Text analytics tools to parse text data and put it in a relational structure for analysis.
Data mining tools to create descriptive and predictive models.
Hadoop and MapReduce to process large volumes of unstructured and semi-structured data in a parallel environment.

Figure 2. Types of Analytical Tools
Part IV - Types of Tools.jpg

Figure 2 plots these tools on a graph where the x axis represents calculation complexity and the y axis represents data volumes. Ad hoc analytical tools for casual users (or more realistically super users) are clustered in the bottom left corner of the graph, while ad hoc tools for power users are clustered slightly above and to the right. Planning and scenario modeling tools cluster further to the right, offering slightly more calculation complexity against small volumes of data. High-powered analytical tools, which generally rely on machine learning algorithms and specialized analytical databases, cluster in the upper right quadrant.

Data

Business analysts function like one-man IT shops. They must access, integrate, clean and analyze data, and then present it to other users. Figure 2 depicts the typical workflow of a business analyst. If an organization doesn't have a mature data warehouse that contains cross-functional data at a granular level, they often spend an inordinate amount of time sourcing, cleaning, and integrating data. (Steps 1 and 2 in the analyst workflow.) They then create a multiplicity of analytical silos (step 5) when they publish data, much to the chagrin of the IT department.

Figure 2. Analyst Workflow

In the absence of a data warehouse that contains all the data they need, business analysts must function as one-man IT shops where they spend an inordinate amount of time iterating between collecting, integrating, and analyzing data. They run into trouble when they distribute their hand-crafted data sets broadly.

Data Warehouse. The most important way that organizations can improve the productivity and effectiveness of business analysts is to maintain a robust data warehousing environment that contains most of the data that analysts need to perform their work. This can take many years. In a fast-moving market where the company adds new products and features continuously, the data warehouse may never catch up. But, nonetheless, it's important for organizations to continuously add new subject areas to the data warehouse, otherwise business analysts have to spend hours or days gathering and integrating this data themselves.

Atomic Data. The data warehouse also needs to house atomic data, or data at the lowest level of transactional detail, not summary data. Analysts generally want the raw data because they can repurpose in many different ways depending on the nature of the business questions they're addressing. This is the reason that highly skilled analysts like to access data directly from source systems or a data warehouse staging area. At the same time, less skilled analysts appreciate the heavy lifting done by the IT group to clean and integrate disparate data sets using common metrics, dimensions, and attributes. This base level of data standardization expedites their work.

Once a BI team integrates a sufficient number of subject areas in a data warehouse at an atomic level of data, business analysts can have a field day. Instead of downloading data to an analytical workstation, which limits the amount of data they can analyze and process, they can now run calculations and models against the entire data warehouse using analytical functions built into the database or that they've created using database development toolkits. This improves the accuracy of their analyses and models and saves them considerable time.

Summary

The technical side of analytics is daunting. There are many moving parts that all have to work synergistically together. However, the most important part of the technical equation is the data. The old adage holds true: "garbage in, garbage out." Analysts can't deliver accurate insights if they don't have access to good quality data. And it's a waste of their time to spend days trying to prepare the data for analysis. A good analytics program is built on a solid data warehousing foundation that embeds analytical sandboxes tailored to the requirements of individual analysts.


Posted November 15, 2011 7:44 AM
Permalink | No Comments |

Advanced analytics promises to unlock hidden potential in organizational data. If that's the case, why have so few organizations embraced advanced analytics in a serious way? Most organizations have dabbled with advanced analytics, but outside of credit card companies, online retailers, and government intelligence agencies, few have invested sufficient resources to turn analytics into a core competency.

Advanced analytics refers to the use of machine learning algorithms to unearth patterns and relationships in large volumes of complex data. It's best applied to overcome various resource constraints (e.g., time, money, labor) where the output justifies the investment of time and money. (See "What is Analytics and Why Should You Care?" and "Advanced Analytics: Where Do You Start?")

Once an organization decides to invest in advanced analytics, it faces many challenges. To succeed with advanced analytics, organizations must have the right culture, people, organization, architecture, and data. (See figure 1.) This is a tall task. This article examines the "soft stuff" required to implement analytics--the culture, people, and organization--the first three dimensions of the analytical framework in figure 1. A subsequent article examines the "hard stuff"--the architecture, tools, and data.

Figure 1. Framework for Implementing Advanced Analytics
Part III - Implementation Challenges.jpg

The Right Culture

Culture refers to the rules--both written and unwritten--for how things get done in an organization. These rules emanate from two places: 1) the words and actions of top executives and 2) organizational inertia and behavioral norms of middle management and their subordinates (i.e., "the way we've always done it.") Analytics, like any new information technology, requires executives and middle managers to make conscious choices about how work gets done.

Executives. For advanced analytics to succeed, top executives must first establish a fact-based decision making culture and then adhere to it themselves. Executives must consciously change the way they make decisions. Rather than rely on gut feel alone, executives must make decisions based on facts or intuition validated by data. They must designate authorized data sources for decision making and establish common metrics for measuring performance. They must also hold individuals accountable for outcomes at all levels of the organization.

Executives also need to evangelize the value and importance of fact-based decision making and the need for a performance-driven culture. They need to recruit like-minded executives and continuously reinforce the message that the organization "runs on data." Most importantly, they not only must "talk the talk," they must "walk the walk." They need to hold themselves accountable for performance outcomes and use certifiable information sources, not resort to their trusted analyst to deliver the data view they desire. Executives who don't follow their own rules send a cultural signal that this analytics fad will pass and so it's "business as usual."

Managers and Organizational Inertia. Mid-level managers often pose the biggest obstacles to implementing new information technologies because their authority and influence stems from their ability to control the flow of information, both up and down organizational ladders. Mid-level managers have to buy into new ways of capturing and using information for the program to succeed. If they don't, they, too, will send the wrong signals to lower level workers. To overcome organizational inertia, executives need to establish new incentives for mid-level managers and hold them accountable for performance metrics aligned with strategic goals around the decision making and the use of information.

The Right People

It's impossible to do advanced analytics without analysts. That's obvious. But hiring the right analysts and creating an environment for them to thrive is not easy.
Analysts are a rare breed. They are critical thinkers who need to understand a business process inside and out and the data that supports it. They also must be computer-literate and know how to use various data access, analysis, and presentation tools to do their jobs. Compared to other employees, they are generally more passionate about what they do, more committed to the success of the organization, more curious about how things work, and more eager to tackle new challenges.

But not all analysts do the same kind of work, and it's important to know the differences. There are four major types of analysts:


  • Super Users. These are tech-savvy business users who gravitate to reporting and analysis tools deployed by the business intelligence (BI) team. These analysts quickly become the "go to" people in each department to get an ad hoc report or dashboard, if you don't want to wait for the BI team. While super users don't normally do advanced analytics, they play an important role because they offload ad hoc reporting requirements from more skilled analysts.

  • Business Analysts. These are Excel jockeys that executives and managers answer to create and evaluate plans, crunch numbers, and generally answer any question an executive or manager might have that can't be addressed by a standard report or dashboard. With training, they can also create analytical models.

  • Analytical Modelers. These analysts have formal training in statistics and a data mining workbench, such as those from IBM (i.e., SPSS) or SAS. They build descriptive and predictive models that are the heart and soul of advanced analytics.

  • Data Scientists. These analysts specialize in analyzing unstructured data, such as Web traffic and social media. They write Java and other programs to run against Hadoop and NoSQL databases and know how to write efficient MapReduce jobs that run in "big data" environments.

Where You Find Them. Most organizations struggle to find skilled analysts. Many super users and business analysts are self-taught Excel jockeys, essentially tech-savvy business people who aren't afraid to learn new software tools to do their jobs. Many business school graduates fill this role, often as a stepping stone to management positions. Conversely, a few business-savvy technologists can grow into this role, including data analysts and report developers who have a proclivity toward business and working with business people.

Analytical modelers and data scientists require more training and skills. These analysts generally have a background in statistics or number crunching. Statisticians with business knowledge or social scientists with computer skills tend to excel in these roles. Given advances in data mining workbenches, it's not critical that analytical modelers know how to write SQL or code in C, as in the past. However, data scientists aren't so lucky. Since Hadoop is an early stage technology, data scientists need to know the basics of parallel processing and how to write Java and other programs in MapReduce. As such, they are in high demand right now.

The Right Organization

Business analysts play a key role in any advanced analytics initiative. Given the skills required to build predictive models, analysts are not cheap to hire or easy to retain. Thus, building the right analytical organization is key to attracting and retaining skilled analysts.

Today, most analysts are hired by department heads (e.g., finance, marketing, sales, or operations) and labor away in isolation at the departmental level. Unless given enough new challenges and opportunities for advancement, analysts are easy targets for recruiters.

Analytics Center of Excellence. The best way to attract and retain analysts is to create an Analytics Center of Excellence. This is a corporate group that oversees and manages all business analysts in an organization. The Center of Excellence provides a sense of community among analysts and enables them to regularly exchange ideas and knowledge. The Center also provides a career path for analysts so they are less tempted to look elsewhere to advance their careers. Finally, the Center pairs new analysts with veterans who can give them the mentoring and training they need to excel in their new position.

The key with an Analytics Center of Excellence is to balance central management with process expertise. Nearly all analysts should be embedded in departments and work side by side with business people on a daily basis. This enables analysts to learn business processes and data at a granular level while immersing the business in analytical techniques and approaches. At the same time, the analyst needs to work closely with other analysts in the organization to reinforce the notion that they are part of a larger analytical community.

The best way to accommodate these twin needs is by creating a matrixed analytical team. Analysts should report directly to department heads and indirectly to a corporate director of analytics or vice versa. In either case, the analyst should physically reside in his assigned department most or all days of the week, while participating in daily "stand up" meetings with other analysts so they can share ideas and issues as well as regular off-site meetings to build camaraderie and develop plans. The corporate director of analytics needs to work closely with department heads to balance local and enterprise analytical requirements.

Summary

Advanced analytics is a technical discipline. Yet, some of the keys to its success involve non-technical facets, such as culture, people, and organization. For an analytics initiative to thrive in an organization, executives must create a fact-based decision making culture, hire the right people, and create an analytics center of excellence that attracts, trains, and retains skilled analysts.


Posted November 7, 2011 9:45 AM
Permalink | No Comments |

Business intelligence is changing. I've argued in several reports that there is no longer just one intelligence--i.e., business intelligence--but multiple intelligences, each supporting a unique architecture, design framework, end-users, and tools. But all these intelligences are still designed to help business users leverage information to make smarter decisions and support the creation of either reporting or analysis applications.

The four intelligences are:


  1. Business Intelligence. Addresses the needs of "casual users," delivering reports, dashboards, and scorecards tailored to each user's role, populated with metrics aligned with strategic objectives and powered by a classic data warehousing architecture.

  2. Analytics Intelligence. Addresses the needs of "power users," providing ad hoc access to any data inside or outside the enterprise to answer business questions that can't be identified in advance using spreadsheets, desktop databases, OLAP tools, data mining tools and visual analysis tools.

  3. Continuous Intelligence. Collects, monitors, and analyzes large volumes of fast-changing data to support operational processes. It ranges from near real-time delivery of information (i.e., hours to minutes) in a data warehouse to complex event processing and streaming systems that trigger alerts.

  4. Content Intelligence. Gives business users the ability to analyze information contained in documents, Web pages, email messages, social media sites and other unstructured content using NoSQL and semantic technology.

You may wonder how all these intelligences fit together architecturally. They do, but it's not the clean, neat architecture that you may have seen in data warehousing books of yore. Figure 1 below depicts a generalized architecture that supports the four intelligences.

Figure 1. BI Ecosystem of the Future
BI Ecosystem of Future.jpg

The top half of the diagram represents the classic top-down, data warehousing architecture that primarily delivers interactive reports and dashboards to casual users (although the streaming/complex event processing (CEP) engine is new.) The bottom half of the diagram adds new architectural elements and data sources that better accommodate the needs of business analysts and data scientists and make them full-fledged members of the corporate data environment.

A recent report I wrote describes the components of this architecture in some detail and provides market research on the adoption of analytic platforms (e.g. DW appliances and columnar and MPP databases), among other things. The report is titled: "Big Data Analytics: Profiling the Use of Analytical Platforms in User Organizations." You can download it for free at Bitpipe by clicking on the hyperlink in the previous sentence.

Since "Multiple Intelligences" framework and BI ecosystem that supports it represent what I think the future holds for BI, I'd love to get your feedback.


Posted October 21, 2011 9:35 AM
Permalink | No Comments |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |

I recently spoke with James Phillips, co-founder and senior vice president of products, at Membase, an emerging NoSQL provider that powers many highly visible Web applications, such as Zynga's Farmville and AOL's ad targeting applications. James helped clarify for me the role of NoSQL in today's big data architectures.

Membase, like many of its NoSQL brethren, is an open source, key-value database. Membase was designed to run on clusters of commodity servers so it could "solve transaction problems at scale," says Philips. Because of its transactional focus, Membase is not technology that I would normally talk about in the business intelligence (BI) sphere.

Same Challenges, Similar Solutions

However, today the transaction community is grappling with many of the same technical challenges as the BI community--namely, accessing and crunching large volumes of data in a fast, affordable way. Not coincidentally, the transactional community is coming up with many of the same solutions--namely, distributing data and processing across multiple nodes of commodity servers linked via high-speed interconnects. In other words, low-cost parallel processing.

Key-Value Pairs. But the NoSQL community differs in one major way from a majority of analytics vendors chasing large-scale parallel processing architectures: it relinquishes the relational framework in favor of key-value pair data structures. For data-intensive, Web-based applications that must dish up data to millions of concurrent online users in the blink of an eye, key-value pairs are a fast, flexible, and inexpensive approach. For example, you just pair a cookie with its ID, slam it into a file with millions of other key-value pairs, and distribute the files across multiple nodes in a cluster. A read works in reverse: the database finds the node with the right key-value pair to fulfill an application request and sends it along.

The beauty of NoSQL, according to Philips, is that you don't have to put data into a table structure or use SQL to manipulate it. "With NoSQL, you put the data in first and then figure out how to manipulate it," Phillips says. "You can continue to change the kinds of data you store without having to change schemas or rebuild indexes and aggregates." Thus, the NoSQL mantra is "store first, design later." This makes NoSQL systems highly flexible but programmatically intensive since you have to build programs to access the data. But since most NoSQL advocates are application developers (i.e. programmers), this model aligns with their strengths.

In contrast, most analytics-oriented database vendors and SQL-oriented BI professionals haven't given up on the relational model, although they are pushing it to new heights to ensure adequate scalability and performance when processing large volumes of data. Relational database vendors are embracing techniques, such as columnar storage, storage-level intelligence, built-in analytics, hardware-software appliances, and, of course, parallel processing across clusters of commodity servers. BI professionals are purchasing these purpose-built analytical platforms to address performance and availability problems first and foremost and data scalability issues secondarily. And that's where Hadoop comes in.

Hadoop. Hadoop is an open source analytics architecture for processing massively large volumes of structured and unstructured data in a cost-effective manner. Like its NoSQL brethren, Hadoop abandons the relational model in favor of a file-based, programmatic approach based on Java. And like Membase, Hadoop uses a scale-out architecture that runs on commodity servers and requires no predefined schema or query language. Many Internet companies today use Hadoop to ingest and pre-process large volumes of clickstream data which are then fed to a data warehouse for reporting and analysis. (However, many companies are also starting to run reports and queries directly against Hadoop.)

Membase has a strong partnership with Cloudera, one of the leading distributors of open source Hadoop software. Membase wants to create bidirectional interfaces with Hadoop to easily move data between the two systems.

Membase Technology

Membase's secret sauce--the thing that differentiates it from its NoSQL competitors, such as Cassandra, MongoDB, CouchDB, and Redis--is that it incorporates Memcache, an open source, caching technology. Memcache is used by many companies to provide reliable, ultra-fast performance for data-intensive Web applications that dish out data to millions of current customers. Today, many customers manually integrate Memcache with a relational database that stores cached data on disk to store transactions or activity for future use.

Membase, on the other hand, does that integration upfront. It ties Memcache to a MySQL database which stores transactions to disk in a secure, reliable, and highly performant way. Membase then keeps the cache populated with working data that it pulls rapidly from disk in response to application requests. Because Membase distributes data across a cluster of commodity servers, it offers blazingly fast and reliable read/write performance required by the largest and most demanding Web applications.

Document Store. Membase will soon transform itself from a pure key-value database to a document store (a la MongoDB.) This will give developers the ability to write functions that manipulate data inside data objects stored in predefined formats (e.g. JSON, Avro, or Protocol Buffers.) Today, Membase can't "look inside" data objects to query, insert, or append information that the objects contain; it largely just dumps object values into an application.

Phillips said the purpose of the new document architecture is support predefined queries within transactional applications. He made it clear that the goal isn't to support ad hoc queries or compete with analytics vendors: "Our customers aren't asking for ad hoc queries or analytics; they just want super-fast performance for pre-defined application queries."

Pricing. Customers can download a free community edition of Membase or purchase an
annual subscription that provides support, packaging, and quality assurance testing. Pricing starts at $999 per node.


Posted December 23, 2010 9:38 AM
Permalink | 2 Comments |
PREV 1 2

Search this blog
Categories ›
Archives ›
Recent Entries ›