Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

Recently in Data Management Category

Data vortex.jpg

I just finished writing the first draft of my upcoming report titled, "Creating an Enterprise Data Strategy: Managing Data as a Corporate Asset." This is a broad topic these days, even broader than just business intelligence (BI) and data warehousing. It's really about how organizations can better manage an enterprise asset--data--that most business people don't value until it's too late.

After spending more than a week reviewing notes from many interviews and trying to formulate a concise, coherent, and pragmatic analysis without creating a book, I can distill my findings into a couple of bullet points. And since I am still collecting feedback from sponsors and others, I welcome your input as well!

  • Learn the Hard Way. Most business executives don't perceive data as a vital corporate asset until they've been badly burned by poor quality data. It could be that their well-publicized merger didn't deliver promised synergies due to a larger than anticipated overlap in customers or customer churn is increasing but they have no idea who is churning or why.
  • The Value of Data. Certainly, there are cost savings from consolidating legacy reporting systems and independent data marts and spreadmarts. But the only way to really calculate the value of data is to understand the risks poor quality data poses to strategic projects, goals, partnerships, and decisions. Since risk is virtually invisible until something bad happens, this is why selling a data strategy is so hard to do.
  • Project Alignment. Even with a catastrophic data-induced failure, the only way to cultivate data fastidiousness is one project at a time. Data governance for data governance's sake does not work. Business people must have tangible, self-evident reasons to spend time on infrastructure and service issues rather than immediate business outcomes on which they're being measured.
  • Business driven. This goes without saying: data strategy and governance is not an IT project or program. Any attempt by executives to put IT in charge of this asset is doomed to fail. The business must assign top executives, subject matter experts, and business stewards to define the rules, policies, and procedures required to maintain accuracy, completeness, and timeliness of critical data elements.
  • Sustainable Processes. The ultimate objective for managing any shared service is embed its care and tending into business processes that are part of the corporate culture. At this point, managing data becomes everyone's business and no one questions why it's done. If you try to change the process, people will say "This is the way we've always done it." This is a sustainable process.
  • Data Defaults. In the absence of strong data governance, data always defaults to the lowest common denominator, which is first and foremost, an analyst armed with a spreadsheet, and secondly, a department head with his own IT staff and data management systems. This is kind of like the law of entropy: it takes a lot of energy to maintain order and symmetry but very little for it to devolve into randomness.
  • Reconciling Extremes. The key to managing data (or any shared services or strategy) is to balance extremes by maintaining a free interplay between polar opposites. A company in which data is a free-for-all needs to impose standard processes to bring order to chaos. On the other hand, a company with a huge backlog of data projects needs to license certain people and groups to bend or break the rules for the benefit of the business.
  • A Touch of Chaos. Instead of trying to beat back data chaos, BI managers should embrace it. Spreadmarts are instantiating of business requirements so use them (and the people who create them) to flesh out the enterprise BI and DW environment. "I don't think it's healthy to think that your central BI solution can do it all. The ratio I'm going for is 80% corporate, 20% niche," says Mike Masciandaro, BI Director at Dow, talking about the newest incarnation of spreadmarts: in-memory visualization tools.
  • Safety Valves - Another approach to managing chaos is to coopt it. If users threaten to create independent data marts while they wait for the EDW to meet their needs, create a SWAT team to build a temporary application that meets their needs. If they complain about the fast and dirty solution (and you don't want to make it too appealing), they know there is a better solution in the offing.
  • Data Tools. There has been a lot more innovation in technology than processes. So, today, organizations should strive to arm their data management teams with the proper tool for every task. And with the volume and types of data accelerating, IT professionals need every tool they can get.

So what did I miss? If you send me some tantalizing insights, I just might have to quote you in the report!


Posted May 19, 2011 9:25 AM
Permalink | 2 Comments |

In a recent blog ("What's in a Word: The Evolution of BI Semantics"), I discussed the evolution of BI semantics and end-user approaches to business intelligence. In this blog, I will focus on technology evolution and vendor messaging.

Four Market Segments. The BI market is comprised of four sub-markets that have experienced rapid change and growth since the 1990s: BI tools, data integration tools, database management systems (DBMS), and hardware platform. (See bottom half of figure 1.)

Thumbnail image for Thumbnail image for BI Market Evolution.jpg

Compute Platform. BI technologies in these market segments run on a compute infrastructure (i.e., the diagonal line in figure 1) that has changed dramatically over the years, evolving from mainframes and mini-computers in the 1980s and client/server in the 1990s to the Web and Web services in the early 2000s. Today, we see the advent of mobile devices and cloud-based platforms. Each change in the underlying compute platform has created opportunities for upstarts with new technology to grab market share and forced incumbents to respond in kind or acquire the upstarts. With an endless wave of new companies pushing innovative new technologies, the BI market has been one of the most dynamic in the software industry during the past 20 years.

BI Tools. Prior to 1990, companies built reports using 3GL and 4GL reporting languages, such as Focus and Ramis. In the 1990s, vendors began selling desktop or client/server tools that enabled business users to create their own reports and analyses. The prominent BI tools were Windows-based OLAP, ad hoc query, and ad hoc reporting tools, and, of course, Excel, which still is the most prevalent reporting and analysis tool in the market today.

In the 2000s, BI vendors "rediscovered" reporting, having been enraptured with analysis tools in the 1990s. They learned the hard way that only a fraction of users want to analyze data and the real market for BI lies in delivering reports, and subsequently, dashboards, which are essentially visual exception reports. Today, vendors have moved to the next wave of BI, which is predictive analytics, while offering support for new channels of delivery (mobile and cloud.) In the next five years, I believe BI search will become an integral part of a BI portfolio, since it provides a super easy interface for casual users to submit ad hoc queries and navigate data without boundaries.

BI Vendor Messaging. In the 1990s, vendors competed by bundling together multiple types of BI tools (reporting, OLAP, query) into a single "BI Suite." A few years later, they began touting "BI Platforms" in which once distinct BI tools in a suite became modules within a unified BI architecture that all use the same query engine, charting engine, user interface, metadata, administration, security model, and application programming interface. In the late 1990s, Microsoft launched the movement towards low-cost BI tools geared to the mid-market when it bundled its BI and ETL tools in SQL Server at no extra charge. Today, a host of low-cost BI vendors, including open source BI tools, cloud-BI tools, and in-memory visual analysis tools have helped bring BI to the mid-market and lower the costs of departmental BI initiatives.

Today, BI tools have become easier to use and tailored to a range of information consumption styles (i.e., viewer interactor, lightweight author, professional author). Consequently, the watchword is now "self-service BI" where business users meet their own information requirements rather than relying on BI professionals or power users to build reports on their behalf. Going forward, BI tools vendors will begin talking about "embedded BI" in which analytics (e.g. charts, tables, models) are embedded in operational applications and mission-critical business processes.
Data Integration Tools. In the data integration market, Informatica and Ascential Software (now IBM) led the charge towards the use of extract, transform, and load (ETL) engines to replace hand-coded programs that move data from source systems to a data warehouse. The engine approach proved superior to coding because its graphical interface meant you didn't have to be a hard-core programmer to write ETL code and, more importantly, it captured metadata in a repository instead of burying it in code.

But vendors soon discovered that ETL tools are only one piece of the data integration puzzle and, following the lead of their BI brethren, moved to create data integration "suites" consisting of data quality, data profiling, master data management, and data federation tools. Soon, these suites turned into data integration "platforms" running on a common architecture. Today, the focus is on using data federation tools to "virtualize" data sources behind a common data services interface and cloud-based data integration tools to migrate data from on premises to the cloud and back again. Also, data integration vendors are making their tools easier to use, thanks in large part to cloud-based initiatives, which now has them evangelizing the notion of "self-service data integration" in which business analysts, not IT developers, build data integration scripts.

DBMS Engines and Hardware. Throughout the 1990s and early 2000s, the database and hardware markets were sleepy tidewaters in the BI market, despite the fact that they consumed a good portion of BI budgets. True, database vendors had added cubing, aggregate aware optimizers, and various types of indexes to speed query performance, but that was the extent of the innovation.

But in the early 2000s, as data warehouse data volumes began to exceed the terabyte mark and query complexity grew, many data warehouses hit the proverbial wall. Meanwhile, Moore's law continued to make dramatic strides in the price-performance of processing, storage, and memory, and soon a few database entrepreneurs spotted an opportunity to overhaul the underlying BI compute infrastructure.

Netezza opened the flood gates in 2002 with the first data warehousing appliance (unless you count Teradata back in the 1980s!) that soon gained a bevy of imitators, offering orders of magnitude better query performance for a fraction of the cost. These new systems offer innovative new storage-level filtering, column-based compression and storage, massively parallel processing architecture, expanded use of memory-based caches, and in some cases, use of solid state disk, to bolster performance and availability for analytic workloads. Today, these "analytic platforms" are turbo-charging BI deployments, and in many cases, enabling BI professionals to deliver solutions that weren't possible before.

As proof of the power of these new purpose-built analytical systems, the biggest vendors in high-tech have invaded the market, picking off leading pureplays before they've even fully germinated. In the past nine months, Microsoft, IBM, Hewlett Packard, Teradata, SAP, and EMC purchased analytic platform vendors, while Oracle built its own with hardware from Sun Microsystems, which it acquired in 2009. (See "Jockeying for Position in the Analytic Platform Market.")

Mainstream Market. When viewed as a whole, the BI market has clearly emerged from an early adopter phase to the early mainstream. The watershed mark was 2007 when the biggest software vendors in the world--Oracle, SAP, and IBM--acquired the leading BI vendors--Hyperion, Business Objects, and Cognos respectively. Also, the plethora of advertisements about BI capabilities that appear on television (e.g., IBM's Smarter Planet campaign) and major consumer magazines (e.g. SAP and SAS Institute ads) reinforce the maturity of BI as a mainstream market. BI is now front and center on the radar screen of most CIOs, if not CEOs, who want to better leverage information to make smarter decisions and gain a lasting competitive advantage.

The Future. At this point, some might wonder if there is much headroom left in the BI market. The last 20 years have witnessed a dizzying array of technology innovations, products, and methodologies. It can't continue at this pace, right? Yes and no. The BI market has surprised us in the past. Even in recent years as the BI market consolidated--with big software vendors acquiring nimble innovators--we've seen a tremendous explosion of innovation. BI entrepreneurs see a host of opportunities, from better self-service BI tools that are more visual and intuitive to use to mobile and cloud-based BI offerings that are faster, better, and cheaper than current offerings. Search vendors are making a play for BI as well as platform vendors that promise data center scalability and availability for increasingly mission-critical BI loads. And we still need better tools and approaches for querying and analyzing unstructured content (e.g., documents, email, clickstream data, Web pages) and deliver data faster as our businesses increasingly compete on velocity and as our data volumes become too large to fit inside shrinking batch windows.

Next week, Beye Research will publish a report of mine that describes a new BI Delivery Framework for the next ten years. In that report, I describe a future state BI environment that contains not just one intelligence (i.e., business intelligence) but four intelligences (e.g. analytic, continuous, and content intelligence) that BI organizations will need to support or interoperate with in the near future. Stay tuned!


Posted March 18, 2011 2:29 PM
Permalink | No Comments |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |

I recently spoke with James Phillips, co-founder and senior vice president of products, at Membase, an emerging NoSQL provider that powers many highly visible Web applications, such as Zynga's Farmville and AOL's ad targeting applications. James helped clarify for me the role of NoSQL in today's big data architectures.

Membase, like many of its NoSQL brethren, is an open source, key-value database. Membase was designed to run on clusters of commodity servers so it could "solve transaction problems at scale," says Philips. Because of its transactional focus, Membase is not technology that I would normally talk about in the business intelligence (BI) sphere.

Same Challenges, Similar Solutions

However, today the transaction community is grappling with many of the same technical challenges as the BI community--namely, accessing and crunching large volumes of data in a fast, affordable way. Not coincidentally, the transactional community is coming up with many of the same solutions--namely, distributing data and processing across multiple nodes of commodity servers linked via high-speed interconnects. In other words, low-cost parallel processing.

Key-Value Pairs. But the NoSQL community differs in one major way from a majority of analytics vendors chasing large-scale parallel processing architectures: it relinquishes the relational framework in favor of key-value pair data structures. For data-intensive, Web-based applications that must dish up data to millions of concurrent online users in the blink of an eye, key-value pairs are a fast, flexible, and inexpensive approach. For example, you just pair a cookie with its ID, slam it into a file with millions of other key-value pairs, and distribute the files across multiple nodes in a cluster. A read works in reverse: the database finds the node with the right key-value pair to fulfill an application request and sends it along.

The beauty of NoSQL, according to Philips, is that you don't have to put data into a table structure or use SQL to manipulate it. "With NoSQL, you put the data in first and then figure out how to manipulate it," Phillips says. "You can continue to change the kinds of data you store without having to change schemas or rebuild indexes and aggregates." Thus, the NoSQL mantra is "store first, design later." This makes NoSQL systems highly flexible but programmatically intensive since you have to build programs to access the data. But since most NoSQL advocates are application developers (i.e. programmers), this model aligns with their strengths.

In contrast, most analytics-oriented database vendors and SQL-oriented BI professionals haven't given up on the relational model, although they are pushing it to new heights to ensure adequate scalability and performance when processing large volumes of data. Relational database vendors are embracing techniques, such as columnar storage, storage-level intelligence, built-in analytics, hardware-software appliances, and, of course, parallel processing across clusters of commodity servers. BI professionals are purchasing these purpose-built analytical platforms to address performance and availability problems first and foremost and data scalability issues secondarily. And that's where Hadoop comes in.

Hadoop. Hadoop is an open source analytics architecture for processing massively large volumes of structured and unstructured data in a cost-effective manner. Like its NoSQL brethren, Hadoop abandons the relational model in favor of a file-based, programmatic approach based on Java. And like Membase, Hadoop uses a scale-out architecture that runs on commodity servers and requires no predefined schema or query language. Many Internet companies today use Hadoop to ingest and pre-process large volumes of clickstream data which are then fed to a data warehouse for reporting and analysis. (However, many companies are also starting to run reports and queries directly against Hadoop.)

Membase has a strong partnership with Cloudera, one of the leading distributors of open source Hadoop software. Membase wants to create bidirectional interfaces with Hadoop to easily move data between the two systems.

Membase Technology

Membase's secret sauce--the thing that differentiates it from its NoSQL competitors, such as Cassandra, MongoDB, CouchDB, and Redis--is that it incorporates Memcache, an open source, caching technology. Memcache is used by many companies to provide reliable, ultra-fast performance for data-intensive Web applications that dish out data to millions of current customers. Today, many customers manually integrate Memcache with a relational database that stores cached data on disk to store transactions or activity for future use.

Membase, on the other hand, does that integration upfront. It ties Memcache to a MySQL database which stores transactions to disk in a secure, reliable, and highly performant way. Membase then keeps the cache populated with working data that it pulls rapidly from disk in response to application requests. Because Membase distributes data across a cluster of commodity servers, it offers blazingly fast and reliable read/write performance required by the largest and most demanding Web applications.

Document Store. Membase will soon transform itself from a pure key-value database to a document store (a la MongoDB.) This will give developers the ability to write functions that manipulate data inside data objects stored in predefined formats (e.g. JSON, Avro, or Protocol Buffers.) Today, Membase can't "look inside" data objects to query, insert, or append information that the objects contain; it largely just dumps object values into an application.

Phillips said the purpose of the new document architecture is support predefined queries within transactional applications. He made it clear that the goal isn't to support ad hoc queries or compete with analytics vendors: "Our customers aren't asking for ad hoc queries or analytics; they just want super-fast performance for pre-defined application queries."

Pricing. Customers can download a free community edition of Membase or purchase an
annual subscription that provides support, packaging, and quality assurance testing. Pricing starts at $999 per node.


Posted December 23, 2010 9:38 AM
Permalink | No Comments |