Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

Recently in Data Integration Category

Data vortex.jpg

I just finished writing the first draft of my upcoming report titled, "Creating an Enterprise Data Strategy: Managing Data as a Corporate Asset." This is a broad topic these days, even broader than just business intelligence (BI) and data warehousing. It's really about how organizations can better manage an enterprise asset--data--that most business people don't value until it's too late.

After spending more than a week reviewing notes from many interviews and trying to formulate a concise, coherent, and pragmatic analysis without creating a book, I can distill my findings into a couple of bullet points. And since I am still collecting feedback from sponsors and others, I welcome your input as well!

  • Learn the Hard Way. Most business executives don't perceive data as a vital corporate asset until they've been badly burned by poor quality data. It could be that their well-publicized merger didn't deliver promised synergies due to a larger than anticipated overlap in customers or customer churn is increasing but they have no idea who is churning or why.
  • The Value of Data. Certainly, there are cost savings from consolidating legacy reporting systems and independent data marts and spreadmarts. But the only way to really calculate the value of data is to understand the risks poor quality data poses to strategic projects, goals, partnerships, and decisions. Since risk is virtually invisible until something bad happens, this is why selling a data strategy is so hard to do.
  • Project Alignment. Even with a catastrophic data-induced failure, the only way to cultivate data fastidiousness is one project at a time. Data governance for data governance's sake does not work. Business people must have tangible, self-evident reasons to spend time on infrastructure and service issues rather than immediate business outcomes on which they're being measured.
  • Business driven. This goes without saying: data strategy and governance is not an IT project or program. Any attempt by executives to put IT in charge of this asset is doomed to fail. The business must assign top executives, subject matter experts, and business stewards to define the rules, policies, and procedures required to maintain accuracy, completeness, and timeliness of critical data elements.
  • Sustainable Processes. The ultimate objective for managing any shared service is embed its care and tending into business processes that are part of the corporate culture. At this point, managing data becomes everyone's business and no one questions why it's done. If you try to change the process, people will say "This is the way we've always done it." This is a sustainable process.
  • Data Defaults. In the absence of strong data governance, data always defaults to the lowest common denominator, which is first and foremost, an analyst armed with a spreadsheet, and secondly, a department head with his own IT staff and data management systems. This is kind of like the law of entropy: it takes a lot of energy to maintain order and symmetry but very little for it to devolve into randomness.
  • Reconciling Extremes. The key to managing data (or any shared services or strategy) is to balance extremes by maintaining a free interplay between polar opposites. A company in which data is a free-for-all needs to impose standard processes to bring order to chaos. On the other hand, a company with a huge backlog of data projects needs to license certain people and groups to bend or break the rules for the benefit of the business.
  • A Touch of Chaos. Instead of trying to beat back data chaos, BI managers should embrace it. Spreadmarts are instantiating of business requirements so use them (and the people who create them) to flesh out the enterprise BI and DW environment. "I don't think it's healthy to think that your central BI solution can do it all. The ratio I'm going for is 80% corporate, 20% niche," says Mike Masciandaro, BI Director at Dow, talking about the newest incarnation of spreadmarts: in-memory visualization tools.
  • Safety Valves - Another approach to managing chaos is to coopt it. If users threaten to create independent data marts while they wait for the EDW to meet their needs, create a SWAT team to build a temporary application that meets their needs. If they complain about the fast and dirty solution (and you don't want to make it too appealing), they know there is a better solution in the offing.
  • Data Tools. There has been a lot more innovation in technology than processes. So, today, organizations should strive to arm their data management teams with the proper tool for every task. And with the volume and types of data accelerating, IT professionals need every tool they can get.

So what did I miss? If you send me some tantalizing insights, I just might have to quote you in the report!


Posted May 19, 2011 9:25 AM
Permalink | 2 Comments |

In a recent blog ("What's in a Word: The Evolution of BI Semantics"), I discussed the evolution of BI semantics and end-user approaches to business intelligence. In this blog, I will focus on technology evolution and vendor messaging.

Four Market Segments. The BI market is comprised of four sub-markets that have experienced rapid change and growth since the 1990s: BI tools, data integration tools, database management systems (DBMS), and hardware platform. (See bottom half of figure 1.)

Thumbnail image for Thumbnail image for BI Market Evolution.jpg

Compute Platform. BI technologies in these market segments run on a compute infrastructure (i.e., the diagonal line in figure 1) that has changed dramatically over the years, evolving from mainframes and mini-computers in the 1980s and client/server in the 1990s to the Web and Web services in the early 2000s. Today, we see the advent of mobile devices and cloud-based platforms. Each change in the underlying compute platform has created opportunities for upstarts with new technology to grab market share and forced incumbents to respond in kind or acquire the upstarts. With an endless wave of new companies pushing innovative new technologies, the BI market has been one of the most dynamic in the software industry during the past 20 years.

BI Tools. Prior to 1990, companies built reports using 3GL and 4GL reporting languages, such as Focus and Ramis. In the 1990s, vendors began selling desktop or client/server tools that enabled business users to create their own reports and analyses. The prominent BI tools were Windows-based OLAP, ad hoc query, and ad hoc reporting tools, and, of course, Excel, which still is the most prevalent reporting and analysis tool in the market today.

In the 2000s, BI vendors "rediscovered" reporting, having been enraptured with analysis tools in the 1990s. They learned the hard way that only a fraction of users want to analyze data and the real market for BI lies in delivering reports, and subsequently, dashboards, which are essentially visual exception reports. Today, vendors have moved to the next wave of BI, which is predictive analytics, while offering support for new channels of delivery (mobile and cloud.) In the next five years, I believe BI search will become an integral part of a BI portfolio, since it provides a super easy interface for casual users to submit ad hoc queries and navigate data without boundaries.

BI Vendor Messaging. In the 1990s, vendors competed by bundling together multiple types of BI tools (reporting, OLAP, query) into a single "BI Suite." A few years later, they began touting "BI Platforms" in which once distinct BI tools in a suite became modules within a unified BI architecture that all use the same query engine, charting engine, user interface, metadata, administration, security model, and application programming interface. In the late 1990s, Microsoft launched the movement towards low-cost BI tools geared to the mid-market when it bundled its BI and ETL tools in SQL Server at no extra charge. Today, a host of low-cost BI vendors, including open source BI tools, cloud-BI tools, and in-memory visual analysis tools have helped bring BI to the mid-market and lower the costs of departmental BI initiatives.

Today, BI tools have become easier to use and tailored to a range of information consumption styles (i.e., viewer interactor, lightweight author, professional author). Consequently, the watchword is now "self-service BI" where business users meet their own information requirements rather than relying on BI professionals or power users to build reports on their behalf. Going forward, BI tools vendors will begin talking about "embedded BI" in which analytics (e.g. charts, tables, models) are embedded in operational applications and mission-critical business processes.
Data Integration Tools. In the data integration market, Informatica and Ascential Software (now IBM) led the charge towards the use of extract, transform, and load (ETL) engines to replace hand-coded programs that move data from source systems to a data warehouse. The engine approach proved superior to coding because its graphical interface meant you didn't have to be a hard-core programmer to write ETL code and, more importantly, it captured metadata in a repository instead of burying it in code.

But vendors soon discovered that ETL tools are only one piece of the data integration puzzle and, following the lead of their BI brethren, moved to create data integration "suites" consisting of data quality, data profiling, master data management, and data federation tools. Soon, these suites turned into data integration "platforms" running on a common architecture. Today, the focus is on using data federation tools to "virtualize" data sources behind a common data services interface and cloud-based data integration tools to migrate data from on premises to the cloud and back again. Also, data integration vendors are making their tools easier to use, thanks in large part to cloud-based initiatives, which now has them evangelizing the notion of "self-service data integration" in which business analysts, not IT developers, build data integration scripts.

DBMS Engines and Hardware. Throughout the 1990s and early 2000s, the database and hardware markets were sleepy tidewaters in the BI market, despite the fact that they consumed a good portion of BI budgets. True, database vendors had added cubing, aggregate aware optimizers, and various types of indexes to speed query performance, but that was the extent of the innovation.

But in the early 2000s, as data warehouse data volumes began to exceed the terabyte mark and query complexity grew, many data warehouses hit the proverbial wall. Meanwhile, Moore's law continued to make dramatic strides in the price-performance of processing, storage, and memory, and soon a few database entrepreneurs spotted an opportunity to overhaul the underlying BI compute infrastructure.

Netezza opened the flood gates in 2002 with the first data warehousing appliance (unless you count Teradata back in the 1980s!) that soon gained a bevy of imitators, offering orders of magnitude better query performance for a fraction of the cost. These new systems offer innovative new storage-level filtering, column-based compression and storage, massively parallel processing architecture, expanded use of memory-based caches, and in some cases, use of solid state disk, to bolster performance and availability for analytic workloads. Today, these "analytic platforms" are turbo-charging BI deployments, and in many cases, enabling BI professionals to deliver solutions that weren't possible before.

As proof of the power of these new purpose-built analytical systems, the biggest vendors in high-tech have invaded the market, picking off leading pureplays before they've even fully germinated. In the past nine months, Microsoft, IBM, Hewlett Packard, Teradata, SAP, and EMC purchased analytic platform vendors, while Oracle built its own with hardware from Sun Microsystems, which it acquired in 2009. (See "Jockeying for Position in the Analytic Platform Market.")

Mainstream Market. When viewed as a whole, the BI market has clearly emerged from an early adopter phase to the early mainstream. The watershed mark was 2007 when the biggest software vendors in the world--Oracle, SAP, and IBM--acquired the leading BI vendors--Hyperion, Business Objects, and Cognos respectively. Also, the plethora of advertisements about BI capabilities that appear on television (e.g., IBM's Smarter Planet campaign) and major consumer magazines (e.g. SAP and SAS Institute ads) reinforce the maturity of BI as a mainstream market. BI is now front and center on the radar screen of most CIOs, if not CEOs, who want to better leverage information to make smarter decisions and gain a lasting competitive advantage.

The Future. At this point, some might wonder if there is much headroom left in the BI market. The last 20 years have witnessed a dizzying array of technology innovations, products, and methodologies. It can't continue at this pace, right? Yes and no. The BI market has surprised us in the past. Even in recent years as the BI market consolidated--with big software vendors acquiring nimble innovators--we've seen a tremendous explosion of innovation. BI entrepreneurs see a host of opportunities, from better self-service BI tools that are more visual and intuitive to use to mobile and cloud-based BI offerings that are faster, better, and cheaper than current offerings. Search vendors are making a play for BI as well as platform vendors that promise data center scalability and availability for increasingly mission-critical BI loads. And we still need better tools and approaches for querying and analyzing unstructured content (e.g., documents, email, clickstream data, Web pages) and deliver data faster as our businesses increasingly compete on velocity and as our data volumes become too large to fit inside shrinking batch windows.

Next week, Beye Research will publish a report of mine that describes a new BI Delivery Framework for the next ten years. In that report, I describe a future state BI environment that contains not just one intelligence (i.e., business intelligence) but four intelligences (e.g. analytic, continuous, and content intelligence) that BI organizations will need to support or interoperate with in the near future. Stay tuned!


Posted March 18, 2011 2:29 PM
Permalink | No Comments |

It's been awhile since I've had the opportunity to examine Informatica as a company and its entire product portfolio, which has blossomed into a broad-based data integration platform. After spending a day with company executives in Menlo Park, CA this week and learning about the company's business and technology strategy, I am bullish on Informatica.

Business Strategy

From a business perspective, the numbers speak for themselves. With $650 million in annual revenues, Informatica has grown revenues at a healthy clip (20% CAGR for the past five years) and maintained high operating margins (20%+ during the past 10 quarters and 31% in the most recent quarter), well above the industry average. And it believes it can continue to grow license revenue 20% a year for the foreseeable future.

Financial analysts repeatedly ask me "Who will buy Informatica?" At this point, the better question is "Who will Informatica buy?" and "Will Informatica expand beyond its core data integration market?"

Informatica has already made about a half-dozen strategic acquisitions in recent years, expanding its portfolio with data quality, master data management, B2B data exchange, messaging, lifecycle management, and complex event processing. CEO Sohaib Abbasi believes there are still many potential growth opportunities in the data integration market. His response to the above questions: "There is no need for us to consider ourselves anything other than a data integration vendor."

The Cloud. For one, Informatica sees the Cloud as a huge growth market for the company. Informatica made an early bet on the Cloud in 2007, and it's already paying dividends. Informatica has 1,300 companies using its Informatica Cloud service, which is a lightweight version of its on-premises, data quality and data integration software. Most Informatica Cloud customers are small- and medium-sized businesses (SMB) and three-quarters are new customers. Informatica believes there is significant upsell opportunity with this new customer base.

"Given the potential growth of the Cloud, we believe we are in the same position Oracle was in 1982 right before the relational database market took off," said Abbasi.

Future Investments. Beyond the Cloud, Informatica sees significant new data integration opportunities in the emerging markets for mobile computing and social networking. James Markarian, Informatica's chief technology officer, hinted that Informatica may deliver end-user applications that exploit these new deployment channels beyond offering core data integration services. In addition, Markarian mentioned other areas of potential investment, including Hadoop, search, semantics, security, workflow, Edge computing, and business process management.

Technology Trends

Integration. On the technology side, I was impressed with the degree to which Informatica is integrating the products in its portfolio. The mantra is design once and deploy anywhere. For example, one customer, Smith and Nephew, built rules in Informatica's data quality product, and then reused the same rules when it purchased and deployed Informatica's MDM product. With reusable rules, customers can safeguard their earlier investments as they expand their Informatica footprint.

Self-Service DI. I was also intrigued by its desire to improve business-IT collaboration and promote "self service data integration." Last year, it deployed Informatica Analyst, a browser-based application that makes it possible for business analysts to profile data sets and create data quality rules. It will soon expand Informatica Analyst to support data mapping. Informatica Cloud also empowers business analysts to move and transform data using a very friendly, wizard-driven interface.

Self service is great in theory, but almost always falters in reality, unless there is sufficient governance to ensure that end-user development doesn't spiral out of control and wreak havoc with information consistency and manageability. So, I was very happy to see that both products had sufficient controls (e.g., role-based access control and audit trails) to manage end-user involvement without restricting their ability to create new objects and rules.

More importantly, I saw the early germination of governance mechanisms that enable a business analyst and IT developer to collaborate on data integration tasks, eliminating a large amount of back-and-forth and miscommunication that currently hampers and slows down data integration work. For example, Informatica Analyst makes it possible for business analysts to create "specifications" for data quality rules (and soon data mappings), test the output, and then pass to IT developers to flesh out in virtual tables.
The developers, in turn, pass their work back to the business analyst to evaluate and test before anything is produced. Essentially, Informatica Analyst is the basis for a good prototyping environment.

It would be ideal if Informatica added workflow to both the Informatica Analyst and Informatica Cloud products to further cement the ties between analyst and developer and foster a true collaborative and managed environment for creating data integration objects. It also would be nice to take these objects and turn them into services available to other users with appropriate permissions and other applications in the Informatica portfolio. This shouldn't be too hard.

Summary

While Informatica has some strong competition in the data integration from behemoths such as IBM, Oracle, and SAP, as well as a host of small, nimble vendors, such as SnapLogic, Jitterbit, Boomi, Denodo, Kapow, Pentaho, Expressor, DataFlux, and Pervasive, the company is in a strong market position. Perhaps Informatica's greatest asset is not its products, but its people and processes. It has a capable, stable management team that has had little turnover compared to other Silicon Valley firms. And its disciplined, focused approach to the market means it has resisted the temptation to pursue new opportunities outside its core competency and take shortcuts on the path to growth.


Posted February 11, 2011 6:55 AM
Permalink | 3 Comments |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |