Oops! The input is malformed! Disruptor: The Rise of the Open Source Data Warehouse by Claudia Imhoff - BeyeNETWORK
We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Disruptor: The Rise of the Open Source Data Warehouse

Originally published December 2, 2008

Since the arrival of data warehouses on the enterprise scene more than a decade ago, these platforms have held out the tantalizing promise of concentrating critical data into easy-to-find, centralized locations, enabling business people at all levels to make their decisions based on concrete, analytical facts, versus intuitive and uninformed decision making.

However, this vision has not yet come to pass for most organizations. It’s not because the data warehouse concept is conceptually flawed – data warehouses have passed muster by providing a means for organizations to archive and provide for analysis data from multiple sources across the enterprise. Rather, data warehouse technology and associated methodologies remain out of reach to most companies.

For many organizations, data warehouses are simply too costly to buy, too costly to implement and too costly to maintain. Data warehousing is still a luxury of deep-pocketed organizations, which have the budgets, staffs and patience capable of purchasing, installing and maintaining these platforms.

Disruptive Force: Open Source

Many of the challenges we’re seeing with data warehousing have a familiar ring to them, however. Other types of systems and technologies over the years have also proven to be costly, slow-to-implement beasts that sapped IT budgets and organizational resources and delivered little in the end. Operating systems, for one, were once many times more expensive than the hardware they ran on, and held companies captive to particular vendors’ ecosystems. Middleware was only a luxury for well-heeled enterprises, and databases required entire staffs of trained and certified specialists to manage the data within.

These and many more markets have been completely turned upside down by the disruptive force known as open source. Open source not only delivered low or no-cost, liberally licensed software that opened up capabilities to even the smallest companies, but also opened up code and functionality to a community process that ensured solutions remained true to prevailing standards. In the process, solutions became far more straightforward and flexible to implement, unencumbered by unwanted bells and whistles.

Now the open source revolution has reached the data warehousing space. There are not only tools and technologies available for the building blocks of data warehouse implementations – such as database management systems, extract, transform and load (ETL), and analytic tools – but the data warehouse itself has been open sourced.

Open source solutions that support data warehouse projects include the following:

Database Systems

The impressive growth of the open source building blocks of successful data warehouse deployments suggests that the time is ripe for the open source data warehouse. Gartner, for example, reports that there has been a significant uptake of open source DBMS engines in recent years.  The analyst firm found that 47% of companies it surveyed have already adopted open source databases, and another 19% are considering adopting within a 12-month period.

In many cases, open source databases are being adopted within markets that have long been ignored by large database vendors. However, there is even an open source database presence within organizations dominated by active commercial database implementations. A study conducted among 226 members of the Independent Oracle Users Group (IOUG) found that more than one-third of these sites, 35%, also had an open source database such as MySQL running on their premises.

ETL Tools

Accompanying the emergence of open source databases are ETL and open-source analytic/business intelligence tools that are gaining enterprise adoption. Gartner estimates that about 11% of the companies it surveyed are using open source ETL tools, with another 16% considering such tools over the coming months. Open source ETL tools include Pentaho’s KETL, Talend, Clover.ETL and Octopus.

Business Intelligence

In the business intelligence (BI) and analytic tools space, Gartner says 9% of the companies already use open source BI solutions, and another 18% are considering adoption within a 12-month period. There are now a number of open-source BI or analytic applications on the market, led by vendors such as Pentaho and JasperSoft. In a survey of 500 companies, Ventana Research also confirmed that interest in open source BI is widespread and growing, with 21% of organizations involved with open source business intelligence having already deployed applications. And, tellingly, a majority also stated that there were no future projects for which they would not consider open source business intelligence.

With such widespread adoption, and comfort, with open source databases, as well as open source analytics, it seems only natural that the open source data warehouse should also now emerge. What began with vendors building proprietary data warehouse products based on open source databases (such as MySQL, PostgreSQL, and Ingres) has now evolved to the introduction of full-fledged open source data warehouse solutions and their accompanying communities. 

A good example is the recent introduction of a product called ICE (Infobright Community Edition) and the accompanying community at www.infobright.org. Forum posts suggest a growing community of users, including users who are knowledgeable about databases but new to data warehousing. Just as MySQL expanded the market for database, open source products such as ICE may do the same given rapidly growing data volumes and need for analytics.


With open source data warehousing now an option, the barriers to small to medium-size business adoption of data warehousing technology will fall, just as the barriers to server operating system and middleware adoption fell a few years ago as open source solutions such as Linux and Apache gained momentum. Open source data warehouses provide new, affordable options to departments or business units of larger organizations that need more lightweight, rapid deployments of analytical applications.

The open source data warehouse employs the same licensing model, the same community development process and same degree of openness as other types of open source software. Simply put, most leading open source products are offered either as a free download, or for a nominal fee, as a fully supported system. In either case, there is no limit to the number of licenses and implementations a company may make with the software; and users have large, committed communities they can turn to for additional support or upgrades.

As is the case with the other popular open source technologies, end users of open source data warehouses will have more access to a knowledgeable and committed community of developers and analysts that can share and deliver the latest innovations for the software. New open source data warehouses also overcome the current limitations of open source databases, which typically do not have the scalability required for handling multi-terabyte environments.

Open source data warehouses can address these issues with smaller footprints and fewer administrative resources to operate. Consider the benefits of the open source model as applied to data warehouses:

  • Open source data warehouses cost less up front, and cost less in terms of maintenance and support. Open source software products on the market today are, as a rule, far less expensive than their commercially licensed counterparts. In addition, developers and IT managers can also download the original source code for these products and do customization or make changes that can help further streamline their operations.

  • Open source data warehouses employ skill sets that are widely available in the market. As a result, an organization with existing database or data warehouse expertise will not have to look further when a new open source data warehouse project is put into place.

  • Open source data warehouses promote greater standardization. Since open source code is transparent and community supported, it’s likely that important standards will consistently be supported across all versions and implementations. Proprietary formats cannot and will not be supported within such community settings.

  • Open source data warehouses are far more flexible. Open source licenses enable enterprises to expand the solutions to an unlimited number of users, unlike the typical per-user or per-processor charges of proprietary software packages. Companies can add users or extend projects at little or no additional cost. In addition, an end-user company does not need to fear being locked into a single vendor’s upgrade path with forced – and costly – migrations to new versions of the system.

  • Open source data warehouses benefit from the community effect. Open source solutions leverage communities of developers and innovators to advance development. New code and features are contributed back to the community, constantly increasing the range of new options available to end users. The application of a community approach to data warehousing – while breaking new ground – suits this environment well because there is a wide assortment of systems and data types that need to be integrated within data warehouse settings. It’s difficult for a single vendor to offer solutions for every integration problem that may exist. In addition, companies can look to their community to address the fixing of any bugs or security flaws in a rapid manner – often taking only days, instead of waiting weeks and months for the next security patch or service pack from a vendor.

  • Open source data warehouses can be implemented incrementally. There isn’t a need to “boil the ocean” with a mega project. A data manager needing to build new functions need not go to a budget committee seeking funding for a capability the business needed yesterday.  Projects can start small and build upon the success of implementations. This also alleviates the tendency to “overpromise,” which is all too often a necessary evil for acquiring optimal levels of funding for data warehouse projects. Open source data warehouses don’t require a lot of start-up funds and can be targeted at the most pressing business problems, growing as they deliver value.


An open source data warehouse is well suited for small to mid-sized companies that need to manage and provide insights for a large volume of data, but don’t have the budgets or resources needed to implement and support a large proprietary data warehouse implementation. In addition, open source data warehouses offer a well-targeted solution to departments or business units of larger organizations seeking rapidly deployable solutions for business issues as they arise. Here are ways to get the most out of an open source data warehouse implementation:

  • Open source and proprietary data warehouses need to co-exist. Open source data warehouses will augment, not replace, proprietary enterprise data warehouses. As described earlier, more than a third of the Oracle shops in a survey run open source databases such as MySQL. Often, these open source databases can be put in place to serve more tactical purposes, to complement or fill new needs that cannot be quickly or efficiently met by the more strategic proprietary database.

  • Look for large, active communities behind the product. Data warehouses – whether open source or proprietary – are complex projects since they involve touching data from every corner of the enterprise. Well-engaged, vibrant communities are an essential resource.

  • The open source data warehouse should be invisible to end users. Operational data warehouses – in which data is closely aligned with production data – are the fastest growing part of the data warehouse market. Data from the warehouse needs to be incorporated in real-time fashion with front-end applications, with little input from end users. In many cases, these “pervasive BI” users do not have technical backgrounds, and need as much ease of use as possible. By contrast, the analysts or “power users” that typically have been the main users of data warehouses in the past were adept at building massive queries. An open source data warehouse should be able to support the pervasive BI users with little tweaking required.

  • The open source warehouse should always support open standards. Some previous versions of “open source” data warehouses on the market, while based on the open source database PostgreSQL, developed their proprietary interfaces on top of that, and backed away from their open source efforts. Open source data warehouses should be compatible with related open source environments.

  • Look for rapid deployability and ease of use. Look for open source data warehouse tools and platforms that employ data compression features, and have small hardware and software footprints, requiring minimal server and storage space to support terabytes’ worth of data.  Otherwise, maintenance costs may rise to levels seen with those of proprietary data warehouses.

  • Weigh the costs of transition. While open source data warehouses may be far more inexpensive than proprietary data warehouses in side-by-side comparisons, it’s still important to weigh transition costs and training costs as this new technology and approach is offered for the first time.

With the arrival of open source, data warehouse solutions can be applied to green-field environments that have never been able to take advantage of the technology. Now, the underserved and unserved portions of the market – from small to medium-size businesses to departments within large organizations – can make the promise of data warehousing a reality.

SOURCE: Disruptor: The Rise of the Open Source Data Warehouse

  • Claudia ImhoffClaudia Imhoff
    A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines.

    She is also the Founder of the Boulder BI Brain Trust, a consortium of independent analysts and consultants (www.BBBT.us). You can follow them on Twitter at #BBBT

    Editor's Note:
    More articles and resources are available in Claudia's BeyeNETWORK Expert Channel. Be sure to visit today!


Recent articles by Claudia Imhoff



Want to post a comment? Login or become a member today!

Be the first to comment!