Originally published December 2, 2008
Since the arrival of data warehouses on the enterprise scene more than a decade ago, these platforms have held out the tantalizing promise of concentrating critical data into easy-to-find, centralized locations, enabling business people at all levels to make their decisions based on concrete, analytical facts, versus intuitive and uninformed decision making.
However, this vision has not yet come to pass for most organizations. It’s not because the data warehouse concept is conceptually flawed – data warehouses have passed muster by providing a means for organizations to archive and provide for analysis data from multiple sources across the enterprise. Rather, data warehouse technology and associated methodologies remain out of reach to most companies.
For many organizations, data warehouses are simply too costly to buy, too costly to implement and too costly to maintain. Data warehousing is still a luxury of deep-pocketed organizations, which have the budgets, staffs and patience capable of purchasing, installing and maintaining these platforms.
Many of the challenges we’re seeing with data warehousing have a familiar ring to them, however. Other types of systems and technologies over the years have also proven to be costly, slow-to-implement beasts that sapped IT budgets and organizational resources and delivered little in the end. Operating systems, for one, were once many times more expensive than the hardware they ran on, and held companies captive to particular vendors’ ecosystems. Middleware was only a luxury for well-heeled enterprises, and databases required entire staffs of trained and certified specialists to manage the data within.
These and many more markets have been completely turned upside down by the disruptive force known as open source. Open source not only delivered low or no-cost, liberally licensed software that opened up capabilities to even the smallest companies, but also opened up code and functionality to a community process that ensured solutions remained true to prevailing standards. In the process, solutions became far more straightforward and flexible to implement, unencumbered by unwanted bells and whistles.
Now the open source revolution has reached the data warehousing space. There are not only tools and technologies available for the building blocks of data warehouse implementations – such as database management systems, extract, transform and load (ETL), and analytic tools – but the data warehouse itself has been open sourced.
Open source solutions that support data warehouse projects include the following:
The impressive growth of the open source building blocks of successful data warehouse deployments suggests that the time is ripe for the open source data warehouse. Gartner, for example, reports that there has been a significant uptake of open source DBMS engines in recent years. The analyst firm found that 47% of companies it surveyed have already adopted open source databases, and another 19% are considering adopting within a 12-month period.
In many cases, open source databases are being adopted within markets that have long been ignored by large database vendors. However, there is even an open source database presence within organizations dominated by active commercial database implementations. A study conducted among 226 members of the Independent Oracle Users Group (IOUG) found that more than one-third of these sites, 35%, also had an open source database such as MySQL running on their premises.
Accompanying the emergence of open source databases are ETL and open-source analytic/business intelligence tools that are gaining enterprise adoption. Gartner estimates that about 11% of the companies it surveyed are using open source ETL tools, with another 16% considering such tools over the coming months. Open source ETL tools include Pentaho’s KETL, Talend, Clover.ETL and Octopus.
In the business intelligence (BI) and analytic tools space, Gartner says 9% of the companies already use open source BI solutions, and another 18% are considering adoption within a 12-month period. There are now a number of open-source BI or analytic applications on the market, led by vendors such as Pentaho and JasperSoft. In a survey of 500 companies, Ventana Research also confirmed that interest in open source BI is widespread and growing, with 21% of organizations involved with open source business intelligence having already deployed applications. And, tellingly, a majority also stated that there were no future projects for which they would not consider open source business intelligence.
With such widespread adoption, and comfort, with open source databases, as well as open source analytics, it seems only natural that the open source data warehouse should also now emerge. What began with vendors building proprietary data warehouse products based on open source databases (such as MySQL, PostgreSQL, and Ingres) has now evolved to the introduction of full-fledged open source data warehouse solutions and their accompanying communities.
A good example is the recent introduction of a product called ICE (Infobright Community Edition) and the accompanying community at www.infobright.org. Forum posts suggest a growing community of users, including users who are knowledgeable about databases but new to data warehousing. Just as MySQL expanded the market for database, open source products such as ICE may do the same given rapidly growing data volumes and need for analytics.
With open source data warehousing now an option, the barriers to small to medium-size business adoption of data warehousing technology will fall, just as the barriers to server operating system and middleware adoption fell a few years ago as open source solutions such as Linux and Apache gained momentum. Open source data warehouses provide new, affordable options to departments or business units of larger organizations that need more lightweight, rapid deployments of analytical applications.
The open source data warehouse employs the same licensing model, the same community development process and same degree of openness as other types of open source software. Simply put, most leading open source products are offered either as a free download, or for a nominal fee, as a fully supported system. In either case, there is no limit to the number of licenses and implementations a company may make with the software; and users have large, committed communities they can turn to for additional support or upgrades.
As is the case with the other popular open source technologies, end users of open source data warehouses will have more access to a knowledgeable and committed community of developers and analysts that can share and deliver the latest innovations for the software. New open source data warehouses also overcome the current limitations of open source databases, which typically do not have the scalability required for handling multi-terabyte environments.
Open source data warehouses can address these issues with smaller footprints and fewer administrative resources to operate. Consider the benefits of the open source model as applied to data warehouses:
An open source data warehouse is well suited for small to mid-sized companies that need to manage and provide insights for a large volume of data, but don’t have the budgets or resources needed to implement and support a large proprietary data warehouse implementation. In addition, open source data warehouses offer a well-targeted solution to departments or business units of larger organizations seeking rapidly deployable solutions for business issues as they arise. Here are ways to get the most out of an open source data warehouse implementation:
With the arrival of open source, data warehouse solutions can be applied to green-field environments that have never been able to take advantage of the technology. Now, the underserved and unserved portions of the market – from small to medium-size businesses to departments within large organizations – can make the promise of data warehousing a reality.
Recent articles by Claudia Imhoff