Open Source Data Warehousing: Get Ready for Disruption

Originally published November 27, 2007

Business intelligence and data warehousing are in the path of open source development. Over the years, open source moved from basic developer tools to infrastructure to application tools and is now moving into applications. 

Open source has been around for a long time – the official incarnation started with the creation of the GNU Public License (GPL) in 1985. The layers of open source that underlie what is needed for business intelligence (BI) have matured to the point that they are starting to take on the major players in our market.

Maturity Level

There is varying maturity for tools at different layers in the data warehouse architecture. As you would expect, basic development tools and operating systems are the most mature elements. Linux accounts for almost ten times the number of server installations as mainstream Unix vendors.

While open source databases like MySQL, Ingres and Postgres are mature when deployed in very large transaction-processing environments, they are not as mature for use in the data warehouse. There are two problem areas that open source databases need to address in order to work well as data warehouse platforms.

Open source databases are missing many of the features we take for granted in analytic environments, such as multiple indexing options, join optimizations, and good query optimizers for complex queries. Most BI tools have come to rely on these features, and the features are rare to non-existent in open source databases.

The databases also need to address weaknesses in handling large data volumes. In single-node configurations, the databases tend to have trouble dealing with very large tables, particularly when we throw more than a handful of users at them. The shared nothing implementations in the open source world tend to rely on data partitioning across nodes, which is not a good solution for query environments where data from multiple nodes must be brought together. This is reminiscent of the database problems that warehouse DBAs encountered in the mid 1990s.

Managing Open Source Databases

The operational side of managing a large warehouse is improving in the open source database world. We need the ability to load, unload, archive and back up these large databases and maintain high uptime. With commercial databases, much of this is done with the help of third-party vendors who have integrated tools. Their support of open source databases is trailing but has picked up over the past two years. Enterprise DBAs need this support in order to manage their complex environments.

Open Source BI Tools

Business intelligence tools have developed relatively quickly. Users are reporting that they are mostly satisfied with the features in the tools. The elements they want to see added are largely on the administration and management side – for example, easier administering of servers or metadata-based query generation. While not feature comparable with the commercial BI products, the open source BI tools are good enough for many people to use today.

ETL Tools

The area with the least focus until relatively recently is ETL tools. Just two years ago, the selection was extremely limited and not very robust. Today there are several ETL tools working their way up to the level of reliability, performance and features that are needed to support a data warehouse environment. The speed with which these products have matured is surprising, given the late start relative to other tools.

Now that there are open source projects for every piece of the data warehouse from the operating system to end user analysis and data mining, IT shops are starting to pay attention. It’s rare to see a company deploy fully on open source, or to change from a commercial BI product to open source. More companies are selecting open source BI for specific needs or point projects where the commercial products either don’t have the right architecture and features or where the cost of a commercial deployment would be prohibitive.

Integration

One area where open source BI tools tend to be better is their integration capabilities. It is often easier to add BI features to OLTP applications using open source tools. They are more configurable and embeddable than their commercial counterparts. Java is the prevalent language for most enterprise IT developers, and many of the tools are designed to work well in this environment. Contrast this with commercial BI products, where embedding and integration into applications is challenging as well as expensive.

Open Source Adoption

There have been many surveys done on open source adoption in IT. It’s no surprise that the number one reason cited in every survey is cost reduction. Beyond cost, the reasons most often cited are avoiding proprietary vendor lock-in, the ability to adapt or customize, easier integration and performance, or extending the life of existing hardware.

If we look at cost reduction, we see that for most of the open source stack the focus is on infrastructure technologies like operating systems, databases and middleware. The open source tools for these have been around longer, are more mature and comparable to their commercial counterparts.

At the same time, those products are all becoming commodities and facing price pressure and easier substitution. According to economic theory, a market with a perfect commodity like software (meaning the cost of production of units after the first unit is zero) and where copying features from a competitor is trivial, the predicted price approaches a minimum. Open source is proving this theory to be true.

The lower limit on price is defined by the minimum amount needed to support ongoing maintenance of the software. Broadly speaking, this is relatively low. Most open source projects have a very small number of people actively working on the code. A much broader community contributes their help in finding and explaining bugs, or voting on new features.

In corporate economic terms, feature R&D and maintenance become externalities. The cost is pushed outside the organization or shared with a large outside group. Enterprise software vendors generally don’t have this option so they are locked into higher prices and higher maintenance costs.

The logical conclusion is that, over time, every software product that becomes a commodity will have an open source counterpart. We’re seeing this today with BI and ETL tools. Most commercial BI and ETL tools do much more than a broad segment of the market requires and are priced accordingly.

This means there’s an over-served market segment of people who simply want a tool that is good enough to meet their needs – without all the extras. This is where we are today with open source BI and ETL tools. Over time, these tools will improve and gain broader adoption.

If you want proof of this, look to the database vendors who are giving away ETL tools with their databases. The clue that business intelligence is a commodity comes from the biggest commercial software vendor: Microsoft. They only compete in commodity markets, so Microsoft’s entrance into business intelligence is the best indicator that enterprise BI is becoming a mass-market commodity. This is good news for adopters of open source because it means they’re moving in the same direction as the market.

 

  • Mark MadsenMark Madsen

    Mark, President of Third Nature, is a former CTO and CIO with experience working in both IT and vendors, including a stint at a company used as a Harvard Business School case study. Over the past decade, Mark has received awards for his work in data warehousing, business intelligence and data integration from the American Productivity & Quality Center, the Smithsonian Institute and TDWI. He is co-author of Clickstream Data Warehousing and lectures and writes about data integration, business intelligence and emerging technology.


Recent articles by Mark Madsen



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!