We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Mark Madsen Subscribe to this blog's RSS feed!

Mark Madsen

Open source is becoming a required option for consideration in many enterprise software evaluations, and business intelligence (BI) isn't exempt. This blog is the interactive part of my Open Source expert channel for the Business Intelligence Network where you can suggest and discuss news and events. The focus is on open source as it relates to analytics, business intelligence, data integration and data warehousing. If you would like to suggest an article or link, send an e-mail to me at open_source_links@ThirdNature.net.

About the author >

Mark, President of Third Nature, is a former CTO and CIO with experience working in both IT and vendors, including a stint at a company used as a Harvard Business School case study. Over the past decade, Mark has received awards for his work in data warehousing, business intelligence and data integration from the American Productivity & Quality Center, the Smithsonian Institute and TDWI. He is co-author of Clickstream Data Warehousing and lectures and writes about data integration, business intelligence and emerging technology.


Talend announced an open source data quality offering this week at the TDWI conference in San Diego. The company is rapidly to filling out the basic components needed in a complete data integration suite. In June they delivered added changed data capture (CDC) features to Open Studio, their ETL tool. They also added Talend Open Profiler for data profiling. While Talend doesn’t offer a complete suite yet, these new offerings are a big expansion of functionality in short time. The ETL and data profiling tools are available today, but Data Quality won’t be ready for download until September.

Talend Open Profiler offers many of the features you would expect, similar to what you find in the tools from Oracle and Microsoft that ship with their databases.

The data quality product will offer basic functionality for data de-duplication, standard formatting requirements (as with phone numbers and addresses) and address validation. I spoke with Yves de Montcheuil, Talend’s VP of marketing before the announcement and he indicated that they are still working on partnerships to provide more advanced features via external data cleansing products and data providers. Expect some partnerships to be announced in the next few months as they work out the details.

Since Open Studio / Integration Suite can make web service calls, you can also use third party services if you don’t mind making a few web service calls. StrikeIron offers a number of commercial data cleansing services, as well as reference data services.

Talend Data Quality will follow the same licensing model as Talend Integration Suite, with an open edition and a commercial edition provided as a subscription. No word yet on what the feature differences are between the two. Personally, I don’t like these feature holdback models. I understand the rationale, but I still believe that it can lead to conflicts with contributors and generates the perception that a product is crippleware.

While this isn’t the first open source data profiling or data quality project available, it’s the first that is integrated into a single suite and, more important to many IT shops, commercially supported. It’s also arguably the most functional. There are a handful of other open source projects in this area, so if you’re not concerned about commercial support, it can’t hurt to explore the following:
Open Source Data Quality
Mural standardization and match engines

I’ll do a more detailed look at Mural another time. It’s relatively new and focuses on all of the technical capabilities underlying MDM. My reading of the documentation indicates that there’s a lot of interesting stuff here, but there may be some pretty big problems as well.

A few people have asked me about InfoSolve over the past year. They offer OpenDQ, which, counter to all claims to the contrary, is not open source. I call this “fauxpen source”. The source isn’t available unless you do a project with them and is unsupported. In essence, you are buying a source code license as part of a project - where you get to support the product you purchase. If that’s the case, you may as well go buy a regular commercial product since that offers more value for the price. If the capital cost is high, then look to database offerings or open source, where you do have support and you also have a community of users and developers.

Posted August 20, 2008 9:00 AM
Permalink | No Comments |

Leave a comment