Data Warehousing After the Bubble

Originally published December 4, 2008

One thing is clear – Basel II and Sarbanes-Oxley did not work. Instead, credit default swaps (CDSs) were regarded as “free money” rather than insurance against which reserves had to be set aside to cover inevitable losses. This data was not captured in the underlying risk data warehouse because the mortgage and CDS products were not supposed to be a hazard. Oops. For my part, I am still trying to understand how a hard working house cleaner and her house painter husband got a $624K mortgage – a paltry quarter of a million dollars was not enough? Mortgages were considered safe, long-term investments on the part of banks and other long-term loan originators. However, at the risk of 20-20 hindsight, of which there is no shortage, this approach was from the days back when banking was supposed to be boring and innovative financial instruments such as CDSs and packages of subprime mortgages were just a glimmer in the investment banker’s eye.

So what is the lesson here? Meaning is use; and data in isolation is worthless. It is information that is useful as the way to eliminate or reduce uncertainty. No data warehouse can contain all dimensions of a business, industry or market; and when so many variables are changing simultaneously, blind spots happen. The inside of a bubble is a lot more comfortable than the reality outside it. Things were neither as rosy as they seemed; nor are they as dark as they now appear. People still need food to eat; transportation, including cars, to get around; and places to live. Mark Twain is reported to have said, “Buy land. They aren’t making any more of it.” I believe he added – “but make sure it is not under water.” The bursting of the bubble and the ongoing economic challenges will reinforce several trends in data warehousing.

Open source data warehousing. This will accelerate open source’s readiness for enterprise deployment. Functions such as “heart beat” that make data warehouses capable of supporting mission-critical applications that require mirroring, rollback, automatic failover, redo and related components of high availability are now a requirement. These capabilities are a work in progress, but at an accelerating pace, given the urgency of the situation. Open source databases – in particular MySQL from Sun and Postgres Plus from EnterpriseDB – are working their way through the enterprise as components of appliances, column-oriented data marts and related applications. These applications are often support-oriented and are not mission critical, but provide a testing and training ground for the next generation of frontline infrastructure.

Open source is not for the technologically faint of heart. It is optimally deployed in connection with a support package from a vendor that is going to be available on weekends, holidays and when you least expect to need it. Still, the potential for disruption of the existing installed base of the standard relational databases is significant. While information technology is holding up relatively well amidst the recession, it is hard to see how that can continue when customers in finance, retail, hospitality, travel and manufacturing are taking it on the chin. In contrast, open source remains a bright spot where innovation promises to return improved productivity, which after all is the best way of creating new opportunities for cost reduction, efficiency and profitability.

Data warehousing in the clouds. As noted in my recent article, cloud computing has come up fast with companies such as Amazon, Google and stealing a march on the information technology infrastructure stalwarts such as Dell, HP, IBM, Microsoft and Oracle. Cloud computing differs from all the usual suspects – the grid, software as a service (SaaS), and simple web hosting – by providing virtualization of the entire technology stack and a retail interface for the purchase of business applications in small increments with which to run a medium-sized business or perhaps an enterprise. The service level agreement (SLA) for the application in the cloud is one that a business person can understand, accommodating data persistence, system reliability, redundancy, security and business continuity. However, the catch is that the SLA is still in the process of being defined in an enterprise context. Thus, cloud computing is best suited for small and medium-sized organizations that can afford to be flexible about their requirements in order to save a few nickels on infrastructure.

Back to basics: data quality. Data quality remains an issue as some customers just disappear, leaving only an entry in the data warehouse to be cleared up. As rapid seismic changes in consumer behavior occur, other customers move into demographically different categories and no longer have the same marketing, buying or shopping profiles. Retail discovers the returning popularity of “lay away” plans, which require their own application profile. The basic question of data warehousing remains more important than ever – who is buying or using what product or service, and when and where are they doing so? In any crisis and breakdown of what is ordinary – in this case, the end of living beyond one’s means – the natural tendency is to overreact. In that respect, a single, high quality data point (“fact”) from a data warehouse is worth a thousand opinions. Stay the course.

Back to basics: front end. Perception of business value migrates in the direction of the user interface. In addition to enterprise front end from SAP/Business Objects or IBM/Cognos, upstarts are offering engaging variations on the dashboard theme. For example, for those interested in new options, check out LogiXML and SiSense.  A front end vendor that blurs the distinction front/back in an interesting way and reaches back to data sources, providing ETL-like access in addition to analytics, is Lyza Software. Now layer open source on top of this market. Pentaho is more than an open source front end since it aspires to data mining and data integration results. However, it did get its start in reporting and dashboards. When successful, all the laborious work of upstream data integration will result in an "Aha!" experience as the business analyst gains an insight about customer relations, product offerings or market dynamics. A new and better user interface is not in itself the cause of the breakthrough. Without the work of integrating the upstream data, the result would not have been possible.

Back to basics: middle layer. Data integration is arguably a trend with many of the enterprise application integration (EAI), extract, transform and load (ETL) and customer data integration (CDI) service vendors leading the charge. Now layer open source on top of it. An interesting approach to open source ETL is provided by Apatar. In addition to a compelling pricing proposition, Apatar is building a community of users by means of a shareable database of ETL maps submitted and maintained, in the spirit of open ETL, in its Forge database. While it is improbable that anyone else’s application is exactly like yours, business problems fall into categories and one is likely to be close to what you are seeking. This is a great way to avoid reinventing the wheel and to jumpstart a project. Data integration requires schema integration. A schema is a database model (structure) that accurately represents the data in such a way that it is meaningful. To compare entities such as customers, products, sales or store geography across different data stores, the schemas must be reconciled in terms of consistency and meaning. If the meanings differ, then translation (transformation) rules must be designed and implemented. The point is that IT developers cannot “plug into” data integration by purchasing a “plug in” for a tool without also undertaking the design work to integrate (i.e., map and translate) the schemas representing the targets and sources.

Back to basics: back end. Design consistent and unified definitions of product, customer, channel, sales or store geography, etc. This is the single most important action an IT department can undertake regarding a data warehousing architecture. Front line data warehousing with clickstream applications are here to stay, and key data dimensions and attributes now also include those relevant to the Web such as page hierarchies, sessions, user IDs and shopping carts. Every department (finance, marketing, inventory, production) wants the same data in different form – that’s why the star schema design and its data warehouse implementation were invented. Extensive research is available on how to avoid the religious wars between data warehouses and data marts by means of a flexible data warehouse design. The previously cited comments on open source databases and data warehousing in the clouds are relevant here. According to my calculation, that constitutes front end, middle, and back end open source options from which to assemble a complete system. Obviously, enterprise customers will find value is having even more choices, and those are coming. In my opinion, economic uncertainty will be a benefit to open source and its users. IT benefits from the available bandwidth that developers may have now to start something really engaging (“cool”), and it limits the downside financial risk. Win-win.

Plenty of blame and finger-pointing is available as the responsibilities for the housing bubble, credit default swaps (CDS), and packages of toxic mortgage debit get passed around like hot potatoes. Self-scrutiny on the part of Barney Frank, Chuck Schumer and Henry Waxman, members of Congress who urged on the excesses of mortgage lenders Freddie Mac and Fannie Mae are noticeably absent. It is true that Alan Greenspan in testimony before Congress indicated that one of the problems was that he had “bad data”; but that was in the context of acknowledging that his point of view on regulation was in need of more work.1 This is the moral equivalent when decoded from “central banker talk” of the former Fed chairman saying that he was wrong, the recognition of which I shall cherish no matter how long I live. Fortunately for Greenspan, he already published his book, because his reputation now looks to have been as inflated as the price of housing in the year 2006. However, the one thing that no one has yet done is blame it on the data warehouse. Accurate, timely data is more important than ever before, and the data warehouse is one of the best ways of assuring it. Seriously, I expect there to be more work in the public sector building data warehouses (as well as transactional systems) to navigate through the economic and political dynamics.

End Note:

  1. Neil Irwin and Amit Paley, "Greenspan says he was wrong on regulation." The Washington Post, October 24, 2008.

SOURCE: Data Warehousing After the Bubble

  • Lou AgostaLou Agosta
    Lou Agosta is an independent industry analyst, specializing in data warehousing, data mining and data quality. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, business and information technology. He is currently focusing on the challenge of transforming America’s healthcare system using information technology (HIT). He can be reached at

    Editor's Note: More articles, resources, and events are available in Lou's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Lou Agosta



Want to post a comment? Login or become a member today!

Be the first to comment!