I used to think that data virtualization tools were great for niche applications, such as creating a quick and dirty prototype or augmenting the data warehouse with real-time data in an operational system or accessing data outside the corporate firewall. But now I think that data virtualization is the key to creating an agile, cost-effective data management infrastructure. In fact, data architects should first design and deploy a data virtualization layer prior to building any data management or delivery artifacts.
What is Data Virtualization? Data virtualization software makes data spread across physically distinct systems appear as a set of tables in a local database. Business users, developers, and applications query this virtualized view and the software automatically generates an optimized set of queries that fetch data from remote systems, merge the disparate data on the fly, and deliver the result to users. Data virtualization software consumes virtually any type of data, including SQL, MDX, XML, Web services, and flat files and publishes the data as SQL tables or Web services. Essentially, data virtualization software turns data into a service, hiding the complexity of back-end data structures behind a standardized information interface.
With data virtualization, organizations can integrate data without physically consolidating it. In other words, they don't have to build a data warehouse or data mart to deliver an integrated view of data, which saves considerable time and money. In addition, data virtualization lets administrators swap out or redesign back-end databases and systems without affecting downstream applications.
The upshot is that IT project teams can significantly reduce the time they spend sourcing, accessing, and integrating data, which is the lionshare of work in any data warehousing project. In other words, data virtualization speeds project delivery, increases business agility, reduces costs, and improves customer satisfaction. What's not to like?
Long Time Coming. Data virtualization has had a long history. In the early days of data warehousing (~1995), it was called virtual data warehousing (VDW) and advocates positioned it as a legitimate alternative to building expensive data warehouses. However, data warehousing purists labeled VDW as "voodoo and witchcraft" and chased it from the scene. During the next 10 years, data virtualization periodically resurfaced, each time with a different moniker, including enterprise information integration or EII and data federation, but the technology never got much traction, and vendor providers came and went.
Drawbacks. One reason data virtualization failed to take root is politics. Source systems owners don't want BI tools submitting ad hoc queries against their operational databases. And these administrators have the clout to lock out applications they think will bog down system performance.
Other traditional drawbacks of data virtualization are performance, scalability, and query complexity. The engineering required to query two or more databases is complex and becomes exponentially more challenging as data volumes and query complexity grow. As a result, data virtualization tools historically have been confined to niche applications involving small volumes of clean, consistent data sets that require little to no transformation and complex joins.
Today, however, data virtualization is making a resurgence. Advances in network speeds, CPU performance, and available memory have significantly increased the performance and scalability of data virtualization tools, expanding the range of applications they can support. Moreover, data virtualization vendors continue to enhance their query optimizers to handle more complex queries and larger data volumes. Also, thanks to the popularity of data center virtualization, many organizations are open to exploring the possibility of virtualizing their data as well.
But does this mean data virtualization is ready to take center stage in your data management architecture? I think yes. Last week, I listened to data architects from Qualcomm, BP, Comcast, and Bank of America discuss their use of data virtualization tools at "Data Virtualization Day," a one-day event hosted by Composite Software, a leading data virtualization vendor. After hearing their stories, I am convinced that data virtualization is the missing layer in our data architectures.
These architects reported no performance or scalability issues with data virtualization. If they encounter a slow query, they simply persist or cache the target data in a traditional database and reconfigure the semantic layer accordingly. In other words, they virtualize everything they can and persist when they must. This approach overcomes all physical and political obstacles to data virtualization, while improving query performance and project agility. And some hard-core "data virtualizers" do away with a data warehouse altogether, preferring to persist snapshots of data that requires an historical or time-series view.
Today, the biggest obstacles to the growth of data virtualization are perceptions and time. Given the innate bias among data warehousing professionals to persist everything, most data architects doubt that data virtualization tools offer adequate performance for their query workloads. In addition, it takes time to introduce data virtualization tools into an existing data warehousing architecture. The tools must prove their worth in an initial application and build on their success. Since enterprise-caliber data virtualization tools cost several hundred thousand dollars, they need a well-respected visionary to advocate for their usage. In most organizations, it's easier to go with the flow than buck the trend.
Nonetheless, the future looks bright for data virtualization. Since most of our data environments are heterogeneous (and always will be), it just makes sense to implement a virtualization layer that presents users and applications with a unified interface to access any back-end data no matter where it's located or how its structured. A layer of abstraction that balances federation and persistence can do two things that every IT department must deliver: lower costs and quicker deployments.