The Enterprise Data Mountain, Part 1 How Many Data Stores Do We Need to Manage It?
by Colin White
Originally published September 18, 2006
A wealth of articles and papers has been written about enterprise data integration and about how to create an integrated and consistent view of your enterprise-wide data. The result is a confusing set of buzzwords and conflicting ideas. This confusion and complexity arises because there is no one-size-fits-all solution to enterprise data integration – the approach used will vary by project and the complexity of the data sources involved. It will also depend on the politics and organizational boundaries of the IT department.
I have already written and presented extensively about the need to build a cohesive enterprise integration architecture that can support the various techniques, technologies and products that are involved in data integration projects, and I will not repeat this material here. Instead, I want to look at data integration from the perspective of the types of data that exist in organizations and review the various ways we can store and manage that data.
In this article, I will use the following terms:
When evaluating the types of data and data store that exist in organizations we need to look at the five main types of business processes that create and maintain data in an IT system (see Figure 1): collaborative processes, business planning processes, operational transaction processes, master data processes, and business intelligence processes. Let’s examine each in turn.
Figure 1: Types of Business Process
There are many different types of collaborative application, and even in environments where the direction is to integrate collaborative data in shared content data stores, there are likely to be multiple stores, which, in turn, are likely to be managed in multiple databases and by multiple database management systems (DBMSs). To provide business users with easy and personalized access to these dispersed data stores, companies are employing a business portal, federated data integration and search techniques, and business taxonomies.
The coming challenge for companies is how to deal with the increasing amount of private (and thus unmanaged) data being generated by new Web technologies such as blogs and wikis.
Business Planning Processes
In larger organizations, planning data is likely to remain split between ERP and independent planning applications. ERP planning data is maintained in a similar fashion to operational data, which is discussed later in this article. Many independent planning applications come from BI vendors, and the PDS data in these cases will be managed by the same DBMS being used for supporting BI processing.
Business Transaction Processes
Access to current, but dispersed, business transaction data can be done using data federation techniques; but to provide an integrated view of this dispersed data, many companies integrate business transaction data into an operational data store (ODS). The data in the ODS is updated asynchronously from business transaction processing, which adds a certain amount of data latency to the ODS data.
The data in the ODS is used for operational reporting, to supply information to other downstream business transaction applications, for building new business transaction applications and to help in migrating legacy applications to modern technology.
How ODS data is physically managed in a database and by a DBMS is dependent on the BI strategy and data architecture of a company; and we will, therefore, defer this discussion until after BI processes have been reviewed.
Business Intelligence Processes
There are many different approaches to building a data warehouse. This topic is well documented elsewhere and will not be repeated here. Suffice it to say that a common architecture consists of an enterprise data warehouse (EDW) containing detailed historic and aged data, and one or more underlying data marts containing summarized information of the data in the EDW. The age of the data in an EDW or data mart is often denoted by a time field in each data store record.
A debate that frequently occurs between BI specialists and vendors is whether ODS, EDW and data mart data should be stored together or separately from each other. This discussion frequently confuses logical and physical storage models. Data stores, by definition, are logical objects that have their own data model and are clearly separate from each other. The storage debate should instead focus on whether data stores can be kept in the same database, on the same database server or managed by the same DBMS.
The decision to use the same DBMS and database server to manage both an ODS and an EDW is primarily concerned with workload. An ODS workload can be dominated by the transaction update rate, by the BI query rate, by the downstream operational application query rate, or may split evenly between updates and queries. Some DBMSs by their nature perform very well with transaction and mixed workloads, whereas others are better suited to query workloads. The workload and the ability of the DBMS and the database server to handle the workload is a key deciding factor here.
Most ODS and EDW data stores are implemented using a relational DBMS. With the relational approach, data in these data stores is mapped to one or more relational tables. These “logical” tables are, in turn, mapped to the physical storage objects of the DBMS. These physical objects vary by product. Let’s assume the DBMS maps a table to a tablespace, which, in turn, is mapped to a database. Further, let’s assume a tablespace can be split into multiple partitions for performance and management reasons.
Some people argue that the ODS data can be stored in one partition of a tablespace and the EDW data in the remaining partitions. With this technique, current data moves from the ODS partition to the EDW partitions as it is updated or ages. This approach works if:
Some application designs will work with this approach, whereas others will not. The data models of the ODS and EDW, for example, may not always be the same. As I said in the beginning of this article, one size does not fit all.
Depending on functional and performance needs, data mart data may be handled by a relational DBMS, a multidimensional DBMS or a hybrid DBMS. Key deciding factors here are the amount and complexity of the data involved, and whether the data mart workload consists of ad hoc queries or pre-planned ones. Data mart data, therefore, may be managed by the same DBMS and database server as the ODS and EDW, or by a completely different system. Again, one size does not fit all applications.
Master Data Processes
The direction of the industry is to separate master data from business transaction data and to maintain it using separate master data processes. This trend will have a dramatic effect on the way master data is stored and managed, not only in the operational environment, but also in the BI environment. This topic will be the focus of Part 2 of this article in next month’s newsletter.
Recent articles by Colin White
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC