The Enterprise Data Mountain, Part 1 How Many Data Stores Do We Need to Manage It?

Originally published September 18, 2006

A wealth of articles and papers has been written about enterprise data integration and about how to create an integrated and consistent view of your enterprise-wide data. The result is a confusing set of buzzwords and conflicting ideas. This confusion and complexity arises because there is no one-size-fits-all solution to enterprise data integration – the approach used will vary by project and the complexity of the data sources involved. It will also depend on the politics and organizational boundaries of the IT department.

I have already written and presented extensively about the need to build a cohesive enterprise integration architecture that can support the various techniques, technologies and products that are involved in data integration projects, and I will not repeat this material here. Instead, I want to look at data integration from the perspective of the types of data that exist in organizations and review the various ways we can store and manage that data. 

In this article, I will use the following terms:

  • Private data – data belonging to an individual person that is not shared or managed

  • Shared data – managed workgroup and corporate data that is shared by multiple individuals, groups or companies

  • Internal business data – private or shared data that is created and maintained by a company’s employees and internal applications

  • External business data – data that can be supplied and/or modified by people and applications external to an organization

  • Structured data – a set of one or more data records with a fixed field structure that is defined by an external data model

  • Semi-structured data – a set of one or more data records with a variable field structure that can be defined by an external data model

  • Unstructured data – a set of one or more data records that have no externally defined field structure

  • Current data – the latest version of a data record

  • Historic data – a record of an earlier version of a data record

  • Aged data – current data that is no longer required by operational processes

  • Data store – a logical set of shared data

  • Database – a group of physical data files containing the data from one or more data stores

  • Database server – a software and hardware system used to manage one or more databases

  • Database management system (DBMS) – the data management software component of a database server

When evaluating the types of data and data store that exist in organizations we need to look at the five main types of business processes that create and maintain data in an IT system (see Figure 1): collaborative processes, business planning processes, operational transaction processes, master data processes, and business intelligence processes. Let’s examine each in turn.

Figure 1: Types of Business Process

Collaborative Processes
Collaborative processes involve people-centric activities and tasks that create and maintain unstructured and semi-structured data. This data may include workgroup information such as word processing documents, spreadsheets, presentations, business intelligence (BI) reports, images, videos, e-mail and operational enterprise information such as electronic forms, digitized images of purchase orders, faxes and Web pages. A significant percentage of this collaborative data, especially at the workgroup level, is private data. The direction of companies is to integrate this private data into shared content data stores (CDS) using content management software. The CDS is used for storing both current and historic data. Versioning is used frequently for identifying historic data.

There are many different types of collaborative application, and even in environments where the direction is to integrate collaborative data in shared content data stores, there are likely to be multiple stores, which, in turn, are likely to be managed in multiple databases and by multiple database management systems (DBMSs). To provide business users with easy and personalized access to these dispersed data stores, companies are employing a business portal, federated data integration and search techniques, and business taxonomies.

The coming challenge for companies is how to deal with the increasing amount of private (and thus unmanaged) data being generated by new Web technologies such as blogs and wikis.

Business Planning Processes
Planning processes are used for strategic, tactical and operational planning, budgeting and forecasting. The semi-structured and structured data created by these processes may be maintained in a private data store such as a spreadsheet, or in a shared data store managed by an independent planning application or enterprise resource planning (ERP) system. The direction of companies is to integrate private planning data into a shared planning data store (PDS) managed by independent planning applications. The PDS is used to maintain both current and historic data. Versioning is frequently used for controlling historic data in a PDS.

In larger organizations, planning data is likely to remain split between ERP and independent planning applications. ERP planning data is maintained in a similar fashion to operational data, which is discussed later in this article. Many independent planning applications come from BI vendors, and the PDS data in these cases will be managed by the same DBMS being used for supporting BI processing.

Business Transaction Processes
Business transaction processes involve application-centric processes that are responsible for handling daily business operations such as order entry, inventory control, billing and shipping. The structured data associated with these processes is usually maintained in a variety of dispersed business transaction data stores, which, in turn, are managed in multiple databases and by multiple DBMSs.

Access to current, but dispersed, business transaction data can be done using data federation techniques; but to provide an integrated view of this dispersed data, many companies integrate business transaction data into an operational data store (ODS). The data in the ODS is updated asynchronously from business transaction processing, which adds a certain amount of data latency to the ODS data.

The data in the ODS is used for operational reporting, to supply information to other downstream business transaction applications, for building new business transaction applications and to help in migrating legacy applications to modern technology.

How ODS data is physically managed in a database and by a DBMS is dependent on the BI strategy and data architecture of a company; and we will, therefore, defer this discussion until after BI processes have been reviewed.

Business Intelligence Processes
Business intelligence processes are responsible for monitoring, analyzing, measuring and managing the key business operations of a company. Traditionally the information used by BI processes is based on historic and aged data captured into a data warehouse from business transaction data stores and/or an ODS. There is a trend, however, for companies to capture data from content and planning data stores to provide additional information and business context to BI processing.

There are many different approaches to building a data warehouse. This topic is well documented elsewhere and will not be repeated here. Suffice it to say that a common architecture consists of an enterprise data warehouse (EDW) containing detailed historic and aged data, and one or more underlying data marts containing summarized information of the data in the EDW. The age of the data in an EDW or data mart is often denoted by a time field in each data store record.

A debate that frequently occurs between BI specialists and vendors is whether ODS, EDW and data mart data should be stored together or separately from each other. This discussion frequently confuses logical and physical storage models. Data stores, by definition, are logical objects that have their own data model and are clearly separate from each other. The storage debate should instead focus on whether data stores can be kept in the same database, on the same database server or managed by the same DBMS.

The decision to use the same DBMS and database server to manage both an ODS and an EDW is primarily concerned with workload. An ODS workload can be dominated by the transaction update rate, by the BI query rate, by the downstream operational application query rate, or may split evenly between updates and queries. Some DBMSs by their nature perform very well with transaction and mixed workloads, whereas others are better suited to query workloads. The workload and the ability of the DBMS and the database server to handle the workload is a key deciding factor here.

Most ODS and EDW data stores are implemented using a relational DBMS. With the relational approach, data in these data stores is mapped to one or more relational tables. These “logical” tables are, in turn, mapped to the physical storage objects of the DBMS. These physical objects vary by product. Let’s assume the DBMS maps a table to a tablespace, which, in turn, is mapped to a database. Further, let’s assume a tablespace can be split into multiple partitions for performance and management reasons.

Some people argue that the ODS data can be stored in one partition of a tablespace and the EDW data in the remaining partitions. With this technique, current data moves from the ODS partition to the EDW partitions as it is updated or ages. This approach works if:

  • The DBMS and the database server can handle the workload

  • The data models of the ODS and EDW are the same

  • It is acceptable to partition the data by date and time

  • It is acceptable for the primary key to contain date and time and for applications and queries to be time aware

Some application designs will work with this approach, whereas others will not. The data models of the ODS and EDW, for example, may not always be the same. As I said in the beginning of this article, one size does not fit all.

Depending on functional and performance needs, data mart data may be handled by a relational DBMS, a multidimensional DBMS or a hybrid DBMS. Key deciding factors here are the amount and complexity of the data involved, and whether the data mart workload consists of ad hoc queries or pre-planned ones. Data mart data, therefore, may be managed by the same DBMS and database server as the ODS and EDW, or by a completely different system. Again, one size does not fit all applications.

Master Data Processes
Master data processes are responsible for managing the data associated the key business entities of an organization such as customers, employees, suppliers, parts, ledgers and locations. Traditionally this master data has been managed by business transaction processes and is maintained in business transaction data stores. This master data is then usually propagated into an ODS and an EDW. Master data is, in fact, used in the business intelligence environment to organize the way data is presented to business users who may need, for example, to view information by customer, by region and by part or product number.

The direction of the industry is to separate master data from business transaction data and to maintain it using separate master data processes. This trend will have a dramatic effect on the way master data is stored and managed, not only in the operational environment, but also in the BI environment. This topic will be the focus of Part 2 of this article in next month’s newsletter.

  • Colin WhiteColin White

    Colin White is the founder of BI Research and president of DataBase Associates Inc. As an analyst, educator and writer, he is well known for his in-depth knowledge of data management, information integration, and business intelligence technologies and how they can be used for building the smart and agile business. With many years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. Colin has written numerous articles and papers on deploying new and evolving information technologies for business benefit and is a regular contributor to several leading print- and web-based industry journals. For ten years he was the conference chair of the Shared Insights Portals, Content Management, and Collaboration conference. He was also the conference director of the DB/EXPO trade show and conference.

    Editor's Note: More articles and resources are available in Colin's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Colin White

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!