Techno Babble: Components of a Business Intelligence Architecture

Originally published March 18, 2008

This article is an excerpt from Successful Business Intelligence: Secrets to Making BI a Killer App by Cindi Howson (McGraw-Hill; 2007; 0071498516). Reprinted with permission from the McGraw-Hill Companies.

Every business intelligence (BI) deployment has an underlying architecture. The BI architecture is much like the engine of a car – a necessary component, often powerful, but one that users, like drivers, don’t always understand. For some companies new to business intelligence, the BI architecture may primarily be the operational systems and the BI front-end tools. For more mature BI deployments and particularly for enterprise customers, it will involve ETL (extract, transform, and load) tools, a data warehouse, data marts, BI front-end tools, and other such components.

When IT discusses BI with users, we readily fall into techno babble, and senseless acronyms abound. Most car drivers know that cars have a battery, a transmission, a fuel tank – an adequate level of knowledge for having a conversation with a mechanic or salesperson but arguably not so much expertise to begin rebuilding an engine. In this chapter, then, I’ll present the major architectural technical components that make up BI and that business users should have at least a high-level understanding of to participate in discussions about building and leveraging a BI solution. If you are a technical expert, you might find this chapter to be overly simplified and it is. If you are looking for a reference on any one of these components, consult the list of resources in Appendix B of Successful Business Intelligence.

Operational and Source Systems

Operational systems are the starting point for most quantitative data in a company. Operational systems may also be referred to as “transaction processing systems,” “source systems,” and “enterprise resource planning” (ERP) systems. As Figure 1 illustrates:

  • Manufacturing system
    When a product is produced, the produc¬tion order is entered in the manufacturing system. The quantity of raw material used and the finished product produced are recorded.

  • Sales system
    When a customer places an order, the order details are entered in an order entry system.

  • Supply chain system
    When the product is available, the product is shipped and order fulfillment details are entered.

  • Accounting system
    Accounting then invoices the customer and collects payment. The invoices and payments may be recorded in an operational system that is different from the order entry system.

alt

Figure 1: Operational systems record data from operational tasks.

In each step in this process, users are creating data that can eventually be used for business intelligence. As well, to complete a task, operational users may need business intelligence. Perhaps in order to accept an order, the product must be available in inventory. As is the case with many online retailers, customers cannot place an order for a product combination (color, size) that is not available; a report immediately appears with a list of alternative sizes or colors.

alt

The operational systems shown in Figure 1 may be custom-developed transaction systems or a purchased package from companies such as Oracle (Oracle E-business, PeopleSoft, J.D. Edwards), SAP, or Microsoft (Dynamics GP). With custom-developed operational systems or with modules coming from different vendors, data may be manually entered into each system. A better approach is to systematically transfer data between the systems or modules. However, even when data is systematically transferred, the Customer ID entered in the order system may not, for example, be the same Customer ID entered in the accounting system – even though both IDs refer to the same customer!

Ideally, consistent information flows through the process seamlessly, as shown in Figure 2. Enterprise resource planning (ERP) systems ensure adherence to standard processes and are broader in scope than custom operational systems of the past. From a data perspective, ERPs reduce duplicate data entry and thus improve data quality (see Chapter 7 of Successful Business Intelligence). With an integrated ERP, a common set of reference tables with consistent customer IDs, product codes, and chart of accounts are shared across the modules or applications.

Within the business intelligence life cycle, the operational systems are the starting point for data you will later want to analyze. If you do not capture the data in the operational system, you can’t analyze it. If the operational system contains errors, those errors will only get compounded when you later aggregate and combine it with other data.

alt

Figure 2: ERP systems reduce duplicate data entry and ensure adherence to standard processes.

Additional Source Systems

While much of the data warehouse (described in the next section) is populated by operational systems, data may also come from additional data sources such as:

  • Distributors who supply sales and inventory information

  • Click-stream data from web logs that show the most frequently viewed products or online shopping cart analysis for partially completed orders

  • Market prices from external research firms

Whether this additional data gets loaded into a central data warehouse will depend on how consistently it can be merged with corporate data, how common the requirement is, and politics. If the data is not physically stored in the data warehouse, it may be integrated with corporate data in a specific data mart. Disparate data sources may, in some cases, also be accessed or combined within the BI front-end tool.

Data Transfer: From Operational to Data Warehouse

BI often involves analyzing summary data and combining data from multiple operational systems. To facilitate this, data will be extracted from the operational systems and loaded into a data warehouse, as shown in Figure 3.

This process is referred to as extract, transform, and load (ETL). More recently, some data warehouse teams have changed the order in which they do certain things and will call it ELT (extract, load, transform).

The “transform” process of ETL is often the most time-consuming, particularly when multiple, disparate systems are involved. Inconsistent codes (product ID, customer ID), handling of incomplete data, changing codes to meaningful terms (1 = not shipped, 2 = shipped) are all part of the transform process.

alt

Figure 3: Major components in the business intelligence life cycle.

Early data warehouse efforts usually relied on custom-coded ETL and many still do. More recently, as packaged ETL solutions have come on the market and become integrated with the BI front-end, customers use purchased ETL solutions. Popular solutions for ETL include Informatica Power Center, IBM Websphere Data Stage, Oracle Data Integrator, Ab Initio, and Microsoft Integration Services (a component of SQL Server).

NOTE: Throughout this book, I will mention specific vendor products as a way of providing you with concrete examples. These listings are not exhaustive, and exact product names frequently change amid vendor acquisition and product releases.

Why Not Extract Everything?

In designing a data warehouse, requirements analysts will ask users what they need so that the ETL specialists can figure out what should be extracted from the source systems. Because much of BI is unpredict¬able in nature and users often don’t know what they want until they see it, you might ask “why not extract everything?” in the event that you might one day need that data.

There are a number of reasons why all the data should not be extracted:

  • High data replication and storage costs.

  • The time window in which data can be ETL’d (extracted, transformed, and loaded) is increasingly small, especially since many companies and data warehouses serve a global user base.

  • Negative impact on query performance when too much detailed data is stored in the data warehouse.

  • Limited time, money, and human resources force a prioritization of what data to extract to include in the data warehouse.

Enterprise Information Management

As the data warehouse industry has matured and ETL tools have evolved, this market segment is increasingly referred to as enterprise information management (EIM). EIM includes ETL tools but also will include data modeling tools, data quality, data profiling, metadata management, and master data management (MDM).

Metadata

IT professionals talk a lot about metadata and go to great pains to make the business understand its importance. So with a chuckle, I will give you the classic definition: metadata is data about the data. Helpful, isn’t it?

Metadata is similar to a card file in a library or book details on Amazon.com. A card file in a library (or the book details on Amazon) tells you which category a book belongs to, when it was published, and so on. Metadata may describe such things as:

  • When the data was extracted from the source system

  • When the data was loaded into the data warehouse

  • From which source system an item originated

  • From which physical table and field in the source system it was extracted

  • How something was calculated—for example, revenue = (price × quantity sold) – discounts

  • What the item means in a business context (revenue is based on the amount invoiced and does not include returns or bad debts)

The first few bullets in this list may not be all that interesting to many business users but they are critical in the design and functioning of a data warehouse. These items are also important in knowing how timely the information you are analyzing is. If, for example, the data warehouse did not fully load due to an error, you need to be aware of this and consider this incomplete data in your reports.

As you move down the list, the items become much more important to all business users. A salesperson, for example, may have a different definition of revenue than a finance person would. As more people use BI, metadata is critical in ensuring a common business terminology and in ensuring users really know what the data means.

Master Data Management

David Loshin, president of Knowledge Integrity and a specialist in information quality and master data management (MDM), defines MDM as follows:

Master data management is comprised of the business applications, methods, and tools that implement the policies, procedures, and infrastructure to support the capture, integration, and subsequent shared use of accurate, timely, consistent, and complete master data.1

Master data is the code and descriptions for customer, product, charts of accounts, regions, and so on. Master data management is what ensures that the product ID from the product table shown in Figure 2 preceding is ideally the same ID across all the applications. This product ID is stored and maintained in one common place so that the relevant operational and business intelligence systems can access and share it.

In practice, rarely is there a single product ID for a variety of technical and organizational reasons. In this case, master data will include the mappings of the different product IDs that really are the same product represented in different systems. Master data also includes hierarchies of how individual products, customers, and accounts aggregate and form the dimensions by which you analyze various facts (see the “Data Warehouse Tables” section later). If this all sounds a little boring and unimportant to you, read the story of how pivotal a role master data has played in Dow Chemical’s business intelligence success in Chapter 7.

The Data Warehouse

A data warehouse is the collection of data extracted from various operational systems, transformed to make the data consistent, and loaded for analysis. With some business users, “data warehouse” has become a dirty word, associated with “expensive,” “monolithic,” and of no business value. Other terms, such as reporting database and data mart, are also used and may sound less monolithic to some business stakeholders. In reality, they both serve similar purposes but might have different scope and technical architecture.

Do I Need a Data Warehouse?

Many ERP implementations were sold on the promise of delivering business insight. They don’t. Having a single operational system that ensures consistent business processes and that uses consistent reference data (customer, product codes) will make business analysis significantly easier. But there are a number of fundamental differences between operational systems and data warehouses, highlighted in Table 1.

It is because of these myriad differences that I would argue all companies need a data warehouse, regardless of the size of the company. The technical architecture of the data warehouse may vary, but its necessity does not. I have worked with customers with fewer than 20 employees and less than $1 million in revenues who needed a “reporting database,” and I have worked with customers with greater than $20 billion in revenues who needed a “data warehouse.”

Why Bother with a Data Warehouse at All?

Many customers new to BI want to skip the data warehouse and deploy a BI tool directly against the operational system. This may seem like a faster approach to business intelligence. In some instances, it may be an acceptable way to start with BI, and this approach addresses operational BI needs. However, for most companies, you will want a data warehouse when:

  • You need to perform cross-subject or cross-functional analysis, such as products ordered versus inventory on hand. Such information may exist in two different systems or different modules within an ERP system and are thus combined into the data warehouse.

  • You want to perform analysis on summary information, aggregated by time (month, quarter) or by some other hierarchy (product groupings).
    These hierarchies often don’t exist in transaction systems, and even when they do, running such voluminous queries within a transaction
    system can slow it to the point of interfering with data entry.

  • You need consistently fast reporting and analysis times. Because of their different purposes and design, data warehouses allow for faster queries than operational systems.

alt

Table 1: Comparison of Operational Systems with Data Warehouses

 

Data Marts

A data mart is a subset of the data coming from a central data warehouse. A data mart also may be used to feed a central data warehouse. Whereas a data warehouse is designed to serve the needs of the enterprise, a data mart may serve the needs of a particular business unit, function, process, or application. Because a data mart is aligned with a particular business requirement, some businesses may want to skip the data warehouse and build an independent data mart. According to industry research, fewer companies now do this2 as independent data marts have been met with limited success and over time have a higher implementation cost.

Data Warehouse Tables

Within the data warehouse, data is physically stored in individual tables within a relational database. Your company may use the same relational database software for your ERP system as for your data warehouse (for example, Oracle, Microsoft SQL Server, IBM DB2) or a relational database specifically designed for business intelligence (Teradata, SAS Intelligence Storage).

Experts will deploy a number of different table design approaches to support the diverse business needs, performance requirements, and storage constraints. Most data warehouses have two types of tables: (1) a fact table that contains keys into the dimension tables and numeric information to analyze, such as sales, inventory, or calls. Such facts are often referred to as measures; and (2) dimension tables that allow analysis of measures from different perspectives such as product, time, or geography.

A fact table can have millions of detailed rows of data, commonly referred to as having a “finer granularity,” or can be significantly smaller, containing mainly summary numbers. To improve the performance of queries, database designers may choose to create aggregate or summary tables around a fact table such that there may be a DAILY_SALES_FACT table, MONTHLY_SALES_FACT table, and YEARLY_SALES_FACT table. One fact table together with its associated dimension tables is referred to as a star schema, as shown in Figure 4.

Dimension tables are also referred to as lookup tables or reference tables. The dimension tables can be broken into more than one table; for example, detailed material IDs may reside in a MATERIAL_ID table. The groupings and product hierarchy for the material IDs may reside in a separate table such as PRODUCT_GROUPING. This type of structure is referred to as a snowflake design and is used in data warehouses that have extremely large dimensions. You can think of dimensions as the ways by which you want to analyze facts, for example, sales by geography or sales by product.

alt

Figure 4: Star schema

 

 

alt

Figure 5: Snowflake design

In a transaction system, data is stored in a way that allows for fast data entry with minimal amounts of data duplicated across the physical tables. Data is said to be stored in normalized tables in a transaction system when a minimal amount of data is replicated in each table and a data element needs to be updated in only one place. For example, the same customer name does not appear in multiple rows in a table. In a data warehouse or data mart, the emphasis is on storing data in ways that facilitate analysis and that speed query performance. Data redundancy is less of a concern, and as the data warehouse is a read-only environment, there is less concern about having to change multiple instances of the same value in thousands of tables and rows. Normalization in an operational system means the facts and the dimensions will be spread across many tables. For example, order information may exist in both an ORDER_HEADER table and an ORDER_LINES table as shown next. Trying to report on which customers bought which products means joining multiple tables and aggregating information from multiple tables, which will produce incorrect query results. Earlier, in the Figure 4, all of the order information was extracted into a single ORDERS_FACT table, making it easier to query.

alt 

Dimensions and hierarchies often do not exist in the transaction system. For example, the transaction system may store a plant ID for the individual facility that produces a product, but it may not contain information about where the plants are located and to which business units they belong. This hierarchical information is often only stored in a data warehouse or in a separate master data management system.

In some respects, business users may not care about how the data is physically stored, whether in the data warehouse or in the transaction system. A business view in the BI tool (see Successful Business Intelligence, Chapter 3, the section “A Business View of the Data”) will often hide such technical issues. However, the better that business users can define requirements in advance, the better that data modelers might be able to store data in a way that facilitates the analysis. For example, if a user wants to analyze something like staffing levels versus sales performance and these two subjects exist in different fact tables and data marts, such analysis can be a challenge with certain BI tools. If users want to routinely analyze these two different subject areas together, then the data modeler may ultimately decide to store them in one common fact table.

The Data Warehouse Technology Platform

To drive a car, you need roads, highways, and a transportation infrastructure, just as in a BI environment a number of servers and networks may be involved:

  • The server(s) on which the relational database management system (RDBMS) is running

  • The server(s) that run the ETL software and processes

  • The web server(s) that provide the entry point into the BI environment

  • The BI server(s) that process queries, dashboards, and reports (see Chapter 3 of Successful Business Intelligence)

As part of this technical infrastructure, multiple servers may mirror each other for performance, load balancing, failover support, and so on. The network between the servers and the end users (whether internal business users or external customers) are also critical pinch points in the BI environment. Much of the time, this infrastructure gets taken for granted unless there are performance or reliability issues.

For smaller businesses, targetted BI deployments, or those with scalability issues, two emerging technologies are worth noting: appliances and Software as a Service (SaaS).

Appliances

Data warehouse appliances combine the server, the database, and the data storage into one system.3 Business intelligence appliances that also include BI capabilities are even newer. Leading data warehouse appliance vendors include Netezza and DATAllegro. Hardware vendors such as IBM and HP also increasingly offer data warehouse appliances. Cognos Now! is a new business intelligence appliance. The promise of an appliance is a complete, optimized solution that delivers better performance at a lower cost of ownership than if a company were to purchase and install these individual components.

Software as a Service

In a SaaS model, BI deployment gets even simpler as customers simply subscribe to a solution. A third-party vendor hosts the technical infrastructure that customers then access via the Web. Customer relationship management (CRM) vendor salesforce.com uses a SaaS model. Business intelligence solutions available as a SaaS include Business Objects CrystalReports.com, SeaTab’s Pivot Link, and LucidEra.

Best Practices for Successful Business Intelligence

The BI architecture consists of the ETL tools and processes, the data warehouse, the technical infrastructure, and the BI user tools. The operational systems provide the basic data that feed the data warehouse either in real-time or on a periodic basis. The underlying foundation of a BI architecture is complex. The implementation can either facilitate business intelligence or become so monolithic and inflexible that it becomes a technical data wasteland. To ensure the BI architecture meets the business requirements:

  • Business users should have a working understanding of the technical issues, components, and terminology that affect their requirements and ability to access data.

  • IT personnel should minimize techno babble and avoid overemphasizing
    the technical architecture for technology’s sake.

 

References:

  1. Loshin, David, “Master Data and Master Data Management: An Introduction,” DataFlux white paper.
  2. Watson, Hugh, “Which Data Warehouse Architecture Is Most Successful,” Business Intelligence Journal, Q1 2006.
  3. TDWI, What Works, November 2006, “Enhancing the Customer Experience and Improving Retention Using Powerful Data Warehousing Appliances.”

Recent articles by Cindi Howson



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!