Originally published March 18, 2008
This article is an excerpt from Successful Business Intelligence: Secrets to Making BI a Killer App by Cindi Howson (McGraw-Hill; 2007; 0071498516). Reprinted with permission from the McGraw-Hill Companies.
Every business intelligence (BI) deployment has an underlying architecture. The BI architecture is much like the engine of a car – a necessary component, often powerful, but one that users, like drivers, don’t always understand. For some companies new to business intelligence, the BI architecture may primarily be the operational systems and the BI front-end tools. For more mature BI deployments and particularly for enterprise customers, it will involve ETL (extract, transform, and load) tools, a data warehouse, data marts, BI front-end tools, and other such components.
When IT discusses BI with users, we readily fall into techno babble, and senseless acronyms abound. Most car drivers know that cars have a battery, a transmission, a fuel tank – an adequate level of knowledge for having a conversation with a mechanic or salesperson but arguably not so much expertise to begin rebuilding an engine. In this chapter, then, I’ll present the major architectural technical components that make up BI and that business users should have at least a high-level understanding of to participate in discussions about building and leveraging a BI solution. If you are a technical expert, you might find this chapter to be overly simplified and it is. If you are looking for a reference on any one of these components, consult the list of resources in Appendix B of Successful Business Intelligence.
Operational systems are the starting point for most quantitative data in a company. Operational systems may also be referred to as “transaction processing systems,” “source systems,” and “enterprise resource planning” (ERP) systems. As Figure 1 illustrates:
Figure 1: Operational systems record data from operational tasks.
In each step in this process, users are creating data that can eventually be used for business intelligence. As well, to complete a task, operational users may need business intelligence. Perhaps in order to accept an order, the product must be available in inventory. As is the case with many online retailers, customers cannot place an order for a product combination (color, size) that is not available; a report immediately appears with a list of alternative sizes or colors.
The operational systems shown in Figure 1 may be custom-developed transaction systems or a purchased package from companies such as Oracle (Oracle E-business, PeopleSoft, J.D. Edwards), SAP, or Microsoft (Dynamics GP). With custom-developed operational systems or with modules coming from different vendors, data may be manually entered into each system. A better approach is to systematically transfer data between the systems or modules. However, even when data is systematically transferred, the Customer ID entered in the order system may not, for example, be the same Customer ID entered in the accounting system – even though both IDs refer to the same customer!
Ideally, consistent information flows through the process seamlessly, as shown in Figure 2. Enterprise resource planning (ERP) systems ensure adherence to standard processes and are broader in scope than custom operational systems of the past. From a data perspective, ERPs reduce duplicate data entry and thus improve data quality (see Chapter 7 of Successful Business Intelligence). With an integrated ERP, a common set of reference tables with consistent customer IDs, product codes, and chart of accounts are shared across the modules or applications.
Within the business intelligence life cycle, the operational systems are the starting point for data you will later want to analyze. If you do not capture the data in the operational system, you can’t analyze it. If the operational system contains errors, those errors will only get compounded when you later aggregate and combine it with other data.
Figure 2: ERP systems reduce duplicate data entry and ensure adherence to standard processes.
While much of the data warehouse (described in the next section) is populated by operational systems, data may also come from additional data sources such as:
Whether this additional data gets loaded into a central data warehouse will depend on how consistently it can be merged with corporate data, how common the requirement is, and politics. If the data is not physically stored in the data warehouse, it may be integrated with corporate data in a specific data mart. Disparate data sources may, in some cases, also be accessed or combined within the BI front-end tool.
BI often involves analyzing summary data and combining data from multiple operational systems. To facilitate this, data will be extracted from the operational systems and loaded into a data warehouse, as shown in Figure 3.
This process is referred to as extract, transform, and load (ETL). More recently, some data warehouse teams have changed the order in which they do certain things and will call it ELT (extract, load, transform).
The “transform” process of ETL is often the most time-consuming, particularly when multiple, disparate systems are involved. Inconsistent codes (product ID, customer ID), handling of incomplete data, changing codes to meaningful terms (1 = not shipped, 2 = shipped) are all part of the transform process.
Figure 3: Major components in the business intelligence life cycle.
Early data warehouse efforts usually relied on custom-coded ETL and many still do. More recently, as packaged ETL solutions have come on the market and become integrated with the BI front-end, customers use purchased ETL solutions. Popular solutions for ETL include Informatica Power Center, IBM Websphere Data Stage, Oracle Data Integrator, Ab Initio, and Microsoft Integration Services (a component of SQL Server).
NOTE: Throughout this book, I will mention specific vendor products as a way of providing you with concrete examples. These listings are not exhaustive, and exact product names frequently change amid vendor acquisition and product releases.
In designing a data warehouse, requirements analysts will ask users what they need so that the ETL specialists can figure out what should be extracted from the source systems. Because much of BI is unpredict¬able in nature and users often don’t know what they want until they see it, you might ask “why not extract everything?” in the event that you might one day need that data.
There are a number of reasons why all the data should not be extracted:
As the data warehouse industry has matured and ETL tools have evolved, this market segment is increasingly referred to as enterprise information management (EIM). EIM includes ETL tools but also will include data modeling tools, data quality, data profiling, metadata management, and master data management (MDM).
IT professionals talk a lot about metadata and go to great pains to make the business understand its importance. So with a chuckle, I will give you the classic definition: metadata is data about the data. Helpful, isn’t it?
Metadata is similar to a card file in a library or book details on Amazon.com. A card file in a library (or the book details on Amazon) tells you which category a book belongs to, when it was published, and so on. Metadata may describe such things as:
The first few bullets in this list may not be all that interesting to many business users but they are critical in the design and functioning of a data warehouse. These items are also important in knowing how timely the information you are analyzing is. If, for example, the data warehouse did not fully load due to an error, you need to be aware of this and consider this incomplete data in your reports.
As you move down the list, the items become much more important to all business users. A salesperson, for example, may have a different definition of revenue than a finance person would. As more people use BI, metadata is critical in ensuring a common business terminology and in ensuring users really know what the data means.
Master Data Management
David Loshin, president of Knowledge Integrity and a specialist in information quality and master data management (MDM), defines MDM as follows:
Master data management is comprised of the business applications, methods, and tools that implement the policies, procedures, and infrastructure to support the capture, integration, and subsequent shared use of accurate, timely, consistent, and complete master data.1
Master data is the code and descriptions for customer, product, charts of accounts, regions, and so on. Master data management is what ensures that the product ID from the product table shown in Figure 2 preceding is ideally the same ID across all the applications. This product ID is stored and maintained in one common place so that the relevant operational and business intelligence systems can access and share it.
In practice, rarely is there a single product ID for a variety of technical and organizational reasons. In this case, master data will include the mappings of the different product IDs that really are the same product represented in different systems. Master data also includes hierarchies of how individual products, customers, and accounts aggregate and form the dimensions by which you analyze various facts (see the “Data Warehouse Tables” section later). If this all sounds a little boring and unimportant to you, read the story of how pivotal a role master data has played in Dow Chemical’s business intelligence success in Chapter 7.
A data warehouse is the collection of data extracted from various operational systems, transformed to make the data consistent, and loaded for analysis. With some business users, “data warehouse” has become a dirty word, associated with “expensive,” “monolithic,” and of no business value. Other terms, such as reporting database and data mart, are also used and may sound less monolithic to some business stakeholders. In reality, they both serve similar purposes but might have different scope and technical architecture.
Many ERP implementations were sold on the promise of delivering business insight. They don’t. Having a single operational system that ensures consistent business processes and that uses consistent reference data (customer, product codes) will make business analysis significantly easier. But there are a number of fundamental differences between operational systems and data warehouses, highlighted in Table 1.
It is because of these myriad differences that I would argue all companies need a data warehouse, regardless of the size of the company. The technical architecture of the data warehouse may vary, but its necessity does not. I have worked with customers with fewer than 20 employees and less than $1 million in revenues who needed a “reporting database,” and I have worked with customers with greater than $20 billion in revenues who needed a “data warehouse.”
Many customers new to BI want to skip the data warehouse and deploy a BI tool directly against the operational system. This may seem like a faster approach to business intelligence. In some instances, it may be an acceptable way to start with BI, and this approach addresses operational BI needs. However, for most companies, you will want a data warehouse when:
Table 1: Comparison of Operational Systems with Data Warehouses
A data mart is a subset of the data coming from a central data warehouse. A data mart also may be used to feed a central data warehouse. Whereas a data warehouse is designed to serve the needs of the enterprise, a data mart may serve the needs of a particular business unit, function, process, or application. Because a data mart is aligned with a particular business requirement, some businesses may want to skip the data warehouse and build an independent data mart. According to industry research, fewer companies now do this2 as independent data marts have been met with limited success and over time have a higher implementation cost.
Within the data warehouse, data is physically stored in individual tables within a relational database. Your company may use the same relational database software for your ERP system as for your data warehouse (for example, Oracle, Microsoft SQL Server, IBM DB2) or a relational database specifically designed for business intelligence (Teradata, SAS Intelligence Storage).
Experts will deploy a number of different table design approaches to support the diverse business needs, performance requirements, and storage constraints. Most data warehouses have two types of tables: (1) a fact table that contains keys into the dimension tables and numeric information to analyze, such as sales, inventory, or calls. Such facts are often referred to as measures; and (2) dimension tables that allow analysis of measures from different perspectives such as product, time, or geography.
A fact table can have millions of detailed rows of data, commonly referred to as having a “finer granularity,” or can be significantly smaller, containing mainly summary numbers. To improve the performance of queries, database designers may choose to create aggregate or summary tables around a fact table such that there may be a DAILY_SALES_FACT table, MONTHLY_SALES_FACT table, and YEARLY_SALES_FACT table. One fact table together with its associated dimension tables is referred to as a star schema, as shown in Figure 4.
Dimension tables are also referred to as lookup tables or reference tables. The dimension tables can be broken into more than one table; for example, detailed material IDs may reside in a MATERIAL_ID table. The groupings and product hierarchy for the material IDs may reside in a separate table such as PRODUCT_GROUPING. This type of structure is referred to as a snowflake design and is used in data warehouses that have extremely large dimensions. You can think of dimensions as the ways by which you want to analyze facts, for example, sales by geography or sales by product.
Figure 4: Star schema
Figure 5: Snowflake design
In a transaction system, data is stored in a way that allows for fast data entry with minimal amounts of data duplicated across the physical tables. Data is said to be stored in normalized tables in a transaction system when a minimal amount of data is replicated in each table and a data element needs to be updated in only one place. For example, the same customer name does not appear in multiple rows in a table. In a data warehouse or data mart, the emphasis is on storing data in ways that facilitate analysis and that speed query performance. Data redundancy is less of a concern, and as the data warehouse is a read-only environment, there is less concern about having to change multiple instances of the same value in thousands of tables and rows. Normalization in an operational system means the facts and the dimensions will be spread across many tables. For example, order information may exist in both an ORDER_HEADER table and an ORDER_LINES table as shown next. Trying to report on which customers bought which products means joining multiple tables and aggregating information from multiple tables, which will produce incorrect query results. Earlier, in the Figure 4, all of the order information was extracted into a single ORDERS_FACT table, making it easier to query.
Dimensions and hierarchies often do not exist in the transaction system. For example, the transaction system may store a plant ID for the individual facility that produces a product, but it may not contain information about where the plants are located and to which business units they belong. This hierarchical information is often only stored in a data warehouse or in a separate master data management system.
In some respects, business users may not care about how the data is physically stored, whether in the data warehouse or in the transaction system. A business view in the BI tool (see Successful Business Intelligence, Chapter 3, the section “A Business View of the Data”) will often hide such technical issues. However, the better that business users can define requirements in advance, the better that data modelers might be able to store data in a way that facilitates the analysis. For example, if a user wants to analyze something like staffing levels versus sales performance and these two subjects exist in different fact tables and data marts, such analysis can be a challenge with certain BI tools. If users want to routinely analyze these two different subject areas together, then the data modeler may ultimately decide to store them in one common fact table.
To drive a car, you need roads, highways, and a transportation infrastructure, just as in a BI environment a number of servers and networks may be involved:
As part of this technical infrastructure, multiple servers may mirror each other for performance, load balancing, failover support, and so on. The network between the servers and the end users (whether internal business users or external customers) are also critical pinch points in the BI environment. Much of the time, this infrastructure gets taken for granted unless there are performance or reliability issues.
For smaller businesses, targetted BI deployments, or those with scalability issues, two emerging technologies are worth noting: appliances and Software as a Service (SaaS).
Data warehouse appliances combine the server, the database, and the data storage into one system.3 Business intelligence appliances that also include BI capabilities are even newer. Leading data warehouse appliance vendors include Netezza and DATAllegro. Hardware vendors such as IBM and HP also increasingly offer data warehouse appliances. Cognos Now! is a new business intelligence appliance. The promise of an appliance is a complete, optimized solution that delivers better performance at a lower cost of ownership than if a company were to purchase and install these individual components.
In a SaaS model, BI deployment gets even simpler as customers simply subscribe to a solution. A third-party vendor hosts the technical infrastructure that customers then access via the Web. Customer relationship management (CRM) vendor salesforce.com uses a SaaS model. Business intelligence solutions available as a SaaS include Business Objects CrystalReports.com, SeaTab’s Pivot Link, and LucidEra.
The BI architecture consists of the ETL tools and processes, the data warehouse, the technical infrastructure, and the BI user tools. The operational systems provide the basic data that feed the data warehouse either in real-time or on a periodic basis. The underlying foundation of a BI architecture is complex. The implementation can either facilitate business intelligence or become so monolithic and inflexible that it becomes a technical data wasteland. To ensure the BI architecture meets the business requirements:
Recent articles by Cindi Howson