This study examines the nature and importance of the data warehouse (DW) appliance to the area of business intelligence (BI) and specifically to data warehousing. We consider a data warehouse to be a critical part of the infrastructure to support business intelligence within an enterprise.
A data warehouse appliance is an integrated package of hardware and software that supports data warehousing for a company. The objective for a data warehouse appliance is one purpose, one package, one install and one service. In other words, when deploying a data warehouse appliance, the purchaser should expect to purchase the product from a single vendor, receive one shipment of an integrated pre-configured and pre-tested system, install it as a single deployment, operate it as a single system-management unit, and service it from one vendor support group.
To explore this topic, this research paper focuses on the following questions:
It is critical to understand your company’s requirements for data warehousing. In particular, will the data warehouse appliance being considered serve as a single-function data mart, or as an enterprise data warehouse (EDW) supporting multiple subject areas? As your company evolves, your system requirements will change significantly, requiring periodic reevaluation of architecture and technology. Many companies that are beginning a data warehousing effort will start with a data mart and then evolve toward an enterprise data warehouse. Thus, their requirements may require significant changes in hardware and software.
The data warehouse appliance is an evolution in technology with incrementally faster processors, new chip technologies, larger disks and the like. However, the data warehouse appliance has been a revolution in the practice of data warehousing, enabling enterprises to conduct business in entirely new ways. As a result, companies are reexamining the role of business intelligence to remold critical business processes throughout the enterprise.
Everything should be made as simple as possible, but not one bit simpler.
– Albert Einstein
Data warehouse appliances are about making data warehousing simple and powerful. As the quote from Albert Einstein suggests, any technology (including data warehousing) should be made simple. And, that simplicity should drive the power to analyze data in new ways for new uses within the enterprise.
In the global economy, doing business as usual is no longer an option. For all businesses and governments, it is essential to understand and respond quickly and properly to unexpected changes. The objective of data warehousing is to maintain a consistent view of business reality. The objective of business intelligence is to leverage that consistent view for smart decisions at all levels of the organization. This combination is enabling enterprises to conduct business in entirely new ways.
Many consider a data warehouse appliance to be just a clever marketing ploy of the vendors. However, our experience over the last three years indicates that there is something new and different about data warehouse appliances. This is due to better performance results for certain query processing, simplicity of installation and operation, and a lower total cost of ownership (TCO) for a multi-terabyte data warehouse. Hence, data warehouse appliances have had a disruptive influence in the marketplace and on the practice of data warehousing.
Whether or not the deployment of a data warehouse appliance is suitable for your company, an understanding of data warehouse appliance technology is required. This new technology is changing the way that we think about large-scale data warehousing. In the past, data warehouses were often complex and expensive. By eliminating these constraints, data warehouse appliances are causing a rethinking of the role of data warehousing in enterprise architectures, having a significant impact on the core business of an enterprise.
It is important to realize that a data warehouse appliance is an evolution of technology that started decades ago. In the early 1980s Britton-Lee and Teradata sold self-contained database machines that had simple SQL functionality.1 Although their capabilities were small compared with today’s appliances, these early database machines were used quite effectively for executive decision support.
Today’s business requirements far exceed the requirements of those early years. The importance of data warehouse appliances is being driven by the need for both extreme performance and extreme scalability. The reason for the term extreme is that the demands of global business are continually pushing the limits of technology.
In studies by WinterCorp2, the largest data warehouse was 110TB in 2005. Now, just two years later, data warehouses exceeding a hundred terabytes are common, and some are being designed for petabyte capacities. “From 2003 to 2005, the size of the largest data warehouse tripled, breaking the 100TB barrier. The number of rows rose five-fold and peak workload exceeded 1 billion SQL statements per hour.” In general, WinterCorp has observed that the size of the largest databases has tripled every two years since 2001, as illustrated in Figure 1.
Figure 1: Size of the Largest Data Warehouses
The size of data is one of many factors in the performance of a data warehouse. However, size coordinates positively with data loading rates, report/query loads, heavy analytical loads, and other performance factors.
Experience dictates that regardless of the initial business requirements, future requirements are likely to rapidly increase by an order of magnitude. For example, an initial 10TB data warehouse is like to grow to 50 or even 100TB in a few years, and an initial 100 users could expand to include most of your employees. Therefore, the ability to handle extreme scalability is as important as extreme performance.
When used in the context of a data warehouse, the term appliance is often misunderstood and misused. This section examines a set of general criteria for evaluating an appliance by starting with the basics.
The term appliance emerged in the early 1900s when electricity started to power household devices, such as washing machines. Before then, electric motors were used as generic sources of power for many devices. An appliance was initially considered a luxury because an expensive multipurpose component (electric motor) was used for a limited purpose.
As shown in the Figure 2, Webster Dictionary defines appliance as the “act of applying.” This implies that an appliance is being put to use for a practical purpose. The second
definition is a piece of equipment (tool or device) “designed for a particular use or function” to enable the act of applying.
Figure 2: Webster Dictionary Definition of Appliance
The use of a standardized energy source (electricity or gas) is also an aspect of the definition. This aspect is probably why we usually do not consider a pot, sink, bathtub, bicycle, chair or the like to be an appliance.
Hence, an appliance is “a machine that is designed for a specific purpose.”
Household appliances, such as a toaster, can be used to illustrate this definition. An electric toaster enables us to do the specific purpose of browning bread, so that butter and jam can be properly supported. Before the advent of a toaster, bread was toasted by placing it into a metal frame and holding it over a fire4.
A search of the websites for Sears, Target, and Wal-Mart produced some interesting observations about toasters:
An oven can also enable us to brown bread and thus make toast. However, an oven can perform many other functions, such as baking and roasting. So, is an oven also an appliance? Is a toaster more of an appliance than an oven? This illustrates the importance of defining the intended purpose of an appliance. Both the toaster and oven are examples of kitchen appliances, which are also part of household appliances. Thus, the scope of the purpose is important.
This discussion highlights the complexity of applying the term appliance meaningfully to ordinary products. This is especially true when trying to understand a data warehouse appliance. A vendor can label any product as an appliance. The challenge is to understand the implications behind that label.
For instance, data warehousing is a specific type of information processing. At the same time, data warehousing is a general approach for supporting business intelligence. Hence, from an IT perspective, a data warehouse appliance serves a specific purpose, but from a BI viewpoint it is more generalized.
For a data warehouse appliance, its purpose alone is insufficient to understand its objectives. We must add additional criteria. In this research paper, we will use three criteria – purpose, convenience and value – for understanding and evaluating a data warehouse appliance.
From the classical definition, an appliance is designed for a specific purpose. The challenge is to define its purpose in terms of scope, design and ends – not means.
First, the purpose should have the proper scope. Echoing the quote of Albert Einstein, the scope should be as narrow as possible for its purpose, but not one bit narrower. In other words, the proper scope should provide a total solution for what is needed; no more and no less.
Second, the purpose should be reflected in its design. The expression purpose-built is often used to imply that an appliance in its design focuses on its purpose. A related example is Danish furniture, whose function and form together display an artistic harmony. The design of an appliance can vary from a native appliance (all components are unique and proprietary) to a packaged appliance (standard components but assembled and configured uniquely).
Third, the purpose of an appliance should be defined in terms of what it does, rather than on how the function is achieved. In other words, the emphasis should be on the ends, rather than the means. A toaster browns bread. Its purpose should be described in terms of the quality of browning and the types of bread, rather than the way the electricity heats the heating elements or how the timer pops the bread up at the right moment. Stating the ends, instead of the means, may be simple for toasters, but imagine the complexity of a data warehouse.
The best appliance is the one that gets the job done (i.e., fulfills its purpose) with the greatest convenience and value. The next section covers these topics.
In addition to fulfilling its purpose, an appliance should offer ease of use, simplicity and compatibility.
First, the ease of use is defined as the amount of required time, effort and resources required utilizing an appliance, across various phases in its life cycle – acquisition, deployment, operation, maintenance and retirement. Simply put, an appliance should be simple to install and use.
Second, an appliance should be simple to understand, even though its mechanism may be very complex. It is quite a challenge for the designer to encapsulate the functionality of an appliance in the simplest manner possible and not one bit simpler. All components should appear as a single integrated package, like that of a black box. Only those features necessary to control the purpose of the appliance should be visible to the user.
Third, an appliance should be compatible, which is defined as living in harmony with its environment6. A toaster should be compatible to its power source. Plugging a U.S. appliance (using 110 volts) into a European power outlet (supplying 220 volts) is not an example of compatibility.
The best appliance is one that is convenient in its ease of use, simplicity and compatibility.
The last criterion is the value that an appliance provides to the business, which is directly related to alignment, cost and performance.
First, the purpose of the appliance should align with the intended requirements of the business. For instance, a toaster that toasts bread but not bagels does not align with the requirements of a bagel shop.
Second, an appliance should have low cost in terms of its total cost of ownership (TCO). An appliance may be very inexpensive to purchase. However, the total cost over its lifetime could be quite high for many factors, such as maintenance, power/space requirements, education/training, and so on.
Third, an appliance should have high performance in achieving its purpose. For instance, the performance of the toaster should be related to toasting a slice of bread, such as the number of slices toasted per hour. Cost must be considered in the context of TCO. Likewise, performance much be considered in the larger context of reliability, scalability, security and the like.
In summary, the general criteria for an appliance are shown in Figure 3. It shows the three main criteria of purpose, convenience and value, along with the breakdown of each.
Figure 3: General Criteria for an Appliance
This section extends the general criteria – purpose, convenience and value – to address the specific criteria for a data warehouse appliance. In particular, we assert that a data warehouse appliance should adhere to the following requirements:
Examples of products that are not data warehouse appliances are generalized DBMS products that are supported on a variety of commercial hardware platforms, such as Microsoft SQL Server and Oracle Database. These products can be used for a variety of purposes (such as for both business transaction and business intelligence processing), require several other separate components to build a system, may be installed separately from other components, and may require several support points. However, there is no reason why these products could not be used in the construction of a data warehouse appliance solution.
The remainder of this section explains the purpose and architecture of a data warehouse appliance. Next, the various types of a data warehouse appliance are listed. Finally, the distinguishing criteria that should be considered when selecting a data warehouse appliance solution are given.
The purpose of a data warehouse appliance is as part of a stack of BI functions, as shown in Figure 4.
Figure 4: Stack of Business Intelligence Functions
The top layer is the information delivery function, which supports dashboards, portals, desktop integration, collaboration, and search. The next layer is the BI applications/tools that support integrated planning methodologies, predictive analytics, and operational BI processes. And, the next layer is data integration services, which supports data integration and business content across the enterprise.
Data warehousing is encapsulated within the last two layers of the BI stack, which are data management and system software/hardware. Data management for data warehousing increasingly involves massively parallel processing (MPP), along with high-performance analytics and the ability to handle complex workloads. Likewise, system software/hardware may involve open source components, commodity hardware, and reliable mass storage.
In particular, we assert that the purpose of a data warehouse appliance is:
To enable high performance data warehousing with a total cost-of-ownership (TCO) that enables a rapid return-on-investment (ROI) to the business.
The two important phrases to note in this definition are “high performance data warehousing” and “rapid ROI to the business.” There is only a business case for using a data warehouse appliance if its total installation and operating costs are less than the business benefits achieved by its use. Current business pressures also make it important that this ROI be achieved as rapidly as possible.
A data warehouse appliance provides high performance data warehousing at a lower TCO. This enables business cases for BI projects to be built that were previously not possible for technology reasons (performance was unsatisfactory) or costs reasons (there was a negative ROI). The TCO of a data warehouse appliance is lower not only because the cost of the hardware and software is cheaper, but also because the simplicity and ease of use of the environment reduces installation, administration and support cots. The improved usability of a data warehouse appliance also has the benefit that projects can be developed and deployed faster. This reduces the time to business value. Data warehouse appliances therefore provide a reduced TCO (and thus increased business ROI) with a better time to business value.
All the customers interviewed for the study were achieving significant ROI and were very satisfied with the appliances they were using. All of them also commented, however, that the total cost of a data warehousing project involves more than just the cost of the data warehouse appliance. The largest cost of many projects is the effort involved in acquiring and integrating source data for loading into a data warehouse. A data warehouse appliance does not reduce this cost.
The following sections amplify on the objectives of a data warehouse appliance by explaining the architecture that enables higher performance, lower TCO, and quicker return on investment (or faster time to value).
A variety of architectures are employed in a data warehouse appliance. The discussion below, however, should provide sufficient detail to help readers understand data warehouse appliance operations and identify important features that distinguish products.
The main components of a data warehouse appliance are the server, disk storage, and database software, as shown in Figure 5.
Figure 5: Data Warehouse Appliance Architecture
All data warehouse appliance servers provide some variation of MPP. The MPP details are not as important as a demonstrated capability for good performance and scalability. A data warehouse appliance typically consists of one or more front-end processors (running the UNIX or Linux operating systems) connected by a high-speed interconnect (gigabit Ethernet, InfiniBand, or fibre channel, for example) to a set of back-end processors. Data warehouse appliances, however, are sometimes built using other hardware configurations.
Each back-end processor has its own disk storage, maintaining a portion of the database. When a front-end processor receives a database query, it splits the query into sub-components and routes the sub-components to the appropriate back-end processors. Each back-end processor runs its query sub-component against its part of the data warehouse database. The results are returned across the interconnect to the front-end processor. The front-end processor then combines the result data coming across the interconnect and returns the final query result to the requesting application.
Data warehouse appliance vendors often advertise an appliance as being built using off-the-shelf commodity hardware. The meaning of the term commodity varies. In some cases, it means the vendor purchases standard components, such as microprocessor chips and hard disk drives and then builds its own custom appliance. Commodity may also mean that the vendor employs standard assemblies, such as server units and disk storage units. This latter definition is used in this report. The terms custom and commodity can also be applied to the system and database software used in a data warehouse appliance.
The advantage of a custom solution is that the server hardware can be optimized for data warehousing workloads. The advantage of a commodity hardware approach, on the other hand, is that it is more likely to satisfy an organization’s hardware standards and preferred supplier purchasing policies. Likewise, the more customized the hardware, the less likely it can be used for other purposes.
The hardware and software used to build a data warehouse appliance will affect both the performance and the reliability of the server. Performance is determined by factors such as the speed of the processors, the amount of memory, the interconnect speed, disk drive performance, and degree of parallelism. Reliability is affected be the mean time between failures (MTBF) of the individual components, the level of redundancy built into the hardware (power supplies, processor boards, interconnects, disk controllers and drives, for example), and the use of features such as RAID storage.
One consideration when evaluating a data warehouse appliance server is its flexibility for expansion. Can additional processors, memory, and disk storage be added without having to completely replace the hardware, and without a major disruption to processing? Will the data warehouse appliance data and workload be automatically redistributed to take advantage of the expanded configuration?
New vendors in the data warehouse appliance marketplace frequently custom-build data warehouse appliances based on modified and sometimes extended versions of an open source relational DBMSs, such as Ingres or PostgreSQL. Appliances that come from more traditional large system suppliers, such as HP, IBM and Teradata, usually employ their own relational DBMS product that has undergone many years of development. These latter DBMS products provide a richer, but sometimes more proprietary, set of DBMS features. As is often the case, this richer functionality increases the complexity of the product.
When data warehouse appliance vendors claim support for an open source DBMS, they usually imply that the appliance supports the SQL interfaces of the open source product. Under the covers, the open source relational DBMS may be highly modified to support the hardware configuration of the appliance; and, in some cases, little remains of original product except for the SQL interfaces, SQL parser, and some elements of the relational optimizer. The more modified the DBMS is, the more difficult it will be for the vendor to support new features being added to the original open source codebase.
All data warehouse appliance vendors provide support for ODBC and JDBC SQL interfaces and a specific level of the SQL standard. It is quite common to see a vendor advertise support for the SQL-92 standard and some functions from SQL-99. It is important to realize that these standards are large and have several different levels of possible support. In addition, recent additions have been made to SQL standard and published as SQL-2003 and SQL-2006. It is important, therefore, to understand the SQL functionality provided with any given appliance. Existing custom-built applications and third-party vendor tools that take advantage of advanced SQL features and proprietary DBMS extensions may need to be modified when they are migrated to an appliance.
SQL features to consider when evaluating a data warehouse appliance include stored procedures, triggers, referential integrity, analytical functions, complex data types (especially text and XML), temporary tables, and materialized views. Further, database administration functions should be considered, such as workload management, load and backup utilities, data encryption and compression, data security, administration tools, tuning of cache sizes, and data partitioning.
Lack of these features simplifies the environment, which is important for rapid data warehouse building, but may reduce performance and flexibility as workloads become larger and more complex.
Categorizing data warehouse appliances is not an easy task, and it is becoming more complicated as new vendors enter the marketplace, trying to create a niche for themselves.
For this study, we considered four types of data warehouse appliance:
Figure 6 shows how these various solutions have evolved from early database machines and relational DBMS software.
Figure 6: Data Warehouse Appliance Evolution
The dividing line between these types can be blurred. For example, Teradata fits within the native data warehouse appliance type but has aspects of the packaged data warehouse appliance type. The five sponsors (as noted in the figure) represent a good spectrum of offerings, as described in the Appendix.
Many companies make the big mistake of focusing on the entry cost of an appliance product. However, that price may be only one piece of the total cost of ownership (TCO) for a new system. Consider all the varied cost items of the system, realizing that there are many real costs hidden in intangible items. For example, following are some of the cost items within a typical TCO for a data warehouse appliance project:
Where is the bang for the buck? Data warehouse appliance products often display amazing performance results for certain data warehouse workloads. The issue is whether high performance will be produced for your unique workload that supports your business requirements. Without a detailed proof of concept (POC), it is impossible to resolve this issue. Even then, the real impacts may not be revealed until the system is several months into production.
Performance must be evaluated along multiple dimensions, such as query complexity, response times, number of queries per hour, number of users supported, database load/update time and so on. In addition, the data warehouse appliance must handle the demands of scalability, such as increases in the number of users, database size and number of concurrent queries.
Most data warehouse appliances achieve good performance through high-speed sequential processing of the data. Each sequential scan is performed in parallel across the many processors. This approach reduces or even eliminates the need for indexes (improving data load and update performance) and simplifies SQL optimization. Data warehouse appliances perform particularly well when processing SQL queries that involve table scans, star schema joins, and operations that involve tables that only have a single index. This is why a data warehouse appliance is highly suited for building a data warehouse that involves the processing of large amounts of data using simple queries, and for transforming and aggregating data before it is loaded into an enterprise data warehouse.
Many of new data warehouse appliance vendors are now extending their products to support more complex workloads. The degree to which they have achieved success varies and depends largely on the workload being performed. This is why a realistic POC benchmark is important when evaluating data warehouse appliance solutions.
When selecting a data warehouse appliance for other than simple query data warehouse workloads and data staging applications, the key performance differentiators between data warehouse appliance vendors are support for complex physical schemas, complex queries, multi-user workloads, and mixed workloads (simple and complex queries, or intermixed updates and queries). Scalability in terms of data size, number of users, query volume, and number of parallel queries should also be considered. For operational BI applications, the ability to support low-latency data, event-driven processing, and service-oriented architecture are important factors. In fact, operational BI may be a suitable marketplace for another, more specialized, type of data warehouse application appliance.
In any given application workload, transactions either query (select) data or modify (insert, update, delete) data. Application workloads traditionally divide into two types: business transaction (BTx) operational workloads that support the daily running of business operations, and business intelligence (BI) workloads that monitor, report, and analyze those operations.
BI and BTx are types of application processing, in contrast to OLAP and OLTP that are specific technologies. A problem with the confusion among these terms is that the word transaction has two meanings. In both BI and BTx, everything is a transaction (a so-called unit of work), including queries. However, the term transaction also gets equated specifically to BTx processing (hence the association with OLTP). Typically, BTx processes run short-running online transactions that may do a series of data query and data modification operations. As a matter of interest, BTx processing may also consist of long-running batch applications that consist of one or more transactions (units of work) depending on how many commit points they have.
BTx workloads support applications such as order entry, shipping and billing and usually consist of short-running transactions that select and modify small amounts of data. From a performance perspective, the focus is on workload management and on processing multiple transactions in parallel.
BI workloads usually involve simple to highly complex queries and very large amounts of data. Performance is achieved via parallel query processing. Also, the relational database optimizer plays a much more important role here. For high performance when using traditional DBMS technologies, more indexes and aggregations are created on the data as compared with business transaction workloads. Data warehouse appliances overcome the performance overheads of handling queries by moving processing and record selection toward the storage subsystem, tuning for sequential processing and often eliminating the need for most indexes.
In most enterprises, the two types, or styles, of application processing have blended together, as business requirements become more complex. This situation is shown in Figure 7.
Figure 7: Application Workload Spectrum
The efficient operation of a data warehouse becomes more difficult when organizations implement mixed complex workloads. There are two types of mixed workload.
The first type of mixed workload is a query workload involving both long-running complex queries and short simple queries. This is often the workload that occurs when the number of users (and thus query mix) increases, and as the processing moves from accessing a data mart to an enterprise data warehouse. Mixed query workloads usually require more indexes and better relational optimization to handle queries efficiently. These workloads also need intelligent workload management so that complex queries that may run for several minutes or even hours do not swamp machine resources and lock out the handling of the short simple ones that run in a few seconds.
A second type of mixed workload involves modifying data warehouse data concurrently with complex query processing. This type of workload may arise when trickle feeding updates to a data warehouse to overcome reduced batch windows or when supporting low-latency data for operational business intelligence.
A mixed query/update workload consists of short- and long-running transactions that may perform a mixture of updates, deletes, inserts, and selects. They can hit any part of the database. If the database is partitioned on time, then new records (not updates or deletes) are likely to be added to the last partition in the database. However, the database may be partitioned by other than date (e.g., hash). It may not even be partitioned. This type of workload has some similar characteristics to a BTx workload and, therefore, has similar technology requirements. In these workloads, technologies such as locking and concurrency schemes matter more. Workload management is also important, but workload management by itself is not sufficient. The overall quality of the product’s workload management, locking protocols, concurrency scheme, cache management, data and index management and relational optimization play an important role in achieving good performance in such an environment.
If you have a simple data warehouse workload, choosing the proper system is easier. It becomes more difficult when deploying a mixed query workload and especially when using workloads involving mixed query and data modification operations. A POC effort that is specific to your business requirements is the only way to understand the tradeoffs inherent in the application workload spectrum.
From the perspective of data warehouse appliance vendors who offer native appliances (i.e., designed from scratch as a data warehouse appliance) and software appliances, their evolution is to move from the right side toward the middle of the application spectrum where simple queries against large amounts of data are extended into an environment of a mixture of complex and simple queries and even concurrent updates. Some vendors, such as Teradata, have been moving in this direction for many years. Newer appliance vendors are just beginning to make this move, and their progress varies.
As workloads move toward the middle of the application spectrum and enterprise data warehousing, the data warehouse appliance must become less stand-alone. It must support a broader set of workloads and must coexist with other IT hardware and software systems. The new data warehouse appliance vendors are beginning to address these requirements, but it is in this area that the differences between the various products are most distinct.
From the perspective of vendors who offer packaged appliances (i.e., designed from pre-existing system components), their evolution is to provide cost-effective solutions for the right side of the application spectrum. Their legacy in workload management and query optimization is extended into super-fast complex query processing. The challenge for these vendors is reducing cost and providing a single install as a single package of hardware and software components. Further, any change to a component should be installed as part of a single system update, as opposed to separate updates from each component as that often results in system malfunctions.
The bottom line for all appliance vendors is that they must maintain their competitive advantage with the right balance of low TCO, high performance, easy administration and expandability.
For data warehousing, a major item in TCO is the proper staffing for the deployment, configuration, operation, reconfiguration, and maintenance of the data warehouse appliance. It is often the case that the proper staffing cannot be hired as employees, thus requiring outsourcing to remote service providers and long-term professional services by the vendor. This is not an easy issue to resolve. Ideally, the data warehouse appliance should be a plug-and-play appliance – just turn it on, and the data warehouse appliance performs optimally for your requirements.
An expected characteristic of any large-scale data warehouse project is that its requirements will change. The more successful the project is, the more change will be required. Expandability is related to the ease of scalability.
Flexibility is the ability to incrementally add hardware (e.g., disks, processors) without the need to do a major reconfiguration or a complete replacement of the existing system. In other words, expandability should be field-upgradeable with automated reconfiguration.
Therefore, plan for and have the ability for expandability and flexibility, particularly for major increases in data sizes, data flow rates, lower data latency, query complexity, concurrent user demands, and workload variety.
The data warehouse appliance platform must be housed and operated within a managed data center. The implication is that the data warehouse appliance must be compatible with hundreds of standards, dealing with electrical power, cooling limitations, wiring conventions, cabinet constraints, lightning, operating system, storage system, system management monitoring, networking security, and so on. For a data warehouse appliance product to be friendly in this complex environment, careful planning and expertise are required.
Tracy Abdo of Network Solutions remarked, “With enterprise-level solutions, vendors must realize that they must support a mixed-vendor environment and that their solutions must integrate together. A company must also understand how enterprise data warehouse technologies can be used to solve its unique problems and be integrated into the overall IT environment."
When evaluating appliances, both the vendor and the customer have specific responsibilities. The responsibility of the vendor is to clearly define the purpose, the design for that purpose, and delivery on specific value criteria. The responsibility of the customer is to clearly define their unique business requirements, state the required purpose of the appliance, select the best design for that purpose, and insure the expected value is realized. If the two perspectives match, a win-win situation will result. If not, there will be great frustration by both parties.
Therefore, an evaluation should focus on the alignment of the business requirements to the capabilities of the data warehouse appliance. The person seeking to use a data warehouse appliance must understand precisely how it will be used within their business. There are no magical answers here. Tough tradeoffs must be made to best satisfy the requirements.
Along with defining your business requirements, you should also identify and prioritize your pain points – financial or political issues that engage the attention of senior management. For many enterprises, the pain points tend to cluster around the following issues.
First, the cost of business intelligence processing and data warehousing are constantly increasing, especially in the soft dollars of IT staffing and professional services. In some cases, the cost may prevent the launch of certain BI projects. One of the big advantages of data warehouse appliances is that they can enable BI projects that in the past have been too expensive to implement.
Second, data volumes are constantly increasing, causing capacities of the existing system to limit growth. This is why a data warehouse appliance solution must be flexible and allow for easy scalability.
Third, workloads on the data warehouse have become more complex, driven by mixture of various applications required for a large user base. As discussed earlier, the ability to support complex workloads is one of the main distinguishing features between different data warehouse appliance solutions.
It is important to understand how the various types of data warehouse appliances are positioned as to their strengths and applications. Here are some general guidelines.
The native and software data warehouse appliances have good performance for simple queries against large data stores. In addition, they have the potential to deliver this performance for a lower acquisition cost than packaged appliances. They are suited for single function data marts and for data staging to an enterprise data warehouse (EDW) and large data marts, along with data cleanup, transformation, and aggregation. In some cases, these products may support more complex processing and an EDW environment, but this will vary by product.
The packaged data warehouse appliances have good performance for wide range of workloads. They usually have good SQL support and database tools, but they tend to be more complex and have a higher
cost than other data warehouse appliances.
The data management appliances require the additional overhead of a host computer and the host DBMS. Workload can be split between the host and the data management appliance, thus boosting
performance. However, be careful of the SQL compatibility with the host DBMS.
In our interviews, some IT professionals remarked that custom-built approaches often have performance advantages over solutions built using commodity components, while others have commented that commodity components have better reliability and make it easier to integrate the data warehouse appliance into the enterprise IT environment.
As you consider a specific data warehouse appliance product, here is a checklist of detailed features that should be considered:
Because of the complexity of these features and of the business requirements, a methodical POC is required for successful selection and installation of a data warehouse appliance.
As part of the research study, interviews were conducted with a customer of each of the five vendor sponsors. This section highlights several insights that surfaced from the interviews about current experiences with data warehouse appliances.
DATAllegro Case Study of TEOCO Corporation – TEOCO Corporation is a 150-person company in Virginia providing revenue management to telecommunications firms. The problem was identifying and correcting billing errors with high data volumes from many data feeds requiring complex queries. Solution was a 120TB analytic data mart. The insight was that “DW appliances do supply massive horsepower, but getting the data ready for analysis is the tough part.”
IBM Case Study of The Pepsi Bottling Group – The Pepsi Bottling Group is a global beverage distributor with a 100-year history and nearly $13 billion in annual sales. The problem was to evolve a sales data mart into an enterprise data warehouse. The solution was a packaged data warehouse appliance solution architected in separate layers for data access and data storage. They “can do more with less” and stressed to “think single solution and insist that the vendor do the same.”
Netezza Case Study of Epsilon Data Management – Epsilon Data Management is a high-end service provider for CRM, having 2,200 clients. The problem was delivering advanced CRM to build brand awareness through 1-on-1 marketing, which required huge data collections that push the limits of traditional DBMS technology. The solution was to host custom data marts for their clients using data warehouse appliances. It took “one week to get the data warehouse appliance up and running” with costs much less than traditional technology.”
Sun/Greenplum Case Study of Frontier Airlines, Inc. – Frontier Airlines is a major U.S. air carrier with 5,000 employees. The problem was revenue management by analyzing trends based on many data external sources. The solution was an analytic data mart that was put into production in only four months. They are “doing business in a different way” with “a low cost solution that provided sound ROI for project.”
Teradata Case Study of Network Solutions – Network Solutions is a 900-person company providing a broad range of Internet services. The problem was the integration of data marts into enterprise data warehouse with an emphasis on CRM. The solution was a “mature and open” data warehouse system. The theme was to “change that data into dollars.” The term "appliance" was avoided since it was viewed negatively as “an inflexible black box.”
The following sections highlight several insights that surfaced from the interviews about the current experiences of these customers with data warehouse appliances.
The perception of a data warehouse appliance varied. Most viewed the term positively as the next wave of technology in the data warehouse area. However, others were not as sure. “We don’t use the word appliance. It’s just another big computer to us,” said John Devolites of TEOCO. Another person felt that the term appliance was viewed negatively within their company because it implied that the black-box nature of an appliance was inflexible and unable to adapt to changing requirements.
A repeated recommendation in several interviews was to overcome the false belief that the data warehouse is just another database. “A data warehouse is a very different, ever-evolving animal, and it can be your critical competitive differentiator,” said Tracy Adbo of Network Solutions. “You must be able to reinvent the data warehouse by finding new uses every few months, sell your innovations to your internal customers and make their experience with the data warehouse constantly better.”
Technology itself is a small part of being successful with an enterprise data warehouse. A company must remain focused on its unique requirements, while being constantly aware of changing technology solutions for fulfilling those requirements. New low-cost data warehouse appliance solutions may suffice for departmental and data mart projects, but more mature high-cost solutions may be necessary for an enterprise data warehouse.
Conduct a thorough proof of concept based on your unique workload requirements, and not a standardized benchmark. This is a time-consuming effort, but one that results in a maturing of your business requirements, along with better data warehouse appliance product selection.
Mike Spies of The Pepsi Bottling Group (PBG) recommended architecting your data warehouse in two layers – a data access layer that is separate from and independent of the data storage layer. “Such architecture would avoid the dilemma of having to choose between designing data tables for access and designing them for efficient storage. If you have user and application interfaces tightly coupled to the storage layer, you lose flexibility, and the system becomes difficult to manage.” Such an approach makes it easier to move to new data management technologies (such as data warehouse appliances) without affecting existing applications.
A clear statement of the benefits from a data warehouse appliance was from Tracy Abdo of Network Solutions when she said, “We had to be able to do more with our data assets. We needed to turn data into dollars.”
Mike Spies of PBG said that their main benefit was to “do more and more for less and less.” From 2000 to 2003, their BI and data warehousing investment and workload remained flat. Then in 2004, the new installation supported five times the workload for the same investment. At the beginning of 2007, new hardware led to a new three-year lease, which doubled the capacity of the configuration and saved $400,000 per year.
Mike Coakley of Epsilon estimated that it was three times less effort to use a data warehouse appliance as compared to traditional database technology.
Shane Jackson of Frontier Airlines said that the business case for their data warehouse appliance project was based on achieving full payback within eighteen months. “ROI for the project was way beyond our expectations,” said Jackson. “We enjoyed a lower cost solution with better support than would have been obtained previously.”
Several companies reported that the most time/resource-consuming portion of their data warehouse project was data integration. Since the data warehouse appliance does not affect this part of the project, plan accordingly. Data integration involves:
John Devolites of TEOCO split their data warehouse appliance into a data staging data warehouse that receives the real-time data and an analytical reporting data warehouse that is normally read-only. Then once per day, the two are synchronized. This approach minimizes the performance impact on the reporting/query workload with concurrent updates to the data.
Think in terms of a single solution when designing your data warehouse – and insist that your vendor do the same. Mike Spies of PBG noted, “A key success factor was making the vendor accountable for providing and supporting an integrated solution. Having a single integrated product set supporting the data warehouse has been a key and is particularly important for system upgrades in the future. The data warehouse is an enterprise asset, and it must be stable at all times. A vendor must have the mind-set of implementing and supporting a solution, rather than a set of products. Making the vendor responsible for supporting an integrated product set forces the vendor to have an investment in the final solution.”
Mike Coakley of Epsilon advised, “Setting and managing user expectations is also of critical importance. Because of the good performance that a data warehouse appliance initially exhibits with light workloads, user expectations are set very high. When performance slows because of the increased query loads, as will happen with any platform, users may complain. They usually don’t remember when the same queries ran much slower on their old systems.”
To complement the interviews with data warehouse appliance customers, we conducted an e-survey of more than 250 IT professionals. They were asked about their perceptions of the viability of data warehouse appliance, features important to a data warehouse appliance, and their plans regarding the use of data warehouse appliances. Survey details are contained in Appendix A.
Data warehouse appliance viability was positive but guarded. Almost 40% thought that a data warehouse appliance was a great idea if care was taken in product selection and application usage. However, some (27%) thought that a data warehouse appliance was still unproven but showed great promise, while less (15%) thought that many companies are using a data warehouse appliance successfully. Therefore, IT professionals are beginning to regard data warehouse appliances as a valid technology, but they are not entirely convinced.
Data warehouse appliance features were broadly recognized without clear distinction of the most important features. The features of performance, scalability, and reliability scored the highest but were closely followed by many other features. Oddly, the feature that was perceived as least important was a “single vendor for product, education and support,” which seems to contradict the data warehouse appliance requirement for one purpose, one package, one install, one support. An explanation is that users do not directly relate the “single product” feature with TCO and performance features, which do have high perceived importance.
Nearly 50 percent of the respondents were actively involved with data warehouse appliances by evaluating, planning, deploying, or running these products. Of those who had no plans, reasons given were that other projects had higher priority and their data volumes were too low. Of those who had plans, it was again split evenly between those who were just evaluating versus those already planning, deploying or operating a data warehouse appliance.
Finally, for those who had data warehouse appliance in production, half said that their product was performing well and meeting their needs. The other half was split among “too soon to judge,” “project is more difficult than expected” and “working but below expectations,” in that order.
Data warehouse appliances are not new and are not magic, but they certainly are changing the way that business thinks about data warehousing. And, there is no question that data warehouse appliances have had a disruptive impact on the marketplace.
We recommend that the most important criteria to evaluate a data warehouse appliance solution is to focus on one purpose, one package, one install, one support. As stated earlier, when you deploy a data warehouse appliance, you should expect to purchase and maintain one product item, and receive one shipment of a pre-configured and pre-tested system. This is a key factor in reducing TCO.
This is the good news. The bad news is that your job has just started. You must then layer on top of the data warehouse appliance the essential functions of data integration services, BI
applications and tools, and information delivery. A general rule of thumb is that two-thirds of the costs are contained in the top three layers of a BI stack, implying that if the data warehouse
appliance purchasing, maintenance, and administration costs are halved, the total solution costs are reduced by only 17%.
Figure 8: Data Warehouse Appliance Relationship to Total BI Cost
Also, consider time to value for your system. You may install your data warehouse appliance in one week. However, it may take you several months to build a production system – with multi-year historical data, up-to-the-hour current data, adequate data cleansing and enhancement, the proper analytics, enterprise-wide reporting, business performance dashboards, and so forth. The analogy is: you just bought and installed a wonderful toaster, but that toaster does not make your kitchen fully functional.
John Devolites of TEOCO summarized the situation as follows: “We avoid the term appliance because it is misleading with respect to the amount of effort required to deploy a production data warehousing system. The actual amount of effort involved is actually no different from that of any system. Vendors try to make it sound simple, but it is not. It may be an appliance, but I still have to add all the doors and windows to it.”
Mike Coakley of Epsilon noted, “It takes about one week to get the data warehouse appliance up and running. Like most data warehousing projects, getting the data ready to load is a time-consuming process (particularly in the area of data cleansing).”
The first generation of data warehouse appliance offerings has proven itself to be suitable for a single-function multiterabyte data warehouse. In addition, this generation is also suited for a staging area for transforming and filtering large volumes of data to offload a larger enterprise data warehouse.
The next generation of data warehouse appliance offerings is providing increased functionality in two critical areas. First, there is improved workload management to support a wider spectrum of BI applications from simple reporting to concurrent updates to complex queries. Second, there is better support for commodity hardware that improves flexibility and integration into the IT infrastructure. The challenge will be for data warehouse appliance vendors to maintain low TCO and high performance as this functionality is added.
There are considerable competitive pressures between the general system vendors and the specialized data warehouse appliance vendors. The former is taking their existing software and hardware components and delivering an integrated offering, such as the IBM Balanced Warehouse. The latter is designing and assembling components from various suppliers, such as Netezza and DATAllegro.
The challenge is for both types of vendors to maintain the requirement of one purpose, one package, one install, one support as they deliver better performance and lower TCO.
The data warehouse appliance solution that you choose should be determined primarily by your business requirements. It is important to thoroughly understand those requirements and to match them to each vendor solution. To the extent that your requirements are poorly defined or are continually changing, your selection process will be more difficult.
For most situations, your selection will probably depend on whether you need a high-performance stand-alone data mart versus a multifunction enterprise data warehouse. Choosing a solution for a stand-alone data mart is much easier than it is for an enterprise data warehouse.
Because a data warehouse appliance selection has significant cost implications for the business, a full POC will usually be required to make the right decision. This POC should be designed based on your business requirements, rather than on some standardized benchmark. You should expect that the POC may take several weeks of planning and execution (depending on the complexity of your requirements). However, this effort should reap benefits of education for your staff, realistic planning for the production system, and wise selection of critical system components.
Endnotes
Acknowledgement
This report was published through the Business Intelligence Network with sponsorship from DATAllegro, Sun/Greenplum, IBM Corporation, Teradata, and Netezza. Editorial content was determined solely
by the authors who assume full responsibility for that content.
As part of the research study into data warehouse appliances for the Business Intelligence Network, an e-survey was conducted during March and April of 2007. E-mail invitations were sent to a selected set of persons, asking them to participate in the survey. Completed surveys were received from 254 persons.
Viability of Data Warehouse Appliance
The first question concerned their perception of the viability of data warehouse appliances: Do you consider data warehouse appliances to be a viable technology? They were given five closed-ended
responses from which to choose:
The results are shown in the following table indicate the typical bell-shaped response skewed toward the conservative. In other words, respondents were hopeful but cautious about data warehouse appliance viability.
Features of Data Warehouse Appliances
The second question probed the perceived importance of fifteen features provided by a data warehouse appliance. In particular, respondents were asked to rate each of the features on the five-point
scale where 1 = not important and 5 = essential.
The results are indicated in the following table. The top three features were performance, scalability and reliability, while single-vendor was the least important. In general, the responses were
fairly uniform, clustering within one unit on the scale, which implies that respondents want a wide spectrum of features.
Planning for a Data Warehouse Appliance
The third question concerned the plans for using a data warehouse appliance with the question: Are you using or do you plan to use a data warehouse appliance for your company? They were given the
following closed-end responses:
The results are shown in the following table. Nearly 50 percent of the respondents were actively involved with data warehouse appliances, either evaluating, planning, deploying, or running these products.
Why a Data Warehouse Appliance is Not Considered
The survey probed further by asking those who were not planning to use a data warehouse appliance about their reasons by asking: If you are not planning to use a data warehouse appliance, what is
the primary reason for this? There were six choices along with an “other” option.
The results of the 133 respondents who were not considering data warehouse appliances are indicated in the following table. The priority for other IT projects (46%) was the greatest reason, followed with data volumes being too low (22%).
Experience of Data Warehouse Appliance Adopters
The survey also asked the data warehouse appliance adopters about their experiences by asking: If you using a data warehouse appliance, what has been your experience so far?
The results of the 28 adopters are shown in the following table. Of those who had data warehouse appliances in production, 50 percent said the solution performed well and satisfied their needs; 29 percent said the project was more difficult than expected or the appliance was below their expectations. Only about 10 percent of respondents indicated they thought that a data warehouse appliance was not a viable approach.
DATAllegro
DATAllegro Case Study: TEOCO Corporation
DATAllegro Solution
Sun/Greenplum
Sun/Greenplum Case Study: Frontier Airlines, Inc.
Sun/Greenplum Solution
IBM Corporation
Netezza
Netezza Case Study: Epsilon Data Management, LLC
Netezza Solution
Teradata
Recent articles by Richard Hackathorn
Recent articles by Colin White
Dr. Richard Hackathorn is founder and president of Bolder Technology, Inc. He has more than thirty years of experience in the information technology industry as a well-known industry analyst, technology innovator and international educator. He has pioneered many innovations in database management, decision support, client-server computing, database connectivity, associative link analysis, data warehousing, and web farming. Focus areas are: business value of timely data, real-time business intelligence (BI), data warehouse appliances, ethics of business intelligence, and globalization of BI.
Richard has published numerous articles in trade and academic publications, presented regularly at leading industry conferences, and conducted professional seminars in eighteen countries. He writes regularly for the Business Intelligence Network (BeyeNETWORK.com) and has a channel for his blog, articles and research studies. He is a member of the IBM Gold Consultants since its inception, the Boulder BI Brain Trust, and the Independent Analyst Platform.
Dr. Hackathorn has written three professional texts, entitled Enterprise Database Connectivity, Using the Data Warehouse (with William H. Inmon), and Web Farming for the Data Warehouse.
Editor's note: More Richard Hackathorn articles, resources, news and events are available in the BeyeNETWORK Richard Hackathorn Channel. Be sure to visit today!
Colin is the Founder of BI Research. He is well known for his in-depth knowledge of leading-edge business intelligence and business integration technologies, and how they can be used to build a smart and agile business. With more than 35 years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. Colin has written numerous articles on business intelligence and enterprise business integration. Colin has an expert channel and blog on the B-Eye-Network and can be reached at cwhite@bi-research.com.
Editor's note: More Colin White articles, resources, news and events are available in the Business Intelligence Network's Colin White Channel. Be sure to visit today!