<-- Back to full color view
Workload-Driven Data Warehousing
Originally published May 3, 2012
The information explosion has rocked the world of data over the last five years. The advancement of mobile technology, the availability of tablets and smartphones, and the rapid growth of social media have all contributed to both production and consumption of data at never-before-seen volumes. Other contributing factors have been recommendation engines, cool new visualization capabilities for business intelligence (BI), advances in software technology to enable machine learning, and smarter systems in hospitals, airports, airplanes, automobiles and more.
Different types of data with varying degrees of complexity are produced at multiple levels of velocity. All of this data is now used for learning more about an enterprise, its customers, the brand, its equity and clout, the competition – all of the factors that drive a business, now quantifiable and available for use in decision making. This explosion of information has caused a flurry of hyperactivity in the enterprise, and business users are wondering about adopting new and additional data for analysis and decision making.
As this story unfolds, IT and data warehouse (DW) teams will start dreading the very thought of more data. Why? Well, all of this data will mean that existing systems will now have additional work processing this data. This is the challenge. This is where the nuance of workload is introduced.
Workload can be defined as the execution and completion of a task that utilizes a mix of resources (e.g., processing power, storage, disk input/output and network bandwidth). At any given time, any system that is processing information is executing a workload. This is common to the world of data warehousing and business intelligence, and very applicable to the underlying information architecture.
On average, there are three types of workload in a standard data warehouse. Think for a minute about your data warehouse and the information processing associated with it: data from multiple sources that are being loaded and integrated, downstream systems and data marts that are accessing data from the same data warehouse, and analytical platforms executing and analyzing large complex queries.
Now, let’s load complexity into this equation: You have jobs that need to load large fact tables, large history tables and move data from one level to another depending on the age of the data. While executing these specific jobs, your storage environment (SAN) is showing signs of slowdown. You find out that the SAN is shared across online transaction processing (OLTP), data warehouse and analytical databases. In addition, the storage is sharing disk between your data warehouse and OLTP system.
Before we proceed further down this path, let’s pause for a minute and look at what we have. We have a three-tier architecture: application → data → storage. In most cases, two out of the three tiers are shared across the enterprise. When these systems are commissioned and designed, the basic premise is to provide sustained speed of performance per service level agreements. In the first few months, due to the volume of data and users, the performance will meet and exceed all service level agreements. Within six months to a year, the performance curve slows down, sometimes drastically. This is not uncommon, and it is not caused by user adoption of the data warehouse.
Figure 1 shows the different categories of activities that affect a typical data warehouse:
- Performance – measured in service level agreements, network throughput, SAN throughput, input/output and memory consumption, and affected by increase of data types, query volume, query types, data center growth and user adoption.
- Growth – measured in increase of data sources, data types, query types, users and data volume.
- Complexity – measured in types of reports, analytics and dashboards built by the users.
Figure 1: Data Warehouse Parameters Over One Year
Different types of workloads include:
- Database impacting
- Reporting workloads – static and interactive business intelligence reports
- Analytical workloads – workloads generated by analytical queries and statistical models such as R, SAS and SPSS
- Computing workloads – workloads generated by intensive SQL operations within the database
- Network impacting workloads
- Functional workloads – commodity tasks executed from web services as part of a service oriented architecture
- Data workloads – traffic generated
- SAN impacting workloads
- Storage workloads – handling the storage and retrieval of data
Workloads can impact the overall data warehouse much like traffic on a highway. Here are two scenarios:
In this scenario, traffic will keep moving – albeit slowly – as long as the vehicles stay in their appropriate lanes. This situation is similar to a data warehouse in that some queries are like fast cars with small result sets, and other queries are like trucks and buses, bringing back lot of information.
In this scenario, traffic is congested; everybody is driving across lanes. This is similar to the data warehouse when you allow for all types of queries, across all data types without regard for the existing database structure, how the data is stored and how it will be processed. You end up with loss of adoption, lack of trust and overall failure. This is where understanding workload becomes mandatory for data warehouse design.
With the evolution of technology in the past decade, we now have the flexibility to design and, in most cases, potentially re-architect existing data warehouses with isolated workloads. What exactly does this mean? Well, to put it simply, you will be able to leverage the right tools and infrastructure for the purpose they have been built to accomplish. For example, you can process unstructured or semi-structured data in its isolated environment and integrate the result sets as needed to the data warehouse. You can process complex statistical functions within the database and generate a result set for use by the application. You can use in-memory technology to isolate report processing and complex calculations to drive off the reporting application server, and load raw data at high speeds into memory.
Some popular frameworks are designed to help you build your own design for isolating workloads, such as Inmon’s DW2.0 and Zachman Architecture Framework 3.0. If you wonder why can’t you follow an information lifecycle management (ILM) technique to manage workloads or just use a database workload optimizer and finish the solution, these are probably solutions to optimize only one layer, e.g., the database.Figure 2
shows the overall workload isolation layout for a data warehouse:
Figure 2: Workload Isolation Architecture
The architecture provides the following benefits:
- Scalability across all layers
- Sustained performance
- Minimized data movement across the layers
- Independent scalability within each layer
The critical success factors for this approach are:
- Metadata-driven architecture – a clear understanding of metadata across the different layers
- Clear data strategy, solution architecture and roadmap
- Clear understanding of system performance and architecture
- Clear, defined goals for performance and service level optimization
As you start thinking about workload-driven approaches for your data warehouse, ensure that all of your architecture teams are aligned and define the big picture. In my next article, I will discuss unstructured and structured workloads.
SOURCE: Workload-Driven Data Warehousing
Recent articles by Krish Krishnan