The prior article in this series discussed the human side of analytics. It explained how companies need to have the right culture, people, and organization to succeed with analytics. The flip side is the "hard stuff"- the architecture, platforms, tools, and data--that makes analytics possible. Although analytical technology gets the lionshare of attention in the trade press--perhaps more than it deserves for the value it delivers--it nonetheless forms the bedrock of all analytical initiatives. This article examines the architecture, platforms, tools, and data needed to deliver robust analytical solutions.
The term "analytical architecture" is an oxymoron. In most organizations, business analysts are left to their own devices to access, integrate, and analyze data. By necessity, they create their own data sets and reports outside the purview and approval of corporate IT. By definition, there is no analytical architecture in most organizations--just a hodge-podge of analytical silos and spreadmarts, each with conflicting business rules and data definitions.
Analytical sandboxes. Fortunately, with the advent of specialized analytical platforms (discussed below), BI architects have more options for bringing business analysts into the corporate BI fold. They can use these high-powered database platforms to create analytical sandboxes for the explicit use of business analysts. These sandboxes, when designed properly, give analysts the flexibility they need to access corporate data at a granular level, combine it with data that they've sourced themselves, and conduct analyses to answer pressing business questions. With analytical sandboxes, BI teams can transform business analysts from data pariahs to full-fledged members of the BI community.
There are four types of analytical sandboxes:
- Staging Sandbox. This is a staging area for a data warehouse that contains raw, non-integrated data from multiple source systems. Analysts generally prefer to query a staging area that contains all the raw data than each source system individually. Hadoop is a staging area for large volumes of unstructured data that a growing number of companies are adding to their BI ecosystems.
- Virtual Sandbox. A virtual sandbox is a set of tables inside a data warehouse assigned to individual analysts. Analysts can upload data into the sandbox and combine it with data from the data warehouse, giving them one place to go to do all their analyses. The BI team needs to carefully allocate compute resources so analysts have enough horsepower to run ad hoc queries without interfering with other workloads running on the data warehouse.
- Free-standing sandbox. A free-standing sandbox is a separate database server that sits alongside a data warehouse and contains its own data. It's often used to offload complex, ad hoc queries from an enterprise data warehouse and give business analysts their own space to play. In some cases, these sandboxes contain a replica of data in the data warehouse, while in others, they support entirely new data sets that don't fit in a data warehouse or run faster on an analytical platform.
- In-memory BI sandbox. Some desktop BI tools maintain a local data store, either in memory or on disk, to support interactive dashboards and queries. Analysts love these types of sandboxes because they connect to virtually any data source and enable analysts to model data, apply filters, and visually interact with the data without IT intervention.
Next-Generation BI Architecture. Figure 1 depicts a BI architecture with the four analytical sandboxes colored in green. The top half of the diagram represents a classic top-down, data warehousing architecture that primarily delivers interactive reports and dashboards to casual users (although the streaming/complex event processing (CEP) engine is new.) The bottom half of the diagram depicts a bottom-up analytical architecture with analytical sandboxes along with new types of data sources. This next-generation BI architecture better accommodates the needs of business analysts and data scientists, making them full-fledged members of the corporate BI ecosystem.
Figure 1. The New BI Architecture
The next-generation BI architecture is more analytical, giving power users greater options to access and mix corporate data with their own data via various types of analytical sandboxes. It also brings unstructured and semi-structured data fully into the mix using Hadoop and nonrelational databases.
Since the beginning of the data warehousing movement in the early 1990s, organizations have used general-purpose data management systems to implement data warehouses and, occasionally, multidimensional databases (i.e., "cubes") to support subject-specific data marts, especially for financial analytics. General-purpose data management systems were designed for transaction processing (i.e., rapid, secure, synchronized updates against small data sets) and only later modified to handle analytical processing (i.e., complex queries against large data sets.) In contrast, analytical platforms focus entirely on analytical processing at the expense of transaction processing.
The analytical platform movement. In 2002, Netezza (now owned by IBM), introduced a specialized analytical appliance, a tightly integrated, hardware-software database management system designed explicitly to run ad hoc queries against large volumes of data at blindingly fast speeds. Netezza's success spawned a host of competitors, and there are now more than two dozen players in the market. (see Table 1).
Today, the technology behind analytical platforms is diverse: appliances, columnar databases, in memory databases, massively parallel processing (MPP) databases, file-based systems, nonrelational databases and analytical services. What they all have in common, however, is that they provide significant improvements in price-performance, availability, load times and manageability compared with general-purpose relational database management systems. Every analytical platform customer I've interviewed has cited an order-of-magnitude performance gains that most initially don't believe.
Moreover, many of these analytical platforms contain built-in analytical functions that make life easier for business analysts. These functions range from fuzzy matching algorithms and text analytics to data preparation and data mining functions. By putting functions in the database, analysts no longer have to craft complex, custom SQL or offboard data to analytical workstations, which limits the amount of data they can analyze and model.
Companies use analytical platforms to support free-standing sandboxes (described above) or as replacements for data warehouses running on MySQL and SQL Server, and occasionally major OLTP databases from Oracle and IBM. They also improve query performance for ad hoc analytical tools, especially those that connect directly to databases to run queries (versus those that download data to a local cache.)
In 2010, vendors turned their attention to meeting the needs of power users after ten years of enhancing reporting and dashboard solutions for casual users. As a result, the number of analytical tools on the market has exploded.
Analytical tools come in all shapes and sizes. Analysts generally need one of every type of tool. Just as you wouldn't hire a carpenter to build an addition to your house with just one tool, you don't want to restrict an analyst to just one analytical tool. Like a carpenter, an analyst needs a different tool for every type of job they do. For instance, a typical analyst might need the following tools:
Excel to extract data from various sources, including local files, create reports, and share them with others via a corporate portal or server (managed Excel).
BI Search tools to issue ad hoc queries against a BI tool's metadata.
Planning tools (including Excel) to create strategic and tactical plans, each containing multiple scenarios.
Mashboards and ad hoc reporting tools to create ad hoc dashboards and reports on behalf of departmental colleagues
Visual discovery tools to explore data in one or more sources of data and create interactive dashboards on behalf of departmental colleagues
Multidimensional OLAP (MOLAP) tools to explore small and medium sets of data dimensionally at the speed of thought and run complex dimensional calculations.
Relational OLAP tools to explore large sets of data dimensionally and run complex calculations
Text analytics tools to parse text data and put it in a relational structure for analysis.
Data mining tools to create descriptive and predictive models.
Hadoop and MapReduce to process large volumes of unstructured and semi-structured data in a parallel environment.
Figure 2. Types of Analytical Tools
Figure 2 plots these tools on a graph where the x axis represents calculation complexity and the y axis represents data volumes. Ad hoc analytical tools for casual users (or more realistically super users) are clustered in the bottom left corner of the graph, while ad hoc tools for power users are clustered slightly above and to the right. Planning and scenario modeling tools cluster further to the right, offering slightly more calculation complexity against small volumes of data. High-powered analytical tools, which generally rely on machine learning algorithms and specialized analytical databases, cluster in the upper right quadrant.
Business analysts function like one-man IT shops. They must access, integrate, clean and analyze data, and then present it to other users. Figure 2 depicts the typical workflow of a business analyst. If an organization doesn't have a mature data warehouse that contains cross-functional data at a granular level, they often spend an inordinate amount of time sourcing, cleaning, and integrating data. (Steps 1 and 2 in the analyst workflow.) They then create a multiplicity of analytical silos (step 5) when they publish data, much to the chagrin of the IT department.
Figure 2. Analyst Workflow
In the absence of a data warehouse that contains all the data they need, business analysts must function as one-man IT shops where they spend an inordinate amount of time iterating between collecting, integrating, and analyzing data. They run into trouble when they distribute their hand-crafted data sets broadly.
Data Warehouse. The most important way that organizations can improve the productivity and effectiveness of business analysts is to maintain a robust data warehousing environment that contains most of the data that analysts need to perform their work. This can take many years. In a fast-moving market where the company adds new products and features continuously, the data warehouse may never catch up. But, nonetheless, it's important for organizations to continuously add new subject areas to the data warehouse, otherwise business analysts have to spend hours or days gathering and integrating this data themselves.
Atomic Data. The data warehouse also needs to house atomic data, or data at the lowest level of transactional detail, not summary data. Analysts generally want the raw data because they can repurpose in many different ways depending on the nature of the business questions they're addressing. This is the reason that highly skilled analysts like to access data directly from source systems or a data warehouse staging area. At the same time, less skilled analysts appreciate the heavy lifting done by the IT group to clean and integrate disparate data sets using common metrics, dimensions, and attributes. This base level of data standardization expedites their work.
Once a BI team integrates a sufficient number of subject areas in a data warehouse at an atomic level of data, business analysts can have a field day. Instead of downloading data to an analytical workstation, which limits the amount of data they can analyze and process, they can now run calculations and models against the entire data warehouse using analytical functions built into the database or that they've created using database development toolkits. This improves the accuracy of their analyses and models and saves them considerable time.
The technical side of analytics is daunting. There are many moving parts that all have to work synergistically together. However, the most important part of the technical equation is the data. The old adage holds true: "garbage in, garbage out." Analysts can't deliver accurate insights if they don't have access to good quality data. And it's a waste of their time to spend days trying to prepare the data for analysis. A good analytics program is built on a solid data warehousing foundation that embeds analytical sandboxes tailored to the requirements of individual analysts.
Posted November 15, 2011 7:44 AM
Permalink | No Comments |