Multivariate Analysis Using Parallel Coordinates
by Stephen Few
Originally published September 12, 2006
This article is part of a series that I began in July of this year with the article entitled "An Introduction to Visual Multivariate Analysis." In that initial article, I provided an overview of several approaches to analyzing multivariate data using visualization techniques. In this article, I am featuring an approach called parallel coordinates.
The first time that I saw a parallel coordinates visualization, I almost laughed out loud. My initial impression was "How absurd!" I couldn't imagine how anyone could make sense of the dense clutter caused by hundreds of overlapping lines (see Figure 1). This certainly isn't a chart that you would present to the board of directors or place on your Web site for the general public. In fact, the strength of parallel coordinates isn't in their ability to communicate some truth in the data to others, but rather in their ability to bring meaningful multivariate patterns and comparisons to light when used interactively for analysis.
Reading Parallel Coordinates
Figure 2 displays the same set of data, but this time the line representing a single county has been highlighted. I selected Alameda County in California, where I live, to see its multivariate profile.
In examining this Alameda County profile, we must be careful to read nothing of significance into the slope of each line segment or the overall pattern formed by the line as a whole. The slopes and overall pattern would look completely different if I rearranged the order of the variables. Instead, we should read the variables one by one to construct a composite profile of Alameda County. In doing so, because we can see Alameda County in the context of all counties, we can quickly determine that home values are higher than average but only about 40% of that of the county with the highest value, the number of farm acres is much lower than average, income level is higher than most but only about 55% of the county with the highest value, and so on.
The Big Picture
Useful Ways to Complement and Interact with Parallel Coordinates
Another way that I can easily highlight items is to simply draw a rectangle around values in the graph itself that interest me. In Figure 4, you can see the results of drawing a rectangle around the highest values on the College Graduate % axis. Given the resulting view, it only takes a moment to notice that counties with the highest percentages of college graduates all have very few acres of farmland, higher than average incomes, relatively small populations, and high life expectancies.
It is often helpful to separate clusters of similar data into separate graphs to more easily focus on specific groups independent of the others and to compare their multivariate profiles. In Figure 5, to pursue an interest in the relationship between the percentage of college graduates and the other variables, I used convenient functionality in Spotfire DXP to divide the data into five groups (or bins) based on the percentage of college graduates and to place each group into a separate graph. The top graph displays counties with the lowest percentage of college graduates and in the bottom graph we see those with the highest percentages. A quick comparison of these graphs reveals that counties with the lowest percentages of college graduates also have the lowest home values as well as widely ranging percentages of elderly residents compared to counties with the highest percentages of college graduates. Another difference between these five groups that surfaces when viewed in this fashion is that the distribution of values for each variable except home value and population tends to narrow with each graph, beginning with the top graph (lowest percentage of college graduates), which displays a broad distribution of values across most variables, and proceeding down to the bottom graph (highest percentage of college graduates), which displays a relatively narrow distribution of values for each variable. In other words, greater percentages of college graduates appear to correspond to greater homogeneity among the people in that county.
Searching for Similar Profiles
After running the search for counties with similar profiles and viewing the results, I selected the 10 counties most similar to Alameda County and removed all but them from the display to eliminate distractions. You can see the results in Figure 7, which shows the 10 counties in the parallel coordinates graph along with Alameda County, which is highlighted. These counties also appear in the table, which now includes two new columns that were produced by the search operation: "Similarity to Active," which measures their correlation to Alameda County (from 0 for no correlation to 1 for an exact correlation), and "Similarity to Active (Rank)," which ranks the counties by degree of correlation.
Variations on the Theme
The variable names appear across the top, including region, state, industry, and so on. This particular example includes both quantitative variables, such as revenue, and categorical variables, such as region. In addition to the gray lines that connect a value of each variable for a given customer, circles display the relative sizes of each value belonging to a particular categorical variable and a box plot displays the distribution of values for a particular quantitative variable. These circles (also known as bubbles) and box plots summarize each variable in a way that can't be seen merely by looking at the lines, which is a nice addition (although the 2-D areas of circles cannot be compared precisely).
Parallel coordinates can reveal correlations between multiple variables. This is particularly useful when you want to identify which conditions correlate highly to a particular outcome. For instance, this example can be used to examine which conditions seem to have contributed to the desired outcome of customers responding to a special marketing campaign named the "Gold Bundle Campaign," which appears on the rightmost axis of the graph. As you can see, relatively few customers responded (indicated by "Yes") to the campaign. It would be useful to know the characteristics of those customers who responded. Look at what happens when I select the "Yes" circle on the "Response Gold Bundle Campaign" axis (see Figure 9).
Now we can begin to look for predominant characteristics across the other variables. Before we do so, however, I'm going to eliminate some of the clutter by turning off the lines, resulting in the graph that appears in Figure 10.
Now it's easier to see the relationships. The first thing I notice is that, of the four regions (on the left-most axis), a much greater percentage of customers in the east responded than anywhere else, which appears to be largely a result of a significant response in the state of New York. The industries that responded the most are manufacturing and real estate, with about the same number of responses, but a much higher percentage of real estate customers. Shifting attention to the quantitative variables, I can easily see that responders tended to have lower than average revenues, profit margins that are typical, but a much lower than average number of employees (that is, they are relatively small companies). Another interesting characteristic is the fact that those customers that responded usually respond much less favorably to marketing campaigns, shown on the Campaign Responses axis. This is a good example of what can be discovered when exploring multivariate business data using a well-designed parallel coordinates display.
I hope that you are beginning to get a sense of what can be seen and the useful questions that can be pursued and answered when using parallel coordinates. Multivariate analysis requires specialized visualizations and methods of interaction with data. Parallel coordinates is only one approach. Next month we'll look at what you can do with heatmaps.
Recent articles by Stephen Few
Copyright 2004 — 2017. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC