One of the first concepts I talk about in my data visualization course is the idea of the elementary perceptual task (EPT), an idea explored in depth by visualization pioneers William S. Cleveland and Robert McGill. Essentially, EPTs are visual building blocks for comparing quantities. The EPTs are summarized nicely in Figure 1 from Graphical Perception: Theory, Experiementation, and Application to the Development of Graphical Methods:
For example, looking at the two dots in the upper-left pane, we perceive that the top dot represents a larger quantity than the bottom dot, because it is higher on a common scale than the bottom dot. In the middle panel (“Angle”), we perceive that the angle on the right represents a larger quantity than the angle on the left, since it is a larger angle.
A fundamental finding from the article is that, as humans, we are better at some elementary perceptual tasks than at others. We more accurately and easily compare quantities if they are mapped using position on a common scale than if they are mapped using length, angle or size. In fact, Cleveland and McGill came up with the following ranking of EPTs we perform most efficiently when visually comparing quantities:
- Position along a common scale;
- Position along misaligned scales;
Recently, I wrote a post discussing a few of the Gestalt principles and illustrating them with some ACS data on income, sex, and field of degree. The Gestalt principles describe how we perceive patterns, while the EPTs describe how we most efficiently compare quantities. In this post, we continue to use income data from the American Community Survey to illustrate EPTs.
Case study: employment level by sex
In my previous post, it was apparent that a gender gap in average annual income persisted even when accounting for year, highest degree, and field of degree. Another important factor that explains income is level of employment, as measured by average hours worked per week. The data for this example are once again aggregated ACS data from the IPUMS USA extract system, filtered to include only employed individuals with at least a Bachelor’s degree.
Let’s start by sticking yet another pitchfork in the favorite straw man of data visualization: the pie chart. With an understanding of EPTs, we can begin to understand why pie charts are so maligned in the data visualization community. Take a look at the graph below, and try to determine how the percent of females working full time changes over the years:
Pie charts are fine for visualizing parts of a single whole (it is easy to tell that, in any given year, a majority of women work full time), but they make it difficult to compare parts of different wholes (how the percent working full time changes over the years). This comparison is difficult to make because we are comparing angles, an elementary perceptual task that we do not perform very efficiently or well. Here are the same data, different graph:
Notice that it’s now much easier to discern that the percent of females working 40 or more hours per week is increasing over time, while the percent of females working part-time is decreasing. We now compare the position of the quantities denoted by points along a common scale (the vertical Y-axis), rather than the angle. We can also still clearly see that in each year, a majority of women work full time, by comparing the points along the common vertical scale within a single year.
Let’s turn now to comparing the hours worked for females to males, in 2016 alone. We’ll use a bar chart for this:
What EPT are we using to compare the percents? It is tempting to think that when we compare quantities in bar charts, we are comparing lengths of the bars. But this is not the primary EPT we are using, here. We are really comparing position along a comon scale once again. This is clearer if we represent the quantities not as bars, but as points. We are carrying out the same exact elementary perceptual task to compare the percents:
This “point chart” has a much better data-to-ink ratio than the bar chart, a concept coined by Edward Tufte. One could argue that it is cleaner and more succinct; better emphasizes the data; and exploits the same exact EPT as the bar chart. Why, then, are bar charts so pervasive while these “point charts” are not? I think the reason is two-fold. First, we just have a comfort level with bar charts. The point of a data visualization is communication: if we can communicate quicker with a more familiar medium, that has advantages. Second, perhaps more importantly, when we see points we immediately start looking for trends. Direction is one of Cleveland and McGill’s EPTs; we are more inclined to try to visually connect points than the tops of bars. However, we shouldn’t always connect points, especially if the categories on the horizontal have no sense of ordering (for example, if “race,” which has no ordering, was on the horizontal instead of employment level).
For these reasons, let’s go back to the bar chart.
What comparison does this encourage us to make? Because of the two separate panels for females and males (the Gestalt principle of enclosure), the encouraged comparisons seem to be within sex: among females, a much greater percent of them work 40 or more hours a week than are in the other 3 categories; the same can be said for males. It is more likely that we want to compare across sex: assess differences between males and females in work time. How can we better encourage this comparison? One approach is to group by hours worked, rather than sex:
This better encourages us to compare females to males, but it still requires us to jump from panel to panel to make that comparison for each employment level. Another major problem with the above graph is that the percents are calculated by conditioning on sex, whereas the facets make it appear the conditioning is on employment level. For example, when we see two bars representing percents in a single panel, we are tempted to think that the total height of the two bars is 100%, which is clearly not the case. The “whole” out of which the percents are taken is not at all clear.
Here’s one last take: a stacked bar chart.
This accomplishes two objectives previous visualizations lacked: A) making the most important comparison obvious (comparing females to males), and B) making obvious “the whole” out of which the percents are taken. Note that when comparing females to males, we use position along the common (vertical) axis to compare the percents who work 40 or more hours (by determining which dark blue bar extends highest on the vertical); and to compare the percents who work 20 or fewer hours (by determining which white bar extends lowest on the vertical). However, to compare the percent who work 21-30 or 31-39 across sex, we use length, since the middle bars do not share a common baseline. Although we sacrifice making comparisons using position along a common scale with the stacked bar chart for two of the employment levels, we make up for it with fewer facets; more succinct presentation; and more obvious important comparisons.
Some of the averages in the data set I used for this post (for example, average income for Alabaman males with a doctoral degree in 2009) are based on very few observations, and should be taken with a very large grain of salt. Another brief data note: to aggregate this data set further, it is important to weight by the
N column, which indicates how many individuals in the population the average for that row represents.
If interested, here is the R code for creating the bar charts.
library(ggplot2) library(dplyr) #Read in the data; assuming current working directory houses the .csv file tmp <- read.csv('ACS-income-data-aggregated.csv') #Filter to only 2016; aggregate incomes df <- tmp%>% filter(Year == 2016) %>% group_by(Sex,Hours.work) %>% summarize(count = sum(N), avg.income = weighted.mean(avg.income, N)) %>% filter(Hours.work != "")%>% mutate(pct = count/sum(count)) #Side by side; faceted by sex: ggplot(data = df) + geom_bar(aes(x = Hours.work, y = 100*pct, fill=Hours.work), stat='identity') + scale_fill_brewer(palette='Blues') + guides(fill=FALSE) + theme_dark()+ ggtitle('Employment level by sex: Bar chart') + facet_wrap(~Sex)+ theme(panel.grid=element_blank()) + xlab('Employment level') + ylab('Percent')+ ylim(c(0,100)) #Side by side; faceted by employment level: ggplot(data = df) + geom_bar(aes(x = Sex, y = 100*pct, fill=Hours.work), stat='identity') + scale_fill_brewer(palette='Blues') + guides(fill=FALSE) + theme_dark()+ facet_grid(.~Hours.work)+ theme(panel.grid=element_blank()) + xlab('Employment level') + ylab('Percent')+ ylim(c(0,100)) #Stacked: ggplot(data = df) + geom_bar(aes(x = Sex, y = 100*pct, fill=Hours.work), stat='identity') + scale_fill_brewer(name='Employment level',palette='Blues') + theme_dark()+ theme(panel.grid=element_blank()) + xlab('Employment level') + ylab('Percent')+ ylim(c(0,100))