The fall semester is over and final grades are in, which means it’s time to reflect on what just took place and how to grow from here. Today, I reflect on my third time teaching the data visualization course. This course has come a long way since the first time I taught it in Fall 2015, and yet there are still so many improvements to make! One of the concepts I want to greater emphasize next time I teach the course are the Gestalt principles, which Gestalt psychologist Kurt Koffka summarizes as the idea that “The whole is other than the sum of the parts.”. I like to think of the Gestalt principles as ground rules for how to create meaningful patterns out of chaos.
Twain Taylor has an excellent post on how the Gestalt principles manifest in data visualization. He summarizes the principles in this chart:
Quoting from his post;
Here is what we notice from each of the illustrations:
- Proximity: We see three rows of dots instead of four columns of dots because they are closer horizontally than vertically.
- Similarity: We see similar looking objects as part of the same group.
- Enclosure: We group the first four and and last four dots as two rows instead of eight dots.
- Symmetry: We see three pairs of symmetrical brackets rather than six individual brackets.
- Closure: We automatically close the square and circle instead of seeing three disconnected paths.
- Continuity: We see one continuous path instead of three arbitrary ones.
- Connection: We group the connected dots as belonging to the same group.
- Figure & ground: We either notice the two faces, or the vase. Whichever we notice becomes the figure, and the other the ground
In my experience visualizing data, it seems the most pervasive principles are proximity, similarity, enclosure, and connection.
I was struck by the ubiquitousness of these principles as I was reviewing my students’ final visualization projects. One very fine project in particular inspired me to write this post. Next time I teach the visualization course, I hope to include the following example to illustrate these principles to my students.
Illustrating the Gestalt principles with ACS income data
As an example data set, I settled on data over 2009-2016 from the American Community Survey, which I requested from the IPUMS USA data request system. I specifically requested incomes by year, sex, educational attainment, and field of degree. I filtered the data to only include those with a Bachelor’s degree or higher who were currently employed at the time of sampling, and created a new variable to indicate whether or not the field was a STEM field (using a combination of this source and this source to help me determine).
The primary question I want to visualize is: What is the inequality in average income comparing males to females? Of course, average income depends on many other factors, including field of degree, type of degree, and year, to name just a few. We’ll focus on visualizing the gender income gap adjusting for these other factors as well.
Here’s a quick look at the first six rows of the data set.
## Year Sex Educ Field avg.income ## 1 2009 Female Bachelor's degree Non-STEM 47440.41 ## 2 2009 Female Bachelor's degree STEM 51423.20 ## 3 2009 Female Doctoral degree Non-STEM 77378.22 ## 4 2009 Female Doctoral degree STEM 87257.88 ## 5 2009 Female Master's degree Non-STEM 60549.57 ## 6 2009 Female Master's degree STEM 65621.16
Notably, the data set consists of one row per year/sex/education/field category, with
avg.income indicating the average income for that combination.
So let’s visualize! We want to investigate the gender income gap, so here’s a first go:
Mostly, this is chaos! But we do see the Gestalt principle of proximity manifest itself: we perceive the incomes on the left (belonging to females) as a group, and the incomes on the right (belonging to males) as a group. Let’s incorporate Year on the horizontal, since we are accustomed to identifying trends across time:
We now see year groupings by way of the proximity principle, and sex groupings via color-coding with the similarity principle. There’s still too much chaos, however. Let’s try again:
The amount of chaos is reduced dramatically, all by way of implementing the enclosure principle, specifically enclosing the highest level of educational attainments together in separate panels. This particular type of enclosure is often referred to as faceting in the data visualization realm. The drastic reduction in chaos, and improvement of clarity, is due to the fact that there were four educational attainment groupings. We would not have improved the clarity as much if we had, say, grouped by field of degree instead, which only has two groups:
Faceting by educational attainment is better, so let’s continue working with that one, now indicating field of degree:
The new principle is connection. Clearly, we perceive each line as an entity, representing now a Sex/Field combination (Male/STEM, for example). Another instance of similarity is in play, since the lines for STEM fields are dashed, while the lines for non-STEM fields are solid.
So, here we have it, a visualization that illustrates the principles of proximity, similarity, enclosure, and connection. We’ve implemented these principles to significantly reduce chaos and improve clarity. But there’s still an issue with this visualization. Remember the intent of the visualization is to explore gender income inequality. If we take another look at the above visualization, this is not the most obvious comparison to make. Rather, due to the proximity and similarity of the lines within each sex (they are close together, and of the same color), the comparison our brain is encouraged to make is to compare incomes of STEM to non-STEM, within sex. That’s not the most important comparison! It makes most sense to compare males to females, within field.
This brings us to a related concept: not all means of introducing similarity are created equal. When we group by similarity, we tend to first recognize similarities in color, then in shape. Angela Wright, a color psychologist, states that “color is noticed by the brain before shapes or wording.” Thus if we want the encourage the viewer to compare the sexes, we should probably color-code by field instead, so we compare income by gender within field. Here’s how that looks:
I’m still not convinced that this encourages the brain to first compare the sexes to each other. The proximity of the STEM and non-STEM lines is too hard to overcome! We might need to introduce another layer of enclosure (by way of faceting) to make the important comparison the most obvious:
Details on data collection and visualization
IPUMS stands for Integrated Public-Use Microdata Series. The suite of IPUMS tools is an excellent source of varied data, from health to education to international census data. I aggregated the data for this example from the microdata extract, and you can download a .csv of the aggregated data here.
R code for creating the “final two” visualizations is below:
library(ggplot2) #Assuming file is in current working directory: stemdata <- read.csv("ACS-stem-aggregated.csv") stemdata$Educ <- factor(stemdata$Educ, levels = c("Bachelor's degree","Master's degree","Doctoral degree","Professional degree")) #Faceting by education; #color-coding by field; #different lines for Sex: ggplot(data = stemdata) + geom_point(aes(x = Year, y = avg.income/1000,color=Field)) + geom_line(aes(x = Year, y = avg.income/1000,color=Field,linetype=Sex)) + facet_grid(.~Educ)+ ylab('Average income (in thousand $)') #Double faceting: ggplot(data = stemdata) + #geom_point(aes(x = Year, y = avg.income/1000, shape=Sex)) + geom_line(aes(x = Year, y = avg.income/1000,linetype=Sex)) + facet_grid(Field~ Educ)+ ylab('Average income (in thousand $)')