Stat consulting: a data science playground

This past semester I taught our STAT 370 (statistical consulting and communication) for the first time. This course gave student experience consulting for real clients from the university and community and focused on communicating with a client as well as report and presentation preparation best practices. Most of the required analyses were simple: paired t-tests, simple linear regression, etc. What struck me was the nontrivality of the data tidying process! While STAT 370 is taken mostly by our statistics majors, so many of the examples we encountered would be beautiful case studies for our introductory DSCI (data science) curriculum. In this post I present an actual example from a client that illustrates this.

The data here concerned undergraduate nursing students in one of four terms of the nursing program. Of interest was measuring the students’ resiliency as measured by the Connor-Davidson Resiliency Scale (CDRS), both prior to and following impelementation of a Stress Management and Resiliency Training (SMART). The client was interested in determining for which of the four terms was there a significant change in resilience.

Here are the data we’re working with (link to pre | link to post). id stands for a unique student identifier. We also have responses to 18 of the CDRS items (each on a 5-point Likert scale). The post data set also contains the student terms; notably this information is not available in the pre data set.

library(dplyr)
library(tidyr)
library(ggplot2)
pre <- read.csv('cdrs_data_pre.csv')
post <- read.csv('cdrs_data_post.csv')
head(post)
names(pre) #missing term information!!
##    id  Term CDRS_Q1 CDRS_Q2 CDRS_Q3 CDRS_Q4 CDRS_Q5 CDRS_Q6 CDRS_Q7
## 1  23 Term1       4       4       3       4       4       3       3
## 2 183 Term1       3       4       4       3       4       3       4
## 3  80 Term1       3       4       4       3       4       3       4
## 4   1 Term1       3       4       4       3       3       2       4
## 5 166 Term1       3       4       3       4       4       4       4
## 6  15 Term1       3       3       2       3       3       2       3
##   CDRS_Q8 CDRS_Q9 CDRS_Q10 CDRS_Q11 CDRS_Q12 CDRS_Q20 CDRS_Q21 CDRS_Q22
## 1       4       4        4        4        3        2        3        3
## 2       2       3        3        3        3        3        3        3
## 3       3       4        4        4        4        4        4        4
## 4       4       4        4        4        4        2        3        3
## 5       4       4        4        4        4        3        4        4
## 6       3       2        3        3        3        2        3        3
##   CDRS_Q23 CDRS_Q24 CDRS_Q25
## 1        3        4        4
## 2        3        4        4
## 3        3        4        4
## 4        2        3        4
## 5        4        4        4
## 6        3        3        3
##  [1] "id"       "CDRS_Q1"  "CDRS_Q2"  "CDRS_Q3"  "CDRS_Q4"  "CDRS_Q5" 
##  [7] "CDRS_Q6"  "CDRS_Q7"  "CDRS_Q8"  "CDRS_Q9"  "CDRS_Q10" "CDRS_Q11"
## [13] "CDRS_Q12" "CDRS_Q20" "CDRS_Q21" "CDRS_Q22" "CDRS_Q23" "CDRS_Q24"
## [19] "CDRS_Q25"

Note that the data are not tidy in the sense that each row is a person, and we have variable information on the questions in columns. I’ll return to this in a bit.

Here ultimately is a visualization that we could use to determine for which terms are the SMART effects strongest, and for which terms is the effect statistically significant:

We can see that the average resilience pre-SMART was lower than average resilience post-SMART, and that these differences were most extreme for students in Terms 2 and 3 (which were also the only statistically significant differences). Additionally, students in Term 4 had very high pre- and post-SMART resilience (they’re seasoned veterans, after all!)

A simple plot, with a simple interpretation. But the path to get there is anything but! To create this plot we need:

  1. to average the 18 CDRS items for each student;
  2. join the data sets;
  3. compute paired t-tests for each term;
  4. prepare data for plotting;
  5. plot

So, let’s proceed!

Average the CDRS items

This is perhaps the most interesting step in the process. As mentioned earlier, the data are not tidy in the sense that we have variable information in columns instead of rows. We could reshape (“gather” or “melt”) to average CDRS score by term. Doing this for the post data set:

post_melt <- post %>% 
  gather(key = 'Question',value='Response',CDRS_Q1:CDRS_Q25)
head(post_melt)
##    id  Term Question Response
## 1  23 Term1  CDRS_Q1        4
## 2 183 Term1  CDRS_Q1        3
## 3  80 Term1  CDRS_Q1        3
## 4   1 Term1  CDRS_Q1        3
## 5 166 Term1  CDRS_Q1        3
## 6  15 Term1  CDRS_Q1        3
post_melt %>% 
  group_by(id,Term) %>%
  summarize(post_mean = mean(Response))
## # A tibble: 67 x 3
## # Groups:   id [?]
##       id   Term post_mean
##    <int> <fctr>     <dbl>
##  1     1  Term1  3.333333
##  2     2  Term3  3.000000
##  3     4  Term2  2.833333
##  4     8  Term2  3.111111
##  5     9  Term1  2.722222
##  6    13  Term4  2.722222
##  7    15  Term1  2.777778
##  8    20  Term1  3.000000
##  9    23  Term1  3.500000
## 10    26  Term4  2.555556
## # ... with 57 more rows

We could also forego the melting, since ultimately all we want is the average response for each student. We can use the original post data set, and an application of the rowMeans() function within which is nested the select() command:

post2 <- post %>%
  mutate(post_mean = rowMeans(select(.,CDRS_Q1:CDRS_Q25))) %>%
  select(id, Term, post_mean)
head(post2)
##    id  Term post_mean
## 1  23 Term1  3.500000
## 2 183 Term1  3.277778
## 3  80 Term1  3.722222
## 4   1 Term1  3.333333
## 5 166 Term1  3.833333
## 6  15 Term1  2.777778

One could argue that the first approach (using gather()) is the best, since it first creates a “tidy” data set and is arguably more readable. But the second appears more succinct. I’ll use the second approach to carry out the same task on the pre data:

pre2 <- pre %>%
  mutate(pre_mean = rowMeans(select(.,CDRS_Q1:CDRS_Q25))) %>%
  select(id, pre_mean)

Join pre and post

This step is easy!

both <- inner_join(pre2,post2,by='id')
head(both)
##   id pre_mean  Term post_mean
## 1  1 3.333333 Term1  3.333333
## 2  2 2.944444 Term3  3.000000
## 3  4 2.888889 Term2  2.833333
## 4  8 2.944444 Term2  3.111111
## 5  9 2.444444 Term1  2.722222
## 6 13 2.944444 Term4  2.722222

Carry out paired t-tests by term

Here is the lone “formal” statistical aspect of the whole problem! We can carry out paired t-tests by term as follows using some ugly base-R code: the by() command. I won’t show the output, but here we can see that only for Terms 2 and 3 is the SMART effect statistically significant:

by(both, both$Term, function(df) t.test(df$pre_mean,df$post_mean,paired=TRUE))

Prepare data for plotting

Referring back to the figure, the geometric mapping includes four points for the four pre-SMART CDRS averages (one for each term); four points for the post-SMART CDRS averages (one for each term); and a line segment connecting them. We can use the joined data set to form our “plotting” data set.

toplot <- both %>%
  group_by(Term) %>%
  summarize(avg_pre = mean(pre_mean),avg_post = mean(post_mean)) %>% 
  mutate(sig = c('no','yes','yes','no'))
toplot
## # A tibble: 4 x 4
##     Term  avg_pre avg_post   sig
##   <fctr>    <dbl>    <dbl> <chr>
## 1  Term1 2.965608 3.100529    no
## 2  Term2 2.977778 3.161111   yes
## 3  Term3 2.952675 3.255144   yes
## 4  Term4 3.211111 3.200000    no

Note that the avg_pre and avg_post columns are in fact “means of means,” and we have manually added a column to indicate whether the difference was statistically significant.

Plot!

And now the fun begins! Here is ggplot code to create the original figure:

ggplot(data = toplot) + 
  geom_point(aes(x = avg_pre, y = 1:4, shape='a',color=sig),size=2)  + 
  geom_point(aes(x = avg_post, y = 1:4,shape='b',color=sig),size=2) + 
  geom_segment(aes(x = avg_pre, xend = avg_post, y=1:4,yend=1:4,color=sig)) + 
  scale_y_reverse() + xlab('average CDRS') + ylab('Term') + 
  scale_shape_manual(name='',values=c('a'=19,'b'=17),labels=c('pre SMART','post SMART')) + 
  scale_color_discrete(name='significant?')

Moral

Wow! The only truly “statistical” aspect of this whole process was a paired t-test. And even that was a bit tricky: figuring out how to carry out these t-tests by term! In reflecting back on the semester, I am struck that our data science students would do well to take our stat consulting course. They would be kept very interested in data “wrangling” tasks such as this. On the flip side, it’s an invaluable experience for our STAT majors (who comprise the usual audience of this course) to realize that the formal statistical analysis in which they are most trained is ultimately the last in a long series of data wrangling steps.

Related