Monday morning, October 30, found me groggy and sandy-eyed. The culprit was the 5-hour and 17-minute, 10-inning thriller between the LA Dodgers and Houston Astros in Game 5 of the 2017 the night before. Thanks to living in the Central Time Zone, I went to bed around 1am. The Astros ended up defeating the Dodgers 13-12, but the game was insane, featuring three comebacks from deficits of 3 runs or more. By the end, I was emotionally exhausted, and I didn’t even have a stake in either team! Lots was written about the game over at Fangraphs, one of my favorite baseball analytics websites.
One post in particular by Craig Edwards caught my attention from a data visualization perspective. It compared the game to another epic game: Game 6 of the 2011 World Series, where the St. Louis Cardinals staved off elimination by defeating the Texas Rangers 10-9 in 11 innings. The crux of the article was a table that listed the top-20 “most exciting events” of each World Series side-by-side. Here is a screenshot of said table:
An “exciting event” is one that yields a large swing in the win expectancy of the game, as measured by Win Probability Added, or WPA. Read Fangraph’s excellent glossary entry on WPA for more detail, but essentially, the bigger the WPA of an event, the more exciting it is.
The table is great, but it’s hard to see which of the two games has the most exciting Top-20 events. I wanted to visualize! But I needed the data. Fangraphs has play logs for every single Major League game, which lists (among other metrics) the events of the game; the win expectancy following the event; and the WPA of the event. I needed data from the Game 6, 2011 and the Game 5, 2017 play logs. But I needed them both in the same data source!
R has a great package
rvest that makes web-scraping (especially scraping html tables) quite easy. Here’s the code I wrote to scrape the Game 6, 2011 data; do some cleaning; and write the cleaned data into a .csv file. I wrote very similar code to get the Game 5, 2017 data:
library(rvest) library(dplyr) #Read in the data, find the right table: url <- 'https://www.fangraphs.com/plays.aspx?date=2011-10-27&team=Cardinals&dh=0&season=2011' raw <- read_html(url)%>% html_table(fill=TRUE) mytable <- raw[][,1:12] #Use dplyr to clean it up. By code row: #Create absolute value of WPA #Create new column to indicate the game #Remove the "%" from the win expectancy, create Event number #Arrange in descending order by WPA #Create rank column cleantable <- mytable %>% mutate(WPA_abs = abs(WPA)) %>% mutate(Game = rep('Game 6, 2011',nrow(mytable))) %>% mutate(WE = as.numeric(gsub('%','',WE)), Event = 1:nrow(mytable)) %>% arrange(-WPA_abs) %>% mutate(WPA_Rank = 1:nrow(mytable)) #Write cleaned data to csv file write.csv(cleantable,file='WPA_Game6_2011_all.csv',row.names=FALSE)
Then on to visualizing! Using Tableau, I created a win-expectancy graph for each of the games. The red dots indicate “exciting events”; events with WPA of 15% or more:
Clearly, both games were crazy, with wild swings in win expectancy! If you count up the red dots, Game 6 in 2011 had 8 “exciting events” while Game 5 in 2017 had 10 “exciting events.” 15% is quite an arbitrary threshold for “exciting” however; different thresholds would likely change the comparison. Comparing total WPA flips the story: the total WPA of 7.2 in Game 6, 2011 was larger than the total WPA of 6.2 in Game 5, 2017.
My primary interest, however, was to compare the top-20 most exciting events of both games in a clearer visual manner, respecting principles of human perception:
From this graph, we can see the four points on the right that lie below the reference line. These indicate that the top four most exciting events in Game 6, 2011 were more exciting than the top four most exciting events from Game 5, 2017. On the other hand, the top 5-10 most exciting events in Game 5, 2017 lie above the reference line, indicating they were more exciting than the events with WPA ranked 5-10 in Game 6, 2011. The rest of the events ranked 11-20 in WPA go back to lying below the line.
So, it does appear that Game 6, 2011 was truly a more thrilling game than Game 5, 2017! The total WPA of Game 6, 2011 was greater than the total WPA in Game 5, 2017, and of the top-20 “most exciting” events, they tended to be more exciting in 2011. Oh well, Game 5, 2017 was still worth losing a few hours of sleep! I think….