Tidy Tuesday: NFL Stadium Attendance

Feb 4, 2020 6 min read tidytuesday

Dip into the new ggttext package to get some rich text formatting in a plot’s title. Also: sports.

Load in data
The question: to super bowl winners have harder schedules?
Summarize data
Test the difference in schedule strength
Plot
Bonus: Color by Team Name
Bonus: What is the bump in attendance the year after winning a superbowl?

Load in data

attendance <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv', 
                       col_types = cols())

glimpse(attendance)

## Observations: 10,846
## Variables: 8
## $ team              <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizon…
## $ team_name         <chr> "Cardinals", "Cardinals", "Cardinals", "Cardinals",…
## $ year              <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 200…
## $ total             <dbl> 893926, 893926, 893926, 893926, 893926, 893926, 893…
## $ home              <dbl> 387475, 387475, 387475, 387475, 387475, 387475, 387…
## $ away              <dbl> 506451, 506451, 506451, 506451, 506451, 506451, 506…
## $ week              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ weekly_attendance <dbl> 77434, 66009, NA, 71801, 66985, 44296, 38293, 62981…

standings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv', 
                      col_types = cols())
glimpse(standings)

## Observations: 638
## Variables: 15
## $ team                 <chr> "Miami", "Indianapolis", "New York", "Buffalo", …
## $ team_name            <chr> "Dolphins", "Colts", "Jets", "Bills", "Patriots"…
## $ year                 <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ wins                 <dbl> 11, 10, 9, 8, 5, 13, 12, 9, 7, 4, 3, 12, 11, 7, …
## $ loss                 <dbl> 5, 6, 7, 8, 11, 3, 4, 7, 9, 12, 13, 4, 5, 9, 10,…
## $ points_for           <dbl> 323, 429, 321, 315, 276, 346, 333, 321, 367, 185…
## $ points_against       <dbl> 226, 326, 321, 350, 338, 191, 165, 255, 327, 359…
## $ points_differential  <dbl> 97, 103, 0, -35, -62, 155, 168, 66, 40, -174, -2…
## $ margin_of_victory    <dbl> 6.1, 6.4, 0.0, -2.2, -3.9, 9.7, 10.5, 4.1, 2.5, …
## $ strength_of_schedule <dbl> 1.0, 1.5, 3.5, 2.2, 1.4, -1.3, -2.5, -0.2, -1.4,…
## $ simple_rating        <dbl> 7.1, 7.9, 3.5, 0.0, -2.5, 8.3, 8.0, 3.9, 1.1, -1…
## $ offensive_ranking    <dbl> 0.0, 7.1, 1.4, 0.5, -2.7, 1.5, 0.0, 0.6, 3.2, -8…
## $ defensive_ranking    <dbl> 7.1, 0.8, 2.2, -0.5, 0.2, 6.8, 8.0, 3.3, -2.1, -…
## $ playoffs             <chr> "Playoffs", "Playoffs", "No Playoffs", "No Playo…
## $ sb_winner            <chr> "No Superbowl", "No Superbowl", "No Superbowl", …

The question: to super bowl winners have harder schedules?

I am curious if teams that win the super bowl have an easier or harder schedule than the other teams that make it to the playoffs. The strength_of_schedule variable provides a simple way to test this.

The only data manipulation we need to do is combine playoffs and sb_winner:

standings <- standings %>%
  mutate(final_position = if_else(sb_winner == "Won Superbowl", sb_winner, playoffs))

Summarize data

standing_summary <- standings %>%
  group_by(final_position) %>%
  summarize(less_zero = sum(strength_of_schedule < 0), 
            total = n(), 
            percent_less_zero = less_zero / total, 
            median_strength = median(strength_of_schedule)
            )

standing_summary

## # A tibble: 3 x 5
##   final_position less_zero total percent_less_zero median_strength
##   <chr>              <int> <int>             <dbl>           <dbl>
## 1 No Playoffs          170   398             0.427             0.2
## 2 Playoffs             132   220             0.6              -0.5
## 3 Won Superbowl          8    20             0.4               0.2

It looks like most teams that make it to the playoffs have an easier schedule (60% of them hace an average opponent rating less than zero) than those that win the super bowl (only 40% have average opponent rating less than zero). Teams that don’t make the playoffs also seem to have more difficult schedule.

Test the difference in schedule strength

The Kruskal-Wallis is suitable for multiple rank-based comparisons of groups:

standings %>%
  kruskal.test(strength_of_schedule ~ final_position, data = .)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  strength_of_schedule by final_position
## Kruskal-Wallis chi-squared = 27.31, df = 2, p-value = 1.174e-06

But sometimes, doing pairwise Wilcox tests can be more interpretable.

standings %>%
 filter(final_position != "Won Superbowl") %>%
  wilcox.test(strength_of_schedule ~ final_position, data = .)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  strength_of_schedule by final_position
## W = 54732, p-value = 2.549e-07
## alternative hypothesis: true location shift is not equal to 0

standings %>%
  filter(final_position != "No Playoffs") %>%
  wilcox.test(strength_of_schedule ~ final_position, data = .)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  strength_of_schedule by final_position
## W = 1599, p-value = 0.04333
## alternative hypothesis: true location shift is not equal to 0

Plot

Manually specify some colors (taken from the fun wesanderson package)

final_position_pal = c(
  "No Playoffs" = "#F2AD00", 
  "Playoffs" = "#00A08A", 
  "Won Superbowl"  = "#FF0000"
)

We will use the features of the great new ggtext package to color text in the subtitle of the plot, obviating the need for a figure legend.

library(ggtext)

standings %>%
  ggplot(aes(x = final_position, y = strength_of_schedule, fill = final_position)) + 
  geom_hline(yintercept = 0, lwd = 0.5, lty = 2, color = 'grey 50') + 
  geom_boxplot() +
  #geom_jitter(width = 0.25) + 
  geom_text(data = standing_summary, 
            inherit.aes = FALSE, 
            nudge_y = 0.22, 
            mapping = aes(x = final_position, y = median_strength, label = median_strength)) + 
  scale_fill_manual(values = final_position_pal,
                      guide = 'none'
                      ) + 
  theme_minimal() + 
  theme(axis.title = element_blank(), 
        axis.text.x = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.grid.major.x = element_blank(),
        plot.title = element_textbox(), 
        plot.subtitle = element_markdown(linewidth = 20)
        ) +
  labs(title = "**No Easy Road to Super Bowl Victory**", 
       subtitle = "Since 2000, NFL <b style='color:#F2AD00'>teams that miss the playoffs</b> have a harder schedule (an average opponent  
rating greater than the dashed line at zero) than <b style='color:#00A08A'>teams that make it to the playoffs</b>.  
A <b style='color:#FF0000'>Super Bowl winner's</b>  schedule is much tougher by comparison."
) + 
  ggsave(filename = 'featured.png', width = 7, height = 7)

Bonus: Color by Team Name

How does it look to plot individual points for each team, coloring them accoring to their team colors (thanks to the teamcolors package)?

team_fill = teamcolors::league_pal('nfl', which = 2)
team_color = teamcolors::league_pal('nfl', which = 1)

## Use just team name, not the home city (which changes for a couple teams)
names(team_fill)  <- str_remove(names(team_fill), "^.* ")
names(team_color)  <- str_remove(names(team_color), "^.* ")

standings %>%
  ggplot(aes(x = final_position, y = strength_of_schedule, fill = team_name, color = team_name)) + 
  geom_jitter(width = 0.4, pch = 23, size  = 3) + 
  scale_fill_manual(values = team_fill, 
                    guide = 'none'
                    ) + 
  scale_color_manual(values = team_color,
                     guide = 'none'
                    )   + 
  theme_minimal() + 
  theme(axis.title = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.grid.major.x = element_blank(),
        plot.title = element_textbox(), 
        plot.subtitle = element_markdown(linewidth = 20)
        )

Ugly. Too many of the teams have colors that are close to one another.

Bonus: What is the bump in attendance the year after winning a superbowl?

Making use of the lag() function, if you account for the fact that each team has a different “baseline” of attendance, it looks like the year after a superbowl win, attendence to away games may go up a bit. Home games and total attendance doesn’t really change though.

attendance %>%
  select(team, team_name, year, total, home, away) %>%
  distinct() %>%
  group_by(team) %>%
  ## adjust attendance on a per-team basis to account for a difference baseline for each team
  mutate_at(vars(total, home, away), .funs = ~scale(., scale = FALSE) ) %>%
  ungroup() %>%
  full_join(standings) %>%
  group_by(team) %>%
  mutate_all(.funs = list(last_year = ~lag(.))) %>%
  filter(sb_winner_last_year == "Won Superbowl") %>%
  filter(!is.na(sb_winner_last_year)) %>%
  ungroup() %>%
  arrange(team, year) %>%
  gather(attendance_the_year_after_winning_superbowl, value, total, total_last_year, home, home_last_year, away, away_last_year) %>%
  ggplot(aes(x = attendance_the_year_after_winning_superbowl, y = value)) + 
  geom_boxplot() + 
  ggpubr::stat_compare_means() + 
  theme_minimal()

## Joining, by = c("team", "team_name", "year")

## `mutate_all()` ignored the following grouping variables:
## Column `team`
## Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.

R ggtext

Tanner Koomar, PhD

Postdoctoral Research Scholar

My research interests include computational genetics, machine learning, and science communication