(source: AFL Nation)
The focus of this post is on creating AFLW team matchplay profiles, using match data from the 2017 and 2018 seasons. By drawing on match data from the 2 completed AFLW seasons thus far, this post also outlines how I think about, analyse, and present insights from small data sets.
Load up those R packages and import the data. Here, for convenience, I am importing from a cleaned CSV file that includes match data from the 2017 and 2018 AFLW seasons.
Ordinarly, my preference is to obtain data from APIs (live connections) or to ‘re-scrape’ the data to ensure I am accessing the most up-to-date version.
library(readr) library(dplyr) library(stringr) library(knitr) library(kableExtra) library(tidyr) library(ggplot2) library(ggridges)
# Import aflw_merged data set from GitHub repo aflw_merged <- read_csv( "https://raw.githubusercontent.com/jacquietran/aflw_data_retrieval/master/output/aflw_merged.csv", col_names = TRUE, col_types = NULL)
Inspecting the data
This data set is small in the sense that we have:
- 2 seasons’ worth of matches played,
- 8 teams in each season,
- 7 rounds in the regular season,
- Every team playing 1 match per round, for a total of 4 matches per round,
- Plus 1 grand final each year.
We can do the simple math:
2 seasons x ((7 rounds x 4 matches per round) + 1 grand final) = 58 matches
…and we can check that we have the same number of matches represented in the data set:
##  58
By documenting simple features and constraints of a data set, we as responsible analysts can rule out (or temporarily park) certain analyses since they won’t produce reliable results.
For example, we can’t use this data to identify common patterns when Team A plays Team B. At most, a pairing of Team A and Team B will appear only twice in a season if both Team A and Team B both make it to the grand final. All other pairings will appear only once per season.
So what can we do with small data sets like this one?
With small samples, the key is to use as much of the data as possible - don’t slice it up in too many ways! More specifically, this means:
- Addressing whole-of-group questions;
- Working with continuous variables in their continuous form (and resisting the temptation to categorically define continuous variables where this is not warranted by the data structure); and
- Using or defining categorical variables with a small number of levels or ‘bins’ or subgroups.
One typical from of whole-group analysis is to produce league-wide summary statistics. By including all match data from both seasons, we get to include 116 records in this analysis (58 matches x 2 teams per match = 116).
These days, my ‘default’ approach is to calculate medians and interquartile ranges, rather than means and standard deviations. You may recall from your high school maths classes: The former measures (median and interquartile range) are less affected by outliers that exist in skewed data sets, and in normally distributed data sets, the median will approximate the mean anyway.
# Calculate summary stats for goals, behinds, and points scored # Both seasons combined aflw_league_scoring <- aflw_merged %>% summarise(num_matches = length(Match.Id), goals_median = median(goals), goals_IQR = IQR(goals), behinds_median = median(behinds), behinds_IQR = IQR(behinds), points_for_median = median(points_for), points_for_IQR = IQR(points_for)) %>% mutate(team = "All teams combined") %>% select(team, everything()) # Reorder the columns # Display table aflw_league_scoring %>% kable("html") %>% kable_styling()
|All teams combined||116||5||3||5||4||32.5||20.25|
We can adapt this code chunk by using the
ungroup() functions to generate team-specific summary statistics, like so:
# Calculate team-specific scoring summary stats aflw_team_scoring <- aflw_merged %>% group_by(team) %>% summarise(num_matches = length(Match.Id), goals_median = median(goals), goals_IQR = IQR(goals), behinds_median = median(behinds), behinds_IQR = IQR(behinds), points_for_median = median(points_for), points_for_IQR = IQR(points_for)) %>% # Sort by median points scored, descending arrange(desc(points_for_median)) # Display table aflw_team_scoring %>% kable("html") %>% kable_styling()
We could also supplement these data summaries with a plot.
Note that we are using the
aflw_merged data set for the plot, rather than plotting the summary statistics stored in
aflw_team_scoring. Tracey Weissgerber explains the thinking behind this with respect to bar and line charts, but the underlying concepts that she emphasises apply here too.
@T_Weissgerber) January 22, 2019
# Create an ordered data set for plotting purposes aflw_scoring_team_ordered <- aflw_merged %>% group_by(team) %>% mutate(points_for_median = median(points_for)) %>% ungroup() %>% arrange(desc(points_for_median)) %>% mutate(team = factor(team, levels = rev(unique(team)))) # Build plot p <- ggplot(aflw_scoring_team_ordered, aes(x = points_for, y = team, fill = team)) p <- p + geom_vline( xintercept = aflw_league_scoring$points_for_median, size = 2, colour = "grey") p <- p + geom_density_ridges(quantile_lines = TRUE, quantiles = 2, scale = 0.9, jittered_points = TRUE, position = position_points_jitter(width = 0.05, height = 0), point_shape = '|', point_size = 3, point_alpha = 1, alpha = 0.7) p <- p + geom_text( aes(label = paste0("League median = ", aflw_league_scoring$points_for_median, " points"), x = 32, y = "Melbourne"), hjust = -0.1, vjust = -4.75, colour = "darkgrey") p <- p + labs( title = "Distribution of total points scored by AFLW teams", subtitle = "All matches, 2017 & 2018 AFLW seasons", x = "Points scored") p <- p + scale_y_discrete( expand = expand_scale(add = c(0.5, 1.25))) p <- p + theme_minimal() p <- p + theme( legend.position = "none", panel.grid.minor = element_blank(), plot.title = element_text(size = 14), plot.subtitle = element_text(size = 12), axis.title.x = element_text(size = 14), axis.title.y = element_blank(), axis.text = element_text(size = 12)) # Display plot p
Team matchplay profiles
A common way of profiling team attributes using game statistics is to organise actions into different matchplay categories. For this analysis, I’m grouping the actions as follows:
- Scoring: Goals, behinds, points for, goal accuracy, goal assists, goal efficiency, shot efficiency, shots at goal.
- Offensive actions: Clearances, marks, possessions, disposals, disposal efficiency, handballs, hitouts, inside 50s, kicks, rebound 50s.
- Defensive actions: Tackles, intercepts, points against.
- Errors and penalties: Clangers, turnovers, frees for and against.
Grouping game statistics in this way allows us to retain detailed information from variable-specific analyses, while also creating a framework to ‘zoom out’ and compare teams in terms of general areas of strength and weakness. For example, a team’s profile might suggest that they are particularly strong (compared to other teams) in their defensive actions, but are sub-par when it comes to scoring actions.
I like to analyse data in this manner as one of many elements in analysing opposition teams. This approach is useful because you can narrow the focus of your scouting through video. If the data indicates that a team’s key strength is their defense, then you might choose to focus your attention on studying clips of their defensive tactics. The numbers alone don’t tell the whole story, especially in the case of match data that is currently available in AFLW; the category of ‘defensive actions’ only includes measures related to tackling, intercepts, and points against (i.e., points allowed). The quantitative analysis gives you guiding information for follow-up analyses and efficiency gains, while watching and analysing game video provides important contextual information such as player positioning, team shape, timing, decisions made, and skill performance. Using the two together can provide a more well-rounded picture of a team’s tactics and strategies.
# Subset to scoring variables aflw_scoring_wide <- aflw_merged %>% select(contains(".Id"), Round.Number, Round.Abbreviation, team, points_for, goals, behinds, goal_accuracy, goal_assists, goal_efficiency, shot_efficiency, shots_at_goal) # Retain wide format for data summary # Reshape wide to long for plotting aflw_scoring_long <- aflw_scoring_wide %>% gather(points_for, goals, behinds, goal_accuracy, goal_assists, goal_efficiency, shot_efficiency, shots_at_goal, key = "variable", value = "value") %>% # Order factor levels within $variable # to specify order of appearance when plotted mutate(variable = factor(variable, levels = c("points_for", "goals", "behinds", "goal_accuracy", "goal_assists", "goal_efficiency", "shot_efficiency", "shots_at_goal"))) # TODO: Create data summary # TODO: Identify top 3 teams for each metric # Plot p <- ggplot(aflw_scoring_long, aes(x = team, y = value)) p <- p + facet_wrap(~variable, nrow = 2, scales = "free_x") p <- p + geom_boxplot(aes(group = team), outlier.shape = NA) p <- p + geom_point(alpha = 1/3, size = 3, aes(colour = team)) p <- p + coord_flip() p <- p + theme(legend.position = "none")
Errors and penalties
Summary of team profiles
Perhaps the most important part of working with small data sets is to take a conservative approach when interpreting your results and communicating your observations to others. The goal of data analysis is to use information to approximate the truth (Weston, 1987), to understand ‘the way things are’.